MongoDB 集群节点 RECOVERING 故障恢复
MongoDB About 4,139 words故障
子节点一直处于RECOVERING
无法恢复,子节点日志提示数据太老了,无法从主节点同步。
shard3:PRIMARY> rs.status()
{
"set" : "shard3",
"date" : ISODate("2020-06-19T07:07:50.969Z"),
"myState" : 1,
"term" : NumberLong(16),
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "192.168.1.101:22003",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 6642815,
"optime" : {
"ts" : Timestamp(1592550470, 135),
"t" : NumberLong(16)
},
"optimeDate" : ISODate("2020-06-19T07:07:50Z"),
"electionTime" : Timestamp(1585907679, 1),
"electionDate" : ISODate("2020-04-03T09:54:39Z"),
"configVersion" : 3,
"self" : true
},
{
"_id" : 1,
"name" : "192.168.1.102:22003",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 6642802,
"lastHeartbeat" : ISODate("2020-06-19T07:07:49.604Z"),
"lastHeartbeatRecv" : ISODate("2020-06-19T07:07:47.392Z"),
"pingMs" : NumberLong(0),
"configVersion" : 3
},
{
"_id" : 2,
"name" : "192.168.1.103:22003",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 101645,
"optime" : {
"ts" : Timestamp(1560585165, 268),
"t" : NumberLong(9)
},
"optimeDate" : ISODate("2019-06-15T07:52:45Z"),
"lastHeartbeat" : ISODate("2020-06-19T07:07:49.071Z"),
"lastHeartbeatRecv" : ISODate("2020-06-19T07:07:50.685Z"),
"pingMs" : NumberLong(0),
"configVersion" : 3
}
],
"ok" : 1
}
解决方案
方案一
删除处于RECOVERING
的子节点所有数据,MongoDB
将自动从主节点同步数据至此子节点。注意:需查看主节点数据量,如果数据量非常大,建议采用方案二。
方案二
对于数据量非常大的情况下,可以选择拷贝一个正常提供服务的集群副本集的节点的所有数据到需要恢复的节点上(注意备份需要恢复节点的历史数据,万一出问题可回退)。
案例
故障简要信息:192.168.1.103
节点一直处于RECOVERING
。
{
"_id" : 2,
"name" : "192.168.1.103:22003",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 101645,
"optime" : {
"ts" : Timestamp(1560585165, 268),
"t" : NumberLong(9)
},
"optimeDate" : ISODate("2019-06-15T07:52:45Z"),
"lastHeartbeat" : ISODate("2020-06-19T07:07:49.071Z"),
"lastHeartbeatRecv" : ISODate("2020-06-19T07:07:50.685Z"),
"pingMs" : NumberLong(0),
"configVersion" : 3
}
子节点(192.168.1.103
)日志一直输出以下信息,提示数据太老了,无法从主节点同步。
2020-06-18T14:00:06.239+0800 I REPL [ReplicationExecutor] syncing from: 192.168.1.101:22003
2020-06-18T14:00:06.275+0800 W REPL [rsBackgroundSync] we are too stale to use 192.168.1.101:22003 as a sync source
2020-06-18T14:00:06.275+0800 I REPL [ReplicationExecutor] could not find member to sync from
2020-06-18T14:00:06.275+0800 E REPL [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2020-06-18T14:00:06.276+0800 I REPL [rsBackgroundSync] our last optime : (term: 9, timestamp: Jun 15 15:52:45:10c)
2020-06-18T14:00:06.276+0800 I REPL [rsBackgroundSync] oldest available is (term: 16, timestamp: Jun 16 18:56:27:b6e)
2020-06-18T14:00:06.276+0800 I REPL [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2020-06-18T14:00:06.276+0800 I REPL [ReplicationExecutor] going into maintenance mode with 506583 other maintenance mode tasks in progress
查看主节点(192.168.1.101
)数据文件夹大小。
du -h --max-depth=1 /data/mongodb/shard3/data
输出
202M /data/mongodb/shard1/data/journal
280K /data/mongodb/shard1/data/rollback
93M /data/mongodb/shard1/data/diagnostic.data
110G /data/mongodb/shard1/data
由于主节点共100GB
左右,故选择清空子节点(192.168.1.103
)数据,让MongoDB
自动同步数据的方案。
- 先停止子节点(
192.168.1.103
)服务:db.shutdownServer()
。 - 迁移历史数据做备份:
mv /data/mongodb/shard3/data /data/mongodb/shard3/data_backup
。 - 启动子节点(
192.168.1.103
)服务:/data/mongodb/bin/mongod -f /data/mongodb/config/mongod-shardsvr3.conf
。 - 等待同步完成。
参考
https://docs.mongodb.com/manual/tutorial/resync-replica-set-member
Views: 6,069 · Posted: 2020-06-19
————        END        ————
Give me a Star, Thanks:)
https://github.com/fendoudebb/LiteNote扫描下方二维码关注公众号和小程序↓↓↓
Loading...