Mongo possible data corruption returning secondary to replica set
I am trying to understand the source of some data corruption that occurred around the same time a secondary was returned to a replica set.
We have a production replica set with 4 nodes - 3 data carrying nodes and an arbiter.
I took a secondary (call it X
) out of the production replica set and used it to seed a new test replica set for some performance benchmarking. After seeding the new replica set, I put X
back into the production replica set. Within about 10 hours we had complaints from customers that they had lost around 2 days of data. X
had been out of production for 2 days as well. So we are wondering if re-introducing X
caused some data reversion.
The timings line up very closely and we haven't been able to find any plausible alternative theory - hence this post.
The odd thing is that only some mongo collections were "reverted". Our database seems to be a mix of the primary and X
.
In more detail this is what I did:
rs.remove(X)
mongod.conf
X
local
database and ran db.dropDatabase()
to clean out the production replica set info mongod.conf
but with a new replica set name X
X
in the new replica set rs.stepDown()
and rs.remove(X)
mongod.conf
local
database mongod.conf
but with the production replica set name rs.add(X)
to add X
back into the production replica set To clarify - no new data was added to X
when it was the primary in the test replica set.
Here's some info which might be relevant:
All nodes are mmapv1 running mongo 3.2.7.
After X
was removed from the production replica set, it's entry for the production primary in /etc/hosts
accidentally got deleted. It was able to communicate directly with the other secondary and arbiter but not the primary. There were lots of heartbeat error logs.
I found these logs which seem to indicate that X
's data got dropped when it re-entered the production replica set:
2017-01-13T10:00:59.497+0000 I REPL [ReplicationExecutor] syncing from: (other secondary)
2017-01-13T10:00:59.552+0000 I REPL [rsSync] initial sync drop all databases
2017-01-13T10:00:59.554+0000 I STORAGE [rsSync] dropAllDatabasesExceptLocal 3
2017-01-13T10:00:59.588+0000 I JOURNAL [rsSync] journalCleanup...
2017-01-13T10:00:59.588+0000 I JOURNAL [rsSync] removeJournalFiles
Prior to all this, developers had also been reporting issues that the primary was sometimes unresponsive under higher loads. These are some errors from the reactivemongo driver:
No primary node is available!
The primary is unavailable, is there a network problem?
not authorized for query on [db]:[collection]
The nodes are on aws: the primary runs on an m3.xlarge
, the secondaries on m3.large
and the arbiter on m3.medium
.
About 30 hours after we got customer complaints, our replica set held an election and X
became the primary. These are the logs:
2017-01-15T16:00:33.332+0000 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-01-15T16:00:33.333+0000 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-01-15T16:00:33.347+0000 I REPL [ReplicationExecutor] dry election run succeeded, running for election
2017-01-15T16:00:33.370+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 2
2017-01-15T16:00:33.370+0000 I REPL [ReplicationExecutor] transition to PRIMARY
2017-01-15T16:00:33.502+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted
This happened before I realized the /etc/hosts
file was broken on X
.
I also found a lot of these errors in the logs when replicating one very large collection (260 million documents):
2017-01-13T13:01:35.576+0000 E REPL [repl writer worker 9] update of non-mod failed: { ts: Timestamp 1484301755000|10, t: 1, h: -7625794279778931676, v: 2, op: "u", ns: ...
This is a different collection though to the one which got corrupted.
链接地址: http://www.djcxy.com/p/61656.html上一篇: metricbeat Mongo辅助节点“无法访问服务器”错误
下一篇: Mongo可能的数据损坏返回副本集