verifying data consistency between two postgresql databases
This is specifically about maintaining confidence in using various replication solutions that you'd be able to failover to the other server without data loss. Or in a master-master situation that you could know within a reasonable amount of time if one of the databases has fallen out of sync.
Are there any tools out there for this, or do people generally depend on the replication system itself to warn over inconsistencies? I'm currently most familiar with postgresql WAL shipping in a master-standby setup, but am considering a master-master setup with something like PgPool. However, as that solution is a little less directly tied with PostgreSQL itself (my basic understanding is that it provides the connection an app would use, thus intercepting the various SQL statements, and would then send them on to whatever servers were in its pool), it got me thinking more about actually verifying data consistency.
Specific requirements:
I'm not talking about just table structure. I'd want to know that actual record data is the same, so that I'd know if records were corrupted or missed (in which case, I would re-initialize the bad database with a recent backup + WAL files before bringing it back into the pool)
Databases are in the order of 30-50 GB. I'm doubting that raw SELECT queries would work very well.
I don't see the need for real-time checking (though it would, of course, be nice). Hourly or even daily would be better than nothing.
Block-level checking wouldn't work. It would be two databases with independent storage.
Or is this type of verification simply not realistic?
You can check the current WAL locations on both the machines... If they represent the same value, that means your underlying databases are consistent with each other...
$ psql -c "SELECT pg_current_xlog_location()" -h192.168.0.10 (do it on primary host)
pg_current_xlog_location
--------------------------
0/2000000
(1 row)
$ psql -c "select pg_last_xlog_receive_location()" -h192.168.0.20 (do it on standby host)
pg_last_xlog_receive_location
-------------------------------
0/2000000
(1 row)
$ psql -c "select pg_last_xlog_replay_location()" -h192.168.0.20 (do it on standby host)
pg_last_xlog_replay_location
------------------------------
0/2000000
(1 row)
you can also check this with the help of walsender and walreceiver processes:
[do it on primary] $ ps -ef | grep sender
postgres 6879 6831 0 10:31 ? 00:00:00 postgres: wal sender process postgres 127.0.0.1(44663) streaming 0/2000000
[ do it on standby] $ ps -ef | grep receiver
postgres 6878 6872 1 10:31 ? 00:00:01 postgres: wal receiver process streaming 0/2000000
If you are looking for the whole table you should be able to do something like this (assuming a table that quite easily fits in RAM):
SELECT md5(array_to_string(array_agg(mytable), ' '))
FROM mytable order by id;
That will give you a hash on the tuple representation on the tables.
Note that you could break this down by ranges, etc. Depending on the type of replication you could even break it down by page range (for streaming replication).
链接地址: http://www.djcxy.com/p/13544.html