How to properly handle asynchronous database replication?

2018-06-04 11:38:25

I'm considering using Amazon RDS with read replicas to scale our database.

Some of our controllers in our web application are read/write, some of them are read-only. We already have an automated way for identifying which controllers are read-only, so my first approach would have been to open a connection to the master when requesting a read/write controller, else open a connection to a read replica when requesting a read-only controller.

In theory, that sounds good. But then I stumbled open the replication lag concept, which basically says that a replica can be several seconds behind the master.

Let's imagine the following use case then:

The browser posts to /create-account , which is read/write, thus connecting to the master

The account is created, transaction committed, and the browser gets redirected to /member-area

The browser opens /member-area , which is read-only, thus connecting to a replica. If the replica is even slightly behind the master, the user account might not exist yet on the replica, thus resulting in an error.

How do you realistically use read replicas in your application, to avoid these potential issues?

This is a hard problem, and there are lots of potential solutions. One potential solution is to look at what facebook did,

TLDR - read requests get routed to the read only copy, but if you do a write, then for the next 20 seconds, all your reads go to the writeable master.

The other main problem we had to address was that only our master databases in California could accept write operations. This fact meant we needed to avoid serving pages that did database writes from Virginia because each one would have to cross the country to our master databases in California. Fortunately, our most frequently accessed pages (home page, profiles, photo pages) don't do any writes under normal operation. The problem thus boiled down to, when a user makes a request for a page, how do we decide if it is "safe" to send to Virginia or if it must be routed to California?

This question turned out to have a relatively straightforward answer. One of the first servers a user request to Facebook hits is called a load balancer; this machine's primary responsibility is picking a web server to handle the request but it also serves a number of other purposes: protecting against denial of service attacks and multiplexing user connections to name a few. This load balancer has the capability to run in Layer 7 mode where it can examine the URI a user is requesting and make routing decisions based on that information. This feature meant it was easy to tell the load balancer about our "safe" pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location.

There is another wrinkle to this problem, however. Let's say you go to editprofile.php to change your hometown. This page isn't marked as safe so it gets routed to California and you make the change. Then you go to view your profile and, since it is a safe page, we send you to Virginia. Because of the replication lag we mentioned earlier, however, you might not see the change you just made! This experience is very confusing for a user and also leads to double posting. We got around this concern by setting a cookie in your browser with the current time whenever you write something to our databases. The load balancer also looks for that cookie and, if it notices that you wrote something within 20 seconds, will unconditionally send you to California. Then when 20 seconds have passed and we're certain the data has replicated to Virginia, we'll allow you to go back for safe pages.

I worked with application which used pseudo-vertical partitioning. Since only handful of data was time-sensitive the application usually fetched from slaves and from master only in selected cases.

As an example: when the User updated their password application would always ask master for authentication prompt. When changing non-time sensitive data (like User Preferences) it would display success dialog along with information that it might take a while until everything is updated.

Some other ideas which might or might not work depending on environment:

After update compute entity checksum, store it in application cache and when fetching the data always ask for compliance with checksum

Use browser store/cookie for storing delta ensuring User always sees the latest version

Add "up-to-date" flag and invalidate synchronously on every slave node before/after update

Whatever solution you choose keep in mind it's subject of CAP Theorem.

链接地址: http://www.djcxy.com/p/14598.html

上一篇: SecItemAdd返回OSStatus代码

下一篇: 如何正确处理异步数据库复制？