So, the state of open source database replication is pretty sad. MySQL replication just doesn't cut it in many serious environments because the slaves can't keep up with the write load on the master. So, PostgreSQL right? Well, not so fast. PostgreSQL replication is handled in one of two ways: Slony or PITR (point-in-time recovery).
Slony provides all the same features as MySQL's replication (except that it is much harder to setup and maintain), but also boasts the same acute performance issues -- a busy master can easily outpace slaves, leaving them in the dust. Query-log-based replication is pretty flawed and while creative people will do whack shit to try to push the envelope this still doesn't make it a good method.
PITR is much more like Oracle's replication mechanism. PITR takes the WAL (write-ahead logs) and ships them over to the slave to be reapplied. This leaves you with an identical database (block for block) and a weak machine can easily keep up with a beefy master. In Oracle terminology "WALs" are called "archive logs."
So, with PITR, all our problems are solved, right? No. When using a PITR-style warm standby configuration the database is in "recovery mode" all the time. This means the database is "sorta up" waiting for the next WAL log to appear so that it can play forward through the transactions and "catch up" to the master: "continuous recovery." This means the database isn't available for queries. Now, Oracle works the same way. While Oracle is recovering, you can't use the database. However, using Oracle you can cancel recover, mount the database read-only, do some queries, unmount and begin recovery again picking right up where you left off. In PostgreSQL you cannot open the database in read-only mode and then later continue recovery as the act of opening the database (even in read-only mode) will deviate from the path of the recovery -- can we say design flaw?
While Oracle's "got game" on PostgreSQL, the concept of stopping recovery so that we can run queries on the slave isn't ideal. If my queries are substantial my "warm standby" will get colder and colder as it sits neglecting to apply archive logs. So, I want my warm standby and I want to be able to run long, heavy queries against it. And someone's going to give it to me!
However, I'm impatient. So, I'm going to make it work myself. Using the power of ZFS, I'm going to snap my PITR slave and clone it into a "disposable" "point-in-time" copy. This is really useful for running heavy reports.
The basic concept is this:
Now, all I have to do is get a copy of the PITR stuff into that zone. There are a few caveats: (1) due to postgres' design the copy must be read-write as it will be destructive even in read-only mode and (2) it will still be in recovery mode, so I'll need the last WAL archive so that it can "finish" recovery before I bring it online.
ZFS gives us cheap, fast read-write clones of filesystems:
In the global zone:
zfs snapshot store2/postgres/82@clonebase
zfs clone store2/postgres/82@clonebase store2/clonedb/82
In the clonedb zone:
zfs mount /store2/postgres/82
zfs set mountpoint=/data/postgres/82 store2/clonedb/82
I need to make sure I copy the latest WAL archive from the PITR slave into the pg_walarchives directory on my clonedb zone. Then I just startup my postgres instance in the zone and touch the /data/postgres/82/failover file (this file tells my recovery script to stop recovering and start up normally). Viola.
It might sound complicated, but we just run ./clonedb_startclone.sh and in about one minute we have a fully operational read-write database and the PITR slave is merrily continuing recovery.
# ./clonedb_startclone.sh
[Mon Jul 2 20:37:14 EDT 2007] Stopping postgres in clonedb
[Mon Jul 2 20:37:20 EDT 2007] Dropping clone and base snapshot
[Mon Jul 2 20:37:38 EDT 2007] Snapshot store2/postgres/82
[Mon Jul 2 20:37:39 EDT 2007] Clone to store2/clonedb/82
[Mon Jul 2 20:37:41 EDT 2007] Mount store2/clonedb/82 at /data/postgres/82 in clonedb
[Mon Jul 2 20:37:43 EDT 2007] Copy last WAL [0000000100000016000000FA]
[Mon Jul 2 20:37:43 EDT 2007] Make it active [induce failover]
[Mon Jul 2 20:37:43 EDT 2007] Start postgres in clonedb
[Mon Jul 2 20:38:07 EDT 2007] System up
Now I can run long data mining queries and other complicated reports against my standby database. No load is induced on the master database at all (so no concern about negative production service effects) and the standby recovery is continuing on unaffected, so from the failover point-of-view nothing has changed either. I am not even limited to one zone! Any time I'd like to, I can just "snap" myself a new query slave. It's a cheap, mutable, entirely disposable copy. Nice.
It's worth noting that this same technique should work like a charm on Oracle as well. Also, it should work well with any other filesystem that supports copy-on-write cloning -- though I don't know of any other than ZFS.
This, in a long line of things, just lets you know that when your database doesn't quite have the spunk to finish the race, today's operating systems are actually powerful enough to drag them across the finish line.
Tuesday, July 3. 2007 at 05:28 (Reply)
Nice to see people do creative things with ZFS and PostgreSQL databases.
One detail though: by running all your clones in zones, you've put "all your eggs in one basket": hardware malfunctions or dies, and all your databases are gone. ZFS doesn't support the notion of a global (cluster) filesystem yet, so your data doesn't get physically replicated across nodes on the network.
However, perhaps something could be cooked up with `zfs send`...
Tuesday, July 3. 2007 at 07:40 (Reply)
You don't need ZFS-fu to do multiple slaves. You can use PITR that way. Nothing prevents you from having multiple slaves doing continuous recovery.
We do use `zfs send` to facilitate the initial build out of the slave: (1) pg_start_backup, (2) zfs snapshot, (3) pg_stop_backup. We then spool the snapshot to tape. When we want to build a new PITR slave, we just take the most recent snapshot used for backup purposes and send it over to the target and start continuous recovery.
Wednesday, July 4. 2007 at 09:01 (Link) (Reply)
So.. I take it that these DBs are running from an Opensolaris Box and thus the filesystems are ZFS on a SAN/NFS?? Or is it local?
Just wondering if it's possible to do this with a windows box running postgres but with a mounted NFS/SAN/SMB share w/ ZFS as the underlying filesystem?
Thursday, July 5. 2007 at 13:02 (Reply)
I did the proof of concept on an OpenSolaris box, but I rolled it out for real on a Solaris 10u3 box with fiber attached storage. You could call it a SAN, but it's an XServe RAID, so I refuse to give it an inflated sense of self-worth.
Tuesday, July 3. 2007 at 18:23 (Link) (Reply)
This sounds like a great technique. Offhand it seems like it should work with LVM snapshots also, for those of us who don't (yet?) have ZFS.
Thursday, July 5. 2007 at 13:04 (Reply)
That would be cool to see. I didn't see any support for using the LVM snapshots as a copy-on-write base for a second filesystem.
We use LVM snapshots extensively to backup our Linux based databases and mail systems. LVM snaps combined with box ext3 and XFS and mounting up the snap read-only to perform the backup. Works like charm.
This requires being about to mount up the snap read-write.
Saturday, July 7. 2007 at 19:01 (Link) (Reply)
Very interesting, and yet another sales job for Solaris :-).
I'll point out the obvious problem with this approach, which is that it's pretty intrusive on the "slave node." That is, when you update the replica, it requires shutting down the postmaster and starting it up again (against new data).
*Entirely* unsuitable for our application of replication where we point the WHOIS service at a "slave node." We need the slave to be up all the time, just like the master :-).
But I could see this being hugely useful for, say, daily reporting. You grab a snapshot at 12:00:01, and can do a bunch of work with it...
You'll be interested to know that the Slony-I team (I've been the release manager for Slony-I for the last year or so) has been talking about improvements for handling bulk changes (at least, for the ones where we could reasonably fold together a bunch of DELETE requests on the same table, and such like).
There is definitely a possibility of performance improvement, although it complicates code that already has too few that understand it, so there is some opposition...
Wednesday, November 28. 2007 at 10:49 (Link) (Reply)
Great article...
I've done something very similar but using a NetApp solution. It provides a nice central mgmt point but yet is relatively inexpensive compared to SAN solutions. It's really the way to go for your enterprise solution because it allows you to play at the hardware level instead of the software level.
I personally don't like replication topologies unless the schema is relatively mature and doesn't change. Most vendors (Oracle, SQL Server, PostgreSQL, etc) tie their replication scenario down into the physical model where it complicates things like management and deployments.
One way to make a hardware solution perform at near replication speeds is to deploy read-only farms behind a load balancer. I build upon the above direction and found that you could mitigate the downtime of restarting the postmaster if you placed many nodes behind a load balancer and then restarted them in pools (like 3+3 or 2+2+2). You can fit quite a few SLA's with that type of scenario. Not "real-time" data but "real-enough-time" for most solutions. Also hardware is cheap and managing software solutions is expense. :)
The nice part is that the data for those clones are all reading from the same snap so the disk space needed is not O(n) but O(1)... Basically just plug in CPU boxes to mount that clone and start PostgreSQL...
-Paul
Friday, January 18. 2008 at 07:42 (Reply)
Very good tips, I am going to set similar system with yours. I am wondering what value do you altually setup for the checkpoint
timeout in the postgres.conf file as well as archive timeout.
We have a relatively quiet database, but we want high availability up to 2 minutes data if server fail, I understand by default WAL log is write to hard disk when the RAM is filled up till 16 m and default check timeout is 5 minutes, I don't think our WAL can be filled up until 16m whthin 2 minutes for our company.
So that mean if I want up to 2 minutes data, I have to set the checkpoint timeout to 2min? and if the checkpoint is set too frequent, will this affect the master server performance a lot?
Any comments or help will be greatly appreciated.
-Raymond
Friday, January 18. 2008 at 11:32 (Reply)
Well, empirically, if you aren't filling the log within two minutes, your server is not so busy that setting the checkpoint interval down to two minutes would cause a performance issue.