The Domain

So, at the $DAYJOB, we were faced with building a large operational data store. Large has many meanings to many people. I’ve written about this before, but I’ll reiterate the scope: (> 1TB data, thousands of tables, several tables with around one billion rows). So, for a variety of reasons, we chose PostgreSQL. I’ve written about that choice a few times, but didn’t write about the choice to use Solaris.

A Bad Choice

So, I’ll start by saying we chose Linux – CentOS 4 (an RHEL4 clone). The box we chose was a 4 processor Opteron with 16GB of RAM, three internal drives and the rest is fiber-attached storage. While I like XFS a lot as a file system, but I’ve seen some odd issues here and there (specifically on our 1TB mail server here and out 1TB subversion server). So, just to be safe, we used the tried and true ext3.

It started out well. One day /data/ (our large postgresql data mount point) suddenly went read-only. Postgres, of course, was quite displeased with this occurrence. So, naturally, I tried to fix the problem by umounting and mounting again, all to no avail. It turns out that a reboot was required to rectify the issue. While this was disturbing, we rebooted and continued on with life. The more annoying issue was the subsequent 18 times this occurred. The show stopper was the 20th time it failed; upon reboot I found catastrophic data loss.

To set the scene appropriately: we run Linux in a lot of places and run FreeBSD and OpenBSD and Solaris. Where we run these it runs well – because when it doesn’t we replace it with something does. I love FreeBSD, but I have some application code that will make it kernel panic inside 5 minutes. Long story short, fear PostgreSQL on 64-bit Linux. Once bitten, twice shy… 20 times you’re an idiot.

A Better Choice

So, we chose Solaris 10. Why? We’ve run a lot on Solaris and been quite pleased with its stability. It has excellent support for enterprise storage hardware and multi-processor AMD Opteron systems. We’ve used VxFS (Veritas File System) and it has been a “good life.” Solaris 10 sports a new file systems called ZFS which boasts a lot of the features of the VxFS file system (but not quite the performance). ZFS’s volume management, built in compression, snapshot capabilities, and simple management makes it a hands-down winner over LVM (Linux Logical Volume Manager) and ext3. Now we have two 2.7TB ZFS pools, soon to add another 1.4TB pool, and management has been quite painless.

Another issue that we had with PostgreSQL (coming from Oracle) was severe inability to introspect the database. Why is a query slow? How many disk reads did it do? Which locks did it acquire, when and how long did it block waiting for the lock to be granted? If a query hits disk, which spindles were accessed? PostgreSQL simply does not provide an interface to this information. How, you ask, does Solaris help with this? Enter DTrace.

DTrace allows us to dynamically instrument the undercarriage of PostgreSQL (see what it is doing from the application level all the way down through the kernel). Because you can instrument application space right along side kernel space, we can see the queries being performed, the SQL plan being executed, the memory allocations, lock acquisitions, disk reads and writes, simply put: everything. Using DTrace we were able to develop a PostgreSQL administration toolkit that provides information such as the number of blocks read and written to each individual spindle over the course of a given query.

We also use SMF, but I honestly don’t think that buys us anything really special. It’s nice that it “just works,” but quite frankly you can my SysV scripts “just work” as well.

In the works is a set of PostgreSQL administrator tools that transparent database-level access to systemic information about PostgreSQL back end processes.

Not everything is peaches and cream.

One of the nice features of ZFS is the snapshot backups and the ability to “dump” the differences between one snapshot and the next. This, in essence, is a BLI: block-level incremental backup. ZFS manages this on a “zobject” level as I understand it, but it should accomplish the same thing. It means that if you have only 1GB of changes on a 5TB filesystem, it is feasible to restore a 5TB base image and then a 1GB changeset to get a block-level accurate restore. This in turn means that you can backup 5TB once per week an then each day backup only the changes between that “level 0” backup and the current snapshot (which should be small and manageable). I love this feature – but hate that it doesn’t work. On our system, at least, a “zfs send” of the differences between one snapshot and the next can take 40-50 hours. This appears to be a flaw in the ZFS send works in that it doesn’t prioritize its I/O requests over normal traffic. I hope the Sun guys get to work on this one soon!

Review

After running the system for some time now, I can say that Solaris 10 was an excellent decision. We’ve since launched four more respectably sized PostgreSQL instances on Solaris 10 with equal success.