The Domain
So, at the $DAYJOB, we were faced with building a large operational data store. Large has many meanings to many people. I've written about this before, but I'll reiterate the scope: (> 1TB data, thousands of tables, several tables with around one billion rows). So, for a variety of reasons, we chose PostgreSQL. I've written about that choice a few times, but didn't write about the choice to use Solaris.
A Bad Choice
So, I'll start by saying we chose Linux -- CentOS 4 (an RHEL4 clone). The box we chose was a 4 processor Opteron with 16GB of RAM, three internal drives and the rest is fiber-attached storage. While I like XFS a lot as a file system, but I've seen some odd issues here and there (specifically on our 1TB mail server here and out 1TB subversion server). So, just to be safe, we used the tried and true ext3.
It started out well. One day /data/ (our large postgresql data mount point) suddenly went read-only. Postgres, of course, was quite displeased with this occurrence. So, naturally, I tried to fix the problem by umounting and mounting again, all to no avail. It turns out that a reboot was required to rectify the issue. While this was disturbing, we rebooted and continued on with life. The more annoying issue was the subsequent 18 times this occurred. The show stopper was the 20th time it failed; upon reboot I found catastrophic data loss.
To set the scene appropriately: we run Linux in a lot of places and run FreeBSD and OpenBSD and Solaris. Where we run these it runs well -- because when it doesn't we replace it with something does. I love FreeBSD, but I have some application code that will make it kernel panic inside 5 minutes. Long story short, fear PostgreSQL on 64-bit Linux. Once bitten, twice shy... 20 times you're an idiot.
A Better Choice
So, we chose Solaris 10. Why? We've run a lot on Solaris and been quite pleased with its stability. It has excellent support for enterprise storage hardware and multi-processor AMD Opteron systems. We've used VxFS (Veritas File System) and it has been a "good life." Solaris 10 sports a new file systems called ZFS which boasts a lot of the features of the VxFS file system (but not quite the performance). ZFS's volume management, built in compression, snapshot capabilities, and simple management makes it a hands-down winner over LVM (Linux Logical Volume Manager) and ext3. Now we have two 2.7TB ZFS pools, soon to add another 1.4TB pool, and management has been quite painless.
Another issue that we had with PostgreSQL (coming from Oracle) was severe inability to introspect the database. Why is a query slow? How many disk reads did it do? Which locks did it acquire, when and how long did it block waiting for the lock to be granted? If a query hits disk, which spindles were accessed? PostgreSQL simply does not provide an interface to this information. How, you ask, does Solaris help with this? Enter DTrace.
DTrace allows us to dynamically instrument the undercarriage of PostgreSQL (see what it is doing from the application level all the way down through the kernel). Because you can instrument application space right along side kernel space, we can see the queries being performed, the SQL plan being executed, the memory allocations, lock acquisitions, disk reads and writes, simply put: everything. Using DTrace we were able to develop a PostgreSQL administration toolkit that provides information such as the number of blocks read and written to each individual spindle over the course of a given query.
We also use SMF, but I honestly don't think that buys us anything really special. It's nice that it "just works," but quite frankly you can my SysV scripts "just work" as well.
In the works is a set of PostgreSQL administrator tools that transparent database-level access to systemic information about PostgreSQL back end processes.
Not everything is peaches and cream.
One of the nice features of ZFS is the snapshot backups and the ability to "dump" the differences between one snapshot and the next. This, in essence, is a BLI: block-level incremental backup. ZFS manages this on a "zobject" level as I understand it, but it should accomplish the same thing. It means that if you have only 1GB of changes on a 5TB filesystem, it is feasible to restore a 5TB base image and then a 1GB changeset to get a block-level accurate restore. This in turn means that you can backup 5TB once per week an then each day backup only the changes between that "level 0" backup and the current snapshot (which should be small and manageable). I love this feature -- but hate that it doesn't work. On our system, at least, a "zfs send" of the differences between one snapshot and the next can take 40-50 hours. This appears to be a flaw in the ZFS send works in that it doesn't prioritize its I/O requests over normal traffic. I hope the Sun guys get to work on this one soon!
Review
After running the system for some time now, I can say that Solaris 10 was an excellent decision. We've since launched four more respectably sized PostgreSQL instances on Solaris 10 with equal success.
Wednesday, November 29. 2006 at 07:47 (Reply)
When ext3 remounts something read only is because it encountered errors, which are quite usually hardware problems (maybe temporary ones, either because some random memory error or because the hd relocated some sectors).
It's really probable that it was not Linux' fault. Weird you didn't know that given the big data volumes you seem to manage. OTOH it's not something you get to see every day.
You can also change that behaviour with a kernel parameter at mount time (check mount's manpage for "errors=continue|remount-ro|panic" option).
Anyway, I'm glad Solaris is working for you, I just thought I should mention that since from the article you seem to blame this on a Linux bug, while it's probably not.
Monday, January 28. 2008 at 15:04 (Reply)
"It's really probable that it was not Linux' fault."
I like your wording, but it is highly improbable that it wasn't linux fault.
"I just thought I should mention that since from the article you seem to blame this on a Linux bug, while it's probably not."
Sure, when a filesystem specifically developped for linux, and running on an enterprise-certified linux distribution, fails in a desastrous manner, it has nothing to do with linux being the operating system, right? I mean, ext3 being run in kernelspace and all and failing and then your zealotry popping up in and all of a sudden you have this need to lie in order to defend something you love. Very weird behaviour from someone who apparently likes being technical and accurate.
"When ext3 remounts something read only is because it encountered errors, which are quite usually hardware problems (maybe temporary ones, either because some random memory error or because the hd relocated some sectors)."
A quad opteron box with 16 gigabytes of RAM has ECC RAM, you can't blame faulty RAM. You can't blame disk-errors either since every SCSI/ATA/FC diskcontroller can be queried for such things, and a decent blockdevice layer knows how to handle such stuff.
"You can also change that behaviour with a kernel parameter at mount time (check mount's manpage for "errors=continue|remount-ro|panic" option)."
This is handy for emergency acces to the data, but does nothing to alleviate the problem.
"OTOH it's not something you get to see every day."
That is plain crap. Ext3 is generally known for being one of the most unreliable filesystems.
Why are the majority of linux enthousiasts always this blindsided?
Thursday, September 4. 2008 at 09:21 (Reply)
"That is plain crap. Ext3 is generally known for being one of the most unreliable filesystems."
Sources please.
You don't really think that just because you say it I'll believe it right?
Friday, September 5. 2008 at 00:00 (Link) (Reply)
Actually, that's the entire point of blogging. It sucks 'cause I say so. The evidence is my documented experience.
Wednesday, November 29. 2006 at 08:30 (Reply)
We're considering running Oracle atop of ZFS file systems, so this is very interesting. You're likely well aware of this, but for those folks who may not be there's good advice on ZFS (in general, and when used with databases) hidden amongst the Sun blogs. This link's a good one to get started with, but there are more if you dig around.
http://blogs.sun.com/realneel/entry/zfs_and_databases
Wednesday, November 29. 2006 at 11:51 (Link) (Reply)
Just to be explicit: on the same hardware, solaris 10 fixed your corruption/read-only /data problem?
Friday, December 1. 2006 at 01:16 (Reply)
Yes. Same exact hardware. We reinstalled Linux twice even to make sure there wasn't something wrong with the install. I've had lots of other people chime in reporting very similar problems.
Wednesday, November 29. 2006 at 15:53 (Link) (Reply)
Enjoyed the article.
I came to this $DAYJOB, as you put it, from a shop that used Solaris 8 and 9 and Oracle for an enormous geospatial database.
I used Debian and Postgres setting us up, but I'm looking to migrate to Solaris 10 given the stability experienced previously with 8 and 9.
Have you used any of the zones features yet?
Friday, December 1. 2006 at 01:20 (Reply)
We use zones extensively in some of our deployments. However, we find that whenever we expect that a given app will be pushing hard on that hardware, there's not point in using zones. For most of our database servers, we just run the DB in the global zone.
We have a new deployment coming up soon that will likely have DBs in zones as the box is "large" and it will be running PostgreSQL and two different version of MySQL. I think leveraging zones to keep those things sufficiently separate will be good.
We're re-rolling our hosting infrastructure now to leverage zones, the lack of a ZFS root on Solaris 10 is a bummer.
Thursday, November 30. 2006 at 03:01 (Link) (Reply)
We had the similar ext3 being remounted as read -only (although, there appear to be at least a couple different causations) with RHEL 4.0.
Once we got the kernels on these machines to RHEL 4.0 Update 3, the read only re-mounts stopped.
Anyways, Death to linux!
Wednesday, December 20. 2006 at 15:34 (Link) (Reply)
I took Centos 4.4 for a spin on 64-bit hardware (HP DL145 G2) also, since I was interested in a 'stable' linux distribution. This was a much smaller scale deployment, but suffice to say that after some of head banging and figuring out that Centos 4.4 shipped with a SATA driver that effectively gave me about 5 Megs/sec throughput. Tried to upgrade the kernel and the machine was not happy (and I've done _lots_ of kernel upgrades). Sent me running back to Gentoo, no problems so far (and no, I'm not a sushi chef). I've been hearing your Solaris 10 raves though and I'm close to drinking the Kool-Aid.
Wednesday, December 20. 2006 at 20:27 (Reply)
To keep it fair, we run a lot of Linux, FreeBSD and OpenBSD. There certainly isn't a perfect OS out there. Solaris' strong points do seem to really mesh well with core database needs.
Sunday, May 6. 2007 at 21:13 (Link) (Reply)
More on this issue. The gentoo kernel I was using on my 64 bit opteron proved to be more stable than the centos kernel (and I thought the problem was kernel specific), but I saw this exact same issue in extended production use. It's an artifact of high (read > 1) load on 64 bit linux. I've tried 2.6.15, 2.6.20, and 2.6.21, but this issue won't go away. Grrr... It only shows up occasionally. This is only the second time it has shown up in production over a 3 month period, but enough is enough. Come to think of it, I saw this issue three years ago on early opterons under very high load, I had thought this issue would have been fixed by now.
Saturday, December 23. 2006 at 13:35 (Link) (Reply)
Solaris 10 is cool. Good choice!
osgeek
Monday, March 26. 2007 at 16:01 (Reply)
Hi -
I think you might have made some bad choices, I don't know about CentOS but I switched to Debian almost 10 years ago, RedHat, I don't know about that, I got tired of it, its broken most of the time.
Go check out http://www.top500.org/, over 400 of the worlds most powerful systems are linux based, Solaris is only used on 6 out of the 500.
As for filesystems, here is what the #6 site is running: This is 120TB of Storage, try that on zfs
Thunderbird Linux Cluster
Initial System: 4, 096 dual CPU compute nodes, 8 user/login nodes, 2 IB subnet management nodes, 16 administration/service nodes. Subsequently added 384 compute nodes for 1500. There are also 128 node and 32 node development clusters.
Nodes:
Dell PowerEdge 1850 1U Servers
Dual Intel Xeon EM64T 3.6GHz processors
6 GB RAM
73 GB SCSI Hard drive
System Parameters:
* 14.4 GF/s dual socket 3.6 GHz single core Intel SMP nodes with 6.4 GB/s memory BW, DDR-2 400 SDRAM (memory B:F=0.42, BW B:F=0.44)
* < 5.0 micro-sec MPI latency and 1.8 GB/s Bandwidth over 4X InfiniBand
* PCI-Express, 4x Single Data Rate with a 2:1 oversubscription (50% blocking)
* Gb-Enet and IB links from each Login node.
* Local disk for swap and root partitions, fallback OS image
* Remote/Network boot
* Serial over LAN
* Lustre storage ~120 TB capacity and 6.0 GB/s BW
* Panasas storage ~50 TB capacity and 4.0 GB/s BW
For more complete information see: http://tbirdweb.sandia.gov
Wednesday, September 16. 2009 at 19:43 (Link) (Reply)
"Hi - I think you might have made some bad choices, I don't know about CentOS but I switched to Debian almost 10 years ago, RedHat, I don't know about that, I got tired of it, its broken most of the time. Go check out http://www.top500.org/, over 400 of the worlds most powerful systems are linux based, Solaris is only used on 6 out of the 500."
First, CentOS is based on Red Hat. It's mostly built form the SRPMS that Red Hat distributes.
Second, this article is not about raw performance so the data point about Linux and top powerful systems is just moot. The point of this article was stability, not raw performance. A really fast system that is not very stable is worthless for production workloads.
Tuesday, May 1. 2007 at 00:59 (Reply)
Glad solaris has worked for you but i'd like to comment, its interesting that you choose the default filesystem for /data when ext3 gives prob the worst performance, stability yes but performance no.
You chose not to use rieser or xfs based on that I'd like to ask what kind of mount opts you chose with xfs and how was the partitions created, i.e. flags.
As for trying a kernel update, how? Stock kernel.org kernel? Patched kernel? Bleeding edge patched kernel?
Also sysctl.conf? Any changes, what version of Postgres? How was it compiled?
I am not critizing just asking some questions since I have use XFS in High Bandwith Traffic sites for quite some time with large files from 5meg to 2g and never had anny issues, performance was allways fantastic.
Saturday, May 5. 2007 at 20:16 (Link) (Reply)
Actually ext3 gives excellent database performance is just a wee bit off raw I/O on Linux. Stability is where it doesn't shine so much:
http://www.redhat.com/magazine/013nov05/features/oracle/
I've seen enough data loss with rieser to turn my stomach. We use XFS on several of our large mail stores, it works great, but for databases ext3 is faster and more common.
We tried several kernels. Updates from redhat, stock kernels. But as for bleeding edge, you're kidding right? This is production.
Typical tunings. I believe the version of postgres on that box was the latest stable at the time which was 8.0.8. It was migrated later to 8.0.8 on Solaris and then export/imported into 8.1.2 (PAINFUL) and then upgraded and is currently humming at 8.1.8. The "upgrade" process from one version to the next is an export/import and with more than a terabyte of data, that's just not something one can do easily.
We have some high throughput XFS systems as well. They are pretty solid. I've had a lot of kernel panics with XFS that threatened data corruption, but I'm very pleased to say that I never once lost a single byte on XFS (on my other machines).
On the subject of filesystems, only ZFS makes me show my "O face."
Wednesday, November 21. 2007 at 23:11 (Reply)
You typed, "Once bitten, twice shy... 20 times _your_ an idiot."
What you meant to type is, "Once bitten, twice shy... 20 times _you're_ an idiot."
I don't know why that error bugs me so much. Heh.
Maybe I'm an idiot or it's past my bedtime but I could not figure out where OmniTi is
located geographically.
Thursday, November 22. 2007 at 13:23 (Reply)
Right you are. Typo corrected. My pet peeve is mixing 'then' and 'than'.
OmniTI's Location:
http://maps.google.com/maps?q=omniti
Though we're in New York as well now.
Friday, October 3. 2008 at 19:19 (Reply)
I've encountered the remount readonly bug as well on debian under vmware. After what I've read, kernel 2.6.22 fixed this bug by means of a workaround, and this fix was included in patches for RHEL 4 and other distributions using older kernels.
Possibly another one of Solaris strong points vs Linux is scalability on ~8 cores and up. But that's another topic.
Monday, December 15. 2008 at 06:56 (Reply)
I've had this problem on linux a few times. It has always come down to Linux MPIO (SAN multipathing) software not doing its job. Did you check this?
Thursday, January 15. 2009 at 12:12 (Reply)
no idea why one would even use Linux (sorry, don't want to start a flame war) but if you are trying to run a server:
1. Solaris is just as free as Linux
2. Solaris will, in general, be easier to administer.
3. Solaris will scale better on multi-cpu motherboards
4. Solaris users use it because it's better, not because of baseless fanaticism