PostgreSQL performance through the eyes of DTrace
We've been doing a lot of PostgreSQL work lately and we have one largish system (terabytes) that runs on top of Apple XServe RAIDs. While people argue that SATA is getting better, let it be understood that Fibre channel SCSI drives rule. The difference between carrier class storage and "enterprise" (a.k.a. commodity) is pretty tremendous.
While this system will eventually make good use of the XServer RAIDs and long-term storage containers for write-once read-many data tables (archives), the "fast" storage it currently tied up and we've been "making due" with all of our heavy read/write activity on 14 SATA drives. It is always advisable to understand the access patterns of your database so that you can put files that are heavily utilized at the same time on different spindles. This allows for a set of drives heads to be occupied with one style of work load without being distracted by another. In big databases, getting more spindles is a must and usually not a budgetary concern. BUT, how do you know which files?
Oracle provides enough information inside the database itself for a DBA to do an assessment of which datafiles could benefit from running on separate spindles. PostgreSQL's statistics exist, but are very lacking in comparison. The rule of thumb is that you pur indexes and data on different spindles. In big systems you don't follow "rules of thumb" you instead use them as the basis for experimentation and emperical analysis. Big systems often vary widely in their configuration and stress points and the "rules of thumb" are for generic problems, while they may hold true on your system, "don't jump."
So, what is a PostgreSQL user to do? First, run Solaris 10. Second leverage Dtrace to really answer the age old question "where does it hurt?" I scratched together a little perl/dtrace to measure the min/avg/max read() and write() times induced by PostgreSQL (yes it uses read() and write() for all of its I/O ops on UNIX) and break it down by filename. I initially used the io::: provider to only measure the ones that induced physical reads against media (didn't come out of buffer cache). While it is really cool that you can do this, we would miss the reads that would happen we moved things around and blew our buffer cache -- so I opted for all reads (less cool : more useful).
After we collect metrics from all the file I/O, we connect to PostreSQL and ask it which database objects correspond to the files we see being read and written to. The result is a simple prstat-like output by database object of the number of reads/writes performed along with the minimum, avergage and maximum turn-around time for the operation in milliseconds. Quite useful for understanding "what hurts."
So the command 'pg_file_stress -d dbname -u postgres -i 30 -n 20' would yield useful output like:FILENAME/DBOBJECT READS WRITES # min avg max # min avg max ods_users 9494 84 144 1039 6 0 34 207 ods_tblfl_unsub_users_p_2006_10 1 201 201 201 127 0 0 0 ods_tblusersanswers_p2006_12_ix_timestamp 1 179 179 179 0 0 0 0 tbluser_address 307 91 91 1076 0 0 0 0 pg_rewrite 8 0 93 749 12 0 4 59 ods_tblusersanswers_p2006_11_ix_timestamp 1 80 80 80 0 0 0 0 ods_tblhits_sum01_p2005_11_partner 20 0 60 897 0 0 0 0 pg_depend_reference_index 38 0 46 700 48 0 0 0 ods_tblfl_unsub_users_p_2006_09 1 28 28 28 94 0 0 0 pg_depend_depender_index 29 0 22 407 11 0 1 14 ods_users_pk 306 0 16 777 0 0 0 0 ods_tblhits_sum01_p2005_10_partner 4 0 12 48 0 0 0 0 pg_statistic 158 0 9 819 88 0 0 0 ods_tblhits_sum01_p2005_11 5279 0 4 1102 0 0 0 0 ods_tblhits_sum01_p2005_10 1277 0 3 755 0 0 0 0 ods_tblusersanswers 4295 0 1 1422 0 0 0 0 pg_statistic_relid_att_index 41 0 1 67 31 0 0 0 pg:3083076 1702 0 1 1015 808 0 0 0 mv_users 2751 0 1 1036 26 0 0 0 users_tx_p20061022 2264 0 0 667 0 0 0 0
9494 read operations with an average turn around time of 144ms. Sweet Jesus! I suppose that could use its own spindles. 307 ops at 91ms ain't pretty either.
More than demonstrating how to solve our problem, this demonstrates the acute need to decommission the box using our fast storage and swing those units over to our database of pain.