We've been doing a lot of PostgreSQL work lately and we have one largish system (terabytes) that runs on top of Apple XServe RAIDs. While people argue that SATA is getting better, let it be understood that Fibre channel SCSI drives rule. The difference between carrier class storage and "enterprise" (a.k.a. commodity) is pretty tremendous.

While this system will eventually make good use of the XServer RAIDs and long-term storage containers for write-once read-many data tables (archives), the "fast" storage it currently tied up and we've been "making due" with all of our heavy read/write activity on 14 SATA drives. It is always advisable to understand the access patterns of your database so that you can put files that are heavily utilized at the same time on different spindles. This allows for a set of drives heads to be occupied with one style of work load without being distracted by another. In big databases, getting more spindles is a must and usually not a budgetary concern. BUT, how do you know which files?

Oracle provides enough information inside the database itself for a DBA to do an assessment of which datafiles could benefit from running on separate spindles. PostgreSQL's statistics exist, but are very lacking in comparison. The rule of thumb is that you pur indexes and data on different spindles. In big systems you don't follow "rules of thumb" you instead use them as the basis for experimentation and emperical analysis. Big systems often vary widely in their configuration and stress points and the "rules of thumb" are for generic problems, while they may hold true on your system, "don't jump."

So, what is a PostgreSQL user to do? First, run Solaris 10. Second leverage Dtrace to really answer the age old question "where does it hurt?" I scratched together a little perl/dtrace to measure the min/avg/max read() and write() times induced by PostgreSQL (yes it uses read() and write() for all of its I/O ops on UNIX) and break it down by filename. I initially used the io::: provider to only measure the ones that induced physical reads against media (didn't come out of buffer cache). While it is really cool that you can do this, we would miss the reads that would happen we moved things around and blew our buffer cache -- so I opted for all reads (less cool : more useful).

After we collect metrics from all the file I/O, we connect to PostreSQL and ask it which database objects correspond to the files we see being read and written to. The result is a simple prstat-like output by database object of the number of reads/writes performed along with the minimum, avergage and maximum turn-around time for the operation in milliseconds. Quite useful for understanding "what hurts."

So the command 'pg_file_stress -d dbname -u postgres -i 30 -n 20' would yield useful output like:
            FILENAME/DBOBJECT                                      READS                 WRITES                                                                #   min   avg   max     #   min   avg   max ods_users                                             9494    84   144  1039     6     0    34   207 ods_tblfl_unsub_users_p_2006_10                          1   201   201   201   127     0     0     0 ods_tblusersanswers_p2006_12_ix_timestamp                1   179   179   179     0     0     0     0 tbluser_address                                        307    91    91  1076     0     0     0     0 pg_rewrite                                               8     0    93   749    12     0     4    59 ods_tblusersanswers_p2006_11_ix_timestamp                1    80    80    80     0     0     0     0 ods_tblhits_sum01_p2005_11_partner                      20     0    60   897     0     0     0     0 pg_depend_reference_index                               38     0    46   700    48     0     0     0 ods_tblfl_unsub_users_p_2006_09                          1    28    28    28    94     0     0     0 pg_depend_depender_index                                29     0    22   407    11     0     1    14 ods_users_pk                                           306     0    16   777     0     0     0     0 ods_tblhits_sum01_p2005_10_partner                       4     0    12    48     0     0     0     0 pg_statistic                                           158     0     9   819    88     0     0     0 ods_tblhits_sum01_p2005_11                            5279     0     4  1102     0     0     0     0 ods_tblhits_sum01_p2005_10                            1277     0     3   755     0     0     0     0 ods_tblusersanswers                                   4295     0     1  1422     0     0     0     0 pg_statistic_relid_att_index                            41     0     1    67    31     0     0     0 pg:3083076                                            1702     0     1  1015   808     0     0     0 mv_users                                              2751     0     1  1036    26     0     0     0 users_tx_p20061022                                    2264     0     0   667     0     0     0     0 

9494 read operations with an average turn around time of 144ms. Sweet Jesus! I suppose that could use its own spindles. 307 ops at 91ms ain't pretty either.

More than demonstrating how to solve our problem, this demonstrates the acute need to decommission the box using our fast storage and swing those units over to our database of pain.