from pcap to postgres: forensic performance analysis
Last week Robert Treat told me it sure would be nice if we could reconstruct PostgreSQL logs from network captures (in the sort of antagonist way that is: "MySQL can do it, why can't we?"). With pgsniff, we can.
Well, it turns out that he was complaining for a reason: a client. Our friends over at Etsy have a server that is so blindingly busy selling handmade things that logging all queries on the box degrades performance unacceptably. This is actually a common problem for PostgreSQL and other databases alike. Most databases offer a feature to log all queries where a single execution exceeds some arbitrarily chosen wall clock time. It turns out this "solution" is acutely painful because it often causes engineers to look at incomplete data and draw (you guessed it) incorrect conclusions.
The concept of a "slow query log" is a woefully incomplete concept. Typically, such a log is configured to track queries that spend more than X amount of time executing, where X is some "small enough number." Some sites put this at 1 second. If I execute a query (with any sort of high frequency) that takes one second to complete, I'm flat-out screwed. So, by the laws of physics, the queries that get logged under such circumstances are outliers (in high transaction, OLTP systems). The common approach to address this is to set X down to 100ms or 10ms. Any good engineer should realize that by doing so you simply change the resolution of the problem and have fixed nothing. The database behaves how it does because of all that it does. The queries you really care about are those that consume the most cumulative execution time, but to determine this you must account for everything. This is why turning on full query logging is essential to thorough performance analysis.
Now, most of my readers know I'm a DTrace fan and, yes, I've been able to do this in real-time, in production on all our Solaris-based PostgreSQL servers for years. However, pgsniff is a callout to all you guys without DTrace. This tool takes pcap information (from a live interface or post-mortem tcpdump).
The tool is a bit rough around the edges and some of the tuple tracking is buggy, but it is already in a form that is spectacularly useful. You can now tcpdump on a machine that can see the traffic (db, web server, monitor port) and reconstruct every query (even in more detail that PostgreSQL's logs provide). On the TODO list: fix bugs, production trials, and bind variable reconstruction so that one can dive into a slow outlier and see the bind variables used during execution of prepared statements.
For now, you can get the code via Subversion here: https://labs.omniti.com/pgtreats/trunk/Sniffer-Postgres/. Once some of the features are fleshed out and it has more production run-time I will put it up on CPAN.