Last week Robert Treat told me it sure would be nice if we could reconstruct PostgreSQL logs from network captures (in the sort of antagonist way that is: "MySQL can do it, why can't we?"). With pgsniff, we can.
Well, it turns out that he was complaining for a reason: a client. Our friends over at Etsy have a server that is so blindingly busy selling handmade things that logging all queries on the box degrades performance unacceptably. This is actually a common problem for PostgreSQL and other databases alike. Most databases offer a feature to log all queries where a single execution exceeds some arbitrarily chosen wall clock time. It turns out this "solution" is acutely painful because it often causes engineers to look at incomplete data and draw (you guessed it) incorrect conclusions.
The concept of a "slow query log" is a woefully incomplete concept. Typically, such a log is configured to track queries that spend more than X amount of time executing, where X is some "small enough number." Some sites put this at 1 second. If I execute a query (with any sort of high frequency) that takes one second to complete, I'm flat-out screwed. So, by the laws of physics, the queries that get logged under such circumstances are outliers (in high transaction, OLTP systems). The common approach to address this is to set X down to 100ms or 10ms. Any good engineer should realize that by doing so you simply change the resolution of the problem and have fixed nothing. The database behaves how it does because of all that it does. The queries you really care about are those that consume the most cumulative execution time, but to determine this you must account for everything. This is why turning on full query logging is essential to thorough performance analysis.
Now, most of my readers know I'm a DTrace fan and, yes, I've been able to do this in real-time, in production on all our Solaris-based PostgreSQL servers for years. However, pgsniff is a callout to all you guys without DTrace. This tool takes pcap information (from a live interface or post-mortem tcpdump).
The tool is a bit rough around the edges and some of the tuple tracking is buggy, but it is already in a form that is spectacularly useful. You can now tcpdump on a machine that can see the traffic (db, web server, monitor port) and reconstruct every query (even in more detail that PostgreSQL's logs provide). On the TODO list: fix bugs, production trials, and bind variable reconstruction so that one can dive into a slow outlier and see the bind variables used during execution of prepared statements.
For now, you can get the code via Subversion here: https://labs.omniti.com/pgtreats/trunk/Sniffer-Postgres/. Once some of the features are fleshed out and it has more production run-time I will put it up on CPAN.
Friday, April 2. 2010 at 11:43 (Link) (Reply)
It's not quite right to say MySQL can do it. MySQL can't, but Maatkit can. And if your tool can output a faithful Postgres log format, then mk-query-digest can do about the best query analysis on the result that you can get anywhere.
Friday, April 2. 2010 at 11:46 (Reply)
Indeed. I added the -pg flag that should produce "faithful" postgres log output.
Saturday, April 3. 2010 at 03:40 (Link) (Reply)
Hmm, that's a shame. There was already a tool to do this on pgfoundry called pgspy using libpcap. That could probably use some love.
Saturday, April 3. 2010 at 11:21 (Reply)
I took a look at pgspy. Dissecting the complicated sequences of packets (with pgspy doesn't really do) was much easier for me to do quickly in perl. I think once the tool meets more people's needs, I'll likely reimplement it in C as it is *way* too slow in perl as it stands.
Unfortunately, taking pgspy from A to B was a bit too much work for me to tackle in C in the time I had.
Sunday, April 4. 2010 at 03:42 (Reply)
Another approach is possible in 8.4 thanks to Martin Pihlak's cool feature, tracking user-defined functions' call counts and runtimes.
All the statistics you need are available in the system view pg_stat_user_functions.
This is of course only possible if you only access the database solely through Stored Procedures, even for SELECT queries.
This is the case in the system I'm working with, which is extremely database centric.