I recently had the honor of talking about distributed tracing at CraftConf 2015. Wonderful conference, wonderful crowd and the talk was well received. Bset summary: “Worth watching, even if you are a vegan.” Broadcast live streaming video on Ustream

Continue reading

mdb custom dmods

Picking up right where we left off in our previous exercises. We’ve got a core due to an error. We fix the error by removing line 31 from myprog.c and rebuilding. The program runs now… prints out some text and pauses… to simulate a long-running program that we need to debug without disrupting too much. Let’s get a core! # UMEM_DEBUG=default ./myprog & [1] 74502 read 25144 words. # echo '::gcore' | mdb -p `pgrep myprog` mdb: core.

Continue reading

So what’s this all about then? Debugging. I’ve written a lot of C, I still write a lot of C and I sure as hell end up debugging a lot of C. One thing that pisses me off is when I’ve got a core file, but I’ve no idea about the exact version or build of the ELF binary that produced it. The bottom line is that I still need to find the failure.

Continue reading

Last week Robert Treat told me it sure would be nice if we could reconstruct PostgreSQL logs from network captures (in the sort of antagonist way that is: "MySQL can do it, why can't we?"). With pgsniff, we can. Well, it turns out that he was complaining for a reason: a client. Our friends over at Etsy have a server that is so blindingly busy selling handmade things that logging all queries on the box degrades performance unacceptably.

Continue reading

Hello from OSCON. I gave my full-stack introspection crash course talk today. It has been quite a while since I've presented anything in a 40 minute format, but I think the talk went quite well. I got a lot of positive feedback. I decided to take a risky approach inspired by dtrace.conf(08) by demonstrating dtrace on a live, mission-critical system we run at OmniTI. The risks of this are: network connections flake out, dtrace doesn't work correctly or I do something stupid and cause some service unavailability.

Continue reading

We've been doing a lot of PostgreSQL work lately and we have one largish system (terabytes) that runs on top of Apple XServe RAIDs. While people argue that SATA is getting better, let it be understood that Fibre channel SCSI drives rule. The difference between carrier class storage and "enterprise" (a.k.a. commodity) is pretty tremendous. While this system will eventually make good use of the XServer RAIDs and long-term storage containers for write-once read-many data tables (archives), the "

Continue reading

Beware of strace

So I have this app… And it appears to be misbehaving. I can’t tell quite what it is blocking on (or momemtarily pausing on) as the case may be just by staring at top or its log files. It’s supposed to perform around 300 message submissions per second and appears to be doing like 30. So, where’s the problem? Or more importantly, how do we find the problem? DTrace is the right answer of course, but I’m on Linux and FreeBSD here.

Continue reading

Author's picture

Theo Schlossnagle

Distributed Systems, Scalability, and Operations. read more

CEO - Circonus

Maryland, USA