I recently had the honor of talking about distributed tracing at CraftConf 2015. Wonderful conference, wonderful crowd and the talk was well received. Bset summary: “Worth watching, even if you are a vegan.”
Picking up right where we left off in our previous exercises. We’ve got a core due to an error. We fix the error by removing line 31 from myprog.c and rebuilding. The program runs now… prints out some text and pauses… to simulate a long-running program that we need to debug without disrupting too much.
Let’s get a core!
# UMEM_DEBUG=default ./myprog & [1] 74502 read 25144 words. # echo '::gcore' | mdb -p `pgrep myprog` mdb: core.
So what’s this all about then? Debugging. I’ve written a lot of C, I still write a lot of C and I sure as hell end up debugging a lot of C. One thing that pisses me off is when I’ve got a core file, but I’ve no idea about the exact version or build of the ELF binary that produced it. The bottom line is that I still need to find the failure.
So I have this app… And it appears to be misbehaving. I can’t tell quite what it is blocking on (or momemtarily pausing on) as the case may be just by staring at top or its log files. It’s supposed to perform around 300 message submissions per second and appears to be doing like 30. So, where’s the problem? Or more importantly, how do we find the problem?
DTrace is the right answer of course, but I’m on Linux and FreeBSD here.