Reconnoiter is coming along. Unlike most open source project, I tend not to talk about mine until their are really useful to people. Over the last year, I’ve adopted the unhealthy attitude that useful means “shiny front-end.” So, I’m blogging to break that attitude and talk a bit about project that doesn’t have a shiny front-end… yet.

Reconnoiter is built out of years of frustration using tools like RRDTOOL, Munin, Cacti, ZenOSS, Nagios, etc. etc. I have a lot of problems with these tools. First, they are not efficient. I need a powerful machine to monitor a mere 10k services. And it actually gets to be an engineering challenge to monitor 100k services with these tools. Also, the graphs are about 10 years old with respect to design and usability. I want something new, something fresh, and something that doesn’t need a damn web UI to configure. Several people have asked, why are you reinventing the wheel? Why don’t you just improve an existing product? My answer is that I want a well-thought-out product foundation so that I can trust all the bits. I want reponsibilities decoupled at the right spots. I want data in a form that the world can query and run reports the likes of which I have not concieved. I don’t want the load on my monitoring machines to be 8. I want my monitoring system to check services and metrics when it planned to, not several minutes (or even 2 seconds) after it told me it would. Simply put, I expect it to work well, all the time. And, of course, I want it to work how I would expect it to work.

Reconnoiter was born out of the need to monitor the internals of many disconnected data centers with between 10 and 1000 machines in each facility. Monitoring can mean a lot of things, here I consider it to be the collection of metrics and awareness of their availability. In and of itself, monitoring is pretty useless, but it is the foundation for two critical pursuits in Internet infrastructure and business management: fault detection and trending.

Fault detection is as simple as understanding when something has faulted. However, knowing something is broken is easier than knowing something is about to break. Is it better to know that your machine just crashed because the chip slagged to the motherboard, or that the temperatures in rack 043 are rising unexpectedly? Answer: both, but I hope I only learn the latter and not the former. Truly, there are too many things to monitor… hundred or thousands of metrics on each piece of equipment. I can’t reasonable go in and configure good/bad thresholds on each one. I want anomaly detection. I want a system that I can say: “this looks right, tell me when it stops looking right.” That, to me, is a much need companion to tradition fault detection.

To me, trending is much more than drawing graphs… it is about intelligent data correlation, regression analysis/curve fitting and looking into the past to see how much you fucked up getting where you are now – in the vain hope that you learn from your mistakes and plan better next time.

Reconnoiter is an attempt to build these things. Building a system requires starting with pain (need), solid structure and plumbing (good engineering). So, reconnoiter is underway. And this post is in mid-step:

It started on OpenBSD, and added support for FreeBSD, Mac OS X, Linux.

As of changeset [292], we have Solaris/OpenSolaris support.

We have a pretty nice front-end for trending under construction, but it isn’t there yet. We’ll have numeric data combined with textual “event” data on the same graphs. All that convenient stuff. Here’s the rather plain-Jane graph you get now (because some people won’t even read a post if it doesn’t have a pretty graph):



Honestly, I don’t know what the value of this post is, but people around here keep telling me that people should be aware of an open-source tool like this, even if it isn’t finished (read: usable) yet. I say it isn’t usable yet, but on our development instances here, we monitor 2892 production metrics across two data centers and the load never peaks past 0.10. I’m pretty excited about where this is going. Honestly, my favorite part right now is that I can configure and control the noitd checking nodes via a telnet console and it acts as if it is a piece of network equipment rather than an “application” – as it should be IMHO.