A little over a year ago, I started in on a project that was of significant scope. Not a few scripts hacked together, nor a conglomeration of pre-existing tools, but rather a carefully engineered product. What product is this? Reconnoiter.

About 10 years ago, we were neck deep in large scale e-mail architecture. We felt pain, we were up at 3am every night attempting to make systems work. Finally, we decided enough was enough and started a skunkworks project to build a better e-mail server. Well, that turned out pretty well. It's got considerable momentum at this stage and is leading the industry as the most advanced digital messaging platform on the planet.

Over the past several (12) years, we have run operations for small and large sites alike. We're responsible for waking up and fixing things at 3am when they are not working. We're responsible for not only designing highly scalable architectures, we're responsible sticking around and seeing them through to the finish. Many people are writing tools for management; two in the recent spotlight are Puppet and Chef. We had very little pain in the arena of provisioning and maintaining systems. I have a theory as to why that is, but that is a topic for another monologue. One of the distinct pains we have suffered since we began revolves around monitoring.

The first issue is that monitoring is two things:

  1. trend analysis for long-term capacity planning and postmortem analysis,
  2. fault detection.

There are many tools today that are hard to use and fail to address our needs for managing thousands of very different machines. Worse, the tools do only one or the other. This means that we must invest time checking disk space in the fault detection tool to alert us when it is "too full" and configure a similar check in a trending tool to show us historical information. Some patchwork was introduced into fault detection tools like Nagios to add trending features... and when I use it, it is clear it was not central to the design.

I have a lot of gripes, but I won't go into all of them. Suffice to say I have them and I think they are the true fuel for developing a next-gen tool to make operations folk suffer less. Combine that with the combustive talent of the engineering group at OmniTI (and now a few outside it) and the oxygen that the open source community provides, and we'll be having a barbecue in no time.

So what's new?

A lot has happened in recent months on the Reconnoiter front. Here's a set of highlights:

  • OmniTI labs got its IANA enterprise number. On that we build SNMP trap support into noitd.
  • Reconnoiter got its IANA assigned port number, no more picking random ports. 43919 or bust!
  • We support durable streaming (for collection and storage of metrics)
  • We support temporal streaming and check cloning for real-time, high-granularity metric collection. This means cross-check (even cross-datacenter) real-time graphs for pretty awesome event correlation.
  • We integrated Esper as a streaming database (complex event processor) as the foundation on which to build fault detection. There's a long way to go here, but the plumbing is in place.
  • We build a custodial daemon capable of running nagios-type checks to ease adoption -- even though it made me cry.
  • We started sitting on #noit on freenode, though it is still eerily quiet in there.
  • We got some new users with the brilliant patience for alpha software and excellent feedback!
  • I'll be talking about Reconnoiter at the 2009 O'Reilly Open Source Convention.
  • It's starting to get some serious attention, so it's time we add some polish!

We've been slowly introducing our managed clients to Reconnoiter and we have, at this point, about a terabyte of metric data. In Reconnoiter, there is no default action to discard data. Yes, that's right. Go buy more disk. It's cheap. You'll thank me next time you have an anomaly today that you think reminds you of one seven months ago... and when you go look at the graph you actually find all the data at its original granularity.