Skip to: categories | main content
Esoteric Curio
Want to work with me at the $DAYJOB?
I am the CEO of OmniTI where I do all sorts of stuff I find absolutely fascinating.
It is rumored that I write code sometimes. I often don't believe this myself, so I use this to follow what it is that I'm working on:
What's a DBA? Database administrator? Database architect? Whatever it is, we need one (or two or three) at OmniTI. The problem is that the Jurassic variant of the DBA simply can't cut it in today's world of the web giants (OmniTI services the web giants). What is it about the modern use of database in the web world that make traditionals DBAs so inadequate? It's actually a variety of things. If you think tackling them is a challenge you're cut out for, send me your resume!
One might say "Scale? Scale? Are you kidding me? Traditional database problems come in various flavors of gigantic." This is true, but the unique dynamics of the web mean that not only is the data flowing in at uncontrollable rates, but tomorrow there is always a new data source with a new representation. Due to the scale of external systems and your own it makes for an interesting sandbox in which we play. A multi-terabyte database is quite common just about anywhere these days; however, ten of them running four different data management platforms is rather common place within large web infrastructures.
This is the real game changer. This is the mental paradigm shift that classical DBAs struggle with most. Try this on for size: "no maintenance windows." Okay, admittedly that is a stretch, but only a small stretch. How about 5 minutes of database downtime allowed every 4 weeks or, worse, every 52 weeks.
Coupled with this is daily (or more frequent) code rollout often times requiring database schema change, new triggers, and different replication dynamics.
The amount of seemingly ass-backwards engineering required to accomplish this feat is usually described as either: "excruciating pit of hell" or "fascinating enigma." (I think of it as both) What I do know is that I have seen some of the most profound data engineering creativity emerge from this insipid constraint.
While noSQL is a fanatical movement that needs to be kept in check, the problems it addresses are real and [some of] the solutions promoted are good. Understanding when data better fits in a key-value store is something that the classically trained DBA struggles with. It's all about data management: Dynomite, CouchDB, Cassandra, Oracle, PostgreSQL, MySQL… they all have a sweet spot &emdash; know it.
The other interesting dynamic has less to do with data management and has more to due with the pace and culture of web companies. In traditional companies, roles are more disjointed; DBAs have a set of responsibilities that software engineers and system administrators do not have and those two groups have a set of responsibilities with which DBAs need not concern themselves. This "silo effect" spells absolute disaster for web companies.
The pace on the web is simply too fast (I often think it is a bit reckless, but that's another rant) to keep all the parts separate. An insufficient amount of time exists to define how one layer of an architecture will communicate with another layer. And while the layers meet at reasonable places most of the time, there are other places where the lines are blurred. These are the points in the architecture that stand to have the highest efficiency, but also are the hardest to troubleshoot; they require an organization of engineers that have no boundaries. In the case of a DBA… one that wants to look at the ORM code that builds the insanely awful SQL queries that beat the stuffing out of their database every second; one that is actually aware of the storage configuration and layer-2 fabric that sits behind the magical "remembering device" that the database commits things to all day long.
Around here, Robert Treat runs the show. The stick wielding and beatings are left to him (he a father of three, so I'm sure he's more than capable).
I guess the question is… are you up for the challenge? Are you a DBA who thinks their job is about more than just ACID? Are you a developer or an SA that has decided to make a shift and tackle the data management layer with a unique perspective? Are you interested in mind bending problems that arrive on your plate a bit too quick for anyone's comfort level? I hope so, because we need people like that!
So... lock a dude on a flight with Internet access, free booze and free movies... yet, he codes. Nothing useful, of course.
One of my favorite programs from past days was the jive.l lexer (it's an Internet scavenger hunt to find it). Well, here's a jive bookmarklet. Try it out. Boredom on flights can be dangerous.
If you’re like me, you’ve read a handful of really useful articles about extending and embedding things like perl, Python, Java, lisp, scheme or lua. The slant there is a technical one: “So you need Java, here’s how to embed it.” I’ve embedded Python once, perl and Java countless times and, most recently, lua. Throughout this I realized that I don’t need the languages as much as I need a convenient extension out of C. In other words, if your foundation app is feature-rich, embedding isn't as much about adding the features of a specific language as it is about exposing the features of your system.
Given this context, there is a lack of articles about how to choose the appropriate embedding technology for a given project. Each language has its own advantages and people religiously argue about them seemingly to no end. Things get interesting when you look to embedding languages, as their internal implementation decisions start to influence what is and isn't possible. I’ll take you on a brief whirlwind tour of my personal experience embedding.
Many of you know I program in Perl quite often. Many people hate Perl and believe it to be an abomination. I love Perl for its flexibility, expressiveness and its terseness — aligning it with myself. However, the perl (5.10 and earlier) interpreter is a horrid, catastrophic example of how not to write a virtual machine for an interpreted language; spaghetti does not begin to describe that mess that is perl internals. The API is insufficiently documented, the internal data representations are indecently exposed and the the way thread support was backed into the code (like a garbage truck through a toy shop) make it more than a challenge to embed. Now, I’ve been writing XS code for years (the C-like language for extending perl itself) and I know and love the language itself, but I’d argue that it is rarely a good idea to embed perl in any application.
You may have noted that I claimed to have embedded perl several times in applications. Some of those turned out to be a good idea. Both (yes only two) of those times, the reason it ended up being a good idea was that to make my application more useful I needed the wealth of CPAN (the comprehensive Perl archive network that contains a wealth of easy-to-use Perl wrappers around common needs). While in the end I won, embedding perl was an awful experience and I wouldn't recommend it to anyone. My specific problems came down to debugging memory usage as it transcends from C into Perl heap and the painful problem that is threading within perl. Each time, integrating into perl’s threading system proved a waste of time and I ended up starting an interpreter for each thread (or had a resource pool of interpreters from which each thread would reserve). So many useful CPAN modules aren’t thread-safe and were never developed with a mindset that they would be loaded into a perl that might be embedded in another system.
While I love the language, I hate the implementation.
While I like Python as a language, I personally find that if my job doesn’t fit in perl and that Python would be a better choice, I usually think a bit harder and realize that C would be even better. I realize this isn’t the same mental leap made by most people, but I love C. C is my favorite language. In fact, I wouldn’t be writing this article if I didn’t write so many applications in C that can benefit from an embedded interpreter to ease configuration and processing tasks.
In fact, I would argue that Python is a great language to embed because the of the interpreter. It's implementation is cleaner than perl’s — as I’ve mentioned the bar could not be set much lower. The fact is that embedding Python in a complicated multi-threaded and/or even driven application really shows the inadequacies of its embedding design. That combined with the fact that I’d rather be coding directly in C means: "I wouldn’t embed that again."
Now, all the applications in which I typically embed interpreters or virtual machines are complicated. If your application is single-threaded and doesn’t do extensive co-routines (e.g. closure-oriented programming or event driven systems) then Python could really be an ideal embedding language.
Java is a complicated discussion. I have a lot of mixed feelings about Java. No language should have functions in its core distribution deprecated at the rate that Java boasts — it’s dysfunctional. Java as a language itself is quite nice. It is powerful, fast, and expressive. It has some annoyances like its verboseness and sprawling code bases. It has one true, deep negative: Java programmers. Now, not all Java programmers are bad, but I've seen far too much Java code produced by programmers far and wide that is of a quality that is flat-out unacceptably low.
When you’re embedding a language, some of those negatives don’t matter so much. We can instead concentrate on ease of embedding. Java has to be, hands-down, the easiest system to embed. The API is crystal clear and absolutely complete. It is all held in a single, small header file called jni.h. Every time I’ve embedded Java is has been simple and painless. JNI team, pat yourself on the back — I’ll happily buy a round of beers if we ever run into each other.
On top of the ease of embedding, the native support for mapping application (C) threads into JVM threads is concise and demonstrates pre-meditated good engineering.
One reason I don’t like Java for this is that it isn’t really “interpreted.” I have to compile (or auto-compile) my code to get it to run in my app. Now, perhaps I could use Java to pull in something like Jython or JRuby to make a rather obtuse, but convenient, system. Success via indirection has never been high on my list — if I ever try it, I’ll be sure to let you know how it went.
Too old, too arcane, AYFKM, leave it on the web guys, respectively.
Okay, in all fairness, I didn’t give Tcl its due. It’s likely the most widely embedded language. I just don’t like Tcl and it shows.
Javascript deserves its own section for arguments against embedding a language. If you have a single-threaded program that is not event driven, Javascript is a brilliant choice. I don’t remember the last time I wrote an application that was both that simple and was in need of an embedded language. So, the fact that Javascript has no support for threading makes it utterly painful and obtuse to embed in most systems. This saddens me. Of all the languages I program in, Javascript is likely my favorite. I know this must be on the merits of the syntax and expressiveness of the language because I loathe DOM and anything that touches it. Oh! how I wish Javascript embraced threading. Oh! how I wish Javascript had the embeddability of Lua.
Lua is my new love. It is the reason I decided to write this article. I actually don’t like the lua language. I don’t particularly like the internal implementation (garbage collected instead of ref counted). It doesn’t have particularly useful extensions to add feature value to your system. So, what is it about lua?
I᾿ve been working a lot on a system called Reconnoiter. noitd, the agent responsible for performing active checks against other systems, has a high-performance hybrid thread/event core. Writing code for a hybrid thread/event system can be quite mind bending to write and challenging to debug. Reconnoiter allows writing checks as modules that are dynamically loaded into the system at run-time. The hybrid thread/event core means that the core itself is capable of handing hundreds of thousands of open sockets (think network connections). Writing complex event-driven C code severely limits the audience that can contribute check code to this open source project. This frames the need.
Managing hundreds of thousands of concurrent threads in the system is a recipe for suboptimal performance. In order to remove the complexity of event-driven programming, I need to extend Reconnoiter’s core to a language that could provide the feel of a procedural (or OO) programming atmosphere while maintaining a non-blocking, continuation based implementation for high single-system concurrency.
Now, this is possible (sort of) with many of the embedding choices. When I get to a point in the interpreted code where I need to suspend what I’m doing and complete it later when actionable data is available (say date from a network read) to take some action. The issue is that in order to resume all these languages where they left off, I need to maintain their C stacks. This can be accomplished with a user-space threading implementation strategy using setjmp, longjmp and a healthy sprinkling of black magic. In a lot of ways, this approach doesn’t solve the problem if requiring a interpreter or VM thread for each concurrent operation which would voraciously consume resources.
Enter lua, champion of concurrency. lua is a stack based virtual machine specifically designed for integration with larger C systems. Within lua, assuming you don’t use any third-party extensions (of which there are few), I can leverage the lua_yield and lua_resume API calls to suspend the execution of a lua co-routine (lua-space thread) without the need of capturing a C stack. The state of lua’s execution is represented in the lua stack and simply pick up later right were I left off. This is a technical nuance, but one that provides a considerable amount of freedom. In a single thread, I can open 100k network sockets and have them all independently driven by lua programs without maintaining 100k C stacks. Furthermore, the lua programs give the feel of blocking reads and writes making network programming once again bearable.
After wrapping Reconnoiter’s libeventer and check systems with lua, I was able to replace the previous incarnations of our HTTP checker (based on libserf, then on libcurl) with a 170 line lua HttpClient that consumes less memory and less CPU! If you’re wondering, yes, the client supports chunked transfer encoding, gzip/deflate content encoding, arbitrary headers, client payloads and methods.
So, despite not really liking lua as a language (I find its syntax a bit painful), the simplicity of embedding it my application was on par with Java, the fact that it is truly interpreted (no compiling lua before it will run) and the absolutely brilliant exposure of its continuations as first-class embedding APIs puts me in bed with lua.
Never embed a language just to embed a language; always have a purpose in mind. Understand the host architecture in which you need to embed and attempt to match the strong points of a particular interpreter or VM to your needs. For example, if you have a heavily threaded host architecture, don’t embed something that doesn’t support threading well (or claims to but takes global locks).
Next time I embed, I will start with three preferences: Java, Javascript and Lua. Sadly, I predict Javascript will be disqualified quickly due to its deficiencies. Java and Lua both have that pros and cons. I just hope the next project finds Lua the best bedmate, I’m already looking forward to working with it again.
rubyrep looks neat. I suppose our next round of tests will have to use it on some billion row tables and see how it fairs.

