A job, a mission, a career: all without a path or a name.
I'm sitting in the SFO airport waiting to sit on a plane for 6 hours to fly home from the O'Reilly Velocity Summit. Was it worth it? You betcha.
What is this Velocity Summit thing? It was a bunch of web architects from highly trafficked sites sitting around talkin' smack. It was operated in Foo style. However, one thing that made me really appreciate this meet-up was the lack of self-importance displayed by attendees. Everyone was just there to talk -- not to make people understand how much they knew. We were talking about The O'Reilly Velocity Web Performance and Operations Conference: what it should be and why.
Two things that I walked away with were (1) a realization of the lack of a career path for people who do what we do (no standard titles, no standard roles and responsibilities and certainly a lack of sex appeal) and (2) a clear lack of terminology for the technology requirements that are so common in these environments. Terminology is easy, in my opinion -- you just argue until someone wins. Of course, arguing is a hobby of mine, so I have bias. On the other hand, defining a career path that is an industry accepted path is hard.
The term Web Operations was used a lot during this event. While it isn't awful, I really don't like this term. The hard part is that the captains, superstars, or heroes in these roles are multidisciplinary experts. They have a deep understanding of networks, routing, switching, firewalls, load-balancing, high availability, disaster recovery, TCP & UDP services, NOC management, hardware specifications, several different flavors of UNIX, several web server technologies, caching technologies, several databases, storage infrastructure, cryptography, algorithms, trending and capacity planning. The issue: how can we expect to find good candidates that have fluency in all of those technologies? In the traditional enterprise, you have architects which are broad and shallow and their team of experts which are focused and deep. However, in the expectation is that your "web operations" engineer be both broad and deep: fix your gigabit switch, optimize your MySQL database and guide the overall architecture design to meet scalability requirements.
I struggle with this. Not everyone can be a superstar. More importantly, no one can really start as a superstar. If we use an apprentice model (which is common in industries without institutional support) we limit the total number of able workers in this field. So, how do we (re)define the requirements for a junior web operations person?
We have to have a plan for hiring on people and progressing them through a career path to make this a legitimate discipline. One person said they just hire people that they think are agile -- "If I tell them to know IOS well enough to configure a router and troubleshoot a problem, I expect them to show up tomorrow with a basic understanding of IOS and ready to start typing in commands at a console." I agree this sort of "no boundaries" attitude is required for the job, but where do you start?
Another person mentioned that the reason for the lack of sex appeal in the position was due to popular attitude. Many people apply for development positions and "don't quite make the cut" and are instead offered system administration positions. I personally don't subscribe to this philosophy and we certainly don't operate like that at OmniTI, but I've see it in other companies -- I hope it is not prevalent.
Basically, this is one of the few positions in the organization that has no boundaries of responsibility. If something breaks, it is your problem. Why isn't this the case throughout the organization -- why is it that even the most junior of developers doesn't wake up to fix their code when it breaks and causes service degradation in the middle of the night? It's uncommon that this level of responsibility is expected of developers, while it is a quite common expectation of the operations crew.
Circling back, I really don't like the term "web ops." I realize it is not far off, but it isn't sexy. Google has a few different roles with this level of responsibility. One I like is called: "Site Reliability Engineer." However, I'd like a set of job titles and a progression through them that makes this an appealing career path for young, ambitious geeks.
In order to define these roles, we should think about what they are responsible for. In our organization I see this as a few things:
On the junior level, they are responsible for learning. They are responsible for deploying new services and documenting such deployments. They are responsible for instrumenting deployments to make sure that faults are detected and trending is possible.
On the mid-level, they are responsible for all of the above, and more. Effective and complete troubleshooting of failures. Making sense of trending information. Understanding work loads that exist. Tuning systems to better accommodate current workloads and proactive tuning to handle known future workloads. One of the key differences between mid-level and junior is the ability to correctly prioritize remediation of issues during incident response. Staying calm, collected and executing with clarity of thought during an emergency.
What does "complete troubleshooting" mean? I mean troubleshooting without boundaries. I want no shyness in cracking open developer code and telling them what they did wrong and why. Finger pointing at people simply doesn't work, you have to point your finger at implementation problems, not people. To do that requires the skill to track a performance problem or reliability issue down to a specific line of code or approach.
On the senior side, technology research and selection is a must. Incorporating new technologies in the architecture to improve availability and reduce costs. Constant analysis of systems to improve efficiency. Capacity planning to understand growth well enough to ensure provisioning and deployment outpace need. Donald Knuth long said that premature optimization is the root of all evil; I've long said that the ability to accurately determine what is premature separates senior from junior.
One of the core responsibilities that must be handled on all levels is assessing the appropriateness of the technologies at hand. At the highest level, the "Web Architect." one must ensure that technology selection as well as development and deployment strategy match the business need. This is "hard."
Above and Beyond
This is a special role. In a lot of ways, this role isn't for failed developers, it is for developers/engineers that have outpaced their career path. One that has a deep understanding of how things work: "a complete systemic view of general site architecture." However, they want more responsibility, they want to make sure that all of it works all of the time: the app, the stack, the hardware, the network. Whatever technology the business needs, it must work, it must performs and it must be able to meet demand. Lastly, in their heart of hearts, they must believe that all problems are equal in their need for resolution and problem prioritization is dictated by business impact and not by flights of fancy (how cool or interesting the problem is).
It's an impossible job requirement: "Knows everything about all technologies deployed in Internet architectures." While no one fills this req., what I want is someone who's career goal is to find out how close they can get. You up to the challenge?