We see a nice peak, a nice valley. Thursday afternoon, we see a nice traffic spike. Well, this used to be what I called a traffic spike. Now, different services have different spike signatures. It resembles traffic model of classic Internet advertising, except that there is genuine interest and thus dramatically higher conversion rates. It's a simple combination of placement, frequency and exposure. Because content, unlike ad banners, exists for an extended period of time (sometimes forever), the frequency is very high. Digg and Reddit have excellent placement with very little exposure (things move out quickly). A site like CNN or NYTimes usually provides mediocre placement (unless you are on the front page) and excellent exposure.
Lately, I see more sudden eyeballs and what used to be an established trend seems to fall into a more chaotic pattern that is the aggregate of different spike signatures around a smooth curve. This graph is from two consecutive days where we have a beautiful comparison of a relatively uneventful day followed by long-exposure spike (nytimes.com) compounded by a short-exposure spike (digg.com):
The disturbing part is that this occurs even on larger sites now due to the sheer magnitude of eyeballs looking at today's already popular sites. Long story short, this makes planning a real bitch.
And the interesting thing is perspective on what is large... People think Digg is popular -- it is. The
New York Times is too, as is CNN and most other major news networks -- if they link to your site, you can expect to see a dramatic and very sudden increase in traffic. And this is just in the United States (and some other English speaking countries)... there are others... and they're kinda big.
What isn't entirely obvious in the above graphs? These spikes happen inside 60 seconds. The idea of provisioning more servers (virtual or not) is unrealistic. Even in a cloud computing system, getting new system images up and integrated in 60 seconds is pushing the envelope and that would assume a zero second response time. This means it is about time to adjust what our systems architecture should support. The old rule of 70% utilization accommodating an unexpected 40% increase in traffic is unraveling. At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients.
I talk about scalability a lot. It's my job. It's my passion. I regularly emphasize that scalability and performance are truly different beasts. One key to scalability is that a "systems design" scales. Architectures are built to be able to scale, they are not built "at scale." It's just too expensive to build a system to serve a billion people (until you have a billion people). It's cheap to
design a system to serve a billion people. Once you have a billion people accessing your site, you can likely justify executing on your design. Google is successful for this reason: their ideas scale and they can build into them as demand rises. On the flip side, traffic anomalies in the form of spikes are unexpected (by their definition) and scaling a system out to meet the
unexpected demand is almost unreasonable. I would even argue that it is more of a performance-centric issue. I want every asset I serve to be as cheap to serve as possible allowing me to handle larger and larger spikes.
The reason I find all of this stuff interesting is that understanding
performance and scalability, understanding the
principles of scalable systems design and having
sound and efficient processes for handling performance issues is becoming crucial for sites regardless of their size. This takes insight and practice and it reminds me of Knuth's famous saying:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
That's all well and good, but which 97% of the time? My response to Knuth's statement (with which I completely agree) is:
Understanding what is and isn't "premature" is what separates senior engineers from junior engineers.
Let's add perspective on the word "sudden." Most network monitoring systems poll SNMP devices (like switches, load-balancers, and hosts) once every five minutes (we do this every 30 seconds in some environments). Some people say, "my site scales! bring it on." We see these spikes happen inside 60 seconds and they occasionally induce a ten-fold increase over trended peaks. Often times, this spike can be well underway for several minutes before your graphing tools even pick up on it. Then, before you have time to analyze, diagnos and remediate... poof... it's gone. Be careful what you wish for.
This, in many ways, is like a tornado. Our ability to predict them sucks. Our responses are crude and they are quite damaging. However, predicting these Internet traffic events isn't even possible -- there are no building weather patterns or early warning signs. Instead we are forced to focus on different techniques for stability and safety. The idea of a DoS, a DDoS or the sometimes similar signature of a sudden popularity spike doesn't increase my heart rate anymore -- it's just another day on the job. However, I thought I'd share the four guidelines that I believe are key to my sanity in these situations:
- Be Alert: build automated systems to detect and pinpoint the cause of these issues quickly (in less than 60 seconds).
- Be Prepared: understand the bottlenecks of your service systemically. Understanding your site inside and out. Contemplate how you would respond if a specific feature or set of features on your site were to get "suddenly popular."
- Perform Triage: understand the importance of the various services that make up your site. If you find yourself in a position to sacrifice one part to ensure continued service of another, you should already know their relative importance and not hesitate in the decision.
- Be Calm: any action that is not analytically driven is a waste of time and energy. be quick, not rash.
Back to those other countries... Enter China and their recently lessened censorship and we have a looming tidal wave for smaller sites that achieve sudden popularity. Spikes of several hundred megabits per second are difficult to account for when your normal trend is around twenty megabits per second. The following graph is traffic induced from a link from a popular foreign news site (that I can't read). I call it: "ouch:"
Wednesday, May 21. 2008 at 22:55 (Link) (Reply)
Any further suggestions there? On the simplest , fastest level for a site would it make sense to cache a page if it detected a referral from one of the big aggregators or NY Times and CNN?
Tuesday, May 27. 2008 at 17:27 (Link) (Reply)
We're working on some systems like these now. They'll be open when they have some more polish.
Thursday, May 22. 2008 at 16:31 (Reply)
looks similar.
Tuesday, May 27. 2008 at 17:48 (Link) (Reply)
Twitter is a great example of a site the valiantly struggles with scalability. You can say a lot of bad things about them. Their architecture isn't great (it grew rather organically after all) and, in light of that, I'd say they do a really impressive job of keeping it alive.
Monday, June 23. 2008 at 17:43 (Link) (Reply)
Sunday, June 29. 2008 at 10:02 (Link) (Reply)
Why did they only last for only such an extreme short period of time?
Sunday, June 29. 2008 at 16:35 (Link) (Reply)
The Digg stuff, has a much shorter time frame. It is the life of an entry on Digg on the front-page. Anywhere from a few minutes to a few hours, but rarely longer than that.
Sunday, June 29. 2008 at 16:37 (Link) (Reply)
http://cparente.wordpress.com/2008/06/21/will-the-internet-break-under-peak-load/
Sunday, June 29. 2008 at 22:38 (Link) (Reply)
Knuth was quoting him.
Monday, June 30. 2008 at 08:18 (Reply)
Monday, June 30. 2008 at 12:23 (Link) (Reply)
http://labs.omniti.com/trac/reconnoiter
Tuesday, July 1. 2008 at 08:05 (Link) (Reply)
1) For the initial response to the spike, the user can indeed set up with "warm" capacity - and this works very well in a cloud computing environment. e.g. they can size their virtual servers for large CPU/RAM with a low base utilization - while they have low usage the hypervisor will share capacity with other VMs, but it is instantly available when the spike hits.
2) Although the spike hits instantly, the higher level of traffic lasts for hours afterwards. With a good monitoring and a good cloud vendor, the user will be aware and able to provision additional capacity within the first hour. This doesn't tackle the initial part of the spike (handled above), but does mean that many of the NYT/Digg readers will actually get a great experience when they arrive later in the day.
Tuesday, July 1. 2008 at 08:59 (Link) (Reply)
1. That was actually the point. The old rule of thumb is to maintain about 70% utilization and I'm saying it is time to rethink that paradigm. You echo my conclusion with using large instances with low utilization "just in case." However, if other VMs on the same hardware spike at the same time, you are limited due to over saturation of the underlying hardware. Heavy use machines typically use all of the hardware anyway.
2. It takes us far less than 60 minutes to bring online new systems from bare-metal, app install, test and integrate into the load balancer. So, it's really the first few minutes I'm interested in solving, the rest is canned.
I think elastic computing make sense the smaller you are. If I have 4 hosts, the ebb and flow is more dramatic and clouds can help a lot. When I already have more than a few racks of servers, I have my own elastic computing if I designed the infrastructure correctly. The premium on fully taxed VMs is pretty significant on the wallet.