Transparency in Cloud Services

37signals recently launched public “Uptime Reports” for their applications (announcement). The reaction on Hacker News was rather tepid, but I think it’s a positive development, and I applaud 37signals for stepping forward. Reliability of cloud applications is a real concern, and there’s not nearly enough hard data out there. Not all products are equally reliable; even within 37signals, the new reports show a 3:1 variation in downtime across apps.

That said, this is a fairly small step. For comparison, take a look at our public monitoring dashboard here at Scalyr. This is a work in progress, and not nearly as pretty as the 37signals page, but it contains far more information. As a user of a cloud application, I would like to know:

  1. Evaluation: how reliable is it?
  2. Diagnosis: if I’m experiencing a problem, is it at my end or their end?

The 37signals uptime reports begin to address the evaluation goal, by reporting uptime over the last 12 months. If you’re only allowed to ask for one number, that’s probably the right one. It’s certainly a lot better than nothing, which is what you get from most providers. But it leaves many unanswered questions.

The heart of the problem is that “up” is not a binary variable. An application can be working for some users but not others. It might be flaky — say, failing 5% of requests, or taking an extra 10 seconds to respond — in a way that drives users nuts, but doesn’t trip an outage detector. Or a critical feature might be broken. These things all happen, and they go to the heart of evaluating the reliability of a cloud application. But they tend to not meet the definition of “downtime”.

Then there’s the question of diagnosis. When an application starts misbehaving, is it because my network connection is flaky? Do I need to restart the browser? Or is it a problem at the provider’s end? An application health dashboard can help answer those questions, but only if it provides real-time data. The 37signals report, with one-day resolution, is not useful for diagnosis.

Transparency in backend services

At Scalyr, we’re building backend services, not user-facing applications. In a backend service, transparency is doubly important. Problems that a human user would consider a minor annoyance can wreak havoc on a downstream service. For instance, suppose latency jumps from 100ms to 400ms. This could cause the downstream service to have four times as many requests in flight, increasing memory usage by 4x. If the extra memory isn’t available, the server might crash, turning a minor latency hiccup into a complete outage.

This may sound extreme, but such things happen. So as a user of a backend service, I want a lot more than just an annual uptime statistic. I want to see latency histograms and error rates over time, at fine granularity, for each operation the service provides. This is the kind of information you’ll find on the Scalyr dashboard.

With backend services, diagnosis is also much more important. If you’re relying on multiple internal and external services, you need a way to quickly narrow down problems. A real-time monitoring dashboard for each service is invaluable. Happily, the data needed is the same — latency histograms and error rates.

Transparency is not just about data

Performance data tells you how a service has performed in the past, but that doesn’t always predict the future. A seemingly reliable service may have been on the edge of disaster; a previously unreliable service may have recently made improvements. To fully evaluate a service, you should also look for an architectural overview. Is it designed for reliability? Are there scaling limits? Single points of failure? How is failover implemented?

The April 2011 AWS outage provides a good example. One of Amazon’s claims for AWS is that the “availability zones” in each region are decoupled — failures in one shouldn’t affect another. As far as I know, this claim had borne out for several years. However, the April outage affected EBS services across all zones in the US-East-1 region. According to Amazon’s postmortem, this was because an important subsystem — the “EBS control plane” — was in fact shared across all zones in the region. Thus, a hardware problem in one zone ultimately affected the other zones.

Quite a few sites went down because they had built their disaster recovery plans around Amazon’s promise of zone independence. Monitoring data prior to April 2011 would not have given any hint of this vulnerability. Only if Amazon had published the architectural details of EBS would customers have known to prepare for the possibility of simultaneous failure of EBS in multiple zones.

I don’t mean to pick on Amazon here; it simply happens that, due to their size, they provide a convenient example. To their credit, they did publish a detailed postmortem — another important form of transparency. By contrast, while 37signals reports that Basecamp was down for roughly 6 hours over the last year, I don’t see any postmortems on their site. Postmortems provide an important window into a service’s inner workings, the professionalism of the team, and the likelihood of repeat problems.

Collateral benefits of transparency

Transparency is not just about trust; it also helps to set expectations. What performance can I expect from this service? Is the latency my benchmark just reported likely to remain consistent over time? How will it change if I store 100 times more data, or change my query pattern? These questions are critical when designing your downstream application.

Transparency is also of value to the community at large: illustrating what real-world production environments look like, and providing points of comparison.

At Scalyr, we’re striving to raise the bar on service transparency. Our current dashboard is just a start. Over the next few months, look for detailed information regarding our internal architecture, even more detailed (and better documented) monitoring data, and a continuing series of posts on the challenges involved in running reliable services.

If you like these ideas, check out our first service, and stay tuned to this blog for bigger things to come!

Call to action

If you run a service, and you publish any sort of detailed uptime, performance, or architectural information, I’d love to hear from you — drop me a line at [email protected]. I’ll collect any interesting examples for a future post.

If you use a service, and you have interesting examples where transparency has come in handy — or the lack of it has bitten you — send me those, too.