AWS Service Disruption Post Mortem

[+] nicpottier|15 years ago|reply

tldr: ""The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future."

AMZN has gotten a lot of flack over this outage, and rightly so. But I do want to dissuade anyone from thinking anybody else could do much better. I worked there 10 years ago, when they were closer to 200 engineers, and the caliber of people there at that point was insane. By far the smartest bunch I've ever worked with, and a place where I learned habits that serve me well to this day.

I know the guys that started the AWS group and they were the best of that already insanely selective group. It is easy to be an arm chair coach and scream that the network changes should have been automated in the first place, or that they should have predicted this storm, but that ignores just how fantastically hard what they are doing is and how fantastically well it works 99(how many 9's now?)% of the time.

In short, take my word for it, the people working on this are smarter than you and me, by an order of magnitude. There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.

[+] ekidd|15 years ago|reply

Given a choice between hosting servers on AWS, and trying to build my own reliable infrastructure with a single sysadmin, I'll take AWS in a heartbeat. But I do want to quibble with one of your points:

It is easy to be an arm chair coach and scream that... they should have predicted this storm

I'm not as smart as the AWS developers, and I have a lot less experience with large-scale distributed systems.

But thanks to my own cluelessness, I've blown up smaller distributed systems, and I've learned one important lesson: Almost nobody is smart enough to understand automatic error-recovery code. Features like automated volume remirroring or multi-AZ failover increase the load on an already stressed system, and they often cause this kind of "storm."

So I've learned to distrust intelligence in these matters. If you want to understand how your system reacts when things start going wrong, you have to find a way to simulate (or cause) large-scale failures:

This is something that Google does really really well by the way, I've watched them turn of 25 core routers simultaneously carrying hundreds of gigabits worth of data, just to verify that what they think will happen, does happen. http://news.ycombinator.com/item?id=2475112

You also need to pay particular attention to components with substantial, ongoing problems, and make sure you don't let known issues linger:

I work at Amazon EC2 and I can tell you what's going on (thanks to this handy throwaway account). What's happening is the EBS team gets inundated with support tickets due to their half-assed product. Here's the hilarious part: whenever we've asked them why they don't fix the main issue, they keep telling us that they're too busy with tickets. What they don't seem to realize is that if they fixed the core issue the tickets would go away. http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

Now, I'm not saying I could have done any better than Amazon (evidence suggests otherwise). But I do know that I'm not smart enough to understand these systems without testing them to destruction, and aggressively fixing the root causes of known problems.

[+] rlpb|15 years ago|reply

> There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.

I can't disagree, but there is one key benefit to not using the cloud for some services.

When your company is working on an important deadline, your sysadmins could choose not to implement that pending network configuration change during that crucial period. You can control your own at-risk times, which you can't generally do with IaaS.

As with everything, it's a trade off.

[+] leoc|15 years ago|reply

See, this makes me uneasy. The best and brightest work very hard to avoid memory errors in their C codebases then, when errors occur anyway, console themselves that no-one could have done any better. The less brilliant look into more automatic forms of memory management. AT&T hired the best and brightest to make heroic efforts so that the old circuit-switched Bell system could achieve mediocre reliability. I certainly hope that the best and brightest at AWS aren't intending to tackle their enormous control-backplane SPOF by assiduously patching every bug that turns up in its behaviour.

[+] fauigerzigerk|15 years ago|reply

If it's too hard for the best then the concept is dead. But I don't believe it. They made some avoidable mistakes.

Just look at the pattern emerging from these kinds of incidents. There's an automatic cluster recovery mechanism that works for individual node failures but makes matters worse once a larger number of nodes fail.

I wonder whether they did extensive testing or simulation of that scenario. The initial root cause is probably unpredictable because there may be many, but what follows is not unpredictable.

I'm not ready to concede that because they are such an insanely smart elite group of people we just have to live with week long outages.

[+] e98cuenc|15 years ago|reply

I've been working at Google, and had a similar experience. But I do think in many cases you can actually do better. It's not that you're smarter than Amazon/Google/... engineers, it's that you're solving a problem that it's orders of magnitude easier.

Your start-up is probably not dealing with setting up & maintaining 200K servers as Amazon, so up to a limit you can actually be better going on your own.

[+] ianso|15 years ago|reply

Agreed with the expertise thing, but to add to your summary of the post-mortem, it looks like human error compounded by: 1) An architecture bug (the EBS "control plane" cuts across Availability Zones and EBS clusters, leading to a single point of failure: this is what broke the "service contract"), 2) a spec/programming bug: No aggressive back-off on retry attempts of EBS ops, and 3) two separate logical bugs: the race condition in the EBS nodes & problems with MySQL replication.

I think that's everything. It just goes to show, most disasters in very well-engineered systems are generally the result of a series of things all going wrong at once, not individual failures...

[+] icedpulleys|15 years ago|reply

My tl;dr reads differently: "we lacked appropriate congestion control in our EBS recovery algorithms."

The trigger was a bad and unexpected network configuration change, but the error was that the attempted recovery by the stuck volumes was that it was uncontrolled.

I don't think that anyone is knocking the intelligence of the AWS engineers, or saying that anyone else could do it better. Just like NASA engineers and scientists are incredibly intelligent and good at what they do, systems can become complicated enough that unexpected errors creep into the system, and not any particular component.

[+] kamaal|15 years ago|reply

The best are human beings too! I mean even the best work in two ways. Build on best principles already known, and make your own best principles. The first depends on how much you know, how much time you have taken to learn, implement, practice and perfect. The second depends on how far you can see.

In either case you can always overlook or fail to predict the even the easily foreseeable future. And that happens due to many reasons plain human error, or even over confidence which is some times the case with the best.

The best way out of this problem is what Jeff Atwood had blogged some time days back. Is to keep failing, and keep failing in different ways. And each failure needs to be translated into to lessons of some sort and then the solution of it a best practice. Even the Netflix model of failing purposefully will do.

There is no way the best can be flawless. Nothing is flawless, so as long it is done by a human.

[+] g123g|15 years ago|reply

Agreed. It has actually given a very nice excuse to developers like me. Whenever your manager comes up to find why something that you are working on is not up, you can simply reply with something like this -

If AWS with some of the smartest engineers can be down for that long do you think that our crappy service will be up 100% of time?

[+] JohnsonB|15 years ago|reply

A single manual configuration mistake should never be able cause this type of complete system failure. For AWS to be engineered this way, the engineers you met must not have been as smart as they seemed. EBS may be extremely sophisticated in other respects, but if this type of thing was even a possibility, the AWS team are very fallible, and probably far from the best in their field.

>In short, take my word for it, the people working on this are smarter than you and me, by an order of magnitude. There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.

Ever since the AWS outage, I've seen a number of these "the AWS guys are so smart, I've met them." type comments. And then paraphrased: "There sort-of can't be that much to blame on them because of how smart they are, and they are so smart anyway, who could do better?" That's not a valid argument, not everyone is equally impressed by an individuals intelligence, perhaps your assessment is wrong. And even if someone is insanely smart, they still can commit practical errors which indicates they are smart, but still flawed in their understanding of engineering in significant ways. Perhaps AWS simply does need a higher caliber of engineer that wouldn't miss out on these dead-simple safeguards that would have prevented this outage.

[+] jwe|15 years ago|reply

It's not so much about thinking that somebody else could do better.

By doing what they do they create the expectation that they _are_ doing better than everybody else.

[+] unknown|15 years ago|reply

[deleted]

[+] bretthopper|15 years ago|reply

I've been noticing a trend recently when reading about large scale failures of any system: it's never just one thing.

AWS EBS outage, Fukushima, Chernobyl, even the great Chicago Fire (forgive me for comparing AWS to those events).

Sure there's always a "root" cause, but more importantly, it's the related events that keep adding up to make the failure even worse. I can only imagine how many minor failures happen world wide on a daily basis where there's only a root cause and no further chain of events.

Once a system is sufficiently complex, I'm not sure it's possible to make it completely fault-tolerant. I'm starting to believe that there's always some chain of events which would lead to a massive failure. And the more complex a system is, the more "chains of failure" exist. It would also become increasingly difficult to plan around failures.

edit: The Logic of Failure is recommended to anyone wanted to know more about this subject: http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-Sit...

[+] akuchling|15 years ago|reply

A similar point is made in Gene Weingarten's "Fatal Distraction" (http://www.pulitzer.org/works/2010-Feature-Writing), which was about parents who forget a child in the car. Excerpt: "[British psychologist James Reason] likens the layers to slices of Swiss cheese, piled upon each other, five or six deep. The holes represent small, potentially insignificant weaknesses. Things will totally collapse only rarely, he says, but when they do, it is by coincidence -- when all the holes happen to align so that there is a breach through the entire system."

[+] siculars|15 years ago|reply

This is an interesting point you hit on and something that is stressed in scuba diving. Basically whenever you go out on a dive you only want to change one thing at a time. Only one thing can be "new" or "untested" or "new to you". Otherwise you run the risk of being task overloaded which leads to cascading failure - potentially catastrophic and/or nonrecoverable.

[+] john61|15 years ago|reply

That's why Sergei Korolev the chief engineer behind the successfull russion space program that led to Gargarins victory said:

„The genius of a construction lies in its simplicity. Everybody can build complicated things."

[+] rapind|15 years ago|reply

I've been an AWS (S3, EC2, SQS) user for over 3 years now and this article detailing their systems at a mid-level is kind of scaring me off of their platform. It just sounds so complicated and I'm not sure I want to rely on it for anything critical until I can really understand it myself.

Also, a couple other complex systems for your trend are; financial markets and commercial jets.

[+] olefoo|15 years ago|reply

A related book that examines some of the same themes is Charles Perrow's Normal Accidents http://press.princeton.edu/titles/6596.html

The examples he draws from are nuclear power plant failures (TMI in particular), civil aviation and oil transport. But the basics will be recognizable to anyone who has dealt with large computing installations; interactive complexity, tight coupling and cascading failures.

It is not a reassuring book, you won't be able to look at any complex system without asking yourself what sequence of simple, predictable failures of widely separated parts could tip it into a catastrophic failure mode.

[+] bd_at_rivenhill|15 years ago|reply

Here's another good resource for understanding these types of problems:

http://www.amazon.com/Normal-Accidents-Living-High-Risk-Tech...

[+] Smerity|15 years ago|reply

> The nodes in an EBS cluster are connected to each other via two networks. The primary network is a high bandwidth network... The secondary network, the replication network, is a lower capacity network used as a back-up network... This network is not designed to handle all traffic from the primary network but rather provide highly-reliable connectivity between EBS nodes inside of an EBS cluster.

During maintenance instead of shifting traffic off of one of the redundant routers the traffic was routed onto the lower capacity network. There was human error involved but the network issue only provoked latent bugs in the system that should have been picked out during disaster recovery testing.

Automatic recovery that isn't properly tested is a dangerous beast; it can cause problems faster and broader than any team of humans are capable of handling.

[+] thebootstrapper|15 years ago|reply

One of the main cause for "re-mirroring storm," is node not backing off from finding a replica.

Here's Twitter Back off decider implementation (Java)

https://github.com/twitter/commons/blob/master/src/java/com/...

When last time i looked i was little clueless on this. Now I find its usage.

[+] biot|15 years ago|reply

Actual URL with Libya dependency removed: https://github.com/twitter/commons/blob/master/src/java/com/...

HN doesn't have a 140 character limit, so there's no need to post an obfuscated shortened link.

[+] hobbes|15 years ago|reply

>...one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly...

This supports the theory that between 50%-80% of outages are caused by human error, regardless of the resilience of the underlying infrastructure.

[+] brown9-2|15 years ago|reply

This supports the theory that between 50%-80% of outages are caused by human error

Not quite - in this case, a single human error then triggered a series of latent and undiscovered bugs in the system itself. It's a confluence of small events that makes for a large-scale problem like this.

[+] tomjen3|15 years ago|reply

Which leaves a question: why not engineer around humans, such that they are never needed in the day-to-day running of the systems?

[+] kamaal|15 years ago|reply

Unfortunately the humans in reference here are supposed to be the best of the lot. This brings in the eternal question during times of crisis ,how are the best very different than the mediocre or the average?

If not then with little hard work and smart work here and there any body can beat these 'best' during non crisis times. And during the crisis time all are same any way.

Probably that's why there are a lot of successful companies even with average talented people.

[+] wanderr|15 years ago|reply

I highly recommend that anyone who was surprised by this outage, or the description of the chain reaction of failures that lead to it, read Systemantics. It is a dry but amusing exploration of the seemingly universal fact that every complex system is always operating in a state of failure, but the complexity, failovers and multiple layers can hide this, until the last link in the chain finally breaks, usually with catastrophic results.

[+] gruseom|15 years ago|reply

read Systemantics

Oh yes. It's a classic that deserves to be much better known. Anybody engaged with complex systems - such as software or software projects - will find all kinds of suggestive things in there. As for "dry"... come now, it's hilarious and has cartoons.

Basically, just get it. Here, I'll help:

http://www.amazon.com/Systems-Bible-Beginners-Guide-Large/dp...

(They ruined the title but it's the same book.)

[+] senthilnayagam|15 years ago|reply

AWS was numero uno in terms of customer visibility and the image of a pathbreaking cloud service, before the incident.

Lack in transparency in reaching out to customers is the biggest mistake what AWS did. They would learn from their mistakes, their servers and networks would be more reliable than ever.

This incident has given a reason for people to look at multi-cloud operation capability, for disaster recovery and backup reasons. AWS monopoly would be gone, there would be many new standards which would be proposed to bring in interoperability and for migrations between clouds.

[+] charper|15 years ago|reply

Seems there is always this issue. System fails. Systems try to repair themselves. Systems saturate something which stops them from repairing. Systems all loop aggressively bringing it all down.

[+] mcpherrinm|15 years ago|reply

There's a quote I found interesting that hasn't been noted here yet:

"This required the time-consuming process of physically relocating excess server capacity from across the US East Region and installing that capacity into the degraded EBS cluster."

And if I read this description of the re-mirror storm correctly, I think that implies Amazon had to increase the size of it's EBS cluster in the affected zone by 13%, which considering the timeline seems fairly impressive.

[+] assiotis|15 years ago|reply

I find it surprising that they did not and do not plan to employ any sort of interlocks/padded walls. What I mean is, if the system is exhibiting some very abnormal state (e.g #remirror_event above a fixed threshold or more than x standard deviations above average) then automated repair actions should probably stop and the issue should be escalated to a human.

[+] neuroelectronic|15 years ago|reply

They will probably do that now. They will probably also make sure they have a powerful SOP for network upgrades as well.

[+] thebootstrapper|15 years ago|reply

Reminds me again, Distributed System are hard and the first fallacies "The network is reliable"

[+] thehodge|15 years ago|reply

An automatic 100% credit for 10 days usage, thats pretty good IMO

[+] tomjen3|15 years ago|reply

Well yes, except that that is usually peanuts compared to the lost income from your service being down.

Really the only purpose of a SLA penalty is to incentivize the provider to keep the network reliable.

[+] rdl|15 years ago|reply

I still don't see a good justification for keeping the ebs control plane exposed to failure across multiple availability zones in a region. Until that is fixed, I would not depend on AZs for real fault tolerance.

[+] moe|15 years ago|reply

Now that's what I call a post mortem. Kudos to the author.

[+] johndbritton|15 years ago|reply

"We will look to provide customers with better tools to create multi-AZ applications that can support the loss of an entire Availability Zone without impacting application availability. We know we need to help customers design their application logic using common design patterns. In this event, some customers were seriously impacted, and yet others had resources that were impacted but saw nearly no impact on their applications."

[+] epi0Bauqu|15 years ago|reply

They should also allow one-time moves of reserved instances between availability zones.

[+] mml|15 years ago|reply

Did I read this correctly in paragraph 2: " For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region."

Their "control plane" network for the EBS clusters span availability zones in a region? If so, this would be the fatal flaw.

[+] leoc|15 years ago|reply

Compare to the 2008 post-mortem: http://status.aws.amazon.com/s3-20080720.html Messaging infrastructure as single point of failure? Check. http://news.ycombinator.com/item?id=2472227

[+] gojomo|15 years ago|reply

I doubt this is the last time we'll hear of a "re-mirroring storm" in an oversaturated cloud.

[+] pwzeus|15 years ago|reply

I for once just want to say that claps to them for figuring this out , nailing it down in fixing it in just few days. After reading this if feels like issue at such massive level can take large amount of time to fix.

101 comments