Cascading errors caused AWS to go down

[+] eli|13 years ago|reply

In case your browser doesn't speak RSS:

Service is operating normally: Root cause for June 14 Service Event June 16, 2012 3:15 AM

We would like to share some detail about the Amazon Elastic Compute Cloud (EC2) service event last night when power was lost to some EC2 instances and Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the US East Region.

At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

The generator fan was fixed and the generator was restarted at 10:19PM PDT. Once power was restored, affected instances and volumes began to recover, with the majority of instances recovering by 10:50PM PDT. For EBS volumes (including boot volumes) that had inflight writes at the time of the power loss, those volumes had the potential to be in an inconsistent state. Rather than return those volumes in a potentially inconsistent state, EBS brings them back online in an impaired state where all I/O on the volume is paused. Customers can then verify the volume is consistent and resume using it. By 1:05AM PDT, over 99% of affected volumes had been returned to customers with a state 'impaired' and paused I/O to the instance.

Separate from the impact to the instances and volumes, the EBS-related EC2 API calls were impaired from 8:57PM PDT until 10:40PM PDT. Specifically, during this time period, mutable EBS calls (e.g. create, delete) were failing. This also affected the ability for customers to launch new EBS-backed EC2 instances. The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. The EBS datastore is used to store metadata for resources such as volumes and snapshots. One of the primary EBS datastores lost power because of the event. The datastore that lost power did not fail cleanly, leaving the system unable to flip the datastore to its replicas in another Availability Zone. To protect against datastore corruption, the system automatically flipped to read-only mode until power was restored to the affected Availability Zone. Once power was restored, we were able to get back into a consistent state and returned the datastore to read-write mode, which enabled the mutable EBS calls to succeed. We will be implementing changes to our replication to ensure that our datastores are not able to get into the state that prevented rapid failover.

Utility power has since been restored and all instances and volumes are now running with full power redundancy. We have also completed an audit of all our back-up power distribution circuits. We found one additional breaker that needed corrective action. We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.

We sincerely apologize for the inconvenience to those who were impacted by the event.

[+] yaix|13 years ago|reply

"service event"

Added to my list of favourite euphemisms.

[+] namidark|13 years ago|reply

Sounds like Amazon is doing something wrong, shouldn't it fail over to Battery then Generator?

[+] jrockway|13 years ago|reply

The RSS link was quite amusing. My Chrome instance downloaded the RSS file without displaying it. Then I clicked it to open, and it opened Firefox. Firefox showed its file download box, suggesting I open the RSS with Google Chrome.

Deadlock detected.

[+] tedunangst|13 years ago|reply

On Linux, right? Firefox, or whatever gnome/dbus/opendesktop/gtk fuckery it uses has all sorts of strange notions about file types. When I download a tar.gz file, it saves a copy to /tmp, then launches a new instance of firefox with a file:// url, which opens a save file dialog.

[+] nowarninglabel|13 years ago|reply

Does anyone know a solution for this? When I got upgraded to this version of Chrome (Version 20.0.1132.34 beta), it started downloading RSS feeds instead of displaying them. Much sadness has ensued :(

[+] Osiris|13 years ago|reply

Opera detected the link as an RSS feed and asked me to subscribe to it with the built-in RSS reader.

I wonder if other browsers would benefit from being able to at least detect RSS feeds and display appropriate information to user.

[+] forgotusername|13 years ago|reply

In my time at larger companies, DC power seems to be one of the weakest links in the reliability chain. Even planned maintenance often goes wrong ("well we started the generator test and the lights went out, that wasn't supposed to happen. Sorry your racks are dead").

Usually the root cause appears simple - a dead fan, breaker set to the wrong threshold, alarm that didn't trigger, incorrect component picked during design phase, or whatever else that gets the blame - things it would seem to a software guy that good processes could mitigate.

Can any electrical engineers elaborate on why power networks fail (in my experience at least) so frequently? I guess failure modes (e.g. lightning strike) are hard to test, but surely an industry this old has techniques. Is it perhaps a cost issue?

[+] mrkurt|13 years ago|reply

It's really incredibly complicated, and difficult to test fully. The bits of Amazon's DC that failed seem like stuff normal testing should catch, but the DC power failures I've dealt with in the past always had some really precise sequence of events that caused some strange failure no one expected.

As an example, Equinix in Chicago failed back in like 2005. Everything went really well, except there was some kind of crossover cable between generators that helped them balance load that failed because of a nick in its insulation. This caused some wonky failure cycle between generators that seemed straight out of Murphy's playbook.

They started doing IR scans of those cables regularly as part of their disaster prep. It's crazy how much power is moving around in these data centers, in a lot of way they're in thoroughly uncharted territory.

[+] rdl|13 years ago|reply

I assume you mean "Datacenter (conditioned) Power", not literally Direct Current power.

In my experience (in ~30 datacenters worldwide, and reading about more), the actual -48v Direct Current plant is usually ROCK SOLID, in comparison to the AC plant. It's almost always overprovisioned and underutilized, at least in older facilities, or those with telcos onsite (who, unlike crappy hosting customers, actually understand power).

My pro tip for datacenter reliability is to try to get as much of your core gear on the DC plant as possible -- core routers, and maybe even some of your infrastructure servers like RRs, monitoring, OOB management, etc. Ideally split stuff between DC and AC such that if either goes down, you're still sort of ok, or at least can recover quickly. DC and AC is even better than dual AC buses, since what starts out as dual AC can easily end up with a single point of failure later (like when they start running out of pdu space, power, or whatever), and dual AC is also more likely to have a closer upstream connection.

DC stuff is WAY simpler to make reliable and redundant, just uses larger amounts of copper and other materials.

[+] Spooky23|13 years ago|reply

Not an EE, but I've observed a few things about electrical infrastructure:

- The work is usually done by outside contractors, working off of specifications that may or may not make sense.

- Some aspects of testing have the potential to be dangerous to the people doing them. (ie. if network switch fails in testings, no big deal. If some types of electrical switch break during testing, the tester is dead.) High voltage electricity is not a toy.

- IT and facilities staff usually don't talk much, and often don't understand each other when they do.

- There's no instrumentation. I get an alert when IT systems aren't configured right. Nothing from the other stuff.

- There is a wide variance in quality of electrical infrastructure that isn't obvious to someone who isn't skilled in that area. IT folks don't need to deal with computers built in 1970. Electricians deal with ancient stuff that may be completely borked all of the time.

[+] nl|13 years ago|reply

Rackspace has a pretty detailed report[1] of their 2009 outages, which is surprisingly similar to the Amazon problems.

[1] http://broadcast.rackspace.com/downloads/pdfs/DFWIncidentRep...

[+] davps|13 years ago|reply

Power failures caused by lightning strike are relatively easy to test with platforms like RTDS [1] (I am not affiliated to RTDS).

I know that you can test in real time your electrical protection systems for almost all the possibilities you can imagine (thousands of them), for example: faults in your high voltage utility distribution system, breaker failures, coordination of the protection systems, lost of your back-up generator power. I don't know their systems or their philosophies, would be interesting for me to know why they don't parallelize groups of generators (at the backup system), so, when one generator fails, the power load are balanced to the others (and using well known schemes to avoid cascade failures).

[1] http://www.rtds.com/applications/applications.html

[+] jluxenberg|13 years ago|reply

"Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications"

"Meaningful disruption" is a bit of a weasel word; Amazon's own EBS API was down for almost two hours[1] despite being designed to use multiple AZs

[1] "the EBS-related EC2 API calls were impaired from 8:57PM PDT until 10:40PM PDT ... The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores"

Guess the moral of the story is, if you require high availability then you must test your system in the face of an availability zone outage.

[+] damian2000|13 years ago|reply

I love this sentence: Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

Translation: If you have a redundant (multiple-AZ) installation, then you were ok, if not then your server died.

[+] jtchang|13 years ago|reply

Data Center Operator:

We've lost our main power. No problem though we have a backup generator so we are good!

... 5 minutes later ...

Uhh boss, our backup generator's fan crapped out. But no worries we have a secondary generator just for this kind of scenarion!

...10 minutes later and lights go out...

"Well damn...looks like we configured the breaker wrong. This is not a good day."

[+] lutorm|13 years ago|reply

Things could be have been worse -- it could have been a nuclear power plant. oh wait...

[+] tysont|13 years ago|reply

On the plus side, the level of transparency that AWS displays and the detail that they provide seems above and beyond the call of duty. I find it refreshing and I hope that other companies follow suit so that customers can understand the details of operational issues, calibrate expectations appropriately, and make informed decisions.

[+] rdl|13 years ago|reply

They're less transparent and responsive than most datacenter or network providers -- it's just that most of those providers hide their outage information behind an NDA, so only customer contacts get it, vs. making it public.

[+] mleonhard|13 years ago|reply

The title is incorrect. It should say something more like "Cascading failures cause part of AWS to go down."

[+] mleonhard|13 years ago|reply

I'm running https://www.rootredirect.com/ and http://www.restbackup.com/ in us-east-1, in multiple availability zones. Both sites remained up with no problems.

[+] aparadja|13 years ago|reply

Does your rootredirect service actually attract paying customers? I'm genuinely interested to know.

[+] moe|13 years ago|reply

Can someone translate that to control rods and manifolds?

[+] tzury|13 years ago|reply

Seems like deploying on two _physical_ regions (or more) is the best and only proven approach.

That could be within the global AWS, or even say, one cluster at AWS and the other at RackSpace/Linode, etc.

[+] robryan|13 years ago|reply

Then you just need to worry about your application consistency with the replication lag. No silver bullet I guess.

[+] drags|13 years ago|reply

Did anyone else run into issues with ELB during the outage? We're multi-AZ and could access unaffected instances directly without a problem, but the load balancer kept claiming they were unhealthy.

[+] gosub|13 years ago|reply

Could it be possible to have power management the same way Erlang manages processes? Instead of 2 or 3 enormous backup power unit, hundreads of small ones to come in and out of use "fluently".

[+] anaheim|13 years ago|reply

TL; DR:

Shit happens. Don't use AWS as your only platform, you will get burned sometime. Guaranteed, you will also get burned if you try to host and run your own stuff. How competent you are determines which way you get burned less.

[+] starship|13 years ago|reply

Actually, starting right now, AWS is probably your best bet.

Old story about Chuck Yeager from the 1950's: one time shortly after take-off, Yeager's aircraft suffered an engine failure, and he had to do an emergency semi-crash landing. When he realized that a mechanic had put the wrong type of fuel in the plane, he went looking for the guy. The mechanic profusely apologized, said he would resign and never work in aviation again. Yeager replied something along the lines of "Nonsense. In fact, I need someone to refuel my plane right now, and I want you to be the one to fuel it. That's because of all the guys here, I know you'll be the one guy who'll be sure to do it right."

Probably apocryphal, but the point has merit.

[+] mhartl|13 years ago|reply

Or, if you use AWS as your only platform, accept that shit will happen from time to time. Unless your application is a matter of life and death, or unless billions of dollars are at stake, a little downtime now and then probably isn't that big a deal. (All my sites went down when Heroku did (including railstutorial.org, which pays my bills), but the losses are acceptable given the convenience of not having to run my own servers.)

[+] spartango|13 years ago|reply

While I'd agree with the general premise that diversification is a good thing in platform use if high-availability is a requirement, given that this outage was single-AZ, this particular outage should really highlight the point that your application should be multi-AZ scaled if it needs to be up.

[+] acdha|13 years ago|reply

More accurately: “don't trust any single data center”. All of the people who complained were directly ignoring Amazon's own advice, not to mention decades of engineering experience.

Going multi-AZ, multi-region or multi-cloud will help, each step up that list being significantly more work for increasingly small returns.

[+] scyscy|13 years ago|reply

Yes. Also, stop navel-gazing (usually that means stop reading Hacker News). Stop commenting on Hacker News as well. Funny thing about the Singularity/aliens/heaven--it'll come even if you don't spend a lot of time worrying about it.

[+] ferringham|13 years ago|reply

[deleted]

[+] heretohelp|13 years ago|reply

1 in a (million/billion/trillion) I guess.

That'll make for a great horror story to tell though.

[+] kbutler|13 years ago|reply

Isn't the moral of the story, "Check your backups"? There was a defective fan in one generator (sounds like it was findable via a test run?) and a misconfigured circuit breaker (sounds like it was findable by a test run).

Redundancy is only helpful if the redundant systems are actually functional.

87 comments