Summary of June 29 AWS Service Event in US-East

[+] kiwidrew|13 years ago|reply

It seems to me that almost all of the issues revolved around the complicated EC2/EBS control systems. Time and time again, we hear about an AWS failure which resulted in brief instance outages. If it was just a dumb datacentre, the affected servers would just boot up, run an fsck on the disks, and return to service. But because of the huge complexity added by the AWS stack, the control systems inevitably start failing, preventing everything from starting up normally.

I can't help but get the feeling that, if it weren't for the fancy "elastic" API stuff, these outages would remain nothing more than just minor glitches. At this point, I don't see how you could possibly justify running on AWS. Far better to just fire up a few dedicated servers at a dumb host and let them do their job.

[+] rdl|13 years ago|reply

It's undeniable that the added complexity of AWS makes it harder to predict how it will behave in any specific failure condition, which is one of the basic things you want to work out when designing it. "Traditional" infrastructure is hard enough (especially networking), and still fails in unique ways, and we've had 10-100 years to characterize it.

Having a common provisioning API in a bunch of physically diverse centers IS a huge advantage for availability, cost, etc, though. If you have 100 hours to set up a system, and $10k/mo, you have two real choices:

1) Dedicated/conventional servers: Burn 10 hours of sales and contract negotiation, vendor selection, etc., plus ~10-20h in anything specific to the vendor, and then set up systems. Get a bunch of dedicated boxes (coloing your own gear may be better at scale but 10-20x the time and then upfront cost...), set up single-site HA (hardware or software load balancer, A+B power, dual line cord servers for at least databases, etc.

2) Set up AWS. Since the lowest tiers are free, it's possible you have a lot of experience with it already. 1h to buy the product, extra time to script to the APIs, be able to be dynamic. You could probably be resilient against single-AZ failure in 100h with $10k/mo, although doing multi-region (due to DB issues and minimum spend per region) might be borderline.

In case #1, you're protected against a bunch of problems, but not against a plane hitting the building, someone hitting EPO by accident, etc. In case #2, you should be resilient against any single physical facility problem, but are exposed to novel software risk.

The best solution would be #3 -- some consistent API to order more conventional infrastructure in AWS-like timeframes. Arguably OpenStack or other systems could offer this (using real SANs instead of EBS, real hw load balancers in place of ELB, ...), and you could presumably do some kind of dedicated host provisioning using the same kind of APIs you use for VM provisioning (big hosting companies have done this with PXE for years; someone like Softlayer can provision a system in ~30 minutes on bare metal). Use virtualization when it makes sense, and bare metal at other times (the big Amazon Compute instances are pretty close) -- although the virtualization layer doesn't seem to be the real weakness, but rather all the other services like ELB/EBS, RDS, etc.)

Basically, IaaS by a provider who recognizes software, and especially big complex interconnected systems, is really hard, and is willing to sacrifice some technical impressiveness and peak efficiency for reliability and easy characterization of failure modes.

[+] zhoutong|13 years ago|reply

It takes time to boot up VMs, basically. System booting is one of the most expensive operations in virtualization, especially when the disk image is somewhere in the network. There will be a lot of random I/O, memory allocation and high CPU usage. It takes a few seconds of CPU time for a decent server to initialize a KVM VM. Try multiplying that by hundreds of thousands of instances.

[+] droithomme|13 years ago|reply

So generators at multiple sites all failed in the exact same way, being unable to produce a stable voltage, even though they are all nearly new, have low hours, and are regularly inspected and tested.

It's impossible that it's an amazing coincidence they all failed on the same day. The fact they were all recently certified and tested means that that process doesn't work to ensure they will come on line any more than the process worked at Fukushima nuclear plant.

They don't give the manufacturer or model, and they say that they are going to have them recertified and continue to use them. So that means they are not going to fix the problem, because they don't know why they failed.

You can not fix the problem if you do not know what caused it.

[+] jaylevitt|13 years ago|reply

To my ears - and maybe this is just wishful hearing - it sounded like they were very, VERY strongly pointing the finger at a certain unnamed generator manufacturer, but doing so in a way that incurred no legal liability.

That manufacturer is probably flying every single C-level exec out to the US-East data center, over the July 4th holiday, to personally disassemble the generator, polish each screw, and carefully put it all back together while singing an a cappella version of "Bohemian Rhapsody", including vocal percussion.

And if they do it to Amazon's satisfaction, Amazon has hinted that they might decide not to out them to the rest of the world. That's called leverage.

[+] ejdyksen|13 years ago|reply

It was only at one site that the generators failed, but it was two generators in that site:

In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks.

It sounds like they were too optimistic about their generator startup times.

[+] ams6110|13 years ago|reply

The difference is that at Fukushima the generators were underwater from the tsunami. A bit difficult to get them started, in that condition.

[+] latch|13 years ago|reply

I don't know how big these data centers are, but why don't they just build a power plant right next to it, dedicated and with underwire lines?

I'm a fan of decentralized power generation, and it would seem like large consumers would have the most to gain.

Is this a regulation issue? I imagine Amazon becoming a provider of electricity (even if it's just to itself), can become a political mess.

[+] cperciva|13 years ago|reply

build a power plant right next to it, dedicated and with underwire lines

They did, and it's called "backup generators". :-)

More seriously, power plants go offline on a regular basis -- typical availability factors are around 90% due to the need for regular maintenance. You need to have a power grid in order to have any reasonable availability.

[+] rdl|13 years ago|reply

People normally try to site big datacenters with dual high voltage feeds from separate substations.

Taking down the entire grid causes many problems, and basically doesn't happen (it happened in the Northeast a few years ago....) http://en.wikipedia.org/wiki/Northeast_blackout_of_2003

The problem is diesel generators basically suck, especially when left powered off. In the long run, I predict fuel cells will take over in the standby power market.

[+] endersshadow|13 years ago|reply

I visited a large data center about five years ago, and they had a power plant on site. However, California has state regulations that they have to use a percentage of power from the utility company (i.e.-they can't go completely off grid).

[+] robryan|13 years ago|reply

I would assume it would be a price issue more than anything.

[+] drags|13 years ago|reply

For multi-Availability Zone ELBs, the ELB service maintains ELBs redundantly in the Availability Zones a customer requests them to be in so that failure of a single machine or datacenter won’t take down the end-point.

Based on how they behave in outages, I've always been curious (read: suspicious) about whether ELBs were redundant across AZs or hosted in a single AZ regardless of the AZs your instances are in.

It's good to hear that they are actually redundant and to understand how they're added/removed from circulation in the event of problems.

[+] WALoeIII|13 years ago|reply

In my experience you get an IP returned as an A record for each AZ you have instances in. Inside each AZ traffic is balanced equally across all instances attached to the ELB. The ELB service itself is implemented as a Java server running on EC2 instances, and it is scaled both vertically and horizontally to maximize throughput.

[+] rdl|13 years ago|reply

I'm kind of confused why UPS failing doesn't lead to an emergency EBS shutdown procedure which is more graceful than just powering it all off. Blocking new writes, letting stuff complete, and unmount in the last 30 seconds would save a LOT of hassle later.

[+] EwanToo|13 years ago|reply

Blocking new writes on it's own would cause instant filesystem corruption for all the hosts using EBS, unless they had already completed their writes and had time to flush their disk caches.

You'd need to integrate it with each VM running, possibly just be sending it the equivalent of a shutdown command from the console, so that it understood that the disk was going away in X seconds, and to shutdown any databases immediately, flush all cache, unmount filesystems, and shutdown itself.

It wouldn't be massively difficult, but not as simple as simply shutting down EBS.

[+] cperciva|13 years ago|reply

I was wondering about that too -- I'd love to see the same thing happen with EC2 instances, sending an instance stop signal five minutes before power goes out so that daemons can be stopped and the OS can sync its filesystems to disk before shutting down.

My guess is that this functionality is missing for historical reasons -- the original S3-backed instances didn't have any concept of powering off an instance.

[+] zhoutong|13 years ago|reply

The problem is that a lot of inconsistencies are caused by the software, not I/O or cache. For example, if your database is in a middle of transaction and everything is in the memory, you can't "let the stuff complete" without installing a software on the database server to monitor power failures.

[+] Maxious|13 years ago|reply

"many clients, especially game consoles and other consumer electronics, only use one IP address returned from the DNS query."

Is this referring to Netflix?

[+] EwanToo|13 years ago|reply

Netflix definitely had the ELB issues, I think it was described by them as "ELB routing issues" - the service was up, but some people were being sent to the wrong AZ.

That certainly sounds like this doesn't it?

[+] zhoutong|13 years ago|reply

AWS officially recommends CNAME records for ELBs, but the IP addresses don't change regularly and also CNAME for root host name won't work if other records are present, so many sysadmins straightaway use A records with the ELB IPs.

[+] pauly007|13 years ago|reply

So far the comments have focused on the technical aspects outlined as points of failure in Amazon's summary: grid failure, diesel generator failure, and the complexities of the amazon stack. What are your thoughts on Amazon's professionalism in their response and action plan going forward? If you're an AWS customer does this style of response keep you on board?

[+] rdl|13 years ago|reply

The level of clarity in post incident reporting by Amazon is excellent. During-incident, sub-par. Amazon seems to try to minimize any "more than just a single AZ is affected" in their realtime reporting during outages. There's also a disconnect between the graphics and the text.

What I don't like is that they make repeated promises about AZes and features which are repeatedly shown to be untrue. They also have never disclosed their testing methodology, which leads me to assume there isn't much of one. That makes me unlikely to rely on any AWS services I can't myself fully exercise, or which haven't shown themselves to be robust. S3, sure. EC2, sure (except don't depend on modifying instances on any given day during an outage). EBS, oh hell no. ELB, probably not my first choice, and certainly not my sole ft/ha layer. Route 53, which I haven't messed with, actually does seem really good, but since it's complex, I'm scared given the track record of the other complex components.

[+] giulianob|13 years ago|reply

Well if you look at their status page ( http://status.aws.amazon.com/ ) none of the statuses shows red. Yellow clearly says its for performance issues and red is for service disruptions. If this isn't service disruption then I don't know what is.

[+] Cloven|13 years ago|reply

Reading between the lines from both the posted note and their persistent failure to provide correct statuses: the ops guys over there are in full CYA desperation mode somewhere around 100% of the time, and a culture of 'it wasn't me', fuelled either by job fear, or promotion fear, or fear of being noticed by bezos, is in full bloom.

[+] rdl|13 years ago|reply

Impressive that the failure of grid power (frequent) and a SINGLE GENERATOR BANK causes this much chaos on the Internet.

[+] cperciva|13 years ago|reply

If I"m reading it right, it's a failure of grid power and two or more generator banks: "each generator independently failed to provide stable voltage".

[+] kalleboo|13 years ago|reply

The generator thing confused me a bit. It seems like the main issue wasn't that the generators failed, but that they took far longer to spin up than expected, so the automatic switchover failed to take, and they had to do a manual switchover at a later point (at which point the UPSes had already started failing). Or am I reading it wrong?

I wonder what the additional cost would be to leave a generator running 24/7 (at minimal load), so you never have an issue with spinning them up. Are they designed to run 24/7/365, or will they wear out too quickly?

[+] jpetazzo|13 years ago|reply

It's interesting to see that they had similar issues 2 weeks ago (power outage) and it looks like nothing was done in those 2 weeks to address the issue, since it happened again this week-end.

[+] nmcfarl|13 years ago|reply

I like that all times are reported in PDT for an event that happened on the east coast - it says something about priorities.

[+] harshreality|13 years ago|reply

At least they specified the timezone, and at least they got the timezone correct (PDT, and not PST) [1].

[1] Not only do a lot of communications not specify a timezone at all, but some communications specify a timezone incorrectly. I recently submitted a ticket with Apple dev support, and the automated reply specifies their hours of operation as 7am to 5pm PST. They're probably clueless, and they mean PDT, but if someone takes them at their word, those hours of operation are off by an hour.

[+] smackfu|13 years ago|reply

If the outage happened in two time zones, which time zone should they use to report? The priority is to make the write-up clear, which means picking a single time zone and being consistent. I guess you could argue it should be UTC, but practically it makes no difference.

[+] res0nat0r|13 years ago|reply

The majority of the AWS team lives and works here in downtown Seattle, so all of the times are in PDT.

[+] WALoeIII|13 years ago|reply

It says that the company is headquartered in Seattle, Washington which is on Pacific Daylight Time.

[+] teflonhook|13 years ago|reply

[deleted]

78 comments