Amazon EC2 and RDS in US-EAST zone down

[+] mikebo|13 years ago|reply

Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.

[+] keithnoizu|13 years ago|reply

I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.

[+] werkshy|13 years ago|reply

Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.

[+] gouranga|13 years ago|reply

That sucks badly.

Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.

Sticky situation.

[+] malachismith|13 years ago|reply

Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.

[+] res0nat0r|13 years ago|reply

Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.

[+] tolos|13 years ago|reply

Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?

[+] pjscott|13 years ago|reply

Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.

[+] bmelton|13 years ago|reply

I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.

Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.

[+] gregholmberg|13 years ago|reply

Individual availability zones can be identified using the API.

   ec2-describe-reserved-instances-offerings --region

will tell you what the zone's identifier is.

After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.

This Alestic article shows how to label them all.

[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones

[+] dwhsix|13 years ago|reply

Keep in mind AZs are different per account. My us-east-1b is not necc'ly your us-east-1b (as someone reminded me on twitter just now).

[+] pjscott|13 years ago|reply

EC2 comes with a free Chaos Monkey service. It's called EC2.

I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.

It's great when you suddenly need a hundred more servers, though.

[+] bad_user|13 years ago|reply

I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.

[+] NathanKP|13 years ago|reply

Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.

Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.

[+] iharris|13 years ago|reply

SNS sent me an e-mail of my instance alarms pretty quickly.

EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.

[+] keithnoizu|13 years ago|reply

By over fifteen minutes in my case. Possibly thirty. WTH.

[+] keithnoizu|13 years ago|reply

I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.

"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."

[+] gooeyblob|13 years ago|reply

Absolutely agree - that's just silly. Their status page is close to useless.

[+] pearle|13 years ago|reply

I'm running in us-east-1 and my EC2 instances and EBS volumes are still responding ok for the moment...

Fingers crossed (just deployed to AWS less than 2 weeks ago).

[+] mattwdelong|13 years ago|reply

It's not entirely down as I can still access my instances. I'm in us-east-1b.

[+] grourk|13 years ago|reply

Your us-east-1b might be my us-east-1a.

[+] KenCochrane|13 years ago|reply

9:32 AM PDT Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.

[+] KenCochrane|13 years ago|reply

I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.

[+] bad_user|13 years ago|reply

For what is worth, my small website is online again.

[+] pearle|13 years ago|reply

Anyone have any details on why us-east-1 seems to be less reliable than the other regions? Is it the oldest?

[+] jaylevitt|13 years ago|reply

According to this calculation (which attempted to probe all the racks in EC2), over 70% of EC2 lives in us-east.

http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...

[+] sausagefeet|13 years ago|reply

I'm under the impression it's the most used.

[+] malachismith|13 years ago|reply

It's the oldest, yes.

[+] rdl|13 years ago|reply

I'm curious why no public paas is multiple AWS region.

[+] malachismith|13 years ago|reply

1) because AWS East is so much cheaper (and none of us like spending money) 2) AppFog actually is multi region (and multi IaaS as well)

[+] zedwill|13 years ago|reply

Interesting enough not only the EBS is down, but ELB can not register instances even if there are not EBS based and completely operational.

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

[+] oasisbob|13 years ago|reply

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

ELBs are sometimes EBS backed.

[+] DigitalSea|13 years ago|reply

Issue #3298392 for EC2 this month. This is ridiculous, so many websites rely on EC2 and it's proving to be extremely unreliable. Cloud computing is definitely not the answer to everything it would seem.

[+] stevefink|13 years ago|reply

Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).

[+] nirvdrum|13 years ago|reply

What zone? I really wish Amazon would provide that info, instead of saying that it only affects one zone.

[+] rabble|13 years ago|reply

Good time to consider Google's Compute Engine as an alternative? What will we call it, GCE?

94 comments