Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)
We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.
Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.
Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.
Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?
Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.
I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.
Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.
EC2 comes with a free Chaos Monkey service. It's called EC2.
I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.
It's great when you suddenly need a hundred more servers, though.
I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.
Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.
Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.
SNS sent me an e-mail of my instance alarms pretty quickly.
EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.
I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.
"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."
9:32 AM PDT Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.
I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.
Issue #3298392 for EC2 this month. This is ridiculous, so many websites rely on EC2 and it's proving to be extremely unreliable. Cloud computing is definitely not the answer to everything it would seem.
Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).
[+] [-] mikebo|13 years ago|reply
[+] [-] keithnoizu|13 years ago|reply
[+] [-] werkshy|13 years ago|reply
We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.
[+] [-] gouranga|13 years ago|reply
Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.
Sticky situation.
[+] [-] malachismith|13 years ago|reply
[+] [-] res0nat0r|13 years ago|reply
[+] [-] tolos|13 years ago|reply
[+] [-] pjscott|13 years ago|reply
[+] [-] bmelton|13 years ago|reply
Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.
[+] [-] gregholmberg|13 years ago|reply
After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.
This Alestic article shows how to label them all.
[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones
[+] [-] dwhsix|13 years ago|reply
[+] [-] pjscott|13 years ago|reply
I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.
It's great when you suddenly need a hundred more servers, though.
[+] [-] bad_user|13 years ago|reply
[+] [-] NathanKP|13 years ago|reply
Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.
[+] [-] iharris|13 years ago|reply
EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.
[+] [-] keithnoizu|13 years ago|reply
[+] [-] keithnoizu|13 years ago|reply
"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."
[+] [-] gooeyblob|13 years ago|reply
[+] [-] pearle|13 years ago|reply
Fingers crossed (just deployed to AWS less than 2 weeks ago).
[+] [-] mattwdelong|13 years ago|reply
[+] [-] grourk|13 years ago|reply
[+] [-] KenCochrane|13 years ago|reply
[+] [-] KenCochrane|13 years ago|reply
[+] [-] bad_user|13 years ago|reply
[+] [-] pearle|13 years ago|reply
[+] [-] jaylevitt|13 years ago|reply
http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...
[+] [-] sausagefeet|13 years ago|reply
[+] [-] malachismith|13 years ago|reply
[+] [-] rdl|13 years ago|reply
[+] [-] malachismith|13 years ago|reply
[+] [-] zedwill|13 years ago|reply
I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.
[+] [-] oasisbob|13 years ago|reply
ELBs are sometimes EBS backed.
[+] [-] DigitalSea|13 years ago|reply
[+] [-] stevefink|13 years ago|reply
[+] [-] nirvdrum|13 years ago|reply
[+] [-] rabble|13 years ago|reply
[+] [-] jfoutz|13 years ago|reply
[+] [-] vachi|13 years ago|reply
[+] [-] mattbillenstein|13 years ago|reply
[+] [-] anuraj|13 years ago|reply
[+] [-] ahmedaly|13 years ago|reply
[+] [-] ahmedaly|13 years ago|reply