top | item 10033605

Show HN: AWS S3 outage test from across the world

77 points| sajal83 | 10 years ago |pulse.turbobytes.com | reply

30 comments

order
[+] chr15|10 years ago|reply
Is there any really way to design your application to handle S3 failures like this? S3's SLA has 99.99% availability, but is there a way to handle the 1% so your application is not affected? Options I can think of:

  1. Using a CDN to serve files can help in some cases
  2. On-prem systems may be able to use gateway-cached volumes and use the local disk cache vs S3
Other ideas?
[+] untog|10 years ago|reply
Slightly OT, but there's an interesting phenomenon at work now that so much of the internet depends on Amazon's infrastructure. When it goes down you might not even need to worry about it that much, as so many sites/apps will be broken that most users will just assume that the internet is broken.
[+] merb|10 years ago|reply
Multi Cloud. Use S3 + Azure Cloud Files + Google Cloud Files
[+] thezilch|10 years ago|reply
S3 SLA is actually for only three nines (99.9%) or 8.76 hours / year or 43.8 minutes / month of downtime: https://aws.amazon.com/s3/sla/

CloudFront offers the same availability. Many CDNs offer no more than three nines. Some claim 100%, but there will eventually be faults. Most do really well to not have recognized outages, but I nonetheless think they offer 100% to guarantee so that you always get credit for any downtime rather than guaranteeing they are never down.

You can look at replicating files to multiple providers; the following shows what kind of uptimes you can expect from the big players: https://cloudharmony.com/status-1year-of-storage

If you can live with read-only states with CDN; a similar report: https://cloudharmony.com/status-1year-of-cdn

[+] count|10 years ago|reply
Understand that there isn't 'An S3 service'. There are multiple S3 services in multiple Regions within AWS, and they're all operated independently of each other (this goes for all other AWS services too) so that cascading failures/etc. don't occur between regions. So, use 2-3 different S3 regions, or some other multi-cloud solution...
[+] ohitsdom|10 years ago|reply
Second this, I would love to hear how companies handle S3 outages. Although one correction, it's 0.01% that you need to handle (if they deliver on their availability promise). That's less than an hour a year.
[+] lukasm|10 years ago|reply
replicate to different data center e.g. Azure.
[+] forrestthewoods|10 years ago|reply
I wonder, what's the yearly downtime of Amazon? If S3 were up 99.99% of the time then the remainder .001% is only 5.256 minutes per year. S3 is actually down more than that. But how much exactly? It's impossible to discuss mitigation strategies if we don't even know what the exact issue we're mitigating is!
[+] Sami_Lehtinen|10 years ago|reply
"Oh no, our server made a boo boo. Please try again."
[+] sajal83|10 years ago|reply
Pls try again. The server had crashed due to "too many open files". I'm leaking file descriptors somewhere.
[+] imrehg|10 years ago|reply
Yeah, I got 2.53% error rate, but that's nothing to worry about - using 79 servers, that's exactly 2 errors, both of them in China, which kinda makes it feel less Amazon's fault, than the Great Firewall's.... Maybe there should be some more meaningful error measure than the raw failure percentage.
[+] scott_karana|10 years ago|reply
Vancouver, Tokyo, Cebu, Singapore, Bangkok, and Kharkiv all show EOF.

Paris, Roubaix, Manchester, Portsmouth, Budapest, Tokyo, Bilthoven, Vleuten, Utrecht, Manila, Sovetskaya Gavan, Tyuven, Singapore, Bangkok, Taipei City, Kiev, Kherson, Ashburn, Mountain View, Rowley, amd Newark all show 503s.

It's more than just the Great Firewall. :-)

[+] mechadock|10 years ago|reply
Error Rate: 41.98% Response Time (avg): 5,518 ms for me.
[+] cddotdotslash|10 years ago|reply
Got some alerts from a couple services that rely on S3 this morning. Perhaps this is related, but everything is back up for now.