top | item 10033605

Show HN: AWS S3 outage test from across the world

77 points| sajal83 | 10 years ago |pulse.turbobytes.com | reply

30 comments

[+] chr15|10 years ago|reply

Is there any really way to design your application to handle S3 failures like this? S3's SLA has 99.99% availability, but is there a way to handle the 1% so your application is not affected? Options I can think of:

  1. Using a CDN to serve files can help in some cases
  2. On-prem systems may be able to use gateway-cached volumes and use the local disk cache vs S3

Other ideas?

[+] untog|10 years ago|reply

Slightly OT, but there's an interesting phenomenon at work now that so much of the internet depends on Amazon's infrastructure. When it goes down you might not even need to worry about it that much, as so many sites/apps will be broken that most users will just assume that the internet is broken.

[+] dbarlett|10 years ago|reply

Cross-Region Replication http://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html

[+] merb|10 years ago|reply

Multi Cloud. Use S3 + Azure Cloud Files + Google Cloud Files

[+] thezilch|10 years ago|reply

S3 SLA is actually for only three nines (99.9%) or 8.76 hours / year or 43.8 minutes / month of downtime: https://aws.amazon.com/s3/sla/

CloudFront offers the same availability. Many CDNs offer no more than three nines. Some claim 100%, but there will eventually be faults. Most do really well to not have recognized outages, but I nonetheless think they offer 100% to guarantee so that you always get credit for any downtime rather than guaranteeing they are never down.

You can look at replicating files to multiple providers; the following shows what kind of uptimes you can expect from the big players: https://cloudharmony.com/status-1year-of-storage

If you can live with read-only states with CDN; a similar report: https://cloudharmony.com/status-1year-of-cdn

[+] count|10 years ago|reply

Understand that there isn't 'An S3 service'. There are multiple S3 services in multiple Regions within AWS, and they're all operated independently of each other (this goes for all other AWS services too) so that cascading failures/etc. don't occur between regions. So, use 2-3 different S3 regions, or some other multi-cloud solution...

[+] ohitsdom|10 years ago|reply

Second this, I would love to hear how companies handle S3 outages. Although one correction, it's 0.01% that you need to handle (if they deliver on their availability promise). That's less than an hour a year.

[+] lukasm|10 years ago|reply

replicate to different data center e.g. Azure.

[+] forrestthewoods|10 years ago|reply

I wonder, what's the yearly downtime of Amazon? If S3 were up 99.99% of the time then the remainder .001% is only 5.256 minutes per year. S3 is actually down more than that. But how much exactly? It's impossible to discuss mitigation strategies if we don't even know what the exact issue we're mitigating is!

[+] Sami_Lehtinen|10 years ago|reply

"Oh no, our server made a boo boo. Please try again."

[+] sajal83|10 years ago|reply

Pls try again. The server had crashed due to "too many open files". I'm leaking file descriptors somewhere.

[+] rbinv|10 years ago|reply

I just re-ran the test and got an error rate of 1.23% (vs. 41.98%): https://pulse.turbobytes.com/results/55c88a0fecbe400bf800073...

edit: S3 seems to be back up and running according to the AWS status page.

[+] imrehg|10 years ago|reply

Yeah, I got 2.53% error rate, but that's nothing to worry about - using 79 servers, that's exactly 2 errors, both of them in China, which kinda makes it feel less Amazon's fault, than the Great Firewall's.... Maybe there should be some more meaningful error measure than the raw failure percentage.

[+] scott_karana|10 years ago|reply

Vancouver, Tokyo, Cebu, Singapore, Bangkok, and Kharkiv all show EOF.

Paris, Roubaix, Manchester, Portsmouth, Budapest, Tokyo, Bilthoven, Vleuten, Utrecht, Manila, Sovetskaya Gavan, Tyuven, Singapore, Bangkok, Taipei City, Kiev, Kherson, Ashburn, Mountain View, Rowley, amd Newark all show 503s.

It's more than just the Great Firewall. :-)

[+] mechadock|10 years ago|reply

Error Rate: 41.98% Response Time (avg): 5,518 ms for me.

[+] cddotdotslash|10 years ago|reply

Got some alerts from a couple services that rely on S3 this morning. Perhaps this is related, but everything is back up for now.