Is there any really way to design your application to handle S3 failures like this? S3's SLA has 99.99% availability, but is there a way to handle the 1% so your application is not affected? Options I can think of:
1. Using a CDN to serve files can help in some cases
2. On-prem systems may be able to use gateway-cached volumes and use the local disk cache vs S3
Slightly OT, but there's an interesting phenomenon at work now that so much of the internet depends on Amazon's infrastructure. When it goes down you might not even need to worry about it that much, as so many sites/apps will be broken that most users will just assume that the internet is broken.
S3 SLA is actually for only three nines (99.9%) or 8.76 hours / year or 43.8 minutes / month of downtime: https://aws.amazon.com/s3/sla/
CloudFront offers the same availability. Many CDNs offer no more than three nines. Some claim 100%, but there will eventually be faults. Most do really well to not have recognized outages, but I nonetheless think they offer 100% to guarantee so that you always get credit for any downtime rather than guaranteeing they are never down.
Understand that there isn't 'An S3 service'. There are multiple S3 services in multiple Regions within AWS, and they're all operated independently of each other (this goes for all other AWS services too) so that cascading failures/etc. don't occur between regions.
So, use 2-3 different S3 regions, or some other multi-cloud solution...
Second this, I would love to hear how companies handle S3 outages. Although one correction, it's 0.01% that you need to handle (if they deliver on their availability promise). That's less than an hour a year.
I wonder, what's the yearly downtime of Amazon? If S3 were up 99.99% of the time then the remainder .001% is only 5.256 minutes per year. S3 is actually down more than that. But how much exactly? It's impossible to discuss mitigation strategies if we don't even know what the exact issue we're mitigating is!
Yeah, I got 2.53% error rate, but that's nothing to worry about - using 79 servers, that's exactly 2 errors, both of them in China, which kinda makes it feel less Amazon's fault, than the Great Firewall's.... Maybe there should be some more meaningful error measure than the raw failure percentage.
[+] [-] chr15|10 years ago|reply
[+] [-] untog|10 years ago|reply
[+] [-] dbarlett|10 years ago|reply
[+] [-] merb|10 years ago|reply
[+] [-] thezilch|10 years ago|reply
CloudFront offers the same availability. Many CDNs offer no more than three nines. Some claim 100%, but there will eventually be faults. Most do really well to not have recognized outages, but I nonetheless think they offer 100% to guarantee so that you always get credit for any downtime rather than guaranteeing they are never down.
You can look at replicating files to multiple providers; the following shows what kind of uptimes you can expect from the big players: https://cloudharmony.com/status-1year-of-storage
If you can live with read-only states with CDN; a similar report: https://cloudharmony.com/status-1year-of-cdn
[+] [-] count|10 years ago|reply
[+] [-] ohitsdom|10 years ago|reply
[+] [-] lukasm|10 years ago|reply
[+] [-] forrestthewoods|10 years ago|reply
[+] [-] Sami_Lehtinen|10 years ago|reply
[+] [-] sajal83|10 years ago|reply
[+] [-] rbinv|10 years ago|reply
edit: S3 seems to be back up and running according to the AWS status page.
[+] [-] imrehg|10 years ago|reply
[+] [-] scott_karana|10 years ago|reply
Paris, Roubaix, Manchester, Portsmouth, Budapest, Tokyo, Bilthoven, Vleuten, Utrecht, Manila, Sovetskaya Gavan, Tyuven, Singapore, Bangkok, Taipei City, Kiev, Kherson, Ashburn, Mountain View, Rowley, amd Newark all show 503s.
It's more than just the Great Firewall. :-)
[+] [-] mechadock|10 years ago|reply
[+] [-] cddotdotslash|10 years ago|reply