Slightly off topic rant follows:
I don't see a lot of tech sites talk about the fact that Azure and GCP have multi-region outages. Everybody sees this kind of thing and goes "shrug, an outage". No, this is not okay. We have multiple regions for a reason. Making an application support multi-region is HARD and COSTLY. If I invest that into my app, I never want it go down due to a configuration push. There has never been an AWS incident across multiple regions (us-east-1, us-west-2, etc). That is a pretty big deal to me.
Whenever I post this somebody comes along and says "well that one time us-east-1 went down and everybody was using the generic S3 endpoints so it took everything down". This is true, and the ASG and EBS services in other regions apparently were. BUT, if you invested the time to ensure your application could be multi-region and you hosted on AWS, you would not have seen an outage. Scaling and snapshots might not have worked, but it would not have been the 96.2% packet drop that GCP is showing here and your end users likely would not have noticed.
The articles that track outages at the different cloud vendors really should be pushing this.
> AWS has the most granular reporting, as it shows every service in every region. If an incident occurs that impacts three services, all three of those services would light up red. If those were unavailable for one hour, AWS would record three hours of downtime.
Was this reflected in their bar graph or not?
Also, GCP has had a number of global events, e.g. the inability to modify any load balancer for >3 hours last year, which AWS has NEVER had (unless you count when AWS was the only cloud with one region).
There are a handful of companies that will try and sell you this. However Id say anything thats simple enough to be expressed as a chart or 1 page summary is not actually useful. Interesting outages have variable breadth, scope, and severity. Its usually some methods or a subset of customers that are impacted. Thats really hard to communicate as a straight percentage. You need to map it back to your particular workload and dependencies. And the meaningful result is how your particular application or customer experience would be affected.
Source: Im a principal at AWS, historically focused on infrastructure and availability/operations, have been oncall for 20 years, and do some internal incident management as my job.
tubaguy50035|6 years ago
Whenever I post this somebody comes along and says "well that one time us-east-1 went down and everybody was using the generic S3 endpoints so it took everything down". This is true, and the ASG and EBS services in other regions apparently were. BUT, if you invested the time to ensure your application could be multi-region and you hosted on AWS, you would not have seen an outage. Scaling and snapshots might not have worked, but it would not have been the 96.2% packet drop that GCP is showing here and your end users likely would not have noticed.
The articles that track outages at the different cloud vendors really should be pushing this.
eeg3|6 years ago
GCP was basically even with AWS, and Microsoft was ~6x their downtime according to that article.
ti_ranger|6 years ago
> AWS has the most granular reporting, as it shows every service in every region. If an incident occurs that impacts three services, all three of those services would light up red. If those were unavailable for one hour, AWS would record three hours of downtime.
Was this reflected in their bar graph or not?
Also, GCP has had a number of global events, e.g. the inability to modify any load balancer for >3 hours last year, which AWS has NEVER had (unless you count when AWS was the only cloud with one region).
hansflying|6 years ago
[deleted]
donavanm|6 years ago
Source: Im a principal at AWS, historically focused on infrastructure and availability/operations, have been oncall for 20 years, and do some internal incident management as my job.