Semi-related: if you ever feel the need to report times to a global audience, not only make sure to always report the timezone (even if it is the same as the user's), but also use UTC offsets rather than timezone names.
Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.
The inconsistency with timezones across different services in the AWS console has always baffled and annoyed me. Some places have a time without a timezone and I can never tell right away if it's utc, local time, or region time.
Technically PDT is always 7 hours behind UTC. PST is always 8 hours behind. We just change which one we use twice a year. Pacific time makes sense when you realize Fremont is the center of the universe.
The outage is in Virginia so PDT isn't even local time. On their status page they are asking users to access the console via a region specific endpoint like https://us-west-2.console.aws.amazon.com. Wonder if the PDT timestamp is because they have to serve the status page from US West right now.
The fact that which timezone is used in the announcement is a sign of progress... AWS announced it pretty quickly, gave nice updates, and seems to have fixed the problem quickly enough. I'm interested to see the postmortem...
When I was with AWS I advocate for ISO8601 "Z" whenever I could or need to influence, say internal systems.
If all systems talk this we'd save tens of thousands of man hours. Just do the conversion for us mortals, or other necessities. Tech side of incidents is definitely "system", I'd argue more often than not consumers of AWS are also tech side with systems in UTCs so health dashboards should also be a UTC first system. Doubt this could get prioritized tho
It doesn't matter if your infra is in another region, because there will almost always be transitive dependencies on us-east-1. IAM is deployed in us-east-1 and there will always be a transitive dependency on us-east-1
I have never had a production issue in other regions due to a us-east-1 outage. The worst that ever happened was I had to wait to update a Cloudfront distribution because the control plane (based in us-east-1) was down, but the existing configuration continued working fine throughout.
I don't know what the architecture of IAM looks like, but somehow it's never suffered a global outage.
My Whole Foods grocery pickup order was affected by this outage. They couldn’t check me in. Groceries were packed in the fridge but they told me to come back later. What a waste of time.
I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm just experiencing selection bias; but I posted a poll on twitter earlier today: https://twitter.com/dijit/status/1668678588713824257
Contents:
> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?
I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?
it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.
then you get to eat popcorn when stuff explodes.
* single server event. $
* multi server event. $$
* single az event. $$$
* multi az event. $$$$
* global provider event. $$$$$
* cross provider event. $$$$$$
* alien invasion. $$$$$$$$$$$$$$
My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.
AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.
> Has anyone ever actually had customers accept an outage because AWS was down...
Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]
Firebase Auth, for instance, offers no SLA at all [1].
I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.
Ok fine. Running your own datacenter in 2023 is incredibly risky. There's the upfront server cost and the ongoing maintenance cost. There's patches and staffing and disaster planning and all the other things that goes into it. Plus there's the cyberinsurance and protections and security components too.
Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare? They have some of the brightest minds in the industry working there, and they can price things at a much better price than anything you can build yourself.
Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing. However with the price of some cloud vendors compared to the DIY approach, it's hard for organizations to ignore.
If you really want to combat this, make the cost of running your own data center less. Reduce risk. Reduce the amount of money it costs for hiring good people or MSP's. Reduce the cost of acquiring and installing hardware.
Organizations pay attention to dollars so if you want the trend to shift, come up with a less costly alternative to the current cloud offerings.
As a side note, I wonder if businesses won't even accept cash if they can't go through their POS system. If not, it's a shame that these modern internet connected POSs lock out stuff like that.
It's fun watching each service fail sequentially while the aws service dashboard just updates them to "Informational" status, whatever that means.
Even management console is down, and their suggested region specific workaround does not work, at least for us-east-1. I can see some processes via api but I don't have code prepared for monitoring every service from my local.
[+] [-] intsunny|2 years ago|reply
AWS can do so many things, reporting critical outage updates in UTC is not one of those things.
[+] [-] kroltan|2 years ago|reply
Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.
[+] [-] rurp|2 years ago|reply
[+] [-] messe|2 years ago|reply
Thank you for reminding me about one of my biggest mildest annoyances from working at AWS.
[+] [-] mulmen|2 years ago|reply
[+] [-] cogogo|2 years ago|reply
[+] [-] joshuanapoli|2 years ago|reply
[+] [-] utbabya|2 years ago|reply
If all systems talk this we'd save tens of thousands of man hours. Just do the conversion for us mortals, or other necessities. Tech side of incidents is definitely "system", I'd argue more often than not consumers of AWS are also tech side with systems in UTCs so health dashboards should also be a UTC first system. Doubt this could get prioritized tho
[+] [-] adubashi|2 years ago|reply
[+] [-] luhn|2 years ago|reply
I don't know what the architecture of IAM looks like, but somehow it's never suffered a global outage.
AWS is really, really good at regional isolation.
[+] [-] mooreds|2 years ago|reply
Our data plane was fine (for example, ec2 instances and s3 buckets in other regions were fine).
[+] [-] jedberg|2 years ago|reply
[+] [-] shepherdjerred|2 years ago|reply
[+] [-] Metaluim|2 years ago|reply
[+] [-] dveeden2|2 years ago|reply
[+] [-] throw03172019|2 years ago|reply
[+] [-] vvoyer|2 years ago|reply
[+] [-] dijit|2 years ago|reply
Contents:
> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?
> [ ] Yeah, outages free pass
> [ ] No, they say to use AZ's
[+] [-] robrtsql|2 years ago|reply
Using 3 AZs in us-east-1 won't save you.
I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?
[+] [-] kobalsky|2 years ago|reply
then you get to eat popcorn when stuff explodes.
[+] [-] Johnny555|2 years ago|reply
[+] [-] kinghajj|2 years ago|reply
[+] [-] paulddraper|2 years ago|reply
If your customers are tech, they're too busy running around with their hair on fire too.
[+] [-] tedmiston|2 years ago|reply
Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]
Firebase Auth, for instance, offers no SLA at all [1].
I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.
[1]: https://stackoverflow.com/a/60500860/149428
[+] [-] mrobins|2 years ago|reply
[+] [-] jmacjmac|2 years ago|reply
[+] [-] throaway87c10f0|2 years ago|reply
edit: for those that would downvote: HN _just_ yesterday: https://news.ycombinator.com/item?id=36295352 https://news.ycombinator.com/item?id=36295305
[+] [-] SamuelAdams|2 years ago|reply
Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare? They have some of the brightest minds in the industry working there, and they can price things at a much better price than anything you can build yourself.
Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing. However with the price of some cloud vendors compared to the DIY approach, it's hard for organizations to ignore.
If you really want to combat this, make the cost of running your own data center less. Reduce risk. Reduce the amount of money it costs for hiring good people or MSP's. Reduce the cost of acquiring and installing hardware.
Organizations pay attention to dollars so if you want the trend to shift, come up with a less costly alternative to the current cloud offerings.
[+] [-] dijit|2 years ago|reply
everyone knows, nobody seems to care.
Another comment of mine in this thread asks the question if you can excuse downtime of your service due to AWS outages.
Consensus seems to be: yes
which is a pretty huge deal, well worth the insane cost increase of AWS by itself. No other hosting provider would grant you such an excuse.
I would weep for the centralised future of the internet, but its already here, so theres no point.
[+] [-] andersrs|2 years ago|reply
[+] [-] hx833001|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] ulrashida|2 years ago|reply
[+] [-] FullyFunctional|2 years ago|reply
This is a good reminder to avoid cloud-centric products, but they are getting harder and harder to avoid.
[+] [-] MrBruh|2 years ago|reply
[+] [-] whoisjuan|2 years ago|reply
I have always stayed away from that region because it seems significantly less reliable than other regions.
[+] [-] assimpleaspossi|2 years ago|reply
[+] [-] noradbase|2 years ago|reply
[+] [-] arixzajicek|2 years ago|reply
[+] [-] chaosmachine|2 years ago|reply
https://ca-central-1.console.aws.amazon.com/console/home
This assumes you don't actually need anything from us-east-1, though :)
[+] [-] impulser_|2 years ago|reply
We don't use big cloud were I work, so maybe I'm missing something. Does East-1 offer something other don't?
[+] [-] thedigitalone|2 years ago|reply
[+] [-] jjice|2 years ago|reply
[+] [-] xyst|2 years ago|reply
[+] [-] VirusNewbie|2 years ago|reply
[+] [-] gerenuk|2 years ago|reply
[+] [-] grumple|2 years ago|reply
Even management console is down, and their suggested region specific workaround does not work, at least for us-east-1. I can see some processes via api but I don't have code prepared for monitoring every service from my local.
[+] [-] nathants|2 years ago|reply
i wonder if it will work first try? the true test of devops culture.
[+] [-] gawshinde|2 years ago|reply
[+] [-] ciguy|2 years ago|reply