top | item 36315300

AWS us-east-1 down

658 points| rurp | 2 years ago

The status page says everything is fine though.

313 comments

order
[+] intsunny|2 years ago|reply
For those wondering: Currently PDT is 7 hours behind UTC.

AWS can do so many things, reporting critical outage updates in UTC is not one of those things.

[+] kroltan|2 years ago|reply
Semi-related: if you ever feel the need to report times to a global audience, not only make sure to always report the timezone (even if it is the same as the user's), but also use UTC offsets rather than timezone names.

Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.

[+] rurp|2 years ago|reply
The inconsistency with timezones across different services in the AWS console has always baffled and annoyed me. Some places have a time without a timezone and I can never tell right away if it's utc, local time, or region time.
[+] messe|2 years ago|reply
> AWS can do so many things, reporting critical outage updates in UTC is not one of those things.

Thank you for reminding me about one of my biggest mildest annoyances from working at AWS.

[+] mulmen|2 years ago|reply
Technically PDT is always 7 hours behind UTC. PST is always 8 hours behind. We just change which one we use twice a year. Pacific time makes sense when you realize Fremont is the center of the universe.
[+] cogogo|2 years ago|reply
The outage is in Virginia so PDT isn't even local time. On their status page they are asking users to access the console via a region specific endpoint like https://us-west-2.console.aws.amazon.com. Wonder if the PDT timestamp is because they have to serve the status page from US West right now.
[+] joshuanapoli|2 years ago|reply
The fact that which timezone is used in the announcement is a sign of progress... AWS announced it pretty quickly, gave nice updates, and seems to have fixed the problem quickly enough. I'm interested to see the postmortem...
[+] utbabya|2 years ago|reply
When I was with AWS I advocate for ISO8601 "Z" whenever I could or need to influence, say internal systems.

If all systems talk this we'd save tens of thousands of man hours. Just do the conversion for us mortals, or other necessities. Tech side of incidents is definitely "system", I'd argue more often than not consumers of AWS are also tech side with systems in UTCs so health dashboards should also be a UTC first system. Doubt this could get prioritized tho

[+] adubashi|2 years ago|reply
It doesn't matter if your infra is in another region, because there will almost always be transitive dependencies on us-east-1. IAM is deployed in us-east-1 and there will always be a transitive dependency on us-east-1
[+] luhn|2 years ago|reply
I have never had a production issue in other regions due to a us-east-1 outage. The worst that ever happened was I had to wait to update a Cloudfront distribution because the control plane (based in us-east-1) was down, but the existing configuration continued working fine throughout.

I don't know what the architecture of IAM looks like, but somehow it's never suffered a global outage.

AWS is really, really good at regional isolation.

[+] mooreds|2 years ago|reply
Control plane will almost always be impacted, I agree.

Our data plane was fine (for example, ec2 instances and s3 buckets in other regions were fine).

[+] jedberg|2 years ago|reply
Usually it only prevents changes, but the runtime isn't affected.
[+] shepherdjerred|2 years ago|reply
I thought there was some recent shift on making IAM multi-region?
[+] Metaluim|2 years ago|reply
So much for redundancy I guess.
[+] dveeden2|2 years ago|reply
https://health.aws.amazon.com/health/status reports:

  Increased Error Rates and Latencies
  Jun 13 12:08 PM PDT We are investigating increased error rates and latencies in the US-EAST-1 Region.
They list Lambda as the only affected service
[+] throw03172019|2 years ago|reply
My Whole Foods grocery pickup order was affected by this outage. They couldn’t check me in. Groceries were packed in the fridge but they told me to come back later. What a waste of time.
[+] dijit|2 years ago|reply
I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm just experiencing selection bias; but I posted a poll on twitter earlier today: https://twitter.com/dijit/status/1668678588713824257

Contents:

> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?

> [ ] Yeah, outages free pass

> [ ] No, they say to use AZ's

[+] robrtsql|2 years ago|reply
> No, they say to use AZ's

Using 3 AZs in us-east-1 won't save you.

I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?

[+] kobalsky|2 years ago|reply
it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.

then you get to eat popcorn when stuff explodes.

  * single server event.   $
  * multi server event.    $$
  * single az event.       $$$
  * multi az event.        $$$$
  * global provider event. $$$$$
  * cross provider event.  $$$$$$
  * alien invasion.        $$$$$$$$$$$$$$
[+] Johnny555|2 years ago|reply
My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.
[+] kinghajj|2 years ago|reply
AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.
[+] paulddraper|2 years ago|reply
Depends on your customers.

If your customers are tech, they're too busy running around with their hair on fire too.

[+] tedmiston|2 years ago|reply
> Has anyone ever actually had customers accept an outage because AWS was down...

Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]

Firebase Auth, for instance, offers no SLA at all [1].

I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.

[1]: https://stackoverflow.com/a/60500860/149428

[+] mrobins|2 years ago|reply
I can think of more times where a whole AZ has had issues than times where just one AZ went dark and failover happened seamlessly.
[+] jmacjmac|2 years ago|reply
Maybe cheaper regions have more users and have higher outage rates
[+] throaway87c10f0|2 years ago|reply
Mysterious lack of "AWS is bad for the internet because it is so centralized" dialog up in here.

edit: for those that would downvote: HN _just_ yesterday: https://news.ycombinator.com/item?id=36295352 https://news.ycombinator.com/item?id=36295305

[+] SamuelAdams|2 years ago|reply
Ok fine. Running your own datacenter in 2023 is incredibly risky. There's the upfront server cost and the ongoing maintenance cost. There's patches and staffing and disaster planning and all the other things that goes into it. Plus there's the cyberinsurance and protections and security components too.

Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare? They have some of the brightest minds in the industry working there, and they can price things at a much better price than anything you can build yourself.

Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing. However with the price of some cloud vendors compared to the DIY approach, it's hard for organizations to ignore.

If you really want to combat this, make the cost of running your own data center less. Reduce risk. Reduce the amount of money it costs for hiring good people or MSP's. Reduce the cost of acquiring and installing hardware.

Organizations pay attention to dollars so if you want the trend to shift, come up with a less costly alternative to the current cloud offerings.

[+] dijit|2 years ago|reply
its just tired at this point.

everyone knows, nobody seems to care.

Another comment of mine in this thread asks the question if you can excuse downtime of your service due to AWS outages.

Consensus seems to be: yes

which is a pretty huge deal, well worth the insane cost increase of AWS by itself. No other hosting provider would grant you such an excuse.

I would weep for the centralised future of the internet, but its already here, so theres no point.

[+] andersrs|2 years ago|reply
It's a mob mentality. Safety in numbers. "Oh well, my site is down but so is my neighbour's so nobody will be that mad about it."
[+] hx833001|2 years ago|reply
Too techie and doing things the right way, so CF shouldn’t be successful? Therefore… jealousy? That’s my guess as to why all the hacker news hate.
[+] ulrashida|2 years ago|reply
Why a throwaway for this post? Not like this is some deep whistleblowing or career risk.
[+] FullyFunctional|2 years ago|reply
I "love" it when my vacuum stops working because an online book sellers servers went down. #modernlife

This is a good reminder to avoid cloud-centric products, but they are getting harder and harder to avoid.

[+] MrBruh|2 years ago|reply
Did this actually happen? The vacuum part
[+] whoisjuan|2 years ago|reply
Why is it always us-east-1 though?

I have always stayed away from that region because it seems significantly less reliable than other regions.

[+] assimpleaspossi|2 years ago|reply
My son delivers part-time for Amazon and all the drivers at his warehouse were sent home. So if your delivery is late or non-existent today....
[+] noradbase|2 years ago|reply
I guess I'll use the downtime to see what's new on Reddi... oh... yeah.
[+] arixzajicek|2 years ago|reply
this happened less than an hour after I altered our prod scheme, thought I brought down production, what a relief
[+] impulser_|2 years ago|reply
Why does everyone keep deploying their products to this one region when it always seems like the one that fails?

We don't use big cloud were I work, so maybe I'm missing something. Does East-1 offer something other don't?

[+] thedigitalone|2 years ago|reply
Toast POS is down 100%, don't go out to lunch.
[+] jjice|2 years ago|reply
As a side note, I wonder if businesses won't even accept cash if they can't go through their POS system. If not, it's a shame that these modern internet connected POSs lock out stuff like that.
[+] xyst|2 years ago|reply
I can see some restaurants just comp’ing the tickets out and having toast foot the bill in lost sales
[+] grumple|2 years ago|reply
It's fun watching each service fail sequentially while the aws service dashboard just updates them to "Informational" status, whatever that means.

Even management console is down, and their suggested region specific workaround does not work, at least for us-east-1. I can see some processes via api but I don't have code prepared for monitoring every service from my local.

[+] nathants|2 years ago|reply
finally an opportunity to test a full deploy from scratch, and restore from backup, in a new region.

i wonder if it will work first try? the true test of devops culture.

[+] ciguy|2 years ago|reply
It appears to be an outage in IAM which is trickling down to every service which relies on IAM auth.