top | item 29474675

(no title)

ipmb | 4 years ago

Looks like they've acknowledged it on the status page now. https://status.aws.amazon.com/

> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.

> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/

discuss

jesboat|4 years ago

> This issue is affecting the global console landing page, which is also hosted in US-EAST-1

Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.

all_usernames|4 years ago

Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.

stevehawk|4 years ago

I don't know if that should surprise us. AWS hosted their status page in S3 so it couldn't even reflect its own outage properly ~5 years ago. https://www.theregister.com/2017/03/01/aws_s3_outage/

tekromancr|4 years ago

I just want to serve 5 terabytes of data

ithkuil|4 years ago

One region? I forgot how to count that low

sangnoir|4 years ago

> At a different (unnamed) FAANG

I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.

1. https://www.youtube.com/watch?v=3t6L-FlfeaI

alfiedotwtf|4 years ago

Maybe has something to do with CloudFront mandating certs to be in us-east-1?

ehsankia|4 years ago

Forget the number of regions. Monitoring for X shouldn't even be hosted on X at all...

hericium|4 years ago

> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.

sheenobu|4 years ago

I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.

balls187|4 years ago

MAANG*

How long before Meta takes over for Facebook?

jabiko|4 years ago

Yeah, but I still have a different understanding what "Increased Error Rates" means.

IMHO it should mean that the rate of errors is increased but the service is still able to serve a substantial amount of traffic. If the rate of errors is bigger than, let's say, 90% that's not an increased error rate, that's an outage.

thallium205|4 years ago

They say that to try and avoid SLA commitments.

jeremyjh|4 years ago

They are still lying about it, the issues are not only affecting the console but also AWS operations such as S3 puts. S3 still shows green.

lsaferite|4 years ago

It's certainly affecting a wider range of stuff from what I've seen. I'm personally having issues with API Gateway, CloudFormation, S3, and SQS

packetslave|4 years ago

IAM is a "global" service for AWS, where "global" means "it lives in us-east-1".

STS at least has recently started supporting regional endpoints, but most things involving users, groups, roles, and authentication are completely dependent on us-east-1.

Rantenki|4 years ago

Yep, I am seeing failures on IAM as well:

   aws iam list-policies
  
  An error occurred (503) when calling the ListPolicies operation (reached max retries: 2): Service Unavailable

Rapzid|4 years ago

I still can't create/destroy/etc CloudFront distros. They are stuck in "pending" indefinitely.

dang|4 years ago

Ok, we've changed the URL to that from https://us-east-1.console.aws.amazon.com/console/home since the latter is still not responding.

There are also various media articles but I can't tell which ones have significant new information beyond "outage".

stephenr|4 years ago

When I brought up the status page (because we're seeing failures trying to use Amazon Pay) it had EC2 and Mgmt Console with issues.

I opened it again just now (maybe 10 minutes later) and it now shows DynamoDB has issues.

If past incidents are anything to go by, it's going to get worse before it gets better. Rube Goldberg machines aren't known for their resilience to internal faults.

JPKab|4 years ago

As a user of Sagemaker in us-east-1, I deeply fucking resent AWS claiming the service is normal. I have extremely sensitive data, so Sagemaker notebooks and certain studio tools make sense for me. Or DID. After this I'm going back to my previous formula of EC2 and hosting my own GPU boxes.

Sagemaker is not working, I can't get to my work (notebook instance is frozen upon launch, with zero way to stop it or restart it) and Sagemaker Studio is also broken right now.

The length of this outage has blown my mind.

wahern|4 years ago

You don't use AWS because it has better uptime. If you've been around the block enough times, this story has always rung hollow.

Rather, you use AWS because when it is down, it's down for everybody else as well. (Or at least they can nod their head in sympathy for the transient flakiness everybody experiences.) Then it comes back up and everybody forgets about the outage like it was just background noise. This is what's meant by "nobody ever got fired for buying (IBM|Microsoft)". The point is that when those products failed, you wouldn't get blamed for making that choice; in their time they were the one choice everybody excused even when it was an objectively poor choice.

As for me, I prefer hosting all my own stuff. My e-mail uptime is better than GMail, for example. However, when it is down or mail does bounce, I can't pass the buck.

markus_zhang|4 years ago

Looks like they removed some 9s from availability in one day. I wonder if more are considering moving away from cloud.

guenthert|4 years ago

Uh, four minutes to identify the root cause? Damn, those guys are on fire.

Frost1x|4 years ago

Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.

I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.

czbond|4 years ago

:) I imagine it went like this theoretical Slack conversation:

> Dev1: Pushing code for branch "master" to "AWS API". > <slackbot> Your deploy finished in 4 minutes > Dev2: I can't react the API in east-1 > Dev1: Works from my computer

flerchin|4 years ago

Outage started at 731 PST from our monitoring. They are on fire, but not in a good way.

tonyhb|4 years ago

It was down as of 7:45am (we posted in our engineering channel), so that's a good 40 minutes of public errors before the root cause was figured out.

giorgioz|4 years ago

I'm trying to login in the AWS Console from other regions but I'm getting HTTP 500. Anyone managed to login in other regions? Which ones?

Our backend is failing, it's on us-east-1 using AWS Lambda, Api Gateway, S3

pbreit|4 years ago

I like how 6 hours in: "Many services have already recovered".

bamboozled|4 years ago

Not even close for us.

bobviolier|4 years ago

https://status.aws.amazon.com/ still shows all green for me

banana_giraffe|4 years ago

It's acting odd for me. Shows all green in Firefox, but shows the error in Chrome even after some refreshes. Not sure what's caching where to cause that.