AWS's us-east-1 region is experiencing issues

[+] halestock|4 years ago|reply

I can't help but wonder, with the increases in attrition across the industry, are we hitting some kind of tipping point where the institutional knowledge in these massive tech corporations is disappearing?

Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.

[+] thethethethe|4 years ago|reply

Disc: Googler opinions are my own.

I've been an SRE on a tier 1 GCP product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.

It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.

In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.

This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.

[+] zwirbl|4 years ago|reply

Just like the tech priests in Warhammer 40k, keeping occult old engineering, thatno one could build anymore, running

[+] 9wzYQbTYsAIc|4 years ago|reply

Outside of the mega-fang industry, I’m wondering the same thing.

The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.

COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).

[+] fragmede|4 years ago|reply

You're right, but that's been true since the beginning of the tech boom (but isn't exclusive to tech) when no one works for a place for several decades. Companies weather this in different ways but attrition has always been around.

What's causing people to believe that the latest round of attrition is any different?

[+] steveBK123|4 years ago|reply

Yes, absolutely. Within my own org of ~50 people, 15% have resigned/contracts ending during Q1 (after 15% in Q4). Of the remaining 85%.. 20% have been around since before COVID / 65% joined during COVID.

Of our senior engineers & team leads, 70% have joined in last 6-9 months.

Only 3 full time senior engineers with 2 years or more tenure.

We've grown during COVID but we've also just burned through people.

Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.

[+] Traster|4 years ago|reply

I think this is a transient issue. When you're in growth mode you make a huge series of hacks to just keep things running and then when you leave.... well, it's a problem. But if the business is robust, and lives beyond you, what replaces your work is better documented, better tested, and maintainable.

That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.

Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.

I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.

[+] faangiq|4 years ago|reply

Yep. They literally need to start doubling pay to retain people. The attrition this year is going to be devastating.

[+] nyellin|4 years ago|reply

That's the problem we're out to solve with robusta.dev.

We're slowly but surely converting the world's institutional technical knowledge into re-usable and automated runbooks.

[+] la64710|4 years ago|reply

Resolved in 7 mins. Can you do better?

[+] newobj|4 years ago|reply

No

[+] la64710|4 years ago|reply

[deleted]

[+] saltypal|4 years ago|reply

Based on our telemetry, this started as NXDOMAINs for sqs.us-east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and becoming a total outage at 20:48 UTC. Naturally, it was completely resolved by 20:57, 5 minutes before anything was posted in the "Personal Health Dashboard" in the AWS console.

It takes a while to find a Vice President, I guess.

[+] mcqueenjordan|4 years ago|reply

Or perhaps triaging, root-causing, and fixing the issue is the highest-order bit?

[+] mhio|4 years ago|reply

The truth assuaging usually takes 15-30 minutes.

[+] easton|4 years ago|reply

From temuze last time:

"If you're having SLA problems I feel bad for you son

I got two 9 problems cuz of us-east-1"

[+] operator1|4 years ago|reply

What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data. Are there issues affecting the internet backbone or something? Or just a coincidence?

[+] 300bps|4 years ago|reply

Important to keep in mind that AWS has 250 services in 84 Availability Zones in 26 regions.

This outage is reportedly impacting 5 services in 1 region.

For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.

[+] super_linear|4 years ago|reply

Absolutely no way to prove this but maybe Q1 deadlines coming up and people trying to launch things and make changes?

[+] frays|4 years ago|reply

Increase in attrition across the industry.

A lot of institutional knowledge in these massive tech corporations is disappearing and we're starting to reach the tipping point.

[+] nix0n|4 years ago|reply

A handful of large-traffic sites have recently, and relatively suddenly, started blocking traffic from a large region. That's a major change in flow.

[+] thethethethe|4 years ago|reply

> What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data.

Source?

[+] xeromal|4 years ago|reply

Russian war is another juicy possibility

[+] 9wzYQbTYsAIc|4 years ago|reply

Elevated risk of cyberattacks due to foreign meddling.

https://www.cisa.gov/shields-up

[+] didip|4 years ago|reply

This is why us-east-1 is perfect for chaos-testing, non-prod, environment.

[+] TameAntelope|4 years ago|reply

Yeah, if you're still running only in us-east-1 at this point, you kind of asked for it...

[+] jasoneckert|4 years ago|reply

Maybe the reason AWS keeps going down is because they run all their stuff on-prem...

[+] merb|4 years ago|reply

I'm not sure if gcloud or azure would help. I run two servers on hetzner which is way cheaper than azure/gcloud they would be better off there.

[+] jedberg|4 years ago|reply

Protip to anyone building new infrastructure in AWS: If you're gonna only use one region in the US, make it us-east-2 or us-west-2. us-east-1 is their oldest and biggest region but also the least stable in the US (ok technically us-west-1 is worse but you can't get that one anymore).

[+] fotta|4 years ago|reply

Somehow AWS managed to make their new status page more opaque than the old one. It's like they want you to scroll through their gigantic list so they can fix the issue before you find the right line.

[+] xilni|4 years ago|reply

This is why you are strongly urged not to rely on one region or AZ.

[+] pid-1|4 years ago|reply

Given the total amount of money I've lost due a single AZ being down, it was totally worth it to NOT go multi az or multi region so far.

Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)

But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.

[+] systemvoltage|4 years ago|reply

Seems like it would be conflict of interest to increase robustness of single AZ (so it never goes down or has its own redundancy) vs. increased revenues from multi AZ deployment.

What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.

It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".

Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.

[+] Johnny555|4 years ago|reply

I would strongly urge not using us-east-1 -- of all the regions we're in, it's by far the most problematic. Use us-east-2 if you need good latency to the East Coast.

[+] m34|4 years ago|reply

Might be true for running stuff in different regions/AZs but if the provisioning region is down (e.g. deploying lambda@edge) one does not really have an alternative

[+] etaioinshrdlu|4 years ago|reply

Does AWS have a plan to improve this region?

Do they acknowledge the problem?

It's been a joke for years how bad us-east-1 is.

[+] consumer451|4 years ago|reply

Nuke the entire site from orbit

It's the only way to be sure

[+] PeterBarrett|4 years ago|reply

SQS went down for us in us-east-1 and we lost health checks on instances there. Fully recovered now.

[+] mtrunkat|4 years ago|reply

In our case (Apify.com) there was a complete outage of SQS (15mins+), most likely DNS problems + EC2 instances got restarted probably as a result of an SQS outage.

EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.

[+] BigGreenTurtle|4 years ago|reply

Yep, I saw empty responses for sqs.us-east-1.amazonaws.com for a while. Seems okay now though.

[+] karmakaze|4 years ago|reply

It's a meme by this point that us-east-1 is not 'the cloud'--it's a snowflake, a pet, etc.

[+] amar0c|4 years ago|reply

My Aruba Instant ON Ap's are "offline" (orange) even tho they work and I am online. My first tought is that some Cloud went nirvana state

[+] asah|4 years ago|reply

us-east-1 again!

[+] unknown|4 years ago|reply

[deleted]

[+] extant_lifeform|4 years ago|reply

The URA target needs to be bumped up to 25%. Churn and burn.

[+] lyjackal|4 years ago|reply

noticed issues with SQS for a couple minutes. Errors from java sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com`

160 comments