I can't help but wonder, with the increases in attrition across the industry, are we hitting some kind of tipping point where the institutional knowledge in these massive tech corporations is disappearing?
Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.
I've been an SRE on a tier 1 GCP
product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.
It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.
In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.
This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.
Outside of the mega-fang industry, I’m wondering the same thing.
The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.
COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).
You're right, but that's been true since the beginning of the tech boom (but isn't exclusive to tech) when no one works for a place for several decades. Companies weather this in different ways but attrition has always been around.
What's causing people to believe that the latest round of attrition is any different?
Yes, absolutely.
Within my own org of ~50 people, 15% have resigned/contracts ending during Q1 (after 15% in Q4).
Of the remaining 85%.. 20% have been around since before COVID / 65% joined during COVID.
Of our senior engineers & team leads, 70% have joined in last 6-9 months.
Only 3 full time senior engineers with 2 years or more tenure.
We've grown during COVID but we've also just burned through people.
Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.
I think this is a transient issue. When you're in growth mode you make a huge series of hacks to just keep things running and then when you leave.... well, it's a problem. But if the business is robust, and lives beyond you, what replaces your work is better documented, better tested, and maintainable.
That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.
Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.
I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.
Based on our telemetry, this started as NXDOMAINs for sqs.us-east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and becoming a total outage at 20:48 UTC. Naturally, it was completely resolved by 20:57, 5 minutes before anything was posted in the "Personal Health Dashboard" in the AWS console.
It takes a while to find a Vice President, I guess.
What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data. Are there issues affecting the internet backbone or something? Or just a coincidence?
Important to keep in mind that AWS has 250 services in 84 Availability Zones in 26 regions.
This outage is reportedly impacting 5 services in 1 region.
For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.
Protip to anyone building new infrastructure in AWS: If you're gonna only use one region in the US, make it us-east-2 or us-west-2. us-east-1 is their oldest and biggest region but also the least stable in the US (ok technically us-west-1 is worse but you can't get that one anymore).
Somehow AWS managed to make their new status page more opaque than the old one. It's like they want you to scroll through their gigantic list so they can fix the issue before you find the right line.
Given the total amount of money I've lost due a single AZ being down, it was totally worth it to NOT go multi az or multi region so far.
Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)
But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.
Seems like it would be conflict of interest to increase robustness of single AZ (so it never goes down or has its own redundancy) vs. increased revenues from multi AZ deployment.
What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.
It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".
Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.
I would strongly urge not using us-east-1 -- of all the regions we're in, it's by far the most problematic. Use us-east-2 if you need good latency to the East Coast.
Might be true for running stuff in different regions/AZs but if the provisioning region is down (e.g. deploying lambda@edge) one does not really have an alternative
In our case (Apify.com) there was a complete outage of SQS (15mins+), most likely DNS problems + EC2 instances got restarted probably as a result of an SQS outage.
EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.
noticed issues with SQS for a couple minutes. Errors from java sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com`
[+] [-] halestock|4 years ago|reply
Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.
[+] [-] thethethethe|4 years ago|reply
I've been an SRE on a tier 1 GCP product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.
It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.
In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.
This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.
[+] [-] zwirbl|4 years ago|reply
[+] [-] 9wzYQbTYsAIc|4 years ago|reply
The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.
COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).
[+] [-] fragmede|4 years ago|reply
What's causing people to believe that the latest round of attrition is any different?
[+] [-] steveBK123|4 years ago|reply
Of our senior engineers & team leads, 70% have joined in last 6-9 months.
Only 3 full time senior engineers with 2 years or more tenure.
We've grown during COVID but we've also just burned through people.
Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.
[+] [-] Traster|4 years ago|reply
That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.
Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.
I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.
[+] [-] faangiq|4 years ago|reply
[+] [-] nyellin|4 years ago|reply
We're slowly but surely converting the world's institutional technical knowledge into re-usable and automated runbooks.
[+] [-] la64710|4 years ago|reply
[+] [-] newobj|4 years ago|reply
[+] [-] la64710|4 years ago|reply
[deleted]
[+] [-] saltypal|4 years ago|reply
It takes a while to find a Vice President, I guess.
[+] [-] mcqueenjordan|4 years ago|reply
[+] [-] mhio|4 years ago|reply
[+] [-] easton|4 years ago|reply
"If you're having SLA problems I feel bad for you son
I got two 9 problems cuz of us-east-1"
[+] [-] operator1|4 years ago|reply
[+] [-] 300bps|4 years ago|reply
This outage is reportedly impacting 5 services in 1 region.
For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.
[+] [-] super_linear|4 years ago|reply
[+] [-] frays|4 years ago|reply
A lot of institutional knowledge in these massive tech corporations is disappearing and we're starting to reach the tipping point.
[+] [-] nix0n|4 years ago|reply
[+] [-] thethethethe|4 years ago|reply
Source?
[+] [-] xeromal|4 years ago|reply
[+] [-] 9wzYQbTYsAIc|4 years ago|reply
https://www.cisa.gov/shields-up
[+] [-] didip|4 years ago|reply
[+] [-] TameAntelope|4 years ago|reply
[+] [-] jasoneckert|4 years ago|reply
[+] [-] merb|4 years ago|reply
[+] [-] jedberg|4 years ago|reply
[+] [-] fotta|4 years ago|reply
[+] [-] xilni|4 years ago|reply
[+] [-] pid-1|4 years ago|reply
Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)
But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.
[+] [-] systemvoltage|4 years ago|reply
What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.
It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".
Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.
[+] [-] Johnny555|4 years ago|reply
[+] [-] m34|4 years ago|reply
[+] [-] etaioinshrdlu|4 years ago|reply
Do they acknowledge the problem?
It's been a joke for years how bad us-east-1 is.
[+] [-] consumer451|4 years ago|reply
It's the only way to be sure
[+] [-] PeterBarrett|4 years ago|reply
[+] [-] mtrunkat|4 years ago|reply
EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.
[+] [-] BigGreenTurtle|4 years ago|reply
[+] [-] karmakaze|4 years ago|reply
[+] [-] amar0c|4 years ago|reply
[+] [-] asah|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] extant_lifeform|4 years ago|reply
[+] [-] lyjackal|4 years ago|reply
[+] [-] hughrr|4 years ago|reply
[+] [-] 0xCAP|4 years ago|reply
[+] [-] adenner|4 years ago|reply
[+] [-] 0xbadcafebee|4 years ago|reply