(no title)
jetru | 4 years ago
We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.
Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.
Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.
Ensorceled|4 years ago
I'm outraged that AWS, as a company policy, continues to lie about the status of their systems during outages, making it hard for me to communicate to my stakeholders.
Empathy? For AWS? AWS is part a mega corporation that is closing in on 2 TRILLION dollars in market cap. It's not a person. I can empathize with individuals who work for AWS but it's weird to ask us to have empathy for a massive faceless, ruthless, relentless, multinational juggernaut.
ithkuil|4 years ago
I may be wrong, but I try to apply the https://en.m.wikipedia.org/wiki/Principle_of_charity
Jgrubb|4 years ago
fastball|4 years ago
That and the apparent policy that a VP must sign off on changing status pages, which is... backwards to say the least.
amzn-throw|4 years ago
I think most people's experience with "VP's" makes them not realize what AWS VP's do.
VP's here are not sitting in an executive lounge wining and dining customers, chomping on cigars and telling minions to "Call me when the data center is back up and running again!"
They are on the tech call, working with the engineers, evaluating the problem, gathering the customer impact, and attempting to balance communicating too early with being precise.
Is there room for improvement? Yes. I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
But the reason why we don't, doesn't have anything to do with having to get VP approval to put that message up. The VP's are there in the trenches most of the time.
jetru|4 years ago
For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.
simonbarker87|4 years ago
linsomniac|4 years ago
A couple years ago all our services at our data center just vanished. I call the data center and they start creating a ticket. "Can you tell me if there is a data center outage?" "We are currently investigating and I don't have any information I can give you." "Listen, if this is a problem isolated to our cabinet, I need to get in the car. I'm trying to decide if I need to drive 60 miles in a blizzard."
That facility has been pretty good to us over a decade, but they were frustratingly tight-lipped about an entire room of the facility losing power because one of their power feeder lines was down.
Could AWS improve? Yes. Does avoiding AWS solve these sorts of problems? No.
nix23|4 years ago
Personally i am a believer in mixed environments, public webservers etc in the "cloud", locally used systems and backup "in house" with a second location (both in Data-centers or at least one), and no, i don't talk about the next google but the 99% of businesses.
ranguna|4 years ago
My company was not affected by this outage because we are multi region. Cheapest and quickest option if you want to have at least some fault tolerance.
tuldia|4 years ago
It is naive to assume people bashing AWS are uncapable to running things better, cheaper, faster, across many other vendors, on-prem, colocation or what not.
> Outrage is the easy response.
That is what made AWS get the marketshare it has now in the first place, the easy responses.
The main selling point of AWS in the beginning was "how easy is to sping a virtual machine". After basically every layman started recommending AWS and we flocked there, AWS started making things more complex than it should. Was that to make harder to get out of it? IDK.
> Empathy and learning is the valuable one.
When you run your infrastructure and something fails and you are not transparent, your users will bash you, independently who you are.
And that was another "easy response" used to drive companies towards AWS. We developers were echoing that "having a infrastructure team or person is not necessary", etc.
Now we are stuck in this learned helplessness where every outage is a complete disaster in terms of transparency, multiple services failing, even for multi-region and multi-az customers, we saying "this service here is also not working" and AWS simple states that service was fine, not affected, up and running.
If it was a sysadmin doing that, people will be asking for his/her neck with pitchforks.
noahtallen|4 years ago
I don’t think this is fair for a couple reasons:
1. AWS would have had to scale regardless just because of the number of customers. Even without adding features. This means many data centers, complex virtual networking, internal networks, etc. These are solving very real problems that happen when you have millions of virtual servers.
2. AWS hosts many large, complex systems like Netflix. Companies like Netflix are going to require more advanced features out of AWS, and this will result in more features being added. While this is added complexity, it’s also solving a customer problem.
My point is that complexity is inherent to the benefits of the platform.
metb|4 years ago
danjac|4 years ago
iso1631|4 years ago
The disdain I saw was towards those claiming that all you need is AWS, that AWS never goes down, and don't bother planning for what happens when AWS goes down.
AWS is an amazing accomplishment, but it's still a single point of failure. If you are a company relying on a single supplier and you don't have any backup plans for that supplier being unavailable, that is ridiculous and worthy of laughter.
qaq|4 years ago
spfzero|4 years ago
Totally understand that complex systems behave in incomprehensible ways (hopefully only temporarily incomprehensible). But they're selling people on the idea of trading your complex system, for their far more complex system that they manage with such great expertise that it is more reliable.
bradknowles|4 years ago
They have SLAs. And there are clauses that cover the weird edge cases for when the SLAs are not met.
xamde|4 years ago
raffraffraff|4 years ago
raffraffraff|4 years ago
unknown|4 years ago
[deleted]
juanani|4 years ago
[deleted]