top | item 29519269

(no title)

jetru | 4 years ago

Complex systems are really really hard. I'm not a big fan of seeing all these folks bash AWS for this, and not really understanding the complexity or nastiness of situations like this. Running the kind of services they do for the kind of customers, this is a VERY hard problem.

We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.

Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.

Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.

discuss

Ensorceled|4 years ago

> Outrage is the easy response. Empathy and learning is the valuable one.

I'm outraged that AWS, as a company policy, continues to lie about the status of their systems during outages, making it hard for me to communicate to my stakeholders.

Empathy? For AWS? AWS is part a mega corporation that is closing in on 2 TRILLION dollars in market cap. It's not a person. I can empathize with individuals who work for AWS but it's weird to ask us to have empathy for a massive faceless, ruthless, relentless, multinational juggernaut.

ithkuil|4 years ago

My reading of GP's comment is that the empathy should be directed towards AWS' team, the people who are building the system and handling the fallout, not AWS the corporate entity.

I may be wrong, but I try to apply the https://en.m.wikipedia.org/wiki/Principle_of_charity

Jgrubb|4 years ago

It seems obvious to me that they're specifically talking about having empathy for the people who work there, the people who designed and built these systems and yes, empathy even for the people who might not be sure what to put on their absolutely humongous status page until they're sure.

fastball|4 years ago

I think most of the outrage is not because "it happened" but because AWS is saying things like "S3 was unaffected" when the anecdotal experience of many in this thread suggests the opposite.

That and the apparent policy that a VP must sign off on changing status pages, which is... backwards to say the least.

amzn-throw|4 years ago

> a VP must sign off on changing status pages, which is... backwards to say the least.

I think most people's experience with "VP's" makes them not realize what AWS VP's do.

VP's here are not sitting in an executive lounge wining and dining customers, chomping on cigars and telling minions to "Call me when the data center is back up and running again!"

They are on the tech call, working with the engineers, evaluating the problem, gathering the customer impact, and attempting to balance communicating too early with being precise.

Is there room for improvement? Yes. I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.

But the reason why we don't, doesn't have anything to do with having to get VP approval to put that message up. The VP's are there in the trenches most of the time.

jetru|4 years ago

There's definitely miscommunication around this. I know I've miscommunicated impact, or my communication was misinterpreted across the 2 or 3 people it had to jump before hitting the status page.

For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.

simonbarker87|4 years ago

I’m not all that angry over the situation but more disappointed that we’ve all collectively handed the keys over to AWS because “servers are hard”. Yeh they are but it’s not like locking ourselves into one vendor with flaky docs and a black box of bugs is any better, at least when your own servers go down it’s on you and you don’t take out half of North America.

linsomniac|4 years ago

If you aren't going to rely on external vendors, servers are really, really hard. Redundancy in: power, cooling, networking? Those get expensive fast. Drop your servers into a data center and you're in a similar situation to dropping it in AWS.

A couple years ago all our services at our data center just vanished. I call the data center and they start creating a ticket. "Can you tell me if there is a data center outage?" "We are currently investigating and I don't have any information I can give you." "Listen, if this is a problem isolated to our cabinet, I need to get in the car. I'm trying to decide if I need to drive 60 miles in a blizzard."

That facility has been pretty good to us over a decade, but they were frustratingly tight-lipped about an entire room of the facility losing power because one of their power feeder lines was down.

Could AWS improve? Yes. Does avoiding AWS solve these sorts of problems? No.

nix23|4 years ago

Servers are not hard if you have a dedicated person (long time ago known as Systemadminstrator), and fun fact...it's sometimes even much cheaper and more reliable then having everything in the "cloud".

Personally i am a believer in mixed environments, public webservers etc in the "cloud", locally used systems and backup "in house" with a second location (both in Data-centers or at least one), and no, i don't talk about the next google but the 99% of businesses.

ranguna|4 years ago

You can either pay a dedicated team to manage your on prem solution, go multi cloud, or simply go multi region on aws.

My company was not affected by this outage because we are multi region. Cheapest and quickest option if you want to have at least some fault tolerance.

tuldia|4 years ago

Excuse me, do we need all that complexity? Telling that it is "hard" is justifiable?

It is naive to assume people bashing AWS are uncapable to running things better, cheaper, faster, across many other vendors, on-prem, colocation or what not.

> Outrage is the easy response.

That is what made AWS get the marketshare it has now in the first place, the easy responses.

The main selling point of AWS in the beginning was "how easy is to sping a virtual machine". After basically every layman started recommending AWS and we flocked there, AWS started making things more complex than it should. Was that to make harder to get out of it? IDK.

> Empathy and learning is the valuable one.

When you run your infrastructure and something fails and you are not transparent, your users will bash you, independently who you are.

And that was another "easy response" used to drive companies towards AWS. We developers were echoing that "having a infrastructure team or person is not necessary", etc.

Now we are stuck in this learned helplessness where every outage is a complete disaster in terms of transparency, multiple services failing, even for multi-region and multi-az customers, we saying "this service here is also not working" and AWS simple states that service was fine, not affected, up and running.

If it was a sysadmin doing that, people will be asking for his/her neck with pitchforks.

noahtallen|4 years ago

> AWS started making things more complex than it should

I don’t think this is fair for a couple reasons:

1. AWS would have had to scale regardless just because of the number of customers. Even without adding features. This means many data centers, complex virtual networking, internal networks, etc. These are solving very real problems that happen when you have millions of virtual servers.

2. AWS hosts many large, complex systems like Netflix. Companies like Netflix are going to require more advanced features out of AWS, and this will result in more features being added. While this is added complexity, it’s also solving a customer problem.

My point is that complexity is inherent to the benefits of the platform.

metb|4 years ago

Thanks for these thoughts. Resonated well with me. I feel we are sleepwalking into major fiascos, when a simple doorbell needs to sit on top this level of complexity. It's in our best interest to not tie every small thing into layers, and layers of complexity. Mundane things like doorbells need to have their fallback at least done properly to function locally without relying on complex cloud systems.

danjac|4 years ago

The problem isn't AWS per se. The problem is it's become too big to fail. Maybe in the past an outage might take down a few sites, or one hospital, or one government service. Now one outage takes out all the sites, all the hospitals and all the government services. Plus your coffee machine stops working.

iso1631|4 years ago

> I'm not a big fan of seeing all these folks bash AWS for this,

The disdain I saw was towards those claiming that all you need is AWS, that AWS never goes down, and don't bother planning for what happens when AWS goes down.

AWS is an amazing accomplishment, but it's still a single point of failure. If you are a company relying on a single supplier and you don't have any backup plans for that supplier being unavailable, that is ridiculous and worthy of laughter.

qaq|4 years ago

Very good summary of why small projects need to think real hard before jumping onto microservices bandwagon.

spfzero|4 years ago

But Amazon advertises that they DO understand the complexity of this, and that their understanding, knowledge and experience is so deep that they are a safe place to put your critical applications, and so you should pay them lots of money to do so.

Totally understand that complex systems behave in incomprehensible ways (hopefully only temporarily incomprehensible). But they're selling people on the idea of trading your complex system, for their far more complex system that they manage with such great expertise that it is more reliable.

bradknowles|4 years ago

They don’t sell “guaranteed no downtime throughout the history of the universe”.

They have SLAs. And there are clauses that cover the weird edge cases for when the SLAs are not met.

xamde|4 years ago

There is this really nice website which explains how complex systems fail: https://how.complexsystems.fail/

raffraffraff|4 years ago

Interesting. Just wondering if your guys have a dedicated DBA?

raffraffraff|4 years ago

Not sure why I got down voted for an honest question. Most start-ups are founders, developers, sales and marketing. Dedicated infrastructure, network and database specialists don't get factored in because "smart CS graduates can figure that stuff out". I've worked at companies who held onto that false notion way too long and almost lost everything as a result ("company extinction event", like losing a lot of customer data)

unknown|4 years ago

[deleted]

juanani|4 years ago

[deleted]