top | item 45653014

(no title)

ed_elliott_asc | 4 months ago

It’s a bit of a guess though isn’t it?

discuss

order

HelloNurse|4 months ago

It's the most plausible, fact-based guess, beating other competing theories.

Understaffing and absences would clearly lead to delayed incident response, but such an obvious negligence and breach of contract would have been avoided by a responsible cloud provider, ensuring supposedly adequate people on duty.

An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place.

AWS engineers being formerly competent but currently stupid, without organizational issues, might be explained by brain damage. "RTO" might have caused collective chronic poisoning, e.g. lead in drinking water, but I doubt Amazon is so cheap.

sofixa|4 months ago

> An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place

You seem to be misunderstanding the nature of the issue.

The DNS records for DynamoDB's API disappeared. They resolve to a dynamic bunch of IPs that constantly change.

A ton of AWS services that use DynamoDB could no longer do so. Hardcoding IPs wasn't an option. Nor could clients do anything on their side.

acdha|4 months ago

> a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses)

Did you consider that DNS might’ve been a symptom? If the DynamoDB DNS records use a health-check, switching DNS servers will not resolve the issue and might make it worse by directing an unusually high volume of traffic at static IPs without autoscaling or fault recovery.

almostgotcaught|4 months ago

> It's the most plausible, fact-based guess, beating other competing theories.

"My wildly conjectural and self-serving theory is not only correct, it is the most correct".

Lol perfectly represents the arrogance of hn.

zer0tonin|4 months ago

We've witnessed someone repeatedly shoot themselves in the foot a few months ago. It is indeed a guess that it may cause their current foot pain, but it is a rather safe one.