top | item 44514895

(no title)

gnat | 7 months ago

Ugh. Each of these points is a classic reliability precaution – yet all were missed simultaneously. As one analyst put it, Google had “written the book on Site Reliability Engineering” but still deployed code that could not handle null inputs. In hindsight, this outage looks like a string of simple errors aligning by unfortunate chance.

Yes, that's how major outages happen. By this stage of maturity any single failure generally doesn't break things dramatically. When things go this wrong, it's ALWAYS a combination of failures: failure of recovery system, omission in detection systems, gap in automated review, oversight in ...

The vacuous gotcha language is indicative of the low quality of the whole article. As Metalnem says in comments here, see the official incident report for a better writeup and more insight. https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

discuss

jaymzcampbell|7 months ago

I really love the "Swiss Cheese model" for showing this in a very explicit way, it's easy to see how the most improbably thing could happen.

https://en.wikipedia.org/wiki/Swiss_cheese_model