Unreliability at Scale

[+] vishnugupta|4 years ago|reply

One of the most severe S3 outages[1] was triggered by a single NIC flipping bits. S3 maintains (at least back then) the routing table (object to node mapping) in a Merkle Tree data structure and it keeps getting updated via gossip protocol. This bit-flipping machine however, would spread misinformation about the routing table forcing all its neighbours and their neighbours (and so on) to sync their Merkle Trees. So a large number of machines spent more time gossiping than serving the customer request.

I vividly remember the outage as I was at Amazon during that time, in fact I was in AWS group but a different team. I guess the ticket was escalated to sev-1 as large number of tier-1 services had begun depending on S3. The COE action items took months to get rolled out.

While it's one thing to say be ready for all kinds of failure, including hardware, it takes massive failures to internalise the lessons. AWS is now paranoid about being resilient against all kinds of failures including earthquake and lightening.

[1] https://status.aws.amazon.com/s3-20080720.html

[+] sokoloff|4 years ago|reply

> So a large number of machines spent more time gossiping than serving the customer request.

Machines are becoming increasingly capable of automating jobs previously only done by humans.

[+] pm90|4 years ago|reply

It always takes that one incident to reprioritize reliability. Kudos to AWS for actually taking action to change the org to be more resilient though. Most organizations that have reliability issues will hire a QE expert or have some Quality initiatives that start hot but fizzle out as leadership loses interest until another disaster hits again.

[+] Abishek_Muthian|4 years ago|reply

What did you call that single NIC flipping bits event in Amazon?

Is the term 'cosmic-ray-induced error' outdated in FAANG?

>Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.

[1]https://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electroni...

[+] willvarfar|4 years ago|reply

This topic was raised earlier this week too:

https://news.ycombinator.com/item?id=27484866 Silent Data Corruptions at Scale

In the other comments, there was talk of having some kind of active search-for-unreliability as a background task in clusters. I was kinda pleased with my suggestion in that thread:

There are programs, used for testing and “fuzzing” compilers, that generate random programs based on a seed.

When a node runs a random program, it doesn’t know if the output is correct. But it could report the seed and result to a central database.

Then, if you had a several nodes, and two nodes ran the same seed and got different output, that would mean something was wrong and needed investigation.

There are also programs for reducing such programs down to a minimum test case. So once a discrepancy is found, it can be reduced to some small program that recreates it.

I once worked on a compiler backend and a CI job generated random C programs and compared x86 output. against the novel cpu simulator. Any discrepancies found were auto reduced by these tools and then a ticket was automatically created. Lots of our bugs were found and fixed this way.

(My memory is we used C-reduce for the reductions. I can’t remember the tool we used for generating the test programs, but there are several.)

[+] pm90|4 years ago|reply

Nice. Automated testing like this is going to be absolutely essential as the systems we build get more complex and operate at scales humans would find it hard to work with.

[+] crocal|4 years ago|reply

There is a solution to this problem, known for a few decades now: the vital coded processor. Basic idea: each operation can simultaneously compute a result and an arithmetic « signature » that can be pre-calculated for a program. The length in bits of the signature defines the probability of missing an error in the underlying hardware. Think ECC but the error correcting code defines the correct _execution_ of some machine code.

For some reason, it does not catch up with the mainstream but it truly would make sense for such application, albeit at the cost of a computational overhead and more work for the compiler.

[+] Animats|4 years ago|reply

And that's why IBM mainframes have duplicate CPUs with comparison. [1]

[1] https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=c...

[+] dimtion|4 years ago|reply

In the Facebook bug listed in the post this specific mitigation would probably not have been enough since the bug was due to invalid instructions emitted by the JIT.

Under exact same workloads, two duplicate JIT running on two duplicate CPU would have most likely emitted the same erroneous code.

[+] throwawaysea|4 years ago|reply

Fascinating read. Kudos to the folks who no doubt had to work hard to identify these esoteric bugs. I do wonder if it is worth it for these companies to identify these issues - it sounds more economical to just isolate the faulty machine and get rid of it, instead of bothering to investigate. On the other hand, they perhaps can't rule out that the issue isn't something more widespread.

[+] mikeytown2|4 years ago|reply

At scales this large, finding the root cause is the best choice. By finding the cause you can then do tests on all boxes via cron to make sure everything works as expected; allowing you to pull hardware before an issue becomes widespread; can then rerun the boxes inputs on another machine if it was handling more than serving http requests. Robust hardware validation and testing is a must when dealing with millions of cores.

[+] yuliyp|4 years ago|reply

What is a faulty machine? How do you find one? Some of the failures experienced happened with specific data and workloads assigned to specific cores of specific machines. How do you identify software bugs vs hardware issues? Given a suspect machine how do you figure out if it's actually the machine's fault? Is it the CPU? RAM? Motherboard that's faulty?

[+] cratermoon|4 years ago|reply

The issue at scale is that you don't know if there's an underlying flaw that will manifest all at once across enough machines to cause a real outage or major service disruption.

[+] sellyme|4 years ago|reply

Somewhat of a deviation from the topic of this article, but I enjoyed this terminology from the Facebook bug analysis:

> After a few iterations, it became obvious that the computation of Int(1.1^53) as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU

The use of the word "obvious" here reminds me very heavily of this particular SMBC:

https://www.smbc-comics.com/comic/how-math-works

Not two paragraphs earlier they're talking about how discovering this required multiple entire teams of engineers. Lots of things are obvious after you've done all the hard work involved in understanding it!

[+] vlovich123|4 years ago|reply

I read it differently. More like after a bunch of debugging it was clear there was one and only one possible problematic instruction (vs any of several).

[+] hinkley|4 years ago|reply

I never ceased to be amazed at how awful we all are with statistics. I struggled in school with that class and I think that made me more sensitive to this phenomenon.

You can't move anyone to action with conversations about failure percentages or probabilities. The light bulbs only ever click on when you rephrase in terms of interval between incidents. If I tell you that you have a high chance of being pre-empted today to deal with operational bullshit, people don't like that. But as a fraction it just isn't compelling to most people.

Be it servers or deployments or CI issues, one negative event per day can be a very small proportion as you scale up. One substantial event per week is an even lower probability. People also don't get that 'all the time' means 'once a week and three times in a row while I was in a bad mood'. So even if your incidence isn't that bad, statistical clustering will earn you enemies in other departments or among your customers, so your perceived rate of failure is much worse than your actual.

[+] Futurebot|4 years ago|reply

Another in a long line of unexpected causes for system errors, the classic being the cosmic ray:

"When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe.

While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices. It's called a single-event upset or SEU.

During an SEU, particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet."

https://www.computerworld.com/article/3171677/computer-crash...

Those of us who do high-level development generally treat these underlying systems as infallible, but as we continue to scale - and more money and lives are on the line - we'll need to get used to the idea of not only not trusting the underlying hardware as the article states, but may have to get to the point where we have "ECC at the system level." We already have this in various distributed systems tech, but this tends to be application-specific. The next step would be to incorporate it directly into datastores generally.

This also suggests that heterogeneous hardware architectures can have an advantage in situations where data integrity is critical, even with the increased administration, hardware, and ops costs. Finally, it also highlights the importance of data audits and reconciliations for even non-suspect data on a regular basis, preferably with the aforementioned heterogeneous setup.

[+] hinkley|4 years ago|reply

In a cloud environment, I worry how often there is a Market for Lemons situation going on.

If I have a machine that seems not to be behaving, am I going to spend as much time doing root cause analysis as the people in this article did? Or am I going to allocate a new VM and dump the old one? I would expect over time for the permanently allocated machines to see a smaller density of bad hardware, while the pool of available machines now has a higher than average rate of hardware failures.

This is bad both for newcomers and for anyone who tries to autoscale hardware - because hardware error density now increases with request rates, instead of remaining stable.

[+] skinkestek|4 years ago|reply

I think it is a quote from "Release It" that says something along the lines of: In our field (so many computations are done) as to make astronomical coincidences happen on a daily basis.

BTW: That entire book is great IMO.

[+] formerly_proven|4 years ago|reply

Hardware failures become much more obvious once you hash your data and check the hashes on each access. You start to see that computers, even servers with ECC memory, aren't quite as reliable as you wished.

[+] vishnugupta|4 years ago|reply

> even servers with ECC memory, aren't quite as reliable as you wished.

This made me recall a recent post by Linus criticizing Intel[1]. I didn't even know things like this could happen until I read that.

[1] https://www.realworldtech.com/forum/?threadid=198497&curpost...

[+] rho4|4 years ago|reply

Hopefully these efforts from the big players result in more reliable hardware for everyone.

[+] hulitu|4 years ago|reply

How about reliable software ?

[+] kthejoker2|4 years ago|reply

Not the only reason to be sure, but this is my goto explanation for why software engineering didn't have the same discipline, apprenticeship, professional licensing, S&P, etc of other engineering degrees.

The systems we work in are fundamentally unsustainable beyond very limited scales and periods of time compared to other disciplines due to the nature of the medium.

If you had these same features in chemistry or civil engineering you wouldn't have nuclear power or the Brooklyn Bridge.

[+] twotwotwo|4 years ago|reply

The "petabyte for a century" problem statement he mentions near the top is fun: can you preserve 1 PB with better-than-even odds of all the bits coming back right in a century? He wrote something about how he saw the durability situation in 2010: https://queue.acm.org/detail.cfm?id=1866298 . He seems to define the problem to allow for maintenance, e.g. check-and-repair operations are discussed.

Broadening a little, I read it as "if you're trying to be more cautious than you usually want for commercial/academic online storage or even backup, what do you do? And would it work?"

A lot (but not all) of the author's Queue piece talks about stats from online storage, which doesn't have some wins you can get if you're entirely about long-term durability.

In online systems, heavy ECC seems out of favor compared to replication for performance reasons, but additional LDPC or RS at the app layer can absorb a substantial % of your volumes having problems, or a ton of random bad blocks. (In a more near-line context, Backblaze uses RS: https://www.backblaze.com/blog/vault-cloud-storage-architect... ) Same for tape -- offline LTO is slow and drives are costly, but the cost/TB and the rated lifespan seem like advantages over HDDs for this specific goal, at least in a narrow engineering sense.

A pile of cryptographic hashes that fits on one device can help you check for and localize errors without sending everything over a network/doing the full ECC dance. If the hash being broken and data tampered with is in your threat model (hey, weird things can happen in a century), you can also hash with a secret nonce you keep separate.

Initially loading a PB with a good chance of no mistakes is a thing too, and durability measures don't totally address that. Maybe your multi-site strategy loads up the original on different hardware at different locations with independent software implementations, and you compare those hashes after you do it.

With all that the hundred-years part is still deeply tricky in a couple ways.

"Lasting a hundred years" is just technologically different from "very low error rate at 5 years." Widely used media like LTO tape seem to max out at a 30y rating, and exotic archive media like the "hardened film" at GitHub's code vault has the big disadvantage of no ecosystem to read it. So seems like you really want refreshes of some sort at intervals, and being sure a task will be done decades from now is hard (assuming high-tech civilization is around and all that--some things just have to be outside the scope of the problem for it to be meaningful).

From that angle, maybe having an online copy of the data is a better investment than the pure engineering perspective would suggest: if other folks can grab a copy of the archive it has a better chance of outliving your organization.

Two, a lot of unknown unknowns crop up at that timescale. A couple decades back we didn't have the experience at scale we do now, and CEEs and other causes of SDC were less on anyone's radar. We could discover something else significant after it's too late to fix. The world can also change in ways that disrupt the durability picture substantially (changing laws or disaster risks, say), short of the types of change that make the whole problem meaningless.

Anyway, fun question, and if you find it fun too, you might like https://www.youtube.com/watch?v=eNliOm9NtCM , a talk on backups from someone at Google that talks about the tape restore after the big GMail glitch and various dimensions of resilience. And I'm sure there are storage papers and Long Now-ish stuff I'm not plugged into about things like this, wouldn't mind hearing about it.

48 comments