top | item 25624068

(no title)

ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000.

Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.

Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.

discuss

giantrobot|5 years ago

I don't think ECC is going to give anyone a false sense of security. The issue at Google's scale is they had to spend thousands of person-hours implementing in software what they would have gotten for "free" with ECC RAM. Lacking ECC (and generally using consumer-level hardware) compounded scale and reliability problems or at least made them more expensive than they might otherwise had been.

Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".

Animats|5 years ago

consumer hardware...

That's Intel's PR. Only "enterprise hardware", with a bigger markup, supports ECC memory. Adding ECC today should add only 12% to memory cost.

AMD decided to break Intel's pricing model. Good for them. Now if we can get ECC at the retail level...

The original IBM PC AT had parity in memory.

AaronFriel|5 years ago

It can't eliminate it but:

1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload

2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.

The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?

I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.

tjoff|5 years ago

ECC seems like a trivial thing to log and keep track of. Surely any Fortune 500 could do it and would have enough scale to get meaningful data out of it?

saagarjha|5 years ago

There was an interesting challenge at DEF CON CTF a while back that tested this, actually. It turns out that it is possible to write x86 code that is 1-bit-flip tolerant–that is, a bit flip anywhere in its code can be detected and recovered from with the same output. Of course, finding the sequence took (or so I hear) something like 3600 cores running for a day to discover it ;)

rfoo|5 years ago

Nit: not for a day, more like 8 hours, and that's because we were lazy and somebody said he "just happened" to have a cluster with unbalanced resources (mainly used for deep learning, but all GPUs occupied with quite a lot CPUs / RAMs left), so we decided to brute force the last 16 bits :)

Also, the challenge host left useful state (which bit was flipped) in registers before running teams' code, without this I'm not sure if it is even possible.

tomxor|5 years ago

> Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code.

There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!

Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.

Jedd|5 years ago

At some point, whatever's watching the watchers is going to be vulnerable to bitflip and similar problems.

Even with a triple-redundant quorum mechanism, slightly further up that stack you're going to have some bit of code running that processes the three returned results - if the memory that's sitting on gets corrupted, you're back where you started.

artifact_44|5 years ago

another past approach: https://en.wikipedia.org/wiki/Tandem_Computers

slumdev|5 years ago

Error-correcting code (the "ECC" in ECC) is just a quorum at the bit level.

eevilspock|5 years ago

I'm surprised that the other replies don't grasp this. This is the proper level to do the quorum.

Doing quorum at the computer level would require synchronizing parallel computers, and unless that synchronization were to happen for each low level instruction, then it would have to be written into the software to take a vote at critical points. This is going to be greatly detrimental both to throughput and software complexity.

I guess you could implement the quorum at the CPU level... e.g. have redundant cores each with their own memory. But unless there was a need to protect against CPU cores themselves being unreliable, I don't see this making sense either.

At the end of the day, at some level, it will always come down to probabilities. "Software engineering principles" will never eliminate that.

sobriquet9|5 years ago

Modern error correction codes can do much better than that.

DSingularity|5 years ago

You need two alpha particles hitting the same rank of memory for failure to happen. Although super rare, even then it is still correctable. You need three before it is silent data corruption. Silent corruption is what you get with non ECC with even a single flip.

klodolph|5 years ago

Where are you getting this from? My understanding is that these errors are predominantly caused by secondary particles from cosmic rays hitting individual memory cells, and I've never heard something so precise as "you need two alpha particles". Aren't the capacitances in modern DRAM chips extremely small?

hn3333|5 years ago

Bit flips can happen, but regardless if they can get repaired by ECC code or not, the OS is notified, iirc. It will signal a corruption to the process that is mapped to the faulty address. I suppose that if the memory contains code, the process is killed (if ECC correction failed).

wtallis|5 years ago

> I suppose that if the memory contains code, the process is killed (if ECC correction failed).

Generally, it would make the most sense to kill the process if the corrupted page is data, but if it's code, then maybe re-load that page from the executable file on non-volatile storage. (You might also be able to rescue some data pages from swap space this way.)

colejohnson66|5 years ago

> You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum.

The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.

EvanAnderson|5 years ago

The Space Shuttle had redundant computers. The Apollo Guidance Computer was not redundant (though there were two AGCs onboard-- one in the CM and one in the LEM). The aerospace industry has a history of using redundant dissimilar computers (different CPU architectures, multiple implementations of the control software developed by separate teams in different languages, etc) in voting-based architectures to hedge against various failure modes.

buildbuildbuild|5 years ago

This remains common in aerospace, each voting computer is referred to as a "string". https://space.stackexchange.com/questions/45076/what-is-a-fl...

patates|5 years ago

Forgive my ignorance, but wouldn't the computer actually reacting to the calculation (and sending a command or displaying the data) still be very vulnerable to bit-flips? Or were they displaying the results from multiple machines to humans?

haolez|5 years ago

Sounds similar to smart contracts running on a blockchain :)

sobriquet9|5 years ago

If you use multiple computers doing the same calculation and then take the answer from the quorum, how do you ensure the computer that does the comparison is not affected by memory failures? Remember that all queries have to through it, so it has to be comparable in scale and power.

rovr138|5 years ago

> how do you ensure the computer that does the comparison is not affected by memory failures?

You do the comparison on multiple nodes too. Get the calculations. Pass them to multiple nodes, validate again and if it all matches, you use it.