(no title)
maria_weber23 | 5 years ago
Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.
Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.
giantrobot|5 years ago
Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".
Animats|5 years ago
That's Intel's PR. Only "enterprise hardware", with a bigger markup, supports ECC memory. Adding ECC today should add only 12% to memory cost.
AMD decided to break Intel's pricing model. Good for them. Now if we can get ECC at the retail level...
The original IBM PC AT had parity in memory.
AaronFriel|5 years ago
1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload
2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.
The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?
I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.
tjoff|5 years ago
saagarjha|5 years ago
rfoo|5 years ago
Also, the challenge host left useful state (which bit was flipped) in registers before running teams' code, without this I'm not sure if it is even possible.
tomxor|5 years ago
There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!
Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.
Jedd|5 years ago
Even with a triple-redundant quorum mechanism, slightly further up that stack you're going to have some bit of code running that processes the three returned results - if the memory that's sitting on gets corrupted, you're back where you started.
artifact_44|5 years ago
slumdev|5 years ago
eevilspock|5 years ago
Doing quorum at the computer level would require synchronizing parallel computers, and unless that synchronization were to happen for each low level instruction, then it would have to be written into the software to take a vote at critical points. This is going to be greatly detrimental both to throughput and software complexity.
I guess you could implement the quorum at the CPU level... e.g. have redundant cores each with their own memory. But unless there was a need to protect against CPU cores themselves being unreliable, I don't see this making sense either.
At the end of the day, at some level, it will always come down to probabilities. "Software engineering principles" will never eliminate that.
sobriquet9|5 years ago
DSingularity|5 years ago
klodolph|5 years ago
hn3333|5 years ago
wtallis|5 years ago
Generally, it would make the most sense to kill the process if the corrupted page is data, but if it's code, then maybe re-load that page from the executable file on non-volatile storage. (You might also be able to rescue some data pages from swap space this way.)
colejohnson66|5 years ago
The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.
EvanAnderson|5 years ago
buildbuildbuild|5 years ago
patates|5 years ago
haolez|5 years ago
sobriquet9|5 years ago
rovr138|5 years ago
You do the comparison on multiple nodes too. Get the calculations. Pass them to multiple nodes, validate again and if it all matches, you use it.