(no title)
MertsA | 2 years ago
Yes but you need to catch it first to know what to take out of production.
>That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.
That's kind of my point. Either it's a heisen-bug and you never see those results again when you repeat the original program or it's permanently broken and you need to swap out the sketchy CPU. If you only care about the first case then you only need one core. If you care about the second case then you need 3 if you want to come up with an accurate result instead of just determining that one of them is faulty. It's like that old adage about clocks on ships. Either take one clock or take three, never two.
namibj|2 years ago
MertsA|2 years ago
> 2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.
If you care about resilience then you either need to settle with one and accept that you can't catch the class of errors that are persistent or go with three if you actually need resilience to those failures as well. If you don't need that kind of resilience like an aerospace application would need then you're probably better off with catching this at a higher layer in the overall distributed systems design. Rather than trying to make a resilient and perfectly accurate server, design your service to be resilient to hardware faults and stack checksums on checksums so you can catch errors (whether HW or software) where some invariant is violated. Meta also has a paper on their "Tectonic filesystem" where there's a checksum of every 4K chunk fragment, a checksum of the whole chunk, and a checksum of the erasure encoded block constructed out of the chunks. Once you add in yet another layer of replication above this then even when some machine is computing corrupt checksums or inconsistent checksums where the checksum and the data are corrupt then you can still catch it and you have a separate copy to avoid data loss.