top | item 36680535

We put a distributed database in the browser and made a game of it

239 points| BratishkaErik | 2 years ago |tigerbeetle.com | reply

62 comments

order
[+] petrosagg|2 years ago|reply
> Sure, we’re not yet injecting storage faults, but then formal proofs for protocols like Raft and Paxos assume that disks are perfect, and depend on this for correctness? After all, you can always run your database over RAID, right? Right? > If your distributed database was designed before 2018, you probably couldn’t have done much. The research didn’t exist.

I'm trying to understand this part but something seems off. It seems to imply that those proofs do not apply to the real world because disks are not perfect, but neither is RAM. The hardware for both RAM and disks has an inherent error rate which can be brought down to arbitrary levels by using error correction codes (e.g ECC RAM).

I'm assuming that the TigerBeetle correctness proofs are predicated on perfect RAM even though in reality there is a small probability of errors. This tells me that there is an error rate which they consider negligible. If that's the case what is the difference between:

* TigerBeetle's storage approach

* Paxos or Raft with enough error correction on disk writes that the probability of errors equals that of ECC RAM which is considered negligible

I've probably made a logical error in my reasoning but I can't see it. Can someone enlighten me?

[+] chubot|2 years ago|reply
Not an expert in this area, but I think disks have correlated failure modes whereas CPUs and memory generally don't.

Especially spinning platter disks, not sure about SSDs.

The difference in failure rates could be orders of magnitude ... Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)

But disks will have non-random manufacturing issues. I'd be interested in more info too, but my impression is that the data on these issues is pretty thin. Foundation DB mentioned it ~10 years ago and Google has published data >10 years ago, but hardware has changed a lot since then

Software redundancy will take care of non-correlated failures, but it fails precisely when there are correlated ones

[+] jorangreef|2 years ago|reply
Thanks for the question! Joran from TigerBeetle here.

The research in question is the 2018 paper from UW-Madison, “Protocol-Aware Recovery for Consensus-Based Storage” (PAR) [0] by Ram Alagappan, Aishwarya Ganesan, as well as Remzi and Andrea Arpaci-Dusseau (who you may recognize as authors of OSTEP).

PAR won best paper at FAST '18 for showing that a single disk sector fault, in the write-ahead log (WAL) of a single replica, could propagate through the distributed RAFT or MultiPaxos consensus protocol, to cause global cluster data loss.

This was counter-intuitive at the time, because PAR showed that the redundancy of these consensus and replication protocols did not in fact always imply fault-tolerance, as had previously been assumed.

The reason is, and we cover this in depth in our recent QCon London talk [1], but it was assumed that checksums alone would be sufficient to detect and recover from storage faults.

However, while checksums can be used under the “Crash Consistency Model” to solve consistency through power loss, PAR showed that checksums are not sufficient to be able to distinguish between a torn write at the end of the (uncommitted) WAL caused by power loss, and a torn write in the middle of the (committed) WAL caused by bitrot.

What you tend to find is that the WALs for many of these protocols will truncate the WAL at the first sign of a checksum mismatch, conflating the mismatch with power loss when it might be bitort, and thereby truncating committed transactions, and undermining quorum votes in the Raft or MultiPaxos implementations.

RAID solutions don't always help here, either. See "Parity Lost and Parity Regained" [2] for more details. ZRAID is better here, and ZFS is a huge inspiration, but with local redundancy under ZFS you're still not leveraging the global redundancy of the consensus protocol as well as you could be.

To summarize PAR:

There are fundamental design changes to both the global consensus protocol and the local storage engine that would need to be made, if the storage fault model of PAR (and TigerBeetle) is to be solved correctly.

Furthermore, few simulators even test for these kinds of storage faults. For example, misdirected I/O, where the disk writes or reads to or from the wrong location of disk, which may yet have a valid checksum.

However, this is important, because disks fail in the real world. A single disk has on the order of a 0.5-1% chance of corruption in a 2 year period [3]. For example, a 5 node cluster has a 2.5-5% chance of a single disk sector fault, which again in terms of PAR can lead to global cluster data loss.

On the other hand, memory (or even CPU) faults, assuming ECC are not in the same order of magnitude probability, and therefore TigerBeetle's memory fault model is to require ECC memory.

But, again, to be crystal clear, checksums alone are not sufficient to solve the consensus corruption issue. The fix requires protocol changes at the design level, for the consensus protocol to be made storage fault-aware.

Thanks for the question and happy to answer more!

[0] “Protocol-Aware Recovery for Consensus-Based Storage” https://www.usenix.org/conference/fast18/presentation/alagap...

[1] “A New Era for Database Design” (we also dive into the research surrounding Fsyncgate, looking into the latent correctness issues that remain) https://www.youtube.com/watch?v=_jfOk4L7CiY

[2] “Parity Lost and Parity Regained” https://www.usenix.org/conference/fast-08/parity-lost-and-pa...

[3] “An Analysis of Data Corruption in the Storage Stack” https://www.cs.toronto.edu/~bianca/papers/fast08.pdf

[+] convolvatron|2 years ago|reply
you just bury these and play games with p. just like distributed consensus, there is no perfect storage medium. 1 bit flip a week too much for you? add secded. secded on memory. interleaving to spread out correlated errors. etc.

at some point you run in the probability of the earth being coincident with the sun and you call it good.

none of viewstamp replication, paxos, or raft deal with storage errors.

where it does get interesting is that in the low p spectrum can you can subvert the correctness of these protocols by fuzzing them. and then you get into byzantine protocols.

[+] pohl|2 years ago|reply
This is a sweet idea and a nice, game-esque implementation.

It could definitely use some onboarding.

There's nothing to give the "player" a hint as to what they should do. What is my goal? Am I trying to defeat the communication of the nodes, or help them? If it's the former, why did I seem to win the first level after doing nothing? I started by trying to mash some keys. Eventually I saw that there were tools in the upper-right, but it wasn't clear what to do with them. It's a bit frustrating that they disappear after you grab them if your click is too far off the target. After successfully applying one of them to one of the targets, it's not clear what one should do next. Was it a good move? Who knows, because there's no feedback.

[+] jorangreef|2 years ago|reply
Thanks for the feedback! This is the first version of our educational SimTigerBeetle frontend.

The goal is to see and interact with a distributed database running visually under Easy/Medium/Hard levels of fault injection, giving you the opportunity to explore how the consensus handles failures, poke the otherwise deterministically simulated world, or inspect progress of the replicas.

So, it's a game in the sim genre, and we've got more to come, but the significance is hopefully what we tried to explain in the post:

You're seeing a new design of distributed database survive cosmic levels of fault injection (8% storage corruption on the read path across all replicas in the Radioactive level!), in your browser... and you can smash things with a hammer. ;)

You might also appreciate the video [0] at the end of our post, where we dive into the three insights behind deterministic simulation testing, and how DST moves beyond chaos engineering.

[0] SimTigerBeetle (Director's Cut!) https://www.youtube.com/watch?v=Vch4BWUVzMM

[+] kristoff_it|2 years ago|reply
Well the point is that the replication protocol is going to fix all issues no matter what, so you could say that the only winning move is to enjoy watching TigerBeetle move forward impervious of scripted nor human-inputted issues.

In other words, you're going to 'win' no matter what :^)

[+] schmichael|2 years ago|reply
This is incredible! I am curious about this paragraph though:

> You’re going to see view changes when the primary crashes or is partitioned, and VSR’s telltale round robin rotation of the new primary among replicas, until a new primary is established. This is in contrast to Raft, which elects a primary at random, but then suffers from the risk of (or increased latency to mitigate) dueling leaders.

It seems like regardless of primary/leader selection mechanism used (deterministic/roundrobin vs voting), you still need a quorum of nodes to agree (and for the minority side to know they can't proceed). Surely in a round robin selection mechanism some sort of vote or liveness check must be performed before it is safe for the #2 node to promote itself to #1/primary? Otherwise if the link between #2 and #1/primary is partitioned, #2 could unilaterally assume it was the primary/leader, even if the rest of the nodes could still communicate with #1 (the original primary). I don't understand how round robin solves the agreement aspect that leader election does.

The simulation seems to only partition nodes and not links so I'm not sure it exercises asymmetric connectivity between members. Although it does mention flaky links, so perhaps they do cover this case.

Edit: I have been informed there is still a vote. :)

[+] jorangreef|2 years ago|reply
Thanks, great to hear you enjoyed it!

The round robin "view change" in VSR is still consensus, and uses quorums to do fault isolation of the old primary, and to preserve the intersection property, to ensure that the committed log survives into the new view.

What's cool about VSR's consensus though, is that the dice is also preloaded, ahead of time, so that there's more information baked into the protocol than with Raft or MultiPaxos, which means that you can neatly sidestep their dueling leader problem, or the latency padding that is often added to mitigate it.

You can read more about VSR's intuitive view change consensus here: http://pmg.csail.mit.edu/papers/vr-revisited.pdf

The round robin view chance can also make VSR (slightly) more resilient to weird network faults, like you describe. For example, variants of VSR using stable storage as hinted at by the '12 paper, are in fact able to survive most of the OmniPaxos liveness scenarios (we submitted some errata to the paper's authors for this).

We haven't yet explored visualizing asymmetric connectivity in SimTigerBeetle (we tried to start with the big things!), however we recently wrote about how we test for this in our VOPR simulator here: https://tigerbeetle.com/blog/2023-07-06-simulation-testing-f...

[+] Digikid13|2 years ago|reply
It's very confusing that y'all are calling this a "game". There's nothing to play, it's just a simulation to watch.
[+] cstrahan|2 years ago|reply
"Simulation" doesn't necessarily entail stimulating interactivity and a pleasing aesthetic. As of writing, I don't know if there's a better term than "game" for what this is. (Maybe we can coin a new term, like "game-lite" -- somewhat akin to what "rogue-lite" is to the "rogue" genre).

See "walking simulators":

https://en.wikipedia.org/wiki/What_Remains_of_Edith_Finch

https://en.wikipedia.org/wiki/The_Stanley_Parable

https://en.wikipedia.org/wiki/Thirty_Flights_of_Loving

Plenty argue that these aren't games either (usual complaints involve lack of problem(s) to resolve, and no win-lose dynamic); but, then, what are these? The closest category I can think of would be "computer-animated film", but... these are interactive, and you can navigate and look in any direction you want, which yields a very different experience than watching a film like "Toy Story".

[+] captainhorst|2 years ago|reply
There are tools in the upper right corner. You can pick one and use it on a beetle. ;) This introduces various faults in the simulation that the database has to recover from.

Minor spoiler: there's another tiny game hidden at the end of the third level.

[+] cdchn|2 years ago|reply
This is pretty cool from a database perspective but I'm really interested in what they used to write this 'game.'
[+] eatonphil|2 years ago|reply
It's covered a little bit in the Evolution section (add #evolution to the URL and hit enter).

If that doesn't answer everything (it's not a long section, granted) Fabio (captainhorst) is on HN answering questions in this thread already so feel free to ask!

[+] muhaaa|2 years ago|reply
I thought tigerbeetle distributes the db on web clients? I skimmed through the docs and cannot find how to design a web app with tigerbeetle like the game https://sim.tigerbeetle.com/
[+] eatonphil|2 years ago|reply
Hey! I'm not completely sure I follow the question. But the game is built a little differently from how you'd typically run TigerBeetle in production. In production you'd run a cluster of replicas (see https://docs.tigerbeetle.com/deploy/hardware#cluster-of-repl...) and then you'd have your application connect to the cluster using one of our client libraries.

The game is more a reflection of how we do simulation testing (as the post mentioned). You can find the code for that here: https://github.com/tigerbeetledb/tigerbeetle/blob/main/src/s....

[+] moralestapia|2 years ago|reply
Nice work @eatonphil and TB team. Thanks for sharing.

I'm eagerly awaiting the production ready (or an RC or something) of TB, I've tried it and it truly excels in its domain.

I have a few clients that could make good use of it but I have to hold my horses a little bit.

[+] jorangreef|2 years ago|reply
Thanks, Alex! Awesome to hear that. SimTigerBeetle has been a skunkworks project, on the side, the past 12 months. We're looking forward to sharing the production release of TigerBeetle with you soon when it's ready.
[+] brimstedt|2 years ago|reply
Is this something similar to couch/pouchdb?
[+] eatonphil|2 years ago|reply
TigerBeetle is more domain-specific, i.e. focused on financial transactions and high-performance, high-availability. There are just two entity types in the database: accounts and transfers between accounts. More details here https://docs.tigerbeetle.com/design/data-modeling if you're interested!
[+] charbuff|2 years ago|reply
This is a really cool sales prop