Heidi Howard is incredible and her work on distributed consensus is illuminating. I think she has actually successfully cracked the cookie of "making consensus easy". https://www.youtube.com/watch?v=KTHOwgpMIiU
The part I found non-obvious is how to snapshot the unbounded vectors of immutable registers, which is a requirement for any real system. But I never gave it a whole lot of thought either.
To snapshot, I either need to find an epoch such that no older epoch could ever have relevant information for the current node, or I need the nodes to agree on a snapshot. Both seem complicated.
I’m more interested in reaching consensus on “libpaxos” vs. “libraft”: Which algorithm is more amenable to practical abstraction, so that the distributed system developer need only register a handful of callbacks for the actual mechanisms of, e.g., persistent local logging?
From experience working in this area, I believe there's a significant tradeoff between performance, flexibility, and time to delivery when it comes to consensus and the things it's used for, like database replication. It's like: "good, fast, or cheap, pick two".
As one of the core authors of the Apache Kudu Raft implementation <http://kudu.apache.org/> (which is written in C++) I know that we tried to design it to be a pretty standalone subsystem, but didn't try to actually provide a libraft per se. We wanted to reuse the Raft write-ahead log as the database write-ahead log (as a performance optimization) which is one reason that making the log API completely generic eluded us a little.
That said, I'm currently at Facebook helping to adapt that implementation for another database. We are trying to make it database agnostic, and we continue to find cases where we need some extra metadata from the storage engine, new kinds of callbacks, or hacks to deal with various cases that just work differently than the Kudu storage engine. It would likely take anybody several real world integrations to get the APIs right (I'm hopeful that we eventually will :)
If some big corps are willing to open source their paxos library (I am pretty sure msft Google Amazon all have their implementation), then there should be no need of any other consideration of competition.
Shameless plug I've written a writeup of the different implementations/variations of Paxos[0] if you'd like to see more of the ecosystem.
I'm actually kinda wondering why this is news now -- the author of this paper is absolutely right of course but I thought this was common knowledge for anyone who knew what raft was. Raft is less robust than Paxos but simpler to implement (and implement correctly) which is why projects choose it. Basically chat (run a paxos round) to decide a leader and build the log (the history of everything that has every happened in the system, see WALs for a similar concept) at the chosen leader instead of by chatting about every single/batch of changes.
- PigPaxos (https://arxiv.org/abs/2003.07760) places the nodes into relay groups and revises the communication pattern to improve scalability. This seems very similar to Compartmentalized Paxos.
I think all these kinds of papers are very confusing. Comparing RSM (replicated state machine) to Paxos is just like comparing a car to an engine. It makes very little or no sense.
In the original Paxos paper (https://lamport.azurewebsites.net/pubs/paxos-simple.pdf), the part 3 (RSM) is not extensively explained. There are countless ways to use Paxos to implement RSM. Multipaxos/Raft/Epaxos try to fill in that gap.
By any means, Paxos itself is 10x simpler than Raft or whatever. Every time I heard a "distributed system" engineer said Paxos is complicated, I know he/she does not have much experience in the field or at least has never implemented the core consensus part...
Indeed, in the paper they're comparing MultiPaxos to Raft.
EDIT: For others, here's a very comprehensive (as of ~2018) review of Paxos-related distributed consensus algorithms with an exposition for each one: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/ That's 17 in all, excluding the original Paxos paper. IMO, it should be linked anywhere Paxos is discussed. The link has been posted twice before by others on HN, but unfortunately hasn't seen any discussion, perhaps because it speaks for itself.
The paper's main conclusion is accurate. Raft is more understandable because of the clarity of the paper. But implementation is very tricky. As I've written elsewhere it takes weeks or months to write a solid implementation from scratch.
Raft itself - rather than any framework in which you would actually want to use it - is quite simple to implement. A few classmates of mine and I implemented a barebones Raft instance in about a weekend.
The conclusion of the paper is that there isn't actually a significant difference between the algorithms. The Raft paper is much clearer about implementation, but (as Heidi says) the impl ideas from the Raft paper can be applied to Paxos in many cases. Raft's leader election _is_ a bit more elegant and results in a less complex implementation. The paper was a great read!
I haven't found a problem well suited to either as a solution. Either the performance constraints are too tight for industrial applications or you are in a P2P space and they are both predicated on nodes not being hostile so you can't use them.
In practice it just seems most efficient to be tolerant of consensus failures and focus on cAP.
A lot of real use production systems (spanner, cockroach, tidb) use the opposite approach - sacrifice reliability for consistency. Scaling constraints are usually solved via sharding (running multiple raft/paxos fsms per dataset)
[+] [-] jganetsk|6 years ago|reply
[+] [-] dnautics|6 years ago|reply
[+] [-] gitgudnubs|6 years ago|reply
To snapshot, I either need to find an epoch such that no older epoch could ever have relevant information for the current node, or I need the nodes to agree on a snapshot. Both seem complicated.
[+] [-] sudofail|6 years ago|reply
[+] [-] flexd|6 years ago|reply
[+] [-] enitihas|6 years ago|reply
[+] [-] throw0101a|6 years ago|reply
* https://www.youtube.com/watch?v=JQss0uQUc6o
See also her PhD dissertation, "Distributed consensus revised":
* https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.html
[+] [-] anogrebattle|6 years ago|reply
[+] [-] twoodfin|6 years ago|reply
[+] [-] mpercy|6 years ago|reply
As one of the core authors of the Apache Kudu Raft implementation <http://kudu.apache.org/> (which is written in C++) I know that we tried to design it to be a pretty standalone subsystem, but didn't try to actually provide a libraft per se. We wanted to reuse the Raft write-ahead log as the database write-ahead log (as a performance optimization) which is one reason that making the log API completely generic eluded us a little.
That said, I'm currently at Facebook helping to adapt that implementation for another database. We are trying to make it database agnostic, and we continue to find cases where we need some extra metadata from the storage engine, new kinds of callbacks, or hacks to deal with various cases that just work differently than the Kudu storage engine. It would likely take anybody several real world integrations to get the APIs right (I'm hopeful that we eventually will :)
[+] [-] justicezyx|6 years ago|reply
[+] [-] hardwaresofton|6 years ago|reply
I'm actually kinda wondering why this is news now -- the author of this paper is absolutely right of course but I thought this was common knowledge for anyone who knew what raft was. Raft is less robust than Paxos but simpler to implement (and implement correctly) which is why projects choose it. Basically chat (run a paxos round) to decide a leader and build the log (the history of everything that has every happened in the system, see WALs for a similar concept) at the chosen leader instead of by chatting about every single/batch of changes.
[0]: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/
[+] [-] judofyr|6 years ago|reply
- Compartmentalized Paxos (https://mwhittaker.github.io/publications/compartmentalized_...) shows how you can compartamentalize the various roles of a Paxos node to increase scalability.
- Matchmaker Paxos (https://mwhittaker.github.io/publications/matchmaker_paxos.p...) introduces a separate set of nodes called matchmakers which are used during reconfiguration.
- PigPaxos (https://arxiv.org/abs/2003.07760) places the nodes into relay groups and revises the communication pattern to improve scalability. This seems very similar to Compartmentalized Paxos.
- Linearizable Quorum Reads in Paxos (https://www.usenix.org/system/files/hotstorage19-paper-chara...) shows how you can do linearizable quorum reads in Paxos.
[+] [-] ideal0227|6 years ago|reply
In the original Paxos paper (https://lamport.azurewebsites.net/pubs/paxos-simple.pdf), the part 3 (RSM) is not extensively explained. There are countless ways to use Paxos to implement RSM. Multipaxos/Raft/Epaxos try to fill in that gap.
By any means, Paxos itself is 10x simpler than Raft or whatever. Every time I heard a "distributed system" engineer said Paxos is complicated, I know he/she does not have much experience in the field or at least has never implemented the core consensus part...
[+] [-] wahern|6 years ago|reply
EDIT: For others, here's a very comprehensive (as of ~2018) review of Paxos-related distributed consensus algorithms with an exposition for each one: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/ That's 17 in all, excluding the original Paxos paper. IMO, it should be linked anywhere Paxos is discussed. The link has been posted twice before by others on HN, but unfortunately hasn't seen any discussion, perhaps because it speaks for itself.
[+] [-] gitgudnubs|6 years ago|reply
I've implemented the consensus part of Paxos in 10 minutes. But it's a toy. It's totally useless without the other stuff.
[+] [-] sriram_malhar|6 years ago|reply
[+] [-] codepie|6 years ago|reply
[+] [-] ccleve|6 years ago|reply
[+] [-] prismatk|6 years ago|reply
[+] [-] enitihas|6 years ago|reply
[+] [-] kevindeasis|6 years ago|reply
Paxos vs Raft: Have we reached consensus on distributed consensus? — Heidi Howard
[+] [-] mirimir|6 years ago|reply
But anyway, maybe this could be applied to helping people work together cooperatively.
[+] [-] chapium|6 years ago|reply
[+] [-] cmckn|6 years ago|reply
[+] [-] stelfer|6 years ago|reply
[+] [-] blamestross|6 years ago|reply
In practice it just seems most efficient to be tolerant of consensus failures and focus on cAP.
[+] [-] dilyevsky|6 years ago|reply