top | item 22994420

Paxos vs. Raft: Have we reached consensus on distributed consensus?

343 points| rbanffy | 6 years ago |arxiv.org | reply

65 comments

order
[+] jganetsk|6 years ago|reply
Heidi Howard is incredible and her work on distributed consensus is illuminating. I think she has actually successfully cracked the cookie of "making consensus easy". https://www.youtube.com/watch?v=KTHOwgpMIiU
[+] dnautics|6 years ago|reply
That was fantastic! Do you know what happened with her ios project?
[+] gitgudnubs|6 years ago|reply
The part I found non-obvious is how to snapshot the unbounded vectors of immutable registers, which is a requirement for any real system. But I never gave it a whole lot of thought either.

To snapshot, I either need to find an epoch such that no older epoch could ever have relevant information for the current node, or I need the nodes to agree on a snapshot. Both seem complicated.

[+] sudofail|6 years ago|reply
This was a very good talk. Thanks for sharing!
[+] flexd|6 years ago|reply
I can't help myself: Wouldn't we need a third consensus algorithm to reach consensus? :D
[+] enitihas|6 years ago|reply
You just got your wish in the form of Viewstamped Replication.
[+] twoodfin|6 years ago|reply
I’m more interested in reaching consensus on “libpaxos” vs. “libraft”: Which algorithm is more amenable to practical abstraction, so that the distributed system developer need only register a handful of callbacks for the actual mechanisms of, e.g., persistent local logging?
[+] mpercy|6 years ago|reply
From experience working in this area, I believe there's a significant tradeoff between performance, flexibility, and time to delivery when it comes to consensus and the things it's used for, like database replication. It's like: "good, fast, or cheap, pick two".

As one of the core authors of the Apache Kudu Raft implementation <http://kudu.apache.org/> (which is written in C++) I know that we tried to design it to be a pretty standalone subsystem, but didn't try to actually provide a libraft per se. We wanted to reuse the Raft write-ahead log as the database write-ahead log (as a performance optimization) which is one reason that making the log API completely generic eluded us a little.

That said, I'm currently at Facebook helping to adapt that implementation for another database. We are trying to make it database agnostic, and we continue to find cases where we need some extra metadata from the storage engine, new kinds of callbacks, or hacks to deal with various cases that just work differently than the Kudu storage engine. It would likely take anybody several real world integrations to get the APIs right (I'm hopeful that we eventually will :)

[+] justicezyx|6 years ago|reply
If some big corps are willing to open source their paxos library (I am pretty sure msft Google Amazon all have their implementation), then there should be no need of any other consideration of competition.
[+] hardwaresofton|6 years ago|reply
Shameless plug I've written a writeup of the different implementations/variations of Paxos[0] if you'd like to see more of the ecosystem.

I'm actually kinda wondering why this is news now -- the author of this paper is absolutely right of course but I thought this was common knowledge for anyone who knew what raft was. Raft is less robust than Paxos but simpler to implement (and implement correctly) which is why projects choose it. Basically chat (run a paxos round) to decide a leader and build the log (the history of everything that has every happened in the system, see WALs for a similar concept) at the chosen leader instead of by chatting about every single/batch of changes.

[0]: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/

[+] judofyr|6 years ago|reply
That's a great article! Thanks a lot. I know about some more variants you might want to check out:

- Compartmentalized Paxos (https://mwhittaker.github.io/publications/compartmentalized_...) shows how you can compartamentalize the various roles of a Paxos node to increase scalability.

- Matchmaker Paxos (https://mwhittaker.github.io/publications/matchmaker_paxos.p...) introduces a separate set of nodes called matchmakers which are used during reconfiguration.

- PigPaxos (https://arxiv.org/abs/2003.07760) places the nodes into relay groups and revises the communication pattern to improve scalability. This seems very similar to Compartmentalized Paxos.

- Linearizable Quorum Reads in Paxos (https://www.usenix.org/system/files/hotstorage19-paper-chara...) shows how you can do linearizable quorum reads in Paxos.

[+] ideal0227|6 years ago|reply
I think all these kinds of papers are very confusing. Comparing RSM (replicated state machine) to Paxos is just like comparing a car to an engine. It makes very little or no sense.

In the original Paxos paper (https://lamport.azurewebsites.net/pubs/paxos-simple.pdf), the part 3 (RSM) is not extensively explained. There are countless ways to use Paxos to implement RSM. Multipaxos/Raft/Epaxos try to fill in that gap.

By any means, Paxos itself is 10x simpler than Raft or whatever. Every time I heard a "distributed system" engineer said Paxos is complicated, I know he/she does not have much experience in the field or at least has never implemented the core consensus part...

[+] wahern|6 years ago|reply
Indeed, in the paper they're comparing MultiPaxos to Raft.

EDIT: For others, here's a very comprehensive (as of ~2018) review of Paxos-related distributed consensus algorithms with an exposition for each one: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/ That's 17 in all, excluding the original Paxos paper. IMO, it should be linked anywhere Paxos is discussed. The link has been posted twice before by others on HN, but unfortunately hasn't seen any discussion, perhaps because it speaks for itself.

[+] gitgudnubs|6 years ago|reply
That, or "Paxos" is a fantastic name, and "replicated state machine" is a mouthful.

I've implemented the consensus part of Paxos in 10 minutes. But it's a toy. It's totally useless without the other stuff.

[+] sriram_malhar|6 years ago|reply
No, simply not true at all. I've done both a few times. And yes, I have relevant experience.
[+] ccleve|6 years ago|reply
The paper's main conclusion is accurate. Raft is more understandable because of the clarity of the paper. But implementation is very tricky. As I've written elsewhere it takes weeks or months to write a solid implementation from scratch.
[+] prismatk|6 years ago|reply
Raft itself - rather than any framework in which you would actually want to use it - is quite simple to implement. A few classmates of mine and I implemented a barebones Raft instance in about a weekend.
[+] enitihas|6 years ago|reply
Is there any open source widely used implementation of multi paxos, because single paxos doesn't seem useful by itself.
[+] mirimir|6 years ago|reply
The title is ~amusing for many who have experienced consensus-based decision making.

But anyway, maybe this could be applied to helping people work together cooperatively.

[+] chapium|6 years ago|reply
I thought raft was the obvious choice since it is a far simpler framework.
[+] cmckn|6 years ago|reply
The conclusion of the paper is that there isn't actually a significant difference between the algorithms. The Raft paper is much clearer about implementation, but (as Heidi says) the impl ideas from the Raft paper can be applied to Paxos in many cases. Raft's leader election _is_ a bit more elegant and results in a less complex implementation. The paper was a great read!
[+] stelfer|6 years ago|reply
We have theorems that say that's impossible, so I guess the answer is no? Or maybe Betteridge's law?
[+] blamestross|6 years ago|reply
I haven't found a problem well suited to either as a solution. Either the performance constraints are too tight for industrial applications or you are in a P2P space and they are both predicated on nodes not being hostile so you can't use them.

In practice it just seems most efficient to be tolerant of consensus failures and focus on cAP.

[+] dilyevsky|6 years ago|reply
A lot of real use production systems (spanner, cockroach, tidb) use the opposite approach - sacrifice reliability for consistency. Scaling constraints are usually solved via sharding (running multiple raft/paxos fsms per dataset)