top | item 22622710

(no title)

nano_o | 6 years ago

Great idea and great work!

A couple nitpicks: it would be nice to see what happens when the leader fails. Optimizing for the case of a stable leader might have impact on recovery time.

Another important aspect for fault-tolerance is whether you can really survive any minority crashing. For example, if only the strictly necessary number of nodes keep up with the leader, then if most of those crash the system will have a really hard time recovering due to the backlog accumulated at slow nodes which now need to catch up for the system to continue operating.

A performance number that does not take those things into account may not be very realistic. Nevertheless the idea is pretty good.

discuss

tptacek|6 years ago

Doesn't Multi-Paxos already have stable leaders? My understanding was that the innovation here was to relay prepare/promise/accept/accepted across a random relay network.

nano_o|6 years ago

Yes, it's a nitpick. The comparison to Multi-Paxos seems fair because it makes similar assumptions (unless re-configuring the relay network after a leader failure is somehow difficult, but I wouldn't expect that).

My point is that it would be nice to benchmark protocols that take into account the issues I brought up, and measure what happens in the worst failure scenarios they are supposed to tolerate. Otherwise we get a false sense of what performance can be achieved if one really cares about fault-tolerance.

This small issue does not diminish the main contribution of the paper in any way.