top | item 23002382

(no title)

mpercy | 5 years ago

From experience working in this area, I believe there's a significant tradeoff between performance, flexibility, and time to delivery when it comes to consensus and the things it's used for, like database replication. It's like: "good, fast, or cheap, pick two".

As one of the core authors of the Apache Kudu Raft implementation <http://kudu.apache.org/> (which is written in C++) I know that we tried to design it to be a pretty standalone subsystem, but didn't try to actually provide a libraft per se. We wanted to reuse the Raft write-ahead log as the database write-ahead log (as a performance optimization) which is one reason that making the log API completely generic eluded us a little.

That said, I'm currently at Facebook helping to adapt that implementation for another database. We are trying to make it database agnostic, and we continue to find cases where we need some extra metadata from the storage engine, new kinds of callbacks, or hacks to deal with various cases that just work differently than the Kudu storage engine. It would likely take anybody several real world integrations to get the APIs right (I'm hopeful that we eventually will :)

discuss

matthewaveryusa|5 years ago

I created a toy raft implementation in typescript, mostly to learn typescript. my interface for the datastore is actually pretty small:

https://github.com/matthewaveryusa/raft.ts/blob/master/src/i...

What kind of metadata do you need exactly from the datastore?

mpercy|5 years ago

Sure, here are a couple examples of things we've had to include in the log APIs:

1. Kudu uses a special "commit" record that goes into the log for crash consistency, related to storage engine buffer flushes. So we need an API to write those into the log. They don't have a term and an index, since they are a local-engine thing, so they have to be skipped when replicating data to other nodes in the case of the current node being the leader. If we were not sharing the log with the engine, we wouldn't need this.

2. Another database I'm working with requires file format information to be written at the top of every log segment, and it has to match the version of the log events following it. That info has to be communicated to the follower up-front even when the follower resumes replicating from the middle of the leader's log. So we need plugin callbacks on both sides to handle this, in terms of packing this as the leader and unpacking it as a follower into the wire protocol metadata.

Requirements like these will come up and either you hack around them by making some kind of out-of-band call (not ideal for multiple reasons) or you bake the capability into the plugin APIs and the communication protocol.

Frankly, designing generic APIs is also one of the less sexy aspects to consider because we spend so much of our time dreaming about and building all the cool distributed systems capabilities like leader elections, dynamic membership changes, flexible quorums, proxying/forwarding messages, rack/region awareness, etc etc etc. :)

The details of long-tail stuff like this is often hammered out as it comes up during implementation.

mpercy|5 years ago

Since I mentioned performance, one other area that makes flexibility nontrivial is how you decide to serialize different types of messages on the wire and on disk. If you don't need extensibility, it's easy to keep things pretty efficient just by using e.g. gRPC and protobuf out of the box. If you want complete flexibility, the simplest thing to do is to give your plugin interfaces blobs to write into and you end up double-serializing everything.