(no title)
danielmewes | 10 years ago
The wall-clock time comes from the server that processes that query.
Whenever the epoch timestamp changes, replicas will get a fresh set of Raft member IDs, and it's expected that they start with an empty Raft log.
Where exactly the epoch timestamps come from is not really relevant to this bug. With the bug fixed, any given node will only accept multi_table_manager_t actions that have a strictly larger epoch timestamp than what they have right now. That is enough to guarantee that they never go back to a previous configuration, and never rejoin a Raft cluster with the old member ID, but a wiped Raft log.
codemac|10 years ago
EDIT: Or is it that it's not required to show forward progress? Still reading the rethinkdb source & docs, thanks for the information so far.
timmaxw|10 years ago
However, a Raft cluster will get stuck if half or more of the members are permanently lost. RethinkDB offers a manual recovery mechanism called "emergency repair" in this case. When the administrator executes an emergency repair, RethinkDB discards the old Raft cluster and starts a completely new Raft cluster, with an empty log and so on. However, some servers might not find out about the emergency repair operation immediately. So we would end up with some servers that were in the new Raft cluster and some that were still in the old Raft cluster. We want the ones in the old Raft cluster to discard their old Raft metadata and join the new Raft cluster.
The process of having those servers join the new Raft cluster is managed using the epoch_t struct. An epoch_t is a unique identifier for a Raft cluster. It contains a wall-clock timestamp and a random UUID. When the user runs an emergency repair, the wall-clock time is initialized to max(current_time(), prev_epoch.timestamp+1). When two servers detect that they are in different Raft clusters, the one that has the lower epoch timestamp discards its data and joins the other one's Raft cluster. The UUID is used for tiebreaking in the unlikely event that the timestamps are the same.
So the clock isn't being used as a trusted source of truth; it's being used as part of a quick-and-dirty emergency repair mechanism for the case where the Raft cluster is permanently broken. The emergency repair mechanism isn't guaranteed to preserve any consistency invariants (as the documentation clearly warns).