top | item 38809402

(no title)

hlship | 2 years ago

In such a system, how does a reader find the root node? I'd be concerned about a) readers seeing a partial write to disk (or page buffer) during an update or b) a crash while the writer writes to disk.

I could imagine using two files, one containing the actual b-tree, the second containing the offset to the latest root note; the second file gets overwritten only after a successful write is verifiably written to disk.

Datomic's (https://www.datomic.com/) architecture is similar to this, but uses many write-once "segments" (which could be files in EFS or S3, or rows in DynamoDB or other stores).

discuss

ww520|2 years ago

The pointer to the root tree node is stored in the last committed metadata page. A read transaction starts with the reading of the metadata page. Reaching the partial written pages is impossible as the writer has not committed the latest metadata page yet. The transaction is committed when the latest metadata page is written as the last step.

hyc_symas|2 years ago

A read txn starts with reading the last page of the file and searching backward for a valid metapage. This can be time-consuming if the previous use crashed while a large write was in progress. Yet another reason LMDB doesn't use append-only design.

twoodfin|2 years ago

There’s a “canonical” pointer to the root node which is updated atomically on every append.

For an in-memory database, a CAS is adequate. Persistent stores ultimately need some kind of file-level locking.

If you look at the Apache Iceberg spec, you get a good idea of how this works: The only “mutability” in that universe is the root table pointer in the catalog.