top | item 40853015

(no title)

awitt | 1 year ago

The original purpose of an LSM-tree is to take I/O off the critical path of a write (there are other reasons to use them though, for example reducing space amplification).

I would argue that by definition an LSM-tree buffers committed writes in memory, and that means you need a WAL for recovery.

If you are going to immediately flush the memtable then IO is on the critical path. And if you have fine grained updates you'll end up with lots of small files, which seems like a bad thing. It could be reasonable if you only receive batch updates.

discuss

7e|1 year ago

Any durable commit is going to have I/O in the critical path unless you're Paxos/Raft replicating in-memory across failure domains (which we're not discussing here), but I think you mean it takes random I/O out of the critical path. You can get that without a WAL, though; just have the LSM keep appending out of order to a growing file and keep the in-memory index. That's the exact same I/O pattern that the WAL would generate, there just isn't an immediate compaction. The in-memory index will be stay fragmented for longer, though (which is why I call the WAL a performance optimization above). I suppose the WAL-less design lets you defer compaction for longer, which might be an advantage if you have lots of disk and lots of RAM, but don't want two-thirds of your throughput (read + write) taken away at critical moments.

awitt|1 year ago

I actually think you're describing a WAL.

hodgesrm|1 year ago

> I would argue that by definition an LSM-tree buffers committed writes in memory, and that means you need a WAL for recovery.

This is true, but note that the WAL does not need to be in the database. You can use an event stream like Kafka and replay blocks of events in the event of a failure. ClickHouse has a feature to deduplicate blocks it has seen before, even if they land on a separate server in a cluster. You still need to store checksums of the previously seen blocks, which is what ClickHouse does. It does put the onus on users to regenerate blocks accurately but the overhead is far lower.

valyala|1 year ago

Batched inserts to LSM-based databases aren't so uncommon. For example, ClickHouse expects batched writes - https://clickhouse.com/docs/en/optimize/bulk-inserts . Of course, it supports asynchronous inserts with in-memory buffering for the recently stored data - https://clickhouse.com/docs/en/cloud/bestpractices/asynchron...

awitt|1 year ago

Yes, that's true. And not even ClickHouse considers their MergeTree engine to be an LSM-tree (see https://github.com/ClickHouse/ClickHouse/blob/bfc0260f359a0c...).

That doesn't make it wrong, or a bad architecture. It still takes ideas from LSM-trees, and it has similarities. But it can't be called an LSM-tree.