I get excited every time I see a paper from Bradish. I've learned so much about high performance software from studying systems that he has worked on. (not to diminish his many co-authors and contributors)
No multithreaded write benchmarks. That's a major omission, given that's where you'll see the biggest difference between b-trees and LSM trees.
The paper also talks about the overhead of the mapping table for node lookups, and says "Bf-Tree by default pins the inner nodes in memory and uses direct pointer addresses to reference them. This allows a simpler inner node implementation, efficient node access, and reduced con-
tention on the mapping table".
But you don't have to pin nodes in memory to use direct pointer lookups. Reserve a field in your key/value pair for a direct in-memory pointer, and after chasing it check that you got the node you expect; only fall back to the mapping table (i.e. hash table of cached nodes) if the pointer is uninitialized or you don't get the node you expect.
"For write, conventional B-Tree performs the worst, as a single
record update would incur a full page write, as evidenced by the
highest disk IO per operation."
Only with a random distribution, but early on the paper talks about benchmarking with a Zip-f distribution. Err?
The benchmark does look like a purely random distribution, which is not terribly realistic for most use cases. The line about "a single record updating incurring a full page write" also ignores the effect of cache size vs. working set size, which is a rather important detail. I can't say I trust the benchmark numbers.
Prefix compression - nice to see this popping up.
"hybrid latching" - basically, they're doing what the Linux kernel calls seqlocks for interior nodes. This is smart, but given that their b-tree implementation doesn't use it, you shouldn't trust the b-tree benchmarks.
However, I found that approach problematic - it's basically software transactional memory, with all the complexity that implies, and it bleeds out into too much of the rest of your b-tree code. Using a different type of lock for interior nodes where read locks only use percpu counters gives the same performance (read locks touch no cachelines shared by other CPUs) for much lower complexity.
Not entirely state of the art, and I see a lot of focus on optimizations that likely wouldn't survive in a larger system, but it does look like a real improvement over LSM trees.
Sure, but on principle, looking at the paper, I'd expect it to outperform B-trees since write amplification is reduced, generally. You thinking about cases requiring ordering of writes to a given record (lock contention)?
I think a fair comparison would be against a whitepaper? Clearly this is an exploratory venture and not production grade software.
You managed to clone the repo an run your test by yourself, whatever the outcome is this is a plus against the standard model for scientific research.
Also, a breath of fresh air among every other show HN thread using hundreds of adjectives to describe the "behavior" of a fully vibed system. I think this is a good model for presenting engineering projects.
Interesting. I've been on/off working on something, though it's an append-only hash index thing (based on Erlang's Bitcask): https://git.sr.ht/~tombert/feocask
I haven't done any benchmarks on it yet so I have no idea how well it performs, but I have a very strict "no explicit lock" policy, so everything is handled with separate files with a reconciliation service that uses a vector clock to determine ordering.
NOTE: A lot of this is AI-assisted, treat what I have written as more of a whiteboard proof of concept. I'm in the process of rewriting it by hand to do some more aggressive optimizations.
Why not add an LSM memtable on top of the cow b+ tree?
Use the skiplist as a write buffer and write to the b+ tree in batches when the skiplist is frozen.
Bftree solves one non-obvious pain point - caching when your data set is random (the key is a hash) and the data is smaller than the page size. LSM reads are based on block size; same with caching. So if your record is 8 bytes, you end up caching the remaining ~4 KB, and the hit rate will be pretty low.
algorithmsRcool|1 month ago
Some of his other projects:
[0] https://github.com/microsoft/garnet [1] https://github.com/microsoft/FASTER [2] https://github.com/microsoft/Trill
reactordev|1 month ago
https://github.com/dotnet/orleans
the_arun|1 month ago
koverstreet|1 month ago
The paper also talks about the overhead of the mapping table for node lookups, and says "Bf-Tree by default pins the inner nodes in memory and uses direct pointer addresses to reference them. This allows a simpler inner node implementation, efficient node access, and reduced con- tention on the mapping table".
But you don't have to pin nodes in memory to use direct pointer lookups. Reserve a field in your key/value pair for a direct in-memory pointer, and after chasing it check that you got the node you expect; only fall back to the mapping table (i.e. hash table of cached nodes) if the pointer is uninitialized or you don't get the node you expect.
"For write, conventional B-Tree performs the worst, as a single record update would incur a full page write, as evidenced by the highest disk IO per operation."
Only with a random distribution, but early on the paper talks about benchmarking with a Zip-f distribution. Err?
The benchmark does look like a purely random distribution, which is not terribly realistic for most use cases. The line about "a single record updating incurring a full page write" also ignores the effect of cache size vs. working set size, which is a rather important detail. I can't say I trust the benchmark numbers.
Prefix compression - nice to see this popping up.
"hybrid latching" - basically, they're doing what the Linux kernel calls seqlocks for interior nodes. This is smart, but given that their b-tree implementation doesn't use it, you shouldn't trust the b-tree benchmarks.
However, I found that approach problematic - it's basically software transactional memory, with all the complexity that implies, and it bleeds out into too much of the rest of your b-tree code. Using a different type of lock for interior nodes where read locks only use percpu counters gives the same performance (read locks touch no cachelines shared by other CPUs) for much lower complexity.
Not entirely state of the art, and I see a lot of focus on optimizations that likely wouldn't survive in a larger system, but it does look like a real improvement over LSM trees.
fuzzybear3965|1 month ago
0xdeafbeef|1 month ago
heliumtera|1 month ago
You managed to clone the repo an run your test by yourself, whatever the outcome is this is a plus against the standard model for scientific research.
Also, a breath of fresh air among every other show HN thread using hundreds of adjectives to describe the "behavior" of a fully vibed system. I think this is a good model for presenting engineering projects.
Retr0id|1 month ago
> The ‘f’ in Bf-Tree stands for “faster”
tombert|1 month ago
I haven't done any benchmarks on it yet so I have no idea how well it performs, but I have a very strict "no explicit lock" policy, so everything is handled with separate files with a reconciliation service that uses a vector clock to determine ordering.
NOTE: A lot of this is AI-assisted, treat what I have written as more of a whiteboard proof of concept. I'm in the process of rewriting it by hand to do some more aggressive optimizations.
unknown|1 month ago
[deleted]
dacapoday|1 month ago
0xdeafbeef|1 month ago
yxhuvud|1 month ago
pgwhalen|1 month ago
I admit I’ll agree that that extra hop was a little confusing to me though. I guess people just like GitHub and don’t like PDFs.