top | item 28008541

B-tree Path Hints

282 points| eatonphil | 4 years ago |github.com | reply

62 comments

order
[+] jvanderbot|4 years ago|reply
In the mid 80s to 90s, there was a lot of interest around "query responsive" data structures. There still is, I'm sure, but that was what I studied and I've moved on since then.

Of these, the fibbonaci and pairing heaps were the most well known products, I reckon.

They essentially move "Active" parts of the data structure upwards near the top of the tree. The use case is different (being heaps), but I wouldn't be surprised if that line of R&D has continued, or if there's a huge gap in performance that can be reclaimed with "intelligent" guesses like this one.

I'm rambling and conjecturing, but maybe someday branch prediction will get a huge boost to their already tangible smarts and we'll see another explosion in practical CPU throughput, combined with a little help from algs and data structures, of course.

[+] nmwnmw|4 years ago|reply
My personal favorite of these is the Splay Tree[1]. I've never used it in production, but it's really simple to implement and reason about. Though my main experience was trying (and failing) to prove amortized bounds on the operations.

[1] https://en.wikipedia.org/wiki/Splay_tree

[+] scottlamb|4 years ago|reply
Does that mean any access to these data structures needs an exclusive lock? One advantage to the "path hints" approach described in the article is that the path hints can be stored separately from the btree itself. The author writes:

> For single-threaded programs, it’s possible to use one shared path hint per B-tree for the life of the program.

> For multi-threaded programs, I find it best to use one path hint per B-tree , per thread.

> For server-client programs, one path hint per B-tree, per client should suffice.

Seems likely multiple threads/clients would have different access patterns anyway, so the hints might actually be better this way in addition to more concurrent.

edit: also, this approach feels a lot lighter-weight to me than mutating the tree: I can imagine the mutations involving frequent allocation/deallocation where this hint is just a vector that grows to the maximum depth and stays there.

[+] gopalv|4 years ago|reply
> we'll see another explosion in practical CPU throughput, combined with a little help from algs and data structures

The learned index[1] is basically a variation of this sort of "hint" which relies on getting the right answer most of the time, but from learning the distribution function as a model instead of relying on the programmer to heuristic it.

(from that paper)

> The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records

I haven't seen anything mix this sort of work with an Eytzinger traversal, but there's always diminishing returns for these.

[1] - https://arxiv.org/abs/1712.01208

[+] ccleve|4 years ago|reply
It's a cool idea, but you have to keep in mind that binary searches are not optimal for b-tree nodes unless the node is really big. If you have a smaller node, < 32 or 64 elements, a simple linear search is faster. The reason has something to do with processor cache, branch mis-prediction, and pipelining.

A linear search can also be optimized to use SIMD, which is a big win.

If you really want to use a path hint, you could just say "start my linear search at the point where I left off last time", and if the first element is too big, wrap around to the beginning. But that might not buy you much.

[+] psykotic|4 years ago|reply
A fast binary search intended for B-tree key array searches does not have branches at all [1]. An B-tree interior node's key array is usually sized to fit in one cache line, so there isn't a difference in cache access pattern.

With 8-byte keys and the usual 64-byte cache line size, you end up with 8 keys per interior node's key array. That's a pretty bad range for the "simple linear search" most people would write, which is going to be dominated by the expected branch mispredict cost. The idea of using a hint to guess a starting position for either linear or binary search has the same issue. You need branches to capitalize on the best case vs worst case spread, and the mispredict cost will dominate for smaller arrays like this.

If you care about speed for this case, use a fast branchless binary search or a vectorized linear search (the latter will have an edge). An additional minor consideration for B-tree traversal is the latency of vector vs scalar loads since you don't know the load address (except for the root node) until you've finished searching the parent node's key array. That is not necessarily true for vectorized searches of small arrays in general, where the vector load addresses may be key independent (like the B-tree root node is here) and so in that case the few extra cycles of L1 load-to-use latency for vector loads (and higher load latencies in general) can often be hidden.

[1] By B-tree I am referring to in-memory B+ trees. A disk-based B+ tree would usually be built around page-granularity nodes. In that regime a branchless binary search for integer/float keys (with software prefetch to populate the cache ahead of the data dependency chains) that finishes off with a vectorized linear search is likely going to be the winner. The B-tree slack interval (M/2, M] helps since it guarantees a fully unrolled binary search with lg(M) steps won't waste any steps. You can still support "underflowing" B-tree nodes with the same branchless code path.

[+] scottlamb|4 years ago|reply
> It's a cool idea, but you have to keep in mind that binary searches are not optimal for b-tree nodes unless the node is really big. If you have a smaller node, < 32 or 64 elements, a simple linear search is faster. The reason has something to do with processor cache, branch mis-prediction, and pipelining.

I've heard that—and iirc Rust's BTreeMap type uses linear search—but does this depend on the type of key? Easy for me to see how that might be true for a BTreeMap<u64, V> but not a BTreeMap<String, V>. In the latter case, there's a pointer to chase for each comparison (and more/variable data to process; SIMD over multiple keys doesn't seem likely).

[+] nextaccountic|4 years ago|reply
Isn't the binary search here taken as the whole tree, spanning multiple nodes? Each time we follow a pointer we are discarding a huge swath of the tree.

In a single node, a linear search is probably better, yes.

[+] jleahy|4 years ago|reply
The something is data dependencies.
[+] ardit33|4 years ago|reply
I agree.... the whole point of a b-tree is to already subdivide the dataset in smaller manageable sets. A binary search wont be much help on those slices if the tree is constructed properly.
[+] mabbo|4 years ago|reply
I love little optimizations like this.

What I'm reading is that the node in the b-tree remembers the most recently found index during search, and if the next search happens to be for the same value or one very close, the binary search goes much faster. This only helps in situations where searches correlated in time tend to have nearby key values, but that's probably very common.

[+] ot|4 years ago|reply
The C++ STL has something similar, insert operations accept an iterator as hint:

https://en.cppreference.com/w/cpp/container/map/insert

In common implementations iterators are just node pointers, which is enough because nodes have parent pointers, so you can recover the whole path.

In extreme cases like ordered insertion this effectively shaves off a log(n) factor.

[+] uyt|4 years ago|reply
I really like that the hint is optional.

You don't always want the hint because if you're jumping d steps ahead, you actually need to touch 2 log(d) nodes, once to walk up, and another to walk down. This means anytime you're moving more than d=sqrt(n) steps away you're better off just walking down from the root instead.

For example if n=1000000, if the hot section is wider than d=1000, you're better off going from root all the time. Of course if you access stuff that are directly adjacent you should still use inorder iteration for those (but I see that as a ++ operator, not a separate lookup).

[+] ntonozzi|4 years ago|reply
A similar idea is finger search (https://en.wikipedia.org/wiki/Finger_search), where you can turn a traversal of a tree from O(log(number of items in the tree)) into O(log(number of items between the finger and the key)).
[+] namibj|4 years ago|reply
A related thing: exponential search. It's the sorted-array equivalent.
[+] uyt|4 years ago|reply
I was going to link the same thing except I would say he is talking about exactly that without knowing the academic name for it.
[+] bob1029|4 years ago|reply
I think you would find the splay tree provides similar properties, potentially in an even more elegant way.

For applications I have used it for, this technique results in the hottest data eventually being condensed into a smaller number of contiguous storage blocks.

The power of this technique is that it amortizes the optimization of the tree across all access (reads), not just inserts & deletions (or other special-purpose re-balancing operations).

[+] jsnell|4 years ago|reply
I don't think this is about hot-data optimization as such though, but about very short-term local clustering of keys.

I.e. it's not that a:b:c and a:b:d are globally accessed more often than x:y:z, but that after a:b:c is accessed it's more likely that the next access is a:b:d than x:y:z.

[+] foobiekr|4 years ago|reply
I love splay trees, skew heaps, etc. data structures that get very little love.

But I’ve actually tried to use splay trees in production code. The performance benefits versus a simple Patricia tree seemed totally absent in actual practice even for things like very frequent adjacent queries. The structure is neat but paying attention to memory layout and cache line size is far, far more important.

[+] vvern|4 years ago|reply
It feels to me like this just a kludge added to deal with a lack of a stateful iterator on top of the tree. In this use case, the author indicates that it is about exploiting locality in a request path. Imagine that the tree offered an iterator that maintained a stack of its path and that thing had a seek method. You can optimize that thing to only go back up to the root if it needs to. Stateful iterators make for a nice pairing with btrees. These other hints are papering over the lack of that abstraction as far as I can tell.
[+] anderskaseorg|4 years ago|reply
With a stateful iterator, you might need to worry about your cached pointers being invalidated by mutations from other callers.
[+] jeremyis|4 years ago|reply
I really enjoyed this post and hope I get more like it in the future:

- It gave a high level, concise background overview of the base concept (B-trees) complete with diagrams

- It talked from first principles about how things worked without really any Jargon

- It explained simply a clever optimization

Thanks for posting!

[+] Symmetry|4 years ago|reply
I'm sort of surprised that the gains are so high. I'd tend to expect the cost of a pointer deference in a large data structure to be bigger than the comparisons even with the attendant branch mispredictions.

EDIT: Oh, duh, if the path hints are useful then cache locality means that the data you're looking for is already nearby. If the hint works the cost of the lookup will actually be dominated by the comparisons.

[+] srcreigh|4 years ago|reply
I'm not sure this is useful for a traditional database. There are lots of costs that are orders of magnitude more important.

This is a CPU optimization, however in a DB the major costs are 1) pulling blocks from disk 2) transferring results over a network.

Even if you assume nodes are cached in memory, a B-Tree node for a database probably fits in L1 cache. SQLite page size is 4098 bytes by default vs 8KB+ L1 cache size. Shouldn't you pay 100x the cost just to load the next node from memory, if you have a large tree? In theory that's <1% faster overall.

I'm curious whether 3x improvement is overall, or just from the binary search. (according to the benchmark it is overall 2x faster - very curious).

Also curious the size of the tree in benchmarks and the spread factor (ie how many children each B-Tree node has). I couldn't find this info from the repo. This could affect the % of time you can find the next node already in a cache.

[+] srcreigh|4 years ago|reply
I found the answers - the tree uses 256 children by default, implying ~16kb node size on 64 bit architecture. The size of the tree in the benchmark is 1M nodes, so ~16GB tree. Pretty hefty tree! :)

It's the seqential insert/get benchmarks which benefits the most from path hints. You'd only have 1 hot path down the tree in this setup (the rightmost path). Is it possible the CPU is caching multiple non-contiguous blocks (ie the hot path)? That would explain why memory access doesn't have as much of a factor as I originally thought.

In other words - this technique may be useful in DBs! Disk and memory are both heavily cached in the CPU, so CPU optimizations can help for sequential/repeated local accesses

[+] kazinator|4 years ago|reply
An interesting idea would be to imitate filesystems and provide an actual "current working node" concept to the application. The current working node is a hint which contains a pointer to a BTree node. This could be the leaf node where some item was previously found, or one node higher above the leaf level.

When the search for an item fails from the current working node, it falls back on earnestly searching from the root.

If a node is split or merged by insertions or deletions, then if any current working node pointers to it exist, they get invalidated or updated somehow. (The B-Tree object could carry a small dynamic set of current working node items which it can update when the tree mutates.)

[+] willvarfar|4 years ago|reply
Wow! Obvious in hindsight!

Do any databases use this or similar tricks?

[+] jlokier|4 years ago|reply
Yes. Even better, a good database will avoid traversing the blocks and path, when the key is close enough to the previous lookup's hint, making all but the first lookup in a nearly-consecutive sequence take O(1) instead of O(log N) where N is the size of the tree, and runs of n nearly-consecutive operations take O(n + log N) instead of O(n log N).
[+] warmfuzzykitten|4 years ago|reply
The "path hint" sounds a lot like a cursor. Since a lot of b-tree traversal is ordered, knowing the last position in the tree is often valuable information.
[+] vecter|4 years ago|reply
In case others are unclear since it doesn't seem to be explicitly stated in the docs, you update the path hint to be the path of the most recently found item. I think this is implied when he says "My path may be wrong, and if so please provide me with the correct path so I get it right the next time.", but you can see where `hint` is set in the implementation also.
[+] nomy99|4 years ago|reply
I think OP is using the wrong data structure, and his fix is bringing him closer to the ideal solution. You need a hashmap function that returns a search range to direct the binary search on the B tree. This lets you add logic on how you want data to be searched in the future. In base case it will work as a the B tree hint
[+] kolbe|4 years ago|reply
I love this. In general, I've found you can do amazing thing to speed up common algorithms by tailoring them to your use case. It usually feels hacky, but I've found it to be very effective.
[+] sroussey|4 years ago|reply
Back in the 1990s, I did something similar with SQL and it worked wonders.
[+] iratewizard|4 years ago|reply
I wonder how this would compare to a B-Tree where you've explicitly loaded large branches of the tree into cpu cache and perform multiple comparisons in a single cpu cycle.