Totally off topic, but I've started to notice that more and more algorithmic low-ish level content assumes Rust by default. My entire life I've been used to anything algorithmic being either written in sciency pseudocode (invariably in latex-generated PDFs), or plain old C(++), the vulgar latin of computer programming. I know C and I don't really know Rust but nevertheless I love that this is changing! I'd never advise a new programmer interested in low-level/fast stuff to begin with C now, so much good content is written around Rust. Like the old C examples, the post doesn't even say "this is Rust". You're expected to expect it to be Rust.
And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.
But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.
I don't know if it's good or bad, with pseudo code you put no constraints on how it should be implemented. It's known that some kind of algorithm are really hard to implement in Rust (every one use the link list data structure as an exemple). So having the article that use rust is good to see it can fit to rust constraints but at the same time does this constraint limit the algorithm itself ?
Quite honestly, doesn't seem like Rust is a win here over C++. In fact, C++ has fewer sigils that make it somewhat easier to read sort of striking the balance between being simple enough (C-like) and abstractions (templates). Readability wise, I would have preferred Julia as that would have been the cleanest explanation of the code through the different topics at sufficient level of detail. Alas, it seems to have stalled somewhat (relatively speaking). It also doesn't help that every Rust article has Rust advocates jumping on you with "WAHT ABOUT SAFETY" which is off-putting since not every program needs to deal with all the complexity of the Rust memory/lifetime model to enable safety (simple bounds checking etc. might be sufficient).
One dimension that is not explored is partitioning the queries in batches. The primary cost is doing lookups on the out-of-cache table, so if you have a sufficiently large amount of queries you can resolve a couple layers of the tree in one step while those layers are in cache, grouping them based on where they land deeper in the tree, and then resolving all those queries that touch the same deeper part of the tree in one batch as well.
In theory, with an infinitely large amount of input queries you will never have to do an out-of-cache lookup, it can be completely amortized away into linear scans.
That said, you now end up with a bunch of results that need to be put back into the correct output order, which likely makes it not worth it. But if the operation can be fused into a reduction (e.g. find the sum of the binary search results) or the output order does not matter in some other way then all of a sudden it might make sense.
I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.
Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.
There are theory papers on "buffer trees"---B-trees where each node is augmented with an O(B)-length array of pending updates and queries. I believe there were also some attempts at building SQL databases based on them. It sounds like you're reaching for the same trick.
Interesting stuff, definitely the kind of real world optimization that happens when you’re able to look at actual access characteristics rather highly abstracted models.
At one level you could just be saving the results into a hashtable to bubble them out again, but at the extreme end the lookup is actually spread across multiple machines in a cluster, so it’s more like throwing the query to the proper one with the right chunk of the index, along with instructions about where the final RPC is supposed to go with the result.
There was some work a while back on streaming data past queries, but you need a fairly bounded data set for that to work. Having ten years of historical data in a data set would gum that up severely.
It enhances the throughput (on average everyone waits less) at the price of a higher max latency (some, who posted a request mobilizing a very-recently-out-of-cache-index, will wait way, way more...), isn't it? In the real world those worst cases quite often kill such optimization.
The (data size / cache size) ratio and queries local (in time) dispersion are key.
Ah yes, I did consider sorting the queries at some point but I guess I then forgot about it again. If the queries are random and much less than the input size, probably most queries will hit different cache lines in the final layer, and I suspect benefits will be limited at maybe 2x best case since the last layer is where the bottleneck is.
Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)
Amazingly thorough ! I love how the author leaves no stone unturned. I had no idea you could do the kind of low level efficiency shaving in Rust.
I wonder how a C++ implementation with https://github.com/jfalcou/eve would compare.
Thanks!
It's somewhat tiring to not have loose ends, but I agree it pays off :)
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
There are stones still unturned. For typical unicode table lookup's you'd need batches for all characters of a word. It would be much faster to lookup eg 16 chars at once, to optimize cache misses.
But when I tested this even Eytzinger was too slow.
Nifty thing about eytzinger trees is that there's a formula for converting from a node index to its position in an inorder traversal.
This is useful if your primary keystore is a normal sorted set of keys - easier to work with - you then don't need to store explicit pointers.
Also, he didn't mention that with eytzinger you can prefetch multiple loop iterations in advance - use 1-based indexing so that sibling nodes line up nicely on cachelines.
You're right. While I wasn't very explicit about this (because algorithmica and the linked paper spend plenty of words on it already), the code snippet for Eytzinger does indeed do both the prefetching and the formula.
In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.)
The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.
this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.
it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).
smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.
the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.
Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")
I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small.
That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).
The number of unique int32 values is not that great, and a full bitset would only be 512MB. Hard to bit a dense bitset in performance.
As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.
If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (https://sux.di.unimi.it).
Hmm, bitmaps is an interesting idea!
If the data is dense enough, then yeah I guess a quick linear scan would work.
There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.
Rank-select is also interesting, but I doubt it comes anywhere close in performance.
I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.
This is really interesting and a thorough write up. Thanks to the author for sharing their work.
Whenever I read about super low-level optimisation, my immediate feeling is that of gratitude to the author for spending so much time shaving off nanoseconds which the entire SE community gets to enjoy.
I wonder how much time humanity has collectively saved simply as a result of how all these optimisations stack up.
Assuming this is going to save someone 10ns per query in the end and I spent 150h on coding and writing this up, we only(?) need around 10^14 queries to make it worth it!
But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)
Interesting. If you need to support occasional writes, you could have a large static search tree and a small writable tree that you check afterwards. The when you have enough changes, you could occasionally ship those changes into a new version of the static tree
Recently I tried to optimize a set implementation and found that interpolation search works quite well, being competitive with HashSet in C# (both single digit ns for my use case), with zero memory overhead. Unfortunately, it only works well with uniform data, so I guess I'll give static search trees a try. This explanation was clear and extremelly well detailed.
At some point I was playing around with interpolation search on the human genome dataset, and it was really really terrible.
It had some very bad worst case behaviour when the input data has big plateaus and you're searching for something around the edges of the plateau. This can be fixed by only using interpolation for the first few iterations and doing binary search for as long as needed afterwards, but that kills a lot of the predictability of the current approach. So while I really like it in theory, in practice I'm not quite convinced it can be made to work efficiently.
Given the fact that the keys are fixed-size integers with a not-too-terrible distribution, I would consider chasing the partitioning idea a bit harder, and possibly even removing all the fancy S-tree parts. Some ideas:
For 32 bits, using the first few or even a lot (16? 20?) bits to index an array and then chasing down the remaining bits somehow (adaptively depending on size of that partition?) seems like it could work.
In this direction, and with really impressive asymptotic performance, there’s the van Emde Boas tree and its variants.
Nice post. Would the larger amount of code result in different performance in a scenario where other code is being run as well, or would the instruction cache be large enough to make this a non-issue?
Jup. I've added to the future work section now that the plan is to use this to speed up suffix array searching. Suffix arrays are pretty regularly used to build indices over say a human genome, that can then be queried.
And since DNA sequencers are still increasing there throughput and accuracy, it's always good to have faster algorithms as well.
I'm using a variant of static search trees in my search engine index though in this scenario the main bottleneck is block reads from disk, but the thinking is relatively similar to cache lines.
Use case is to query hundreds of gigabytes of data in tens of milliseconds.
Is binary search on su ch large integer data sets common? I guess maybe time series data. But I would think that this isn’t amenable to representing strings?
[+] [-] skrebbel|1 year ago|reply
And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.
But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.
[+] [-] uecker|1 year ago|reply
[+] [-] curiouscoding|1 year ago|reply
[+] [-] User23|1 year ago|reply
[+] [-] Narew|1 year ago|reply
[+] [-] npalli|1 year ago|reply
[+] [-] EGreg|1 year ago|reply
Scientific stuff (scipy) and now all the ML stuff too
[+] [-] mung_daal|1 year ago|reply
[deleted]
[+] [-] orlp|1 year ago|reply
In theory, with an infinitely large amount of input queries you will never have to do an out-of-cache lookup, it can be completely amortized away into linear scans.
That said, you now end up with a bunch of results that need to be put back into the correct output order, which likely makes it not worth it. But if the operation can be fused into a reduction (e.g. find the sum of the binary search results) or the output order does not matter in some other way then all of a sudden it might make sense.
[+] [-] mlochbaum|1 year ago|reply
I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.
Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.
[+] [-] tmyklebu|1 year ago|reply
[+] [-] mikepurvis|1 year ago|reply
At one level you could just be saving the results into a hashtable to bubble them out again, but at the extreme end the lookup is actually spread across multiple machines in a cluster, so it’s more like throwing the query to the proper one with the right chunk of the index, along with instructions about where the final RPC is supposed to go with the result.
[+] [-] hinkley|1 year ago|reply
[+] [-] natmaka|1 year ago|reply
The (data size / cache size) ratio and queries local (in time) dispersion are key.
[+] [-] curiouscoding|1 year ago|reply
Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)
[+] [-] bobmcnamara|1 year ago|reply
Walk the larger tree, using the smaller tree.
[+] [-] koverstreet|1 year ago|reply
That's not terribly common.
[+] [-] wolfgangK|1 year ago|reply
[+] [-] curiouscoding|1 year ago|reply
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
[+] [-] rurban|1 year ago|reply
But when I tested this even Eytzinger was too slow.
[+] [-] secondcoming|1 year ago|reply
[+] [-] npalli|1 year ago|reply
https://github.com/google/highway
[+] [-] koverstreet|1 year ago|reply
This is useful if your primary keystore is a normal sorted set of keys - easier to work with - you then don't need to store explicit pointers.
Also, he didn't mention that with eytzinger you can prefetch multiple loop iterations in advance - use 1-based indexing so that sibling nodes line up nicely on cachelines.
[+] [-] curiouscoding|1 year ago|reply
In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.) The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.
[+] [-] sujayakar|1 year ago|reply
it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).
smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.
the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.
[+] [-] curiouscoding|1 year ago|reply
I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small. That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).
[+] [-] Bulat_Ziganshin|1 year ago|reply
[+] [-] gorset|1 year ago|reply
As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.
If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (https://sux.di.unimi.it).
[+] [-] curiouscoding|1 year ago|reply
There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.
Rank-select is also interesting, but I doubt it comes anywhere close in performance.
I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.
[+] [-] shawn_w|1 year ago|reply
[+] [-] wk_end|1 year ago|reply
[+] [-] openquery|1 year ago|reply
Whenever I read about super low-level optimisation, my immediate feeling is that of gratitude to the author for spending so much time shaving off nanoseconds which the entire SE community gets to enjoy.
I wonder how much time humanity has collectively saved simply as a result of how all these optimisations stack up.
[+] [-] curiouscoding|1 year ago|reply
But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)
[+] [-] DerSaidin|1 year ago|reply
Edit: Also, 1.3 figure 1: Should the y-axis label be "Inverse throughput" to match later figures?
[+] [-] topbanana|1 year ago|reply
[+] [-] westurner|1 year ago|reply
[+] [-] estebarb|1 year ago|reply
[+] [-] curiouscoding|1 year ago|reply
[+] [-] amluto|1 year ago|reply
For 32 bits, using the first few or even a lot (16? 20?) bits to index an array and then chasing down the remaining bits somehow (adaptively depending on size of that partition?) seems like it could work.
In this direction, and with really impressive asymptotic performance, there’s the van Emde Boas tree and its variants.
[+] [-] NooneAtAll3|1 year ago|reply
[+] [-] owenthejumper|1 year ago|reply
[+] [-] jules|1 year ago|reply
[+] [-] nielsole|1 year ago|reply
[+] [-] curiouscoding|1 year ago|reply
And since DNA sequencers are still increasing there throughput and accuracy, it's always good to have faster algorithms as well.
[+] [-] marginalia_nu|1 year ago|reply
Use case is to query hundreds of gigabytes of data in tens of milliseconds.
(Skiplists is also used by some search engines.)
[+] [-] Bulat_Ziganshin|1 year ago|reply
[+] [-] vlovich123|1 year ago|reply
[+] [-] throw-qqqqq|1 year ago|reply
You can replace string with any data structure you can hash or index really.
[+] [-] swiftcoder|1 year ago|reply
[+] [-] anamax|1 year ago|reply
Then again, Eytzinger looks like binary heaps.