Amazingly thorough ! I love how the author leaves no stone unturned. I had no idea you could do the kind of low level efficiency shaving in Rust.
I wonder how a C++ implementation with https://github.com/jfalcou/eve would compare.
Thanks!
It's somewhat tiring to not have loose ends, but I agree it pays off :)
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
You kind of lost me towards the end, so I’m not sure whether you attempted it, but I was wondering whether it would be possible for the internal nodes to store only the upper 8/16 bits (that are nonzero for the current subtree). This implies that one 64 byte cache line stores 32 or 64 entries instead of 16 (or better: 31 or 62/63, since u may need some bookkeeping dat).
The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value).
That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data.
There are stones still unturned. For typical unicode table lookup's you'd need batches for all characters of a word. It would be much faster to lookup eg 16 chars at once, to optimize cache misses.
But when I tested this even Eytzinger was too slow.
indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve
curiouscoding|1 year ago
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
ant6n|1 year ago
The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value).
That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data.
rurban|1 year ago
But when I tested this even Eytzinger was too slow.
secondcoming|1 year ago
wolfgangK|1 year ago
npalli|1 year ago
https://github.com/google/highway
Bulat_Ziganshin|1 year ago
indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve