top | item 45366707

(no title)

Phemist | 5 months ago

I am handling a lot of vectors of size 6 or less with integers of values lower than 500.000. I am packing individual vectors in a u128, which comes down to 6*19 + a 3 bit length field = 117 bits with 11 bit left to spare.

I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.

I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?

discuss

order

Const-me|5 months ago

> would you know if there is something similar that works on u128 instead of just u32/u64?

Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them.

Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction.

Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache.

P.S. I’ve tested that code just a couple of times, might be bugs

anematode|5 months ago

I don't think there's a single instruction to do this, but you could probably do it with a combination of shld + bzhi + cmov. rustc already seems to do a great job, and whatever I could come up with that assumes [src, src + len] is always in bounds isn't that much better.

Edit: https://godbolt.org/z/rrhW6T7Mc

Phemist|5 months ago

Hvala puno. Very interesting! I will see how my implementation compares in asm.