(no title)
Phemist | 5 months ago
I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.
I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?
Const-me|5 months ago
Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them.
Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction.
Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache.
P.S. I’ve tested that code just a couple of times, might be bugs
anematode|5 months ago
Edit: https://godbolt.org/z/rrhW6T7Mc
Phemist|5 months ago