top | item 45527386

(no title)

Rarebox | 4 months ago

Interesting example! I've been learning AVX512 by using it to optimize Huffman coding. I found _mm512_permutexvar_epi8 and used it to do byte-indexed lookups, but _mm512_permutex2var_epi8 means I can get by with 2 shuffles and 1 comparison instead of 4 shuffles and 3 comparisons. In the end, on my CPU (AMD 9950x), changing to _mm512_permutex2var_epi8 only sped up compression by ~2%.

Compared to Huff0[1] (used by Zstd), my AVX512 code is currently ~40% faster at both compression and decompression. This requires using 32 datastreams instead of 4 used by Huff0.

[1] https://github.com/Cyan4973/FiniteStateEntropy

discuss

order

camel-cdr|4 months ago

Oh, this is cool, I wanted to look into using SIMD for huffman as well.

For decode, do you use AVX512 to speedup the decode via caching the decode of small codewords?

Do you decode serially or use the self syncronizing nature of huffman codes to decode the stream from multiple offsets in parallel? I haven't seen the later done in SIMD before.

Are there any new SIMD instructions you'd like to see in future ISA extensions?

OpenPower has proposed a scalar instruction to speedup prefix-code decoding: https://libre-soc.org/openpower/prefix_codes/