top | item 45519829

(no title)

Remnant44 | 4 months ago

I'm just happy that finally, with the popularity of zen4 and 5 chips, AVX512 is around ~20% of the running hardware in the steam hardware survey. It's going to be a long while before it gets to a majority - Intel still isn't shipping its own instruction set in consumer CPUs - but its going the right direction.

Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.

Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.

discuss

Rarebox|4 months ago

Interesting example! I've been learning AVX512 by using it to optimize Huffman coding. I found _mm512_permutexvar_epi8 and used it to do byte-indexed lookups, but _mm512_permutex2var_epi8 means I can get by with 2 shuffles and 1 comparison instead of 4 shuffles and 3 comparisons. In the end, on my CPU (AMD 9950x), changing to _mm512_permutex2var_epi8 only sped up compression by ~2%.

Compared to Huff0[1] (used by Zstd), my AVX512 code is currently ~40% faster at both compression and decompression. This requires using 32 datastreams instead of 4 used by Huff0.

[1] https://github.com/Cyan4973/FiniteStateEntropy

camel-cdr|4 months ago

Oh, this is cool, I wanted to look into using SIMD for huffman as well.

For decode, do you use AVX512 to speedup the decode via caching the decode of small codewords?

Do you decode serially or use the self syncronizing nature of huffman codes to decode the stream from multiple offsets in parallel? I haven't seen the later done in SIMD before.

Are there any new SIMD instructions you'd like to see in future ISA extensions?

OpenPower has proposed a scalar instruction to speedup prefix-code decoding: https://libre-soc.org/openpower/prefix_codes/

shihab|4 months ago

Could you please elaborate on your example? Thanks.

Remnant44|4 months ago

Sure.. in detail and abstracted slightly, the byte table problem:

Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.

It looks like this in scalar code:

transform_lut(byte* dest, const byte* src, int size, const byte* lut) { for (int i = 0; i < size; i++) { dest[i] = lut[src[i]]; } }

The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)

Here's a snippet of the AVX512 version.

You load the lookup table into 4 registers outside the loop:

  __m512i p0, p1, p2, p3;
  p0 = _mm512_load_epi8(lut);
  p1 = _mm512_load_epi8(lut + 64);
  p2 = _mm512_load_epi8(lut + 128);
  p3 = _mm512_load_epi8(lut + 192);

Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis.

  auto tLow  = _mm512_permutex2var_epi8(p0, x, p1);
  auto tHigh = _mm512_permutex2var_epi8(p2, x, p3);

You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.

For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.

So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.

kbolino|4 months ago

Here's a non-parallel and unoptimized implementation of that operation in Go:

  func _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
    var dst [64]uint8
    for j := 0; j < 64; j++ {
      i := idx[j]
      src := a
      if i&0b0100_0000 != 0 {
        src = b
      }
      dst[j] = src[i&0b0011_1111]
    }
    return dst
  }

Basically, for a lookup table of 8-bit values, you need only 1 instruction to perform up to 64 lookups simultaneously, for each 128 bytes of table.