top | item 32984235

(no title)

giyanani | 3 years ago

Why do you say shuffle is “SIMD’s killer app”? I’ve only dabbled in vector instructions from a learning perspective, and seen others mention it’s important too, but have yet to understand why.

discuss

Veliladon|3 years ago

Because you can do things using bitmasks and single instructions instead of brute forcing using multiple instructions.

Let's say you have a whole heap of 8-bit numbers you want to multiply by 2 and you have a set of 256-bit registers and a nice SIMD multiply command. If you don't have a shuffle you need to assemble your series of 2s for the second operand for each lane before you can even start. This is going to take hundreds of instructions and hundreds of clocks. Shuffle means you load up lane 0 with the "2" and then splat the contents of lane 0 across the other 31 lanes in two instructions and a few clocks using the shuffle unit.

N.B. Shuffle isn't just about splatting. There's a whole heap of different operations it can do that are useful. I just picked a simple example with an obvious massive performance increase for illustrative purposes.

stabbles|3 years ago

You're not talking about shuffle, you're talking about broadcast. Shuffle instructions is where you take one or two vectors, and output a third with elements from any index of the input. So for example `out = [in[2], in[1]]` is a shuffle of a vector of length 2.

It's useful for example if you have say RGB color data stored contiguously in memory as say RGBRGBRGBRGB..., and you want to vectorize operations on R, B and G separately. You can load a few registers like [RGBR][GBRG][BRGB], and then shuffle them to [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how to shuffle optimally, it takes a few shuffles to get there.

More generally, if you have an array of structs, you often need to go to struct of arrays to do vectorized operations on the array, before returning to an array of struct again.

Another example is fast matrix transpose (in fact you can think of the RGB example a 3 by N matrix transpose to N by 3, where N is the vector width -- AoS -> SoA is a transpose too, in a sense). Suppose you have a matrix of size N by N where N is the vector width, you need N lg N shuffles to transpose the matrix.

Dylan16807|3 years ago

I think that example is too simple to show the benefit of shuffle. It's like explaining the benefit of an adder by showing how you can move a value with X = Y + 0. Especially since there's also a (much simpler) piece of hardware dedicated to ultra-fast splat/broadcast (under the right conditions).

magicalhippo|3 years ago

It's basically several moves for the price of one. Given that you operate on multiple values at once, being able to shuffle or duplicate values comes up all the time.

For example if you're filtering four image lines at a time using a 1D filter kernel, you'll want to replicate the filter coefficient to each SIMD element, so that you can multiply each of the four pixel values with the same coefficient. Shuffle lets you replicate a single coefficient value into all the elements of a register in one instruction.

daniel-cussen|3 years ago

Which is the point of SIMD. Several moves for the price of one.

MrBuddyCasino|3 years ago

This Rust issue [0] was the best short summary of what an SIMD Shuffle is I could find:

„A "shuffle", in SIMD terms, takes a SIMD vector (or possibly two vectors) and a pattern of source lane indexes (usually as an immediate), and then produces a new SIMD vector where the output is the source lane values in the pattern given.“

[0] https://github.com/rust-lang/portable-simd/issues/11

demindiro|3 years ago

I use PSHUFB to convert 24-bit RGB to 32-bit RGBX or BGRX. Without a shuffle instruction it'd be quite a bit harder.

zX41ZdbW|3 years ago

Here is an overview of the usage of the shuffle instruction to speed up decompression in ClickHouse: https://habr.com/ru/company/yandex/blog/457612/