top | item 39269725

(no title)

Whether the technique described here will actually be faster is pretty application-dependent. The problem is that, on x86, shuffle instructions are the bottleneck for many algorithms (at least the type that I often work with). Storing constants this way requires adding an extra shuffle each time that you need to broadcast one of the constants back to a vector register, which exacerbates the bottleneck. In these cases, I’ve found that light spilling to the stack actually performs better.

discuss

pixelesque|2 years ago

Yeah, I haven't checked within the last few years on more recent Intel/AMD processors, but it used to be that on Intel CPUs, only port 5 could be used for shuffles, so it was possible to bottleneck them on code with fairly heavy usage of shuffles.

ack_complete|2 years ago

It's better now that Ice Lake+ can do some shuffles and unpack operations on two ports, but bottlenecking on the shuffle ports can still be a problem.

vardump|2 years ago

It's another tool in the toolbox. Can't have too many of these!

In the end, you'll need to profile it on your target chips.

unknown|2 years ago

[deleted]