Whether the technique described here will actually be faster is pretty application-dependent. The problem is that, on x86, shuffle instructions are the bottleneck for many algorithms (at least the type that I often work with). Storing constants this way requires adding an extra shuffle each time that you need to broadcast one of the constants back to a vector register, which exacerbates the bottleneck. In these cases, I’ve found that light spilling to the stack actually performs better.
pixelesque|2 years ago
ack_complete|2 years ago
vardump|2 years ago
In the end, you'll need to profile it on your target chips.
unknown|2 years ago
[deleted]