top | item 23773732

(no title)

ethelward | 5 years ago

Although AVX instructions, those are scalar. The SS pattern at the end is for Scalar Single-precision, so only one lane of the whole SIMD register is used.

discuss

jeffbee|5 years ago

You're absolutely right, I wasn't reading. The minimal flags that seem to get GCC 10.1 to vectorize are -O3 with optionally -mavx to go wider. Clang doesn't want to vectorize until you give -ffast-math

creato|5 years ago

This can't really be vectorized without -ffast-math regardless. Notice that even with gcc, it's only vectorizing the multiply, and the adds are still scalar. This probably isn't that much of an improvement over the scalar code.

-ffast-math allows the multiply-adds to be reassociated, enabling much better approaches. Clang with -O2 -ffast-math produces good code (vfmadd132ps with 4 independent accumulators), I can't get GCC to produce good code with any flags.