Although AVX instructions, those are scalar. The SS pattern at the end is for Scalar Single-precision, so only one lane of the whole SIMD register is used.
You're absolutely right, I wasn't reading. The minimal flags that seem to get GCC 10.1 to vectorize are -O3 with optionally -mavx to go wider. Clang doesn't want to vectorize until you give -ffast-math
This can't really be vectorized without -ffast-math regardless. Notice that even with gcc, it's only vectorizing the multiply, and the adds are still scalar. This probably isn't that much of an improvement over the scalar code.
-ffast-math allows the multiply-adds to be reassociated, enabling much better approaches. Clang with -O2 -ffast-math produces good code (vfmadd132ps with 4 independent accumulators), I can't get GCC to produce good code with any flags.
jeffbee|5 years ago
creato|5 years ago
-ffast-math allows the multiply-adds to be reassociated, enabling much better approaches. Clang with -O2 -ffast-math produces good code (vfmadd132ps with 4 independent accumulators), I can't get GCC to produce good code with any flags.