It's almost certainly more about scheduling than vectorization. The data dependencies is going to constantly stall the CPU pipeline, so it's just not able to retire instructions very quickly. The SIMD part is almost certainly a red herring. It's helping, but it's far from why it's so much faster. Tiger Lake can retire 4 plain ol' ADD operations per clock[1] - you don't need SIMD / vectorization to get instruction level parallelism. But you do need to ensure there's no data dependencies. The data dependency here is the 90% cost. The SIMD is just the cherry on top.
kllrnohj|3 years ago
1: https://www.agner.org/optimize/instruction_tables.pdf