(no title)
Remnant44 | 1 month ago
The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.
Sesse__|1 month ago
Hah, TIL. Too used to SSE, I guess. (My main target platform is, unfortunately, still limited to SSE3, not even SSSE3.)