(no title)
silentvoice | 6 years ago
If it's OK I'll link to this comment as the inspiration.
On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.
lovasoa|6 years ago
See the generated assembler here: https://rust.godbolt.org/z/G5A2u0
silentvoice|6 years ago
I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.
So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.
tom_mellior|6 years ago