top | item 21347954

(no title)

silentvoice | 6 years ago

Hi I wrote the blog post linked - and I feel a little silly that I didn't check that _both_ loops vectorized. So I fixed the Rust implementation to keep a running vector of partial sums which I finish up at the end - this one did vectorize. The result was a 2X performance bump, which I'm about to include in the blog post as an update.

If it's OK I'll link to this comment as the inspiration.

On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.

discuss

order

lovasoa|6 years ago

I liked your trick to iterate on chunks to force the compiler to vectorize the code ! Now that the code is properly vectorized, you can add the `mul_add` function, and this time you'll see a significant speedup. I tried it on my machine and it made the code 20% faster.

See the generated assembler here: https://rust.godbolt.org/z/G5A2u0

silentvoice|6 years ago

Thanks! The chunks trick was a fairly straightforward translation of what I would do in C++ if the compiler wouldn't vectorize the reduction for some reason. These days most compilers will do it if you pass enough flags, a fact I really took for granted when doing this because Rust is more conservative.

I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.

So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.

tom_mellior|6 years ago

Great to hear that you managed another 2x speedup! Sure, feel free to link my comment if you like.