top | item 43311933

(no title)

nicula | 11 months ago

Some people already mentioned this in the r/cpp discussion. Small correction: 256 is not the correct number of iterations, since if all elements in that slice are even, then your 8-bit counter will wrap-around to zero, which can lead to a wrong answer. What you want is 255 iterations.

I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.

discuss

ack_complete|11 months ago

255 is not ideal either because it results in a partial vector at the end of either 15 or 31 elements. 256-V where V is the vector size is better, so 240 for SSE2/NEON or 224 for AVX2.

This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7.

nicula|11 months ago

Great observations, thanks!

I wrote the code that you suggested (LMK if I understood your points): https://godbolt.org/z/jW4o3cnh3

And here's the benchmark output, on my machine: https://0x0.st/8SsG.txt (v1 is std::count_if(), v2 is the optimization from my blog post, and v3 is what you suggested).

v2 is faster, but v3 is still quite fast.