(no title)
nicula | 11 months ago
I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.
nicula | 11 months ago
I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.
ack_complete|11 months ago
This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7.
nicula|11 months ago
I wrote the code that you suggested (LMK if I understood your points): https://godbolt.org/z/jW4o3cnh3
And here's the benchmark output, on my machine: https://0x0.st/8SsG.txt (v1 is std::count_if(), v2 is the optimization from my blog post, and v3 is what you suggested).
v2 is faster, but v3 is still quite fast.