(no title)
fuber2018 | 2 years ago
You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.
A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.
fuber2018|2 years ago
But if your code will be cross-platform/run on different OSes/CPU arch's, then a SWAR version may be more consistently performant - no need to guess if the compiler's optimization heuristics decided to go with the general purpose CPU registers or faster SIMD registers.
Downside is that the devs are exposed to the gnarly optimized code.