top | item 38223547

(no title)

sudara | 2 years ago

Cache locality and specifically the vertical pass was top of my mind when trying to come up with good ways to vectorize. In the end (at least in my vector implementations) the difference between the passes weren't too large. But most of the them ended up having to do things like first convert the incoming row/col to its own float vector.

One main issue I never resolved is in the middle of the main loop, data has to be converted and written back to the source image and the incoming pixels have to be converted and loaded in. Even when doing all rows or cols in bulk (which was always faster somehow than doing batches of 32/64), that seemed pretty brutal.

I also wondered whether it might be more efficient to rotate the entire image before and after the vertical pass, but in my implementations at least, there wasn't a huge difference in the pass timings.

discuss

No comments yet.