top | item 44396937

(no title)

mlochbaum | 8 months ago

The relevant operations for matrix multiply are leading-axis extension, shown near the end of [0], and Insert +˝ shown in [1]. Both for floats; the leading-axis operation is × but it's the same speed as + with floating-point SIMD. We don't handle these all that well, with needless copying in × and a lot of per-row overhead in +˝, but of course it's way better than scalar evaluation.

[0] https://mlochbaum.github.io/bencharray/pages/arith.html

[1] https://mlochbaum.github.io/bencharray/pages/fold.html

discuss

mlochbaum|8 months ago

And the reason +˝ is fairly fast for long rows, despite that page claiming no optimizations, is that ˝ is defined to split its argument into cells, e.g. rows of a matrix, and apply + with those as arguments. So + is able to apply its ordinary vectorization, while it can't in some other situations where it's applied element-wise. This still doesn't make great use of cache and I do have some special code working for floats that does much better with a tiling pattern, but I wanted to improve +˝ for integers along with it and haven't finished those (widening on overflow is complicated).