The relevant operations for matrix multiply are leading-axis extension, shown near the end of [0], and Insert +˝ shown in [1]. Both for floats; the leading-axis operation is × but it's the same speed as + with floating-point SIMD. We don't handle these all that well, with needless copying in × and a lot of per-row overhead in +˝, but of course it's way better than scalar evaluation.[0] https://mlochbaum.github.io/bencharray/pages/arith.html
[1] https://mlochbaum.github.io/bencharray/pages/fold.html
mlochbaum|8 months ago