(no title)
dsharlet | 2 years ago
As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...
bjourne|2 years ago
On my machine single-threaded OpenBLAS (called via NumPy) multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds when compiled with clang++. And yes I used -O3, -march=native, and all that jazz. Btw, your code crashes g++ which doesn't necessarily mean that it is incorret, but it indicates that the code may be difficult for the compiler to optimize. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added. You can look at my code for how to write a decent inner kernel. Writing it in pure C without intrinsics and hoping that the compiler will optimize it definitely will not work.
It also is not true that "Parallelism would be easy to add". Unless your algorithm is designed from the start to exploit multi-threading, attempting to bolt it on later will not yield good performance.
dsharlet|2 years ago
Here's what I see:
My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b... BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.
Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.