top | item 38861265

(no title)

dsharlet | 2 years ago

This gets to 90% of BLAS: https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...

The less involved versions still get ~70%.

But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.

discuss

order

bjourne|2 years ago

I don't believe you. OpenBLAS is multithreaded but the code you posted is single-threaded. The inner kernel isn't very well optimized either. So, no way.

dsharlet|2 years ago

I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.

As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...