top | item 38870521

(no title)

dsharlet | 2 years ago

BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...

You're probably now comparing parallel code to single threaded code.

discuss

order

bjourne|2 years ago

No, multi-threaded OpenBLAS improves performance to 0.15s.

dsharlet|2 years ago

I dunno man. My claim was that for specific cases with unique properties, it's not hard to beat BLAS, without getting too exotic with your code. BLAS doesn't have routines for multiplies with non-contiguous data, various patterns of sparsity, mixed precision inputs/outputs, etc. The example I gave is for a specific case close-ish to the case I cared about.

You're changing it to a very different case, presumably one that you cared about, although 4096x4096 is oddly square and a very clean power of 2... I said right at the beginning of this long digression that what is hard about reproducing BLAS is its generality.