top | item 37054187

(no title)

mukel | 2 years ago

Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.

discuss

No comments yet.