(no title)
mukel
|
2 years ago
Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.
No comments yet.