What is the point of making the matrix multiplication itself multithreaded (other than benchmarking)? Wouldn't it be more beneficial in practice to have the multithreadedness in the algorithm that use the multiplication?
That's indeed what's typically done in HPC. However, substituting a parallel BLAS can help the right sort of R code simply, for instance, but HPC codes typically aren't bottlenacked on GEMM.
gnufx|1 year ago