(no title)
ndh2 | 7 years ago
Another optimization is locality: The basic algorithm is to loop over i and j, then multiply A's row i with B's column j. You do that with another loop over k to compute the sum over Aᵢₖ∙Bₖⱼ, then store the result in Cᵢⱼ. But what if one row or column is already too big for the cache? Also, once you invested into loading the data into the cache, you want to make the best use of it. You don't want to load it again if you can avoid it. So what you do is you limit the loop ranges for i, j, and k such that the overall memory accessed fits into the cache. But what are the optimal loop sizes?
The answers depends on the CPU architecture (cache sizes) and probably also the memory that you're using.
gnufx|7 years ago
(GEMM shouldn't be main memory limited, if that's what you mean.)
unknown|7 years ago
[deleted]