top | item 18199372

(no title)

ndh2 | 7 years ago

My understanding is that this is only about memory access patterns, and has nothing to do with multithreading. The basic example being matrix multiplications C=A∙B. If the matrices are too big to fit in the L2/L3 cache, you have to think about how to optimize memory access. So instead of having the compiler just generate instructions to read/write memory, you want it to interleave instructions to prefetch memory, ahead of time. But how do you do that? What is the best point in time to prefetch?

Another optimization is locality: The basic algorithm is to loop over i and j, then multiply A's row i with B's column j. You do that with another loop over k to compute the sum over Aᵢₖ∙Bₖⱼ, then store the result in Cᵢⱼ. But what if one row or column is already too big for the cache? Also, once you invested into loading the data into the cache, you want to make the best use of it. You don't want to load it again if you can avoid it. So what you do is you limit the loop ranges for i, j, and k such that the overall memory accessed fits into the cache. But what are the optimal loop sizes?

The answers depends on the CPU architecture (cache sizes) and probably also the memory that you're using.

discuss

order

gnufx|7 years ago

I don't know about this specific example, but polyhedral optimization generally, at least, is useful for parallelization of such loops, e.g. http://pluto-compiler.sourceforge.net/. However, serial GEMM is typically most important in HPC, with parallelization at a higher level.

(GEMM shouldn't be main memory limited, if that's what you mean.)