>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references
It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.
The trick is indeed to somehow imagine how the CPU works with the Lx caches and keep as much info in them as possible. So its not only about exploiting fancy instructions, but also thinking in engineering terms. Most of the software written in higher level langs cannot effectively use L1/L2 and thus results in this constant slowing down otherwise similarly (from asymptotic analysis perspective) complexity algos.
kpw94|1 year ago
https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...
>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references
It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.
larodi|1 year ago