top | item 46841699

(no title)

Archit3ch | 29 days ago

Is there a version of this that supports sparse LU solves?

discuss

I’m trying to make sense of this question.

GEMMs are dense O(N^3) work operations that have roughly the same access pattern and data reuse properties across all matrices. Of course, I’m simplifying things a lot here; tall-skinny and short-fat patterns are much harder to get performance out of but the spirit of the approach is the same as big square matrices.

Sparse LU solves have a different character. There is nowhere near O(N^3) work. You typically expect something closer to O(N^2) but getting performance out of these operations is notoriously difficult because it depends a lot on the sparsity pattern of the linear system. Making matters worse is that you may commonly have a sparse A that factorises to dense L and/or U matrices.

Archit3ch|28 days ago

I'm working with small matrices (e.g. 10x10 to 100x100), where I believe the effect of caches/pipelines/registers/etc will kick in before the O(N^2)-vs-O(N^3) discussion. Then dispatching to the hardware accelerators (SME2 FMLA or AMX FMA) and doing a _dense_ solve with 512-bit vectors could still be faster than a sparse solve at small matrix sizes or NEON.

Though as mentioned elsewhere in the thread, these accelerators only offer throughput, and latency suffers...