Huh? I thought the issue before ringattention is the memory requirement of the softmax layer, since you have to load the whole matrix in at once? It's O(s^2) no?
But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.
chillee|2 years ago
But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.
casercaramel144|2 years ago
How do you do matrix vector attention without keeping the full matrix in cache, surely you don't just load unload it a million times