top | item 39473442

(no title)

Huh? I thought the issue before ringattention is the memory requirement of the softmax layer, since you have to load the whole matrix in at once? It's O(s^2) no?

Also hi horace.

discuss

chillee|2 years ago

Who is this :think:

But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.

casercaramel144|2 years ago

It's camel.

How do you do matrix vector attention without keeping the full matrix in cache, surely you don't just load unload it a million times