top | item 39462720

(no title)

cochne | 2 years ago

I think they are correct, do you have a source? From my knowledge the only other components are the fully connected networks which are not big contributors.

discuss

order

imtringued|2 years ago

It's quadratic, because of the dot product in the attention mechanism.

You can use K-V Caching to get rid of a lot of the quadratic runtime that comes from redundant matrix multiplications, but after you have cached everything, you still need to calculate the dot product k_i * q_j with i,j being index of the tokens. With n tokens, you will get O(n*n).

But you have to remember that this is only n^2 multiplications. It's not exactly the end of the world at context sizes of 32k, for example. It only gets nasty in the hundred thousands to millions.

Here is the source I used: https://sebastianraschka.com/blog/2023/self-attention-from-s...

lumost|2 years ago

For small values of N, the linear terms of the transformer dominate. At the end of the day, a double layer of 764*2048 is still north of 3.1 MM flops/token/layer.