top | item 37214759

(no title)

kherud | 2 years ago

Why is this the expected result? The original transformer algorithm has a n^2 computational complexity, where n is the amount of tokens. As far as I know, there are some improvements which bring it down to something like n*log(n). A linear complexity seems surprising however. Is the reason that calculating the attention can be completely parallelized with decent hardware, so the response time stays linear?

discuss

jiggawatts|2 years ago

The training cost scales as n^2 where ‘n’ is the maximum token window size. During inference the compute cost is constant per token.

Disclaimer: I’m not an expert, this is just what I’ve picked up reading about the technology.

smpanaro|2 years ago

Also not an expert, but I believe it is a little bit of both for inference.

If you are generating token-by-token naively, you do need to pay the n^2 cost since every token must attend to all other tokens. Generating a sequence of 5 tokens starting from 2 (infer 3rd, infer 4th, infer 5th...) will be much faster than starting from 1024 (infer 1025th, infer 1026th, ...) since your n is smaller. But each time time it is n^2.

However that is a naive approach. There is a common optimization, KV caching[1], (on by default for HuggingFace models) that caches all the work from the prior step so you only have to compute the attention for the new token. So you get something like (infer 1025th, cache 1025 attend new 1 token over the other 1025, cache 1026 attend 1 new token over the other 1026, ...). Not quite constant time but much better than n^2.

I would imagine there are other optimizations too, but this is the one I've heard of.

[1] fairly code-y, but links to some other posts at the start: https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-k...

Tostino|2 years ago

I don't believe this is correct from my direct experience running local models. They generate slower when you start to fill up the context window compared to when it first starts responding.