(no title)
kherud
|
2 years ago
Why is this the expected result? The original transformer algorithm has a n^2 computational complexity, where n is the amount of tokens. As far as I know, there are some improvements which bring it down to something like n*log(n). A linear complexity seems surprising however. Is the reason that calculating the attention can be completely parallelized with decent hardware, so the response time stays linear?
jiggawatts|2 years ago
Disclaimer: I’m not an expert, this is just what I’ve picked up reading about the technology.
smpanaro|2 years ago
If you are generating token-by-token naively, you do need to pay the n^2 cost since every token must attend to all other tokens. Generating a sequence of 5 tokens starting from 2 (infer 3rd, infer 4th, infer 5th...) will be much faster than starting from 1024 (infer 1025th, infer 1026th, ...) since your n is smaller. But each time time it is n^2.
However that is a naive approach. There is a common optimization, KV caching[1], (on by default for HuggingFace models) that caches all the work from the prior step so you only have to compute the attention for the new token. So you get something like (infer 1025th, cache 1025 attend new 1 token over the other 1025, cache 1026 attend 1 new token over the other 1026, ...). Not quite constant time but much better than n^2.
I would imagine there are other optimizations too, but this is the one I've heard of.
[1] fairly code-y, but links to some other posts at the start: https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-k...
Tostino|2 years ago