(no title)
Fazebooking | 1 month ago
Context is limited.
You do not want the cloud provider running a context compaction if you can control it a lot better.
There are even tips on when to ask the question like "send first the content then ask the question" vs. "ask the question then send the content"
bluegatty|1 month ago
So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast.
Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately.
Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass.
'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper.
So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb.
Fazebooking|1 month ago