top | item 46631442

(no title)

Bigger context makes responses slower.

Context is limited.

You do not want the cloud provider running a context compaction if you can control it a lot better.

There are even tips on when to ask the question like "send first the content then ask the question" vs. "ask the question then send the content"

discuss

bluegatty|1 month ago

When history is cached conversations tend not to be slower, because the LLM can 'continue' from a previous state.

So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast.

Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately.

Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass.

'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper.

So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb.

Fazebooking|1 month ago

Thats only the case with KV Cache and we do not know how and how long providers keep it.