top | item 39558205

(no title)

Also, while the cost of tokens is lower, let's argue it's cheap enough not to care. Reading 1m tokens surely isn't realistic for latency?

discuss

hansonw|2 years ago

If sub-quadratic architectures (eg Mamba) become a thing, it will become feasible to precompute most of the work for a fixed prefix (i.e. system prompt) and the latency can be pretty minimal. Even with current transformers, if you have a fixed system prompt, you can save the KV cache and it helps a lot (though the inference time of each incremental token is still linear).