top | item 43912143

(no title)

llflw | 10 months ago

It seems like it's token caching, not model caching.

discuss

Jaxkr|10 months ago

That’s what this is. It’s caching the state of the model after the tokens have been loaded. Reduces latency and cost dramatically. 5m TTL on the cache usually.

cal85|10 months ago

Interesting! I’m wondering, does caching the model state mean the tokens are no longer directly visible to the model? i.e. if you asked it to print out the input tokens perfectly (assuming there’s no security layer blocking this, and assuming it has no ‘tool’ available to pull in the input tokens), could it do it?

chpatrick|10 months ago

Isn't the state of the model exactly the previous generated text (ie. the prompt)?

EGreg|10 months ago

Can someone explain how to use Prompt Caching with LLAMA 4?

concats|10 months ago

Depends on what front end you use. But for text-generation-webui for example, Prompt Caching is simply a checkbox under the Model tab you can select before you click "load model".