That’s what this is. It’s caching the state of the model after the tokens have been loaded. Reduces latency and cost dramatically. 5m TTL on the cache usually.
Interesting! I’m wondering, does caching the model state mean the tokens are no longer directly visible to the model? i.e. if you asked it to print out the input tokens perfectly (assuming there’s no security layer blocking this, and assuming it has no ‘tool’ available to pull in the input tokens), could it do it?
Depends on what front end you use. But for text-generation-webui for example, Prompt Caching is simply a checkbox under the Model tab you can select before you click "load model".
Jaxkr|10 months ago
cal85|10 months ago
chpatrick|10 months ago
EGreg|10 months ago
concats|10 months ago