(no title)
dipampaul17 | 9 months ago
This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.
I've tested it successfully with:
- Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama - Qwen variants
The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.
fennecbutt|9 months ago