top | item 44009655

(no title)

Yes, that's one of the key benefits - KVSplit works with any existing .gguf model without requiring reconstruction or special conversion. The quantization happens at runtime on the KV cache, not during model loading or conversion.

This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.

I've tested it successfully with:

- Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama - Qwen variants

The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.

discuss

fennecbutt|9 months ago

I thought flash attention was required for quantised KV?