(no title)
jmorgan | 1 year ago
1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)
[1] https://arxiv.org/pdf/2309.06180
[2] https://github.com/microsoft/vattention
[3] https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...
No comments yet.