top | item 36411268

(no title)

this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization..

discuss

zhisbug|2 years ago

yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput.

ynniv|2 years ago

If this were submitted as a new comment it would be at the top of the page.

You mean like, theoretically, in the future? Or you mean today?

ipsum2|2 years ago

probably mean agnostic, agonistic implies the opposite.