top | item 36411257

(no title)

bioemerl | 2 years ago

I'm spoiled by 4 bit and unfortunately it doesn't appear to be supposed here so this isn't of much use to me, but it's awesome to see people working on the inference speed side of things regardless.

discuss

order

george_123|2 years ago

this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization..

zhisbug|2 years ago

yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput.