top | item 36411257 (no title) bioemerl | 2 years ago I'm spoiled by 4 bit and unfortunately it doesn't appear to be supposed here so this isn't of much use to me, but it's awesome to see people working on the inference speed side of things regardless. discuss order hn newest george_123|2 years ago this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization.. zhisbug|2 years ago yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput. load replies (3)
george_123|2 years ago this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization.. zhisbug|2 years ago yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput. load replies (3)
zhisbug|2 years ago yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput. load replies (3)
george_123|2 years ago
zhisbug|2 years ago