top | item 38201489

(no title)

Thanks for your encouragement! We are working on quantization as well. We recently submitted a paper, Atom [1], that uses 4-bit quantization, delivering 7.73x throughput compared to FP16 and 2.53x compared to INT8. Atom is able to maintain a perplexity (i.e., model accuracy) close to FP16, outperforming existing quantization approaches.

We are polishing the 4-bit code. It will be added to Punica code base soon. Please stay tuned :)

[1] https://arxiv.org/abs/2310.19102

discuss

Palmik|2 years ago

Added to my reading list! The world of quantizations is moving so fast even TheBloke might not be able to keep up!

So Atom base models would be compatible with Punica?

I also wonder, many people already train LoRAs in 8 or even 4 bit (for the base model), would it make sense to match the quantization algo used during training and inference?