top | item 35107389

(no title)

KVFinn | 3 years ago

Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare.

It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test.

>4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference.

https://github.com/ggerganov/llama.cpp/issues/9

discuss

No comments yet.