top | item 36577898

(no title)

kamilner | 2 years ago

The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.

In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684

discuss

londons_explore|2 years ago

Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)

vmirnv|2 years ago

Also I would highly recommend Q5_K_M for both 7B and 13B models.

It has the best balance between quality and weight of the model and almost indistinguishable from original f16: https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...

LoganDark|2 years ago

This is an excellent explanation, thank you!!