top | item 44659007

(no title)

nhecker | 7 months ago

> dynamically assign precision to the layers that need them

Well now I'm curious; how is a layer judged on its relative need for precision? I guess I still have a lot of learning to do w.r.t. how quantization is done. I was under the impression it was done once, statically, and produced a new giant GGUF blob or whatever format your weights are in. Does that assumption still hold true for the approach you're describing?

discuss

irthomasthomas|7 months ago

Last I checked they ran some sort of evals before and after quantisation and measured the effect. E.g Exllama-v2 measures the loss while reciting Wikipedia articles.

clownpenis_fart|7 months ago

[deleted]

smcleod|7 months ago

Within the GGUF (and some other formats) you'll see each layer gets its own quantisation, for example embeddings layers are usually more sensitive to quantisation and as such are often kept at Q8 or FP16. If you run GGUF-dump or click on the GGUF icon on a model in huggingface you'll see.

smcleod|7 months ago

Have a watch of https://youtu.be/vW30o4U9BFE