Can any kind soul explain the difference between GGUF, GGML and all the other model packaging I am seeing these days? Was used to pth and the thing tf uses. Is this all to support inference or quantization? Who manages these formats or are they brewing organically?
austinvhuang|2 years ago
My personal way of understanding it is this - the original sin of model weight format complexity is that NNs are both data and computation.
Representing the computation as data is the hard part and that's where the simplicity falls apart. Do you embed the compute graph? If so, what do you do about different frameworks supporting overlapping but distinct operations. Do you need the artifact to make training reproducible? Well that's an even more complex computation that you have to serialize as data. And so on..
moffkalast|2 years ago
GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF.
GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit.
Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning.
Gracana|2 years ago
That sounds very convenient. What software makes use of the built-in prompt template?
liuliu|2 years ago
GGUF is just weights, safetensors the same thing. GGUF doesn't need a JSON decoder for the format while safetensors needs that.
I personally think having a JSON decoder is not a big deal and make the format more amendable, given GGUF evolves too.