One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.
There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.
a_e_k|1 month ago
https://huggingface.co/models?other=base_model:quantized:zai...
Probably as:
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
homarp|1 month ago
issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931
dumbmrblah|1 month ago
latchkey|1 month ago
disiplus|1 month ago