For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.
I've been running it with llama-server from llama.cpp (compiled for CUDA backend, but there are also prebuilt binaries and instructions for other backends in the README) using the Q4_K_M quant from ngxson on Lubuntu with an RTX 3090:
Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
I think the recently introduced -fit option which is on by default means it's no longer necesary to -ngl, can also probably drop -c which is "0" by default and reads metadata from the gguf to get the model's advertised context size
It's available (with tool parsing, etc.): https://ollama.com/library/glm-4.7-flash but requires 0.14.3 which is in pre-release (and available on Ollama's GitHub repo)
johndough|1 month ago
https://github.com/ggml-org/llama.cpp/releases
https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...
You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completionsSeems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
mistercheph|1 month ago
ljouhet|1 month ago
jmorgan|1 month ago
zackify|1 month ago
pixelmelt|1 month ago