Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.
Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
Update: I'm experiencing issues with OpenCode and this model. I have built the latest llama.cpp and followed the Unsloth guide, but it's not usable at the moment because of:
- Tool calling doesn't work properly with OpenCode
- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher
- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md
I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.
> Codex is notably higher quality but also has me waiting forever.
And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designed to work much better with Anthropic models).
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
Same, I got 12 months of subscription for $28 total (promo offer), with 5x the usage limits of the $20/month Claude Pro plan. I have only used it with claude code so far.
Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.
I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
Note that this is the Flash variant, which is only 31B parameters in total.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.
For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.
I've been running it with llama-server from llama.cpp (compiled for CUDA backend, but there are also prebuilt binaries and instructions for other backends in the README) using the Q4_K_M quant from ngxson on Lubuntu with an RTX 3090:
Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.
The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.
Tried it within LMStudio on my m4 macbook pro – it feels dramatically worse than gpt-oss-20b. Of the two (code) prompts I've tried so far, it started spitting out invalid code and got stuck in a repeating loop for both. It's possible that LMStudio quantizes the model in such a manner that it explodes, but so far not a great first impression.
Excited to finally be able to give this a try today. I'm documenting my experience using aoe + OpenCode + LM Studio + GLM-4.7 Flash + Mac Mini M4Pro 64GB Mem on this thread if anyone wants to follow along and or give me advice about how badly I'm messing up the settings
Interesting they are releasing a tiny (30B) variant, unlike the 4.5-air distill which was 106B parameters. It must be competing with gpt mini and nano models, which personally I have found to be pretty weak. But this could be perfect for local LLM use cases.
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
Comparison to GPT-OSS-20B (irrespective of how you feel that model actually performs) doesn't fill me with confidence. Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5, I would have hoped that their flash model would run circles around GPT-OSS-120B. I do wish they would provide an Aider result for comparison. Aider may be saturated among SotA models, but it's not at this size.
Hoping a 30-A3B runs circles around a 117-A5.1B is a bit hopeful thinking, especially when you’re testing embedded knowledge. From the numbers, I think this model excels at agent calls compared to GPT-20B. The rest are about the same in terms of performance
The benchmarks lie. I've been using using glm 4.7 and it's pretty okay with simple tasks but it's nowhere even near Sonnet. Still useful and good value but it's not even close.
Maybe someone here has tackled this before. I’m trying to connect Antigravity or Cursor with GLM/Qwen coding models, but haven’t had any luck so far. I can easily run Open-WebUI + LLaMA on my 5090 Ubuntu box without issues. However, when I try to point Antigravity or Cursor to those models, they don’t seem to recognize or access them. Has anyone successfully set this up?
I don't believe Antigravity or Cursor work well with pluggable models. It seems to be impossible with Antigravity and with Cursor while you can change the OAI compatible API endpoint to one of your choice, not all features may work as expected.
My recommendation would be to use other tools built to support pluggable model backends better. If you're looking for a Claude Code alternative, I've been liking OpenCode so far lately, and if you're looking for a Cursor alternative, I've heard great things about Roo/Cline/KiloCode although I personally still just use Continue out of habit.
What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent?
I am interesting if I can run it on a 24GB RTX 4090.
I like the byteshape quantizations - they are dynamic variable quantization weights that are tuned for quality vs overall size. They seem to make less errors at lower "average" quantizations than the unsloth 4 bit quants. I think this is similar to variable bitrate video compression where you can keep higher bits where it helps overall model accuracy.
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.
It's in the ollama library at q4_K_M, which doesn't quite fit on my 4090 with the default context length. But it only offloads 8 layers to the CPU for me. I'm getting usable enough token rates. That's probably the easiest way to get it. Not tried it with vllm but if it proves good enough to stick with then I might give it a try.
What's the minimum hardware you need to run this at a reasonable speed?
My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects
Gave it four of my vibe questions around general knowledge and it didn’t do great. Maybe expected with a model as small as this one. Once support in llama.cpp is out I will take it for a spin.
It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
It may be worth taking a look at LFM [1]. I haven't had the need to use it so far (running on Apple silicon on a day to day basis so my dailies are usually the 30B+ MoEs) but I've heard good things from the internet from folks using it as a daily on their phones. YMMV.
We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.
ssh admin.hotaisle.app
Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
dajonker|1 month ago
Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
dajonker|1 month ago
- Tool calling doesn't work properly with OpenCode
- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher
- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md
I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.
latchkey|1 month ago
This user has also done a bunch of good quants:
https://huggingface.co/0xSero
behnamoh|1 month ago
And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
polyrand|1 month ago
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
RickHull|1 month ago
victorbjorklund|1 month ago
vessenes|1 month ago
HumanOstrich|1 month ago
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
pseudony|1 month ago
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
Workaccount2|1 month ago
People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.
behnamoh|1 month ago
This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.
unknown|1 month ago
[deleted]
mckirk|1 month ago
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
ttoinou|1 month ago
unknown|1 month ago
[deleted]
montroser|1 month ago
This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
achierius|1 month ago
primaprashant|1 month ago
[1]: https://mistral.ai/news/devstral-2-vibe-cli
bilsbie|1 month ago
epolanski|1 month ago
You can get LLM as a service for cheaper.
E.g. This model costs less than a tenth of Haiku 4.5.
baranmelik|1 month ago
johndough|1 month ago
https://github.com/ggml-org/llama.cpp/releases
https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...
You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completionsSeems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
ljouhet|1 month ago
zackify|1 month ago
pixelmelt|1 month ago
cmrdporcupine|1 month ago
HEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.
BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.
Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do.KludgeShySir|1 month ago
[deleted]
montroser|1 month ago
z2|1 month ago
unknown|1 month ago
[deleted]
linolevan|1 month ago
tgtweak|1 month ago
esafak|1 month ago
GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.
mgambati|1 month ago
river_otter|1 month ago
https://x.com/natebrake/status/2013978241573204246
Thus far, the 6-bit quant MLX weights were too much and crashed LMS with OOM
dfajgljsldkjag|1 month ago
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
jcuenod|1 month ago
syntaxing|1 month ago
victorbjorklund|1 month ago
unsupp0rted|1 month ago
Not for code. The quality is so low, it's roughly on par with Sonnet 3.5
infocollector|1 month ago
yowlingcat|1 month ago
My recommendation would be to use other tools built to support pluggable model backends better. If you're looking for a Claude Code alternative, I've been liking OpenCode so far lately, and if you're looking for a Cursor alternative, I've heard great things about Roo/Cline/KiloCode although I personally still just use Continue out of habit.
arbuge|1 month ago
https://huggingface.co/inference/models?model=zai-org%2FGLM-...
Mattwmaster58|1 month ago
Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.
syntaxing|1 month ago
eurekin|1 month ago
veselin|1 month ago
I am interesting if I can run it on a 24GB RTX 4090.
Also, would vllm be a good option?
tgtweak|1 month ago
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.
regularfry|1 month ago
karmakaze|1 month ago
I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.
[0] https://z.ai/blog/glm-4.7
lordofgibbons|1 month ago
redrove|1 month ago
kylehotchkiss|1 month ago
My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects
metalliqaz|1 month ago
andhuman|1 month ago
XCSme|1 month ago
strangescript|1 month ago
lostmsu|1 month ago
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
unknown|1 month ago
[deleted]
aziis98|1 month ago
yowlingcat|1 month ago
[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
pixelmelt|1 month ago
twelvechess|1 month ago
cipehr|1 month ago
https://huggingface.co/EssentialAI/rnj-1
piyh|1 month ago
epolanski|1 month ago
PhilippGille|1 month ago
https://openrouter.ai/z-ai/glm-4.7-flash/providers
unknown|1 month ago
[deleted]
latchkey|1 month ago
ssh admin.hotaisle.app
Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
dvs13|1 month ago
xena|1 month ago
Haris18|1 month ago
[deleted]
unknown|1 month ago
[deleted]
wotsdat|1 month ago
[deleted]