(no title)
tarruda | 10 days ago
- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).
This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.
There are a few drawbacks though:
- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...
Hopefully StepFun will do a new release which addresses these issues.
BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1
sosodev|10 days ago
I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.
That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.
tarruda|10 days ago
ipython|10 days ago
It’s my layman understanding that would have to be fixed in the model weights itself?
tarruda|10 days ago
sosodev|10 days ago
It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.
You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635
Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.
petethepig|10 days ago
Did anyone do this kind of math?
tarruda|10 days ago
However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k
And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).
terhechte|10 days ago
tarruda|10 days ago
- OpenAI completions endpoint
- Anthropic messages endpoint
- OpenAI responses endpoint
- A slick looking web UI
Without having to install anything else.
KerrAvon|10 days ago
lostmsu|10 days ago
tarruda|10 days ago
For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".
I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.