(no title)
kir-gadjello | 1 day ago
I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.
I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.
jasonni|17 hours ago
kir-gadjello|9 hours ago
copperx|21 hours ago
kir-gadjello|18 hours ago
Even purely pragmatically, StepFun covers 95% of my research+SWE coding needs, and for the remaining 5% I can access the large frontier models. I was surprised StepFun is even decent at planning and research, so it is possible to get by with it and nothing else (1), but ofc for minmaxing the best frontier model is still the best planner (although the latest deepseek is surprisingly good too).
Finally we are at a point where there is a clear separation of labor between frontier & strong+fast models, but tbh shoehorning StepFun into this "strong+fast" category feels limiting, I think it has greater potential.
CapsAdmin|18 hours ago
Claude code always give me rate limits. Claude through copilot is a bit slow, but copilot has constant network request issues or something, but at least I don't get rate limited as often.
At least local models always work, is faster (50+ tps with qwen3.5 35b a4b on a 4090) and most importantly never hit a rate limit.
nodakai|19 hours ago
It’s 2× faster than its competitors. For tasks where “one-shotting” is unrealistic, a fast iteration loop makes a measurable difference in productivity.
mycall|11 hours ago
Aurornis|9 hours ago
To be clear I never said they weren’t strong or useful. I use them for some small tasks too.
I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.
Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5
kir-gadjello|9 hours ago
Maybe my detailed, requirement-based/spec-based prompting style makes the difference between anthropic's and OSS models smaller and people just like how good Anthropic's models are at reading the programmer's intent from short concise prompts.
Frankly, I think the 1:1 equivalent is an impossible standard given the set of priorities and decisions frontier labs make when setting up their pre-, mid- and post-training pipelines, and benchmark-wise it is achievable for a smaller OSS model to align with Sonnet 4.5 even on hard benchmarks.
Given the relatively underwhelming Sonnet 4.5 benchmarks [1], I think StepFun might have an edge over it esp. in Math/STEM [2] - even an old deepseek-3.2 (not speciale!) had a similar aggregate score. With 4.6 Anthropic ofc vastly improved their benchmark game, and it now truly looks like a frontier model.
1. https://artificialanalysis.ai/models/claude-4-5-sonnet-think... 2. https://matharena.ai/models/stepfun_3_5_flash
aappleby|1 day ago
kir-gadjello|1 day ago
FuckButtons|23 hours ago