top | item 47208358

(no title)

They are not equivalent 1:1, esp. in knowledge coverage (given OOM param size difference) and in taste (Sonnet wins, but for taste one can also use Kimi K2.5), but in my hardcore use (high-performance realtime simulations of various kinds) I would prefer StepFun-3.5-Flash to Sonnet 4 strongly and to 4.5 often enough without a decisive advantage in using exclusively Sonnet 4.5. For truly hard tasks or specifications I would turn to 5.2 or 5.3-codex of course - but one KPI for quality of my work as a lead engineer is to ensure that truly hard tasks are known, bounded and planned-for in advance.

Maybe my detailed, requirement-based/spec-based prompting style makes the difference between anthropic's and OSS models smaller and people just like how good Anthropic's models are at reading the programmer's intent from short concise prompts.

Frankly, I think the 1:1 equivalent is an impossible standard given the set of priorities and decisions frontier labs make when setting up their pre-, mid- and post-training pipelines, and benchmark-wise it is achievable for a smaller OSS model to align with Sonnet 4.5 even on hard benchmarks.

Given the relatively underwhelming Sonnet 4.5 benchmarks [1], I think StepFun might have an edge over it esp. in Math/STEM [2] - even an old deepseek-3.2 (not speciale!) had a similar aggregate score. With 4.6 Anthropic ofc vastly improved their benchmark game, and it now truly looks like a frontier model.

1. https://artificialanalysis.ai/models/claude-4-5-sonnet-think... 2. https://matharena.ai/models/stepfun_3_5_flash

discuss

No comments yet.