"It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase."SWE bench from ~30-40% to ~70-80% this year
elcritch|9 months ago
bckr|9 months ago
Yes. You must guide coding agents at the level of modules and above. In fact, you have to know good coding patterns and make these patterns explicit.
Claude 4 won’t use uv, pytest, pydantic, mypy, classes, small methods, and small files unless you tell it to.
Once you tell it to, it will do a fantastic job generating well-structured, type-checked Python.
viraptor|9 months ago
avs733|9 months ago
40% to 80% is a 2x improvement
It’s not that the second leap isn’t impressive, it just doesn’t change your perspective on reality in the same way.
viraptor|9 months ago
It really depends on how that remaining improvement happens. We'll see it soon though - every benchmark nearing 90% is being replaced with something new. SWE-verified is almost dead now.
energy123|9 months ago
andyferris|9 months ago
A 20% risk seems more manageable, and the improvements speak to better code and problem solving skills around.
piperswe|9 months ago
icpmacdo|9 months ago
keeeba|9 months ago