(no title)
carsoon | 3 months ago
Before it felt like they were good for very specific usecases and common frameworks (Python and nextjs) but still made tons of mistakes constantly.
Now they work with novel frameworks and are very good at correcting themselves using linting errors, debugging themselves by reading files and querying databases and these models are affordable enough for many different usecases.
justanotherunit|3 months ago
NitpickLawyer|3 months ago
Check out the exercise from the swe-agent people who released a mini agent that's "terminal in a loop" and that started to get close to the engineered agents this year.
https://github.com/SWE-agent/mini-swe-agent
carsoon|2 months ago
But these raw models (which i test through direct api calls) are much better. The biggest change with regards to price was through mixture of experts which allowed keeping quality very similar and dropping compute 10x. (This is what allowed deepseek v3 to have similar quality to gpt-4o at such a lower price.)
This same tech has most likely been applied to these new models and now we have 1T-100T? parameter models with the same cost as 4o through mixture of experts. (this is what I'd guess at least)
ACCount37|3 months ago
"A well crafted layer of business logic" just doesn't exist. The amount of "business logic" involved in frontier LLMs is surprisingly low, and mostly comes down to prompting and how tools like search or memory are implemented.
Things like RAG never quite took off in frontier labs, and the agentic scaffolding they use is quite barebones. They bet on improving the model's own capabilities instead, and they're winning on that bet.