Is it the models tho? With every release (mutlimodal etc) its just a well crafted layer of business logic between the user and the LLM. Sometimes I feel like we lose track of what the LLM does, and what the API before it does.
It's 100% the models. Terminal bench is a good indication for this. There the agents get "just a terminal tool", and yet they still can solve lots and lots of tasks. Last year you needed lots of glue, and two years ago you needed monstrosities like langchain that worked maybe once in a blue moon, if you didn't look funny at it.
Check out the exercise from the swe-agent people who released a mini agent that's "terminal in a loop" and that started to get close to the engineered agents this year.
Its definitely a mix, we have been codeveloping better models and frameworks/systems to improve the outputs. Now we have llms.txt, MCP servers, structured outputs, better context management systems and augemented retreival through file indexing, search, and documentation indexing.
But these raw models (which i test through direct api calls) are much better. The biggest change with regards to price was through mixture of experts which allowed keeping quality very similar and dropping compute 10x. (This is what allowed deepseek v3 to have similar quality to gpt-4o at such a lower price.)
This same tech has most likely been applied to these new models and now we have 1T-100T? parameter models with the same cost as 4o through mixture of experts. (this is what I'd guess at least)
"A well crafted layer of business logic" just doesn't exist. The amount of "business logic" involved in frontier LLMs is surprisingly low, and mostly comes down to prompting and how tools like search or memory are implemented.
Things like RAG never quite took off in frontier labs, and the agentic scaffolding they use is quite barebones. They bet on improving the model's own capabilities instead, and they're winning on that bet.
So how would you go and explain how an output of tokens can call a function, or even generate an image since that requires a whole different kind of compute? It’s still a layer between the model which acts as a parser to enable these capabilities.
Maybe “business” is a bad term for it, but the actual output of the model still needs to be interpreted.
Maybe I am way out of line here since this is not my field, and I am doing my best to understand these layers. But in your terms you are maybe speaking of the model as an application?
NitpickLawyer|3 months ago
Check out the exercise from the swe-agent people who released a mini agent that's "terminal in a loop" and that started to get close to the engineered agents this year.
https://github.com/SWE-agent/mini-swe-agent
carsoon|3 months ago
But these raw models (which i test through direct api calls) are much better. The biggest change with regards to price was through mixture of experts which allowed keeping quality very similar and dropping compute 10x. (This is what allowed deepseek v3 to have similar quality to gpt-4o at such a lower price.)
This same tech has most likely been applied to these new models and now we have 1T-100T? parameter models with the same cost as 4o through mixture of experts. (this is what I'd guess at least)
ACCount37|3 months ago
"A well crafted layer of business logic" just doesn't exist. The amount of "business logic" involved in frontier LLMs is surprisingly low, and mostly comes down to prompting and how tools like search or memory are implemented.
Things like RAG never quite took off in frontier labs, and the agentic scaffolding they use is quite barebones. They bet on improving the model's own capabilities instead, and they're winning on that bet.
justanotherunit|3 months ago
Maybe “business” is a bad term for it, but the actual output of the model still needs to be interpreted.
Maybe I am way out of line here since this is not my field, and I am doing my best to understand these layers. But in your terms you are maybe speaking of the model as an application?