top | item 46117253

(no title)

carsoon | 3 months ago

Yeah this latest generation of models (Opus 4.5 GPT 5.1 and Gemini Pro 3) are the biggest breakthrough since gpt-4o in my mind.

Before it felt like they were good for very specific usecases and common frameworks (Python and nextjs) but still made tons of mistakes constantly.

Now they work with novel frameworks and are very good at correcting themselves using linting errors, debugging themselves by reading files and querying databases and these models are affordable enough for many different usecases.

discuss

order

justanotherunit|3 months ago

Is it the models tho? With every release (mutlimodal etc) its just a well crafted layer of business logic between the user and the LLM. Sometimes I feel like we lose track of what the LLM does, and what the API before it does.

NitpickLawyer|3 months ago

It's 100% the models. Terminal bench is a good indication for this. There the agents get "just a terminal tool", and yet they still can solve lots and lots of tasks. Last year you needed lots of glue, and two years ago you needed monstrosities like langchain that worked maybe once in a blue moon, if you didn't look funny at it.

Check out the exercise from the swe-agent people who released a mini agent that's "terminal in a loop" and that started to get close to the engineered agents this year.

https://github.com/SWE-agent/mini-swe-agent

carsoon|2 months ago

Its definitely a mix, we have been codeveloping better models and frameworks/systems to improve the outputs. Now we have llms.txt, MCP servers, structured outputs, better context management systems and augemented retreival through file indexing, search, and documentation indexing.

But these raw models (which i test through direct api calls) are much better. The biggest change with regards to price was through mixture of experts which allowed keeping quality very similar and dropping compute 10x. (This is what allowed deepseek v3 to have similar quality to gpt-4o at such a lower price.)

This same tech has most likely been applied to these new models and now we have 1T-100T? parameter models with the same cost as 4o through mixture of experts. (this is what I'd guess at least)

ACCount37|3 months ago

It's the models.

"A well crafted layer of business logic" just doesn't exist. The amount of "business logic" involved in frontier LLMs is surprisingly low, and mostly comes down to prompting and how tools like search or memory are implemented.

Things like RAG never quite took off in frontier labs, and the agentic scaffolding they use is quite barebones. They bet on improving the model's own capabilities instead, and they're winning on that bet.