top | item 46640068

(no title)

I've yet to be convinced by any article, including this one, that attempts to draw boxes around what coding agents are and aren't good at in a way that is robust on a 6 to 12 month horizon.

I agree that the examples listed here are relatable, and I've seen similar in my uses of various coding harnesses, including, to some degree, ones driven by opus 4.5. But my general experience with using LLMs for development over the last few years has been that:

1. Initially models could at best assemble a simple procedural or compositional sequences of commands or functions to accomplish a basic goal, perhaps meeting tests or type checking, but with no overall coherence,

2. To being able to structure small functions reasonably,

3. To being able to structure large functions reasonably,

4. To being able to structure medium-sized files reasonably,

5. To being able to structure large files, and small multi-file subsystems, somewhat reasonably.

So the idea that they are now falling down on the multi-module or multi-file or multi-microservice level is both not particularly surprising to me and also both not particularly indicative of future performance. There is a hierarchy of scales at which abstraction can be applied, and it seems plausible to me that the march of capability improvement is a continuous push upwards in the scale at which agents can reasonably abstract code.

Alternatively, there could be that there is a legitimate discontinuity here, at which anything resembling current approaches will max out, but I don't see strong evidence for it here.

discuss

Uehreka|1 month ago

It feels like a lot of people keep falling into the trap of thinking we’ve hit a plateau, and that they can shift from “aggressively explore and learn the thing” mode to “teach people solid facts” mode.

A week ago Scott Hanselman went on the Stack Overflow podcast to talk about AI-assisted coding. I generally respect that guy a lot, so I tuned in and… well it was kind of jarring. The dude kept saying things in this really confident and didactic (teacherly) tone that were months out of date.

In particular I recall him making the “You’re absolutely right!” joke and asserting that LLMs are generally very sycophantic, and I was like “Ah, I guess he’s still on Claude Code and hasn’t tried Codex with GPT 5”. I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong and not flattering my every decision. But that news (which would probably be really valuable to many people listening) wasn’t mentioned on the podcast I guess because neither of the guys had tried Codex recently.

And I can’t say I blame them: It’s really tough to keep up with all the changes but also spend enough time in one place to learn anything deeply. But I think a lot of people who are used to “playing the teacher role” may need to eat a slice of humble pie and get used to speaking in uncertain terms until such a time as this all starts to slow down.

orbital-decay|1 month ago

> in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong

That's just a different bias purposefully baked into GPT-5's engineered personality on post-training. It always tries to contradict the user, including the cases where it's confidently wrong, and keeps justifying the wrong result in a funny manner if pressed or argued with (as in, it would have never made that obvious mistake if it wasn't bickering with the user). GPT-5.0 in particular was extremely strongly finetuned to do this. And in longer replies or multiturn convos, it falls into a loop on contradictory behavior far too easily. This is no better than sycophancy. LLMs need an order of magnitude better nuance/calibration/training, this requires human involvement and scales poorly.

Fundamental LLM phenomena (ICL, repetition, serial position biases, consequences of RL-based reasoning etc) haven't really changed, and they're worth studying for a layman to get some intuition. However, they vary a lot model to model due to subtle architectural and training differences, and impossible to keep up because there are so many models and so few benchmarks that measure these phenomena.

aeneas_ory|1 month ago

"Still on Claude Code" is a funny statement, given that the industry is agreeing that Anthropic has the lead in software generation while others (OpenAI) are lagging behind or have significant quality issues (Google) in their tooling (not the models). And Anthropic frontier models are generally "You're absolutely right - I apologize. I need to ..." everytime they fuck something up.

zeroonetwothree|1 month ago

Why is it every time anyone has a critique someone has to say “oh but you aren’t using model X, which clearly never has this problem and is far better”?

Yet the data doesn’t show all that much difference between SOTA models. So I have a hard time believing it.

raincole|1 month ago

People desperately want 'the plateau' to be true because it means our jobs would be safe and we could call ourselves experts again. If the ground is keep moving then no one is truly an expert. There is just no enough time to achieve expertise when the paradigm shifts every six months.

alternatetwo|1 month ago

Claude is still just like that once you’re deep enough in the valley of the conversation. not exactly that phrase but things like that’s the smoking gun or so. nothing has changed.

PaulDavisThe1st|1 month ago

> I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x

It said precisely that to me 3 or 4 days ago when I questioned its labelling of algebraic terms (even though it was actually correct).

overgard|1 month ago

I don't see a reason to think we're not going to hit a plateua sooner or later (and probably sooner). You can't scale your way out of hallucinations, and you can't keep raising tens of billions to train these things without investors wanting a return. Once you use up the entire internets worth of stack overflow responses and public github repositories you run into the fact that these things aren't good at doing things outside their training dataset.

Long story short, predicting perpetual growth is also a trap.

Q6T46nT668w6i3m|1 month ago

On balance, there’s far more evidence to support the conclusion that language models have reached a plateau.

jgalt212|1 month ago

To me this seems like a classic LLM defense.

A doesn't work. You must frontier model 4.

A works on 4, but B doesn't work on 4. You doing it wrong, you must use frontier model 5.

Ok, now I use 5, A and B work, but C doesn't work. Fool, you must use frontier model 6.

Ok, I'm on 6, but now A is not working as it good as it did on A. Only fools are still trying to do A.

soulofmischief|1 month ago

Opus 4.5 seems to be better than GPT 5.2 or 5.2 Codex at using tools and working for long stretches on complex tasks.

MoltenMan|1 month ago

I agree with a lot of what you've said, but I completely disagree that LLM's are no longer sycophantic. GPT-5 is definitely still very sycophantic, 'You're absolutely right!' still happens, etc. It's true it happens far less in a pure coding context (Claude Code / Codex) but I suspect only because of the system prompts, and those tools are by far in the minority of LLM usage.

I think it's enlightening to open up ChatGPT on the web with no custom instructions and just send a regular request and see the way it responds.

danpalmer|1 month ago

I used to get made up APIs in functions, now I get them in modules. I used to get confidently incorrect assertions in files now I get them across codebases.

Hell, I get poorly defined APIs across files and still get them between functions. LLMs aren't good at writing well defined APIs at any level of the stack. They can attempt it at levels of the stack they couldn't a year ago, but they're still terrible at it unless the problem is so well known enough that they can regurgitate well reviewed code.

refactor_master|1 month ago

I still get made-up Python types all the time with Gemini. Really quite distracting when your codebase is massive and triggers a type error, and Gemini says

"To solve it you just need to use WrongType[ThisCannotBeUsedHere[Object]]"

and then I spend 15 minutes running in circles, because everything from there on is just a downward spiral, until I shut off the AI noise and just read the docs.

conradfr|1 month ago

Yeah I've been trying Claude Code for a week (mostly Opus) and in a C++ Juce project it kept hallucinating functions for a simple task ("retrieve DAW track name if available") and actually never got it right.

It also failed a lot to modify a simple Caddyfile.

On the other hand it sometimes blows me away and offers to correct mistakes I coded myself. It's really good on web code I guess as that must be the most public code available (Vue3 and elixir in my case).

measurablefunc|1 month ago

This is the right answer. Unless there is some equivalent of it on the open internet which their search engine can find you should not expect a good outcome.

groby_b|1 month ago

LLMs are bad at creating abstraction boundaries since inception. People have been calling it out since inception. (Heck, even I got a twitter post somewhere >12 months old calling that out, and I'm not exactly a leading light of the effort)

It is in no way size-related. The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

TeMPOraL|1 month ago

> The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

That statement is way too strong, as it implies either that humans cannot create new concepts/abstractions, or that magic exists.

w0m|1 month ago

I believe his argument is that now that you've defined the limitation, it's a ceiling that will likely be cracked in the relatively near future.

131hn|1 month ago

There’s only one way to implement a mission, an algorithm, a task. But there’s an infinity of path, inconsistants, fuzzy and always subjective way to live. Thàt’s our lives, that’s the code LLM are trained on. I do not think, and hope, it will ever change much

wouldbecouldbe|1 month ago

I feel like the main challenge is where to be "loose" and where to be "strict", Claude takes too much liberty often. Assuming things, adding some mock data to make it work, using local storage because there is no db. This makes it work well out of the box, and means I can prompt half ass and have great results. But it also long term causes issues. It can be prompted away, but it needs constant reminder. This seems like a hard problem to solve. I feel like it can already almost do everything if you have the correct vision / structure in mind and have the patience to prompt properly.

It's worst feature is debugging hard errors, it will just keep trying everything and can get pretty wild instead of entering plan mode and really discuss & think things true.

pankajdoharey|1 month ago

Claude is overrated premium piece of developer tech, i have produced equally good results from Gemini and Way better with GPT - medium. And GPT Medium is a really good model at assembling and debugging stuff than Claude. Claude hallucinates when asked why something is correct or should be done. All Models fail equally in some or the other aspect, which point to the fact that these models have strength's and weaknesses, and GPT just happens to be a good overall model. But dev community is so stuck up on Claude for no good reason other than shiny tooling : "Claude Code", besides that the models can be equally worse as the competition. The Benchmarks do not explain the full story. In general though the Thumb rule is if the Model says you are Brilliant, Thats genius or Now thats a deep and insightful question you asked... Its time to start a new session.

skybrian|1 month ago

The article is mostly reporting on the present. (Note the "yet" in the title.)

There's only one sentence where it handwaves about the future. I do think that line should have been cut.