Claude is good at assembling blocks, but still falls apart at creating them

woeirua|1 month ago

It's just amazing to me how fast the goal posts are moving. Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy. The tech is moving so fast. I slept on Opus 4.5 because GPT 5 was kind of an air ball, and just started using it in the past few weeks. It's so good. Way better than almost anything that's come before it. It can one-shot tasks that we never would've considered possible before.

skue|1 month ago

> Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy.

Four years ago, they would have likely asked what in the world is an LLM? ChatGPT is barely 3 years old.

enraged_camel|1 month ago

It literally saved my small startup six-figures and months of work. I've written about it extensively and posted it (it's in my submissions).

ranyume|1 month ago

There are certain things/llm-phenomena that haven't changed since their introduction.

Madmallard|1 month ago

Idk I was using chat gpt 3.5 to do stuff and it was pretty helpful then

utopiah|1 month ago

> The tech is moving so fast.

Well that's exactly the problem : how can one say that?

The entire process of evaluating what "it" actually does has been a problem from the start. Input text, output text ... OK but what if the training data includes the evaluation? This was ridiculous few years ago but then the scale went from some curated text datasets to... most of the Web as text, to most of the Web as text including transcription from videos, to most of the Web plus some non public databases, to all that PLUS (and that's just cheating) tests that were supposed to be designed to NOT be present elsewhere.

So again, that's the crux of the problem, WHAT does it actually do? Is it "just" search? Is it semantic search with search and replace, is it that plus evaluation that it runs?

Sure the scaffolding becomes bigger, the available dataset becomes larger, the compute available keeps on increasing but it STILL does not answer the fundamental question, namely what is being done. The assumption here is because the output text does solve the question ask, then "it" works, it "solved" the problem. The problem is that by definition the entire setup has been made in order to look as plausible as possible. So it's not luck that it initially appears realistic. It's not luck that it can thus pass some dedicated benchmark, but it is also NOT solving the problem.

So yes sure the "tech" is moving "so fast" but we still can't agree on what it does, we keep on having no good benchmarks, we keep on having that jagged frontier https://www.hbs.edu/faculty/Pages/item.aspx?num=64700 that makes it so challenging to make more meaningful statement than "moving so fast" which sounds like marketing claims.

simonw|1 month ago

I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code:

> But in context, this was obviously insane. I knew that key and id came from the same upstream source. So the correct solution was to have the upstream source also pass id to the code that had key, to let it do a fast lookup.

I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

My guess is that Claude is trained to bias towards making minimal edits to solve problems. This is a desirable property, because six months ago a common complaint about LLMs is that you'd ask for a small change and they would rewrite dozens of additional lines of code.

I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

bblcla|1 month ago

(Author here)

> I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code

Yeah, that's fair - a friend of mine also called this out on Twitter (https://x.com/konstiwohlwend/status/2010799158261936281) and I went into more technical detail about the specific problem there.

> I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

> I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

I'm not sure about this! There are a few things Claude does that seem unfixable even by updating CLAUDE.md.

Some other footguns I keep seeing in Python and constantly have to fix despite CLAUDE.md instructions are:

- writing lots of nested if clauses instead of writing simple functions by returning early

- putting imports in functions instead of at the top-level

- swallowing exceptions instead of raising (constantly a huge problem)

These are small, but I think it's informative of what the models can do that even Opus 4.5 still fails at these simple tasks.

Kuinox|1 month ago

> My guess is that Claude is trained to bias towards making minimal edits to solve problems.

I don't have the same feeling. I find that claude tends to produce wayyyyy too much code to solve a problem, compared to other LLMs.

joshribakoff|1 month ago

I expect that adding instructions that attempt to undo training produces worse results than not including the overbroad generalization in the training in the first place. I think the author isn’t making a complaint they’re documenting a tradeoff.

AIorNot|1 month ago

Well yes but the wider point is that it takes new Human skills to manage them - like a pair of horses so to speak under your bridle

When it comes down to it these AI tools are like going to power tools or machines from the artisanal era

- like going from surgical knife to a machine gun- so they operate at a faster pace without comprehending like humans - and without allowing humans time to comprehend all side effects and massive assumptions they make on every run in their context window

humans have to adapt to managing them correctly and at the right scale to be effective and that becomes something you learn

threethirtytwo|1 month ago

Definitely, The training parameters encourage this. The AI is actually deliberately also trying to trick you and we know for that for a fact.

Problems with solutions too complicated to explain or to output in one sitting are out of the question. The AI will still bias towards one shot solutions if given one of these problems because all the training is biased towards a short solution.

It's not really practical to give it training data with multi step ultra complicated solutions. Think about it. The thousands of questions given to it for reinforcement.... the trainer is going to be trying to knock those out as efficiently as possible so they have to be readable problems with shorter readable solutions. So we know AI biases towards shorter readable solutions.

Second, Any solution that tricks the reader will pass training. There is for sure a subset of questions/solution pairs that meet this criteria by definition because WE as trainers simply are unaware we are being tricked. So this data leaks into the training and as a result AI will bias towards deception as well.

So all in all it is trained to trick you and give you the best solution that can fit into a context that is readable in one sitting.

In theory we can get it to do what we want only if we had perfect reinforcement data. The reliability we're looking for seems to be just right over this hump.

maxilevi|1 month ago

LLMs are just really good search. Ask it to create something and it's searching within the pretrained weights. Ask it to find something and it's semantically searching within your codebase. Ask it to modify something and it will do both. Once you understand its just search, you can get really good results.

fennecbutt|1 month ago

I agree somewhat, but more when it comes to its use of logic - it only gleans logic from human language which as we know is a fucking mess.

I've commented before on my belief that the majority of human activity is derivative. If you ask someone to think of a new kind of animal, alien or random object they will always base it off things that they have seen before. Truly original thoughts and things in this world are an absolute rarity and the majority of supposed original thought riffs on what we see others make, and those people look to nature and the natural world for inspiration.

We're very good at taking thing a and thing b and slapping them together and announcing we've made something new. Someone please reply with a wholly original concept. I had the same issue recently when trying to build a magic based physics system for a game I was thinking of prototyping.

bhadass|1 month ago

better mental model: it's a lossy compression of human knowledge that can decompress and recombine in novel (sometimes useful, sometimes sloppy) ways.

classical search simply retrieves, llms can synthesize as well.

dcre|1 month ago

I really don’t think search captures the thing’s ability to understand complex relationships. Finding real bugs in 2000 line PRs isn’t search.

cultureulterior|1 month ago

This is not true.

andoando|1 month ago

Im not sure how anyone can say this. It is really good search, but its also able to combine ideas and reason about and do fairly complex logic on tasks surely absolutely no one has asked before.

__MatrixMan__|1 month ago

Its a very useful model but not a complete one. You just gotta acknowledge that if you're making something new its gonna take all day and require a lot of guard rails, but then you can search for that concept later (add the repo to the workspace and prompt at it) and the agent will apply it elsewhere as if it was a pattern in widespread use. "Just search" doesn't quite fit. I've never wondered how best to use a search engine to make something in a way that will be easily searchable later.

johnisgood|1 month ago

Calling it "just search" is like calling a compiler "just string manipulation". Not false, but aggressively missing the point.

godelski|1 month ago

  > Once you understand its just search, you can get really good results.

I think this is understating the issue, ignoring context. It reminds me of how easy people claim searching is with search engines. But there's so many variables that can make results change dramatically. Just like Google search, two people can type in the exact same query and get very different results. But probably the bigger difference is in what people are searching for.

What's problematic with these types of claims is that they just come off as calling anyone who thinks differently dumb. It's as disconnected as saying "It's intuitive" in one breath and "You're holding it wrong" in another. It's a bad mindset to be in as an engineer because someone presents a problem and instead of trying to address it is dismissed. If someone is holding it wrong, it probably isn't intuitive[0]. Even if they can't explain the problem correctly, they are telling you a problem exists[1]. That's like 80% of the job of an engineer: figuring out what the actual problem is.

As maybe an illustrative example people joke that a lot of programming is "copy pasting from stack overflow". We all know the memes. There's definitely times where I've found this to be a close approximation to writing an acceptable program. But there's many other times where I've found that to be far from possible. There's definitely a strong correlation to what type of programming I'm doing, as in what kind of program I'm writing. Honestly, I find this categorical distinction not being discussed enough with things like LLMs. Yet, we should expect there to be a major difference. Frankly, there are just different amounts of information on different topics. Just like how LLMs seem to be better with more common languages like Python than less common languages (and also worse at just more complicated languages like C or Rust).

[0] You cannot make something that's intuitive to all people. But you can make it intuitive for most people. We're going to ignore the former case because the size should be very small. If 10% of your users are "holding it wrong" then the answer is not "10% of your users are absolute morons" it is "your product is not as intuitive as you think." If 0.1% of your users are "holding it wrong" then well... they might be absolute morons.

[1] I think I'm not alone in being frustrated with the LLM discourse as it often feels like people trying to gaslight me into believing the problems I experience do not exist. Why is it so surprising that people have vastly differing experiences? *How can we even go about solving problems if we're unwilling to acknowledge their existence?*

disconcision|1 month ago

I've yet to be convinced by any article, including this one, that attempts to draw boxes around what coding agents are and aren't good at in a way that is robust on a 6 to 12 month horizon.

I agree that the examples listed here are relatable, and I've seen similar in my uses of various coding harnesses, including, to some degree, ones driven by opus 4.5. But my general experience with using LLMs for development over the last few years has been that:

1. Initially models could at best assemble a simple procedural or compositional sequences of commands or functions to accomplish a basic goal, perhaps meeting tests or type checking, but with no overall coherence,

2. To being able to structure small functions reasonably,

3. To being able to structure large functions reasonably,

4. To being able to structure medium-sized files reasonably,

5. To being able to structure large files, and small multi-file subsystems, somewhat reasonably.

So the idea that they are now falling down on the multi-module or multi-file or multi-microservice level is both not particularly surprising to me and also both not particularly indicative of future performance. There is a hierarchy of scales at which abstraction can be applied, and it seems plausible to me that the march of capability improvement is a continuous push upwards in the scale at which agents can reasonably abstract code.

Alternatively, there could be that there is a legitimate discontinuity here, at which anything resembling current approaches will max out, but I don't see strong evidence for it here.

Uehreka|1 month ago

It feels like a lot of people keep falling into the trap of thinking we’ve hit a plateau, and that they can shift from “aggressively explore and learn the thing” mode to “teach people solid facts” mode.

A week ago Scott Hanselman went on the Stack Overflow podcast to talk about AI-assisted coding. I generally respect that guy a lot, so I tuned in and… well it was kind of jarring. The dude kept saying things in this really confident and didactic (teacherly) tone that were months out of date.

In particular I recall him making the “You’re absolutely right!” joke and asserting that LLMs are generally very sycophantic, and I was like “Ah, I guess he’s still on Claude Code and hasn’t tried Codex with GPT 5”. I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong and not flattering my every decision. But that news (which would probably be really valuable to many people listening) wasn’t mentioned on the podcast I guess because neither of the guys had tried Codex recently.

And I can’t say I blame them: It’s really tough to keep up with all the changes but also spend enough time in one place to learn anything deeply. But I think a lot of people who are used to “playing the teacher role” may need to eat a slice of humble pie and get used to speaking in uncertain terms until such a time as this all starts to slow down.

danpalmer|1 month ago

I used to get made up APIs in functions, now I get them in modules. I used to get confidently incorrect assertions in files now I get them across codebases.

Hell, I get poorly defined APIs across files and still get them between functions. LLMs aren't good at writing well defined APIs at any level of the stack. They can attempt it at levels of the stack they couldn't a year ago, but they're still terrible at it unless the problem is so well known enough that they can regurgitate well reviewed code.

groby_b|1 month ago

LLMs are bad at creating abstraction boundaries since inception. People have been calling it out since inception. (Heck, even I got a twitter post somewhere >12 months old calling that out, and I'm not exactly a leading light of the effort)

It is in no way size-related. The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

wouldbecouldbe|1 month ago

I feel like the main challenge is where to be "loose" and where to be "strict", Claude takes too much liberty often. Assuming things, adding some mock data to make it work, using local storage because there is no db. This makes it work well out of the box, and means I can prompt half ass and have great results. But it also long term causes issues. It can be prompted away, but it needs constant reminder. This seems like a hard problem to solve. I feel like it can already almost do everything if you have the correct vision / structure in mind and have the patience to prompt properly.

It's worst feature is debugging hard errors, it will just keep trying everything and can get pretty wild instead of entering plan mode and really discuss & think things true.

pankajdoharey|1 month ago

Claude is overrated premium piece of developer tech, i have produced equally good results from Gemini and Way better with GPT - medium. And GPT Medium is a really good model at assembling and debugging stuff than Claude. Claude hallucinates when asked why something is correct or should be done. All Models fail equally in some or the other aspect, which point to the fact that these models have strength's and weaknesses, and GPT just happens to be a good overall model. But dev community is so stuck up on Claude for no good reason other than shiny tooling : "Claude Code", besides that the models can be equally worse as the competition. The Benchmarks do not explain the full story. In general though the Thumb rule is if the Model says you are Brilliant, Thats genius or Now thats a deep and insightful question you asked... Its time to start a new session.

skybrian|1 month ago

The article is mostly reporting on the present. (Note the "yet" in the title.)

There's only one sentence where it handwaves about the future. I do think that line should have been cut.

lordnacho|1 month ago

By and large, I agree with the article. Claude is great and fast at doing low level dev work. Getting the syntax right in some complicated mechanism, executing an edit-execute-readlog loop, making multi file edits.

This is exactly why I love it. It's smart enough to do my donkey work.

I've revisited the idea that typing speed doesn't matter for programmers. I think it's still an odd thing to judge a candidate on, but appreciate it in another way now. Being able to type quickly and accurately reduces frustration, and people who foresee less frustration are more likely to try the thing they are thinking about.

With LLMs, I have been able to try so many things that I never tried before. I feel that I'm learning faster because I'm not tripping over silly little things.

bossyTeacher|1 month ago

> I feel that I'm learning faster

Yes, you are feeling that. But is that real? If I take all LLMs from you right now, is your current you still better than your pre-LLM you? When I dream I feel that I can fly and as long as I am dreaming, this feeling is true. But the subject of this feeling never was.

onemoresoop|1 month ago

It’s a bit like the shift from film to digital in one very specific sense: the marginal cost of trying again virtually collapsed. When every take cost money and setup time, creators pre-optimized in their head and often never explored half their ideas. When takes became cheap, creators externalized thought as they could try, look, adjust, and discover things they wouldn’t otherwise. Creators could wander more. They could afford to be wrong because of not constantly paying a tax for being clumsy or incomplete, they became more willing to follow a hunch and that's valuable space to explore.

Digital didn’t magically improve art, but it let many more creatives enter the loop of idea, attempt and feedback. LLMs feel similar: they don’t give you better ideas by themselves, but they remove the friction that used to stop you from even finding out whether an idea was viable. That changes how often you learn, and how far you’re willing to push a thought before abandoning it. I've done so many little projects myself that I would have never had time for and feel that I learned something from it, of course not as much if I had all the pre LLM friction, but it should still count for something as I would never have attempted them without this assistance.

Edit: However, the danger isn’t that we’ll have too many ideas, it’s that we’ll confuse movement with progress.

When friction is high, we’re forced to pre-compress thought, to rehearse internally, to notice contradictions before externalizing them. That marination phase (when doing something slowly) does real work: it builds mental models, sharpens the taste and teaches us what not to bother to try. Some of that vanishes when the loop becomes cheap enough that we can just spray possibilities into the world and see what sticks.

A low-friction loop biases us toward breadth over depth. We can skim the surface of many directions without ever sitting long enough in one to feel its resistance. The skill of holding a half formed idea in our head, letting it collide with other thoughts, noticing where it feels weak, atrophies if every vague notion immediately becomes a prompt.

There’s also a cultural effect. When everyone can produce endlessly, the environment fills with half-baked or shallow artifacts. Discovery becomes harder as signal to noise drops.

And on a personal level, it can hollow out satisfaction. Friction used to give weight to output. Finishing something meant you had wrestled with it. If every idea can be instantiated in seconds, each one feels disposable. You can end up in a state of perpetual prototyping, never committing long enough for anything to become yours.

So the slippery slope is not laziness, it is shallowness, not that people won’t think, but people won’t sit with thoughts. The challenge here is to preserve deliberate slowness inside a world that no longer requires it: to use the cheap loop for exploration, while still cultivating the ability to pause, compress, and choose what deserves to exist at all.

player1234|1 month ago

[deleted]

imiric|1 month ago

> Being able to type quickly and accurately reduces

LLMs can generate code quickly. But there's no guarantee that it's syntactically, let alone semantically, accurate.

> I feel that I'm learning faster because I'm not tripping over silly little things.

I'm curious: what have you actually learned from using LLMs to generate code for you? My experience is completely the opposite. I learn nothing from running generated code, unless I dig in and try to understand it. Which happens more often than not, since I'm forced to review and fix it anyway. So in practice, it rarely saves me time and energy.

I do use LLMs for learning and understanding code, i.e. as an interactive documentation server, but this is not the use case you're describing. And even then, I have to confirm the information with the real API and usage documentation, since it's often hallucinated, outdated, or plain wrong.

mikece|1 month ago

In my experience Claude is like a "good junior developer" -- can do some things really well, FUBARS other things, but on the whole something to which tasks can be delegated if things are well explained. If/when it gets to the ability level of a mid-level engineer it will be revolutionary. Typically a mid-level engineer can be relied upon to do the right thing with no/minimal oversight, can figure out incomplete instructions, and deliver quality results (and even train up the juniors on some things). At that point the only reason to have human junior engineers is so they can learn their way up the ladder to being an architect and responsible coordinating swarms of Claude Agents to develop whole applications and complete complex tasks and initiatives.

Beyond that what can Claude do... analyze the business and market as a whole and decide on product features, industry inefficiencies, gap analysis, and then define projects to address those and coordinate fleets of agents to change or even radically pivot an entire business?

I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.

alfalfasprout|1 month ago

> I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.

At that point, why do you even need the CEO?

0x457|1 month ago

My experience with Claude (and other agents, but mostly Claude) is such a mixed bag. Sometimes it takes a minimal prompt and 20 minutes later produce a neat PR and all is good, sometimes it doesn't. Sometimes it takes in a large prompt (be it your own prompt, created by another LLM or by plan mode) and also either succeed and fail.

For me, most of the failure cases are where Claude couldn't figure something out due to conflicting information in context and instead of just stopping and telling me that it tries to solve in entirely wrong way. Doesn't help that it often makes the same assumptions as I would, so when I read the plan it looks fine.

Level of effort also hard to gauge because it can finish things that would take me a week in an hour or take an hour to do something I can in 20 minutes.

It's almost like you have to enforce two level of compliance: does the code do what business demands and is the code align with codebase. First one is relatively easy, but just doing that will produce odd results where claude generated +1KLOC because it didn't look at some_file.{your favorite language extension} during exploration.

Or it creates 5 versions of legacy code on the same feature branch. My brother in Christ, what are you trying to stay compatible with? A commit that about to be squashed and forgotten? Then it's going to do a compaction, forget which one of these 5 versions is "live" and update the wrong one.

It might do a good junior dev work, but it must be reviewed as if it's from junior dev that got hired today and this is his first PR.

unknown|1 month ago

[deleted]

imiric|1 month ago

> In my experience Claude is like a "good junior developer"

We've been saying this for years at this point. I don't disagree with you[1], but when will these tools graduate to "great senior developer", at the very least?

Where are the "superhuman coders by end of 2025" that Sam Altman has promised us? Why is there such a large disconnect between the benchmarks these companies keep promoting, and the actual real world performance of these tools? I mean, I know why, but the grift and gaslighting are exhausting.

[1]: Actually, I wouldn't describe them as "good" junior either. I've worked with good junior developers, and they're far more capable than any "AI" system.

ChicagoDave|1 month ago

I have several projects that counter this article. Not sure why, but I’ve extracted clean, readable, well-constructed, and well-tested code.

I might write something up at some point, but I can share this:

https://github.com/chicagodave/devarch/

New repo with guides for how I use Claude Code.

Scrapemist|1 month ago

Interesting. So you put these into the project folder for Claude to follow?

michalsustr|1 month ago

This article resonates exactly how I think about it as well. For example, at minfx.ai (a Neptune/wandb alternative), we cache time series that can contain millions of floats for fast access. Any engineer worth their title would never make a copy of these and would pass around pointers for access. Opus, when stuck in a place where passing the pointer was a bit more difficult (due to async and Rust lifetimes), would just make the copy, rather than rearchitect or at least stop and notify user. Many such examples of ‘lazy’ and thus bad design.

alphazard|1 month ago

This sounds suspiciously like the average developer, which is what the transformer models have been trained to emulate.

Designing good APIs is hard, being good at it is rare. That's why most APIs suck, and all of us have a negative prior about calling out to an API or adding a dependency on a new one. It takes a strong theory of mind, a resistance to the curse of knowledge, and experience working on both sides of the boundary, to make a good API. It's no surprise that Claude isn't good at it, most humans aren't either.

joshcsimmons|1 month ago

IDK I've been using opus 4.5 to create a UI library and it's been doing pretty well: https://simsies.xyz/ (still early days)

Granted it was building ontop of tailwind (shifting over to radix after the layoff news). Begs the question? What is a lego?

threethirtytwo|1 month ago

I don't know how someone can look at what you build and conclude LLMs are still google search. It boggles the mind how much hatred people have for AI to the point of self deception. The evidence is placed right in front of you and on your lap with that link and people still deny it.

subdavis|1 month ago

FYI the cursor animation runs before the font loads if the font isn’t ready yet.

dehugger|1 month ago

your github repo was highly entertaining. thanks for make my day a bit brighter:)

Scrapemist|1 month ago

Eventually you can show Claude how you solve problems, and explain the thought process behind it. It can apply these learnings but it will encounter new challenges in doing so. It would be nice if Claude could instigate a conversation to go over the issues in depth. Now it wants quick confirmation to plough ahead.

fennecbutt|1 month ago

Well I feel like this is because a better system would distill such learning into tokens not associated with a human language and that that could represent logic better than using English etc for it.

I don't have the GPUs or time to experiment though :(

0xbadcafebee|1 month ago

I don't think it's possible to make an AI a "Senior Engineer", or even a good engineer, by training it on random crap from the internet. It's got a million brains' worth of code in it. That means bad patterns as well as good. You'd need to remove the bad patterns for it not to "remember" and regurgitate them. I don't think prompts help with this either, it's like putting a band-aid on head trauma.

HarHarVeryFunny|1 month ago

It's also rather like trying to learn flintnapping just by looking at examples of knapped flint (maybe some better than others), rather than having access to descriptions of how to do it, and ultimately any practice of doing it.

You could also use cooking as an analogy - trying to learn to cook by looking at pictures of cooked food rather than by having gone to culinary school and learnt the principles of how to actually plan and cook good food.

So, we're trying to train LLMs to code, by giving them "pictures" of code that someone else built, rather than by teaching them the principles involved in creating it, and then having them practice themselves.

Havoc|1 month ago

> Claude can’t create good abstractions on its own

LLMs definitely can create abstractions and boundaries. e.g. most will lean towards a pretty clean front end vs backend split even without hints. Or work out a data structure that fits the need. Or splits things into free standing modules. Or structure a plan into phases.

So this really just boils down to „good” abstractions which is subject to model improvement.

I really don’t see a durable moat for us meatbags in this line of reasoning

HarHarVeryFunny|1 month ago

There's a difference between "can generate" and "can create [from scratch]". Of course LLMs can generate code that reflects common patterns in the stuff it was trained, such as frontend/backend splits, since this is precisely what they are trained to be able to do.

Coming up with a new design from scratch, designing (or understanding) a high level architecture based on some principled reasoning, rather than cargo cult coding by mimicking common patterns in the training data, is a different matter.

LLMs are getting much better at reasoning/planning (or at least something that looks like it), especially for programming & math, but this is still based on pre-training, mostly RL, and what they learn obviously depends on what they are trained on. If you wanted LLMs to learn principles of software architecture and abstraction/etc, then you would need to train on human (or synthetic) "reasoning traces" of how humans make those decisions, but it seems that currently RL-training for programming is mostly based on artifacts of reasoning (i.e. code), not the reasoning traces themselves that went into designing that code, so this (coding vs design reasoning) is what they learn.

I would guess that companies like Anthropic are trying to address this paucity of "reasoning traces" for program design, perhaps via synthetic data, since this is not something that occurs much in the wild, especially as you move up the scale of complexity from small problems (student assignments, stack overflow advice) to large systems (which are anyways mostly commercial, hence private). You can find a smallish number of open source large projects like gcc, linux, but what is missing are the reasoning traces of how the designers went from requirements to designing these systems the way they did (sometimes in questionable fashion!).

Humans of course learn software architecture in a much different way. As with anything, you can read any number of books, attend any number of lectures, on design principles and software patterns, but developing the skill for yourself requires hands-on personal practice. There is a fundamental difference between memory (of what you read/etc) and acquired skill, both in level of detail and fundamental nature (skills being based on action selection, not just declarative recall).

The way a human senior developer/systems architect acquires the skill of design is by practice, by a career of progressively more complex projects, successes and failures/difficulties, and learning from the process. By learning from your own experience you are of course privy to your own prior "reasoning traces" and will learn which of those lead to good or bad outcomes. Of course learning anything "on the job" requires continual learning, and things like curiosity and autonomy, which LLMs/AI do not yet have.

Yes, us senior meatbags, will eventually be having to compete with, or be displaced by, machines that are the equal of us (which is how I would define AGI), but we're not there yet, and I'd predict it's at least 10-20 years out, not least because it seems most of the AI companies are still LLM-pilled and are trying to cash in on the low-hanging fruit.

Software design and development is a strange endeavor since, as we have learnt, one of the big lessons of LLMs (in general, not just apropos coding), is how much of what we do is (trigger alert) parroting to one degree or another, rather than green field reasoning and exploration. At the same time, software development, as one gets away from boilerplate solutions to larger custom systems, is probably one of the more complex and reasoning-intensive things that humans do, and therefore may end up being one of the last, rather than first, to completely fall to AI. It may well be AI managers, not humans, who finally say that at last AI has reached human parity at software design, able to design systems of arbitrary complexity based on principled reasoning and accumulated experience.

iamacyborg|1 month ago

Here’s an example of a plan I’m working on in CC, it’s very thorough, albeit required a lot of handholding and fact checking on a number of points as it’s first few passes didn’t properly anonymise data.

https://docs.google.com/document/u/0/d/1zo_VkQGQSuBHCP45DfO7...

machiaweliczny|1 month ago

Yeah, that's my current gripe but I think this just needs some good examples in AGENTS.md (I've done some for hooks and it kinda works but need to remind it). I need good AGENTS.md that explain what good abstraction boundary is and how to define is the problem is I am not sure I know how to put it into words, if anyone has idea please let me know.

EGreg|1 month ago

This is exactly what we found out a year ago for all AI builders. But what is the best way to convince early investors of this thesis? They seem to be all-in on just building everything from scratch end-to-end. Here is what we built:

https://engageusers.ai/ecosystem.pdf

malka1986|1 month ago

I am making an app in Elixir.

100% of code is made by Claude.

It is damn good at making "blocks".

However, Elixir seems to be a langage that works very well for LLM, cf. https://elixirforum.com/t/llm-coding-benchmark-by-language/7...

redfloatplane|1 month ago

Hmm, that benchmark seems a little flawed (as pointed out in the paper). Seems like it may give easier problems for "low-resource" languages such as Elixir and Racket and so forth since their difficulty filter couldn't solve harder problems in the first place. FTA:

> Section 3.3:

> Besides, since we use the moderately capable DeepSeek-Coder-V2-Lite to filter simple problems, the Pass@1 scores of top models on popular languages are relatively low. However, these models perform significantly better on low-resource languages. This indicates that the performance gap between models of different sizes is more pronounced on low-resource languages, likely because DeepSeek-Coder-V2-Lite struggles to filter out simple problems in these scenarios due to its limited capability in handling low-resource languages.

It's also now a little bit old, as with every AI paper the second they are published, so I'd be curious to see a newer version.

But, I would agree in general that Elixir makes a lot of sense for agent-driven development. Hot code reloading and "let it crash" are useful traits in that regard, I think

unknown|1 month ago

[deleted]

joduplessis|1 month ago

Recently I've put Claude/others to use in some agentic workflows with easy menial/repetitive tasks. I just don't understand how people are using these agents in production. The automation is absolutely great, but it requires an insane amount of hand-holding and cleanup.

baq|1 month ago

Automate hand holding and cleanup obviously. (Also known as ‘harness’.)

unknown|1 month ago

[deleted]

iamleppert|1 month ago

I use Claude daily and I 100% disagree with the author. The article reeks of someone who doesn't understand how to manage context appropriately or describe their requirements, or know how to build up a task iteratively with a coding agent. If you have certain requirements or want things done in a certain way, you need to be explicit and the order of operations you do things in matters a lot in how efficient it completes the task, and the quality of the final output. It's very good at doing the least amount of work to just make something work by default, but that's not always what you want. Sometimes it is. I'd much rather prefer that as the default mode of operation than something that makes a project out of every little change.

The developers who aren't figuring out how to leverage AI tools and make them work for them are going to get left behind very quickly. Unless you're in the top tier of engineers, I'm not sure how one can blame the tools at this point.

anshumankmr|1 month ago

IDK its been pretty solid (but it does mess up) which is where I come in. But it has helped me work with Databricks (read/writing from it) and train a model using it for some of our customers, though its NOT in prod.

doug_durham|1 month ago

Did the author ask it to make new abstractions? In my experience when I produces output that I don't like I ask it to refactor it. These models have and understanding of all modern design patterns. Just ask it to adopt one.

bblcla|1 month ago

(Author here)

I have! I agree it's very good at applying abstractions, if you know exactly what you want. What I notice is that Claude has almost no ability to surface those abstractions on its own.

When I started having it write React, Claude produced incredibly buggy spaghetti code. I had to spend 3 weeks learning the fundamentals of React (how to use hooks, providers, stores, etc.) before I knew how to prompt it to write better code. Now that I've done that, it's great. But it's meaningful that someone who doesn't know how to write well-abstracted React code can't get Claude to produce it on their own.

unknown|1 month ago

[deleted]

esafak|1 month ago

> Claude doesn’t have a soul. It doesn't want anything.

Ha! I don't know what that has to do with anything, but this is exactly what I thought while watching Pluribus.

jondwillis|1 month ago

Regardless, yet another path to the middle class is closing for a lot of people. RIP (probably me too)

geldedus|1 month ago

The level of anti-AI cope is so entertaining!

lxe|1 month ago

Eh. This is yet another "I tried AI to do a thing, and it didn't do it the way I wanted it, therefore I'm convinced that's just how it is... here's a blog about it" article.

"Claude tries to write React, and fails"... how many times? what's the rate of failure? What have you tried to guide it to perform better.

These articles are similar to HN 15 years ago when people wrote "Node.JS is slow and bad"

MarginalGainz|1 month ago

This mirrors my experience trying to integrate LLMs into production pipelines.

The issue seems to be that LLMs treat code as a literary exercise rather than a graph problem. Claude is fantastic at the syntax and local logic ('assembling blocks'), but it lacks the persistent global state required to understand how a change in module A implicitly breaks a constraint in module Z.

Until we stop treating coding agents as 'text predictors' and start grounding them in an actual AST (Abstract Syntax Tree) or dependency graph, they will remain helpful juniors rather than architects.

mklyachman|1 month ago

[deleted]

falloutx|1 month ago

Spam bot?

237 comments