top | item 46583507

(no title)

I don't understand the stance that AI currently is able to automate away non-trivial coding tasks. I've tried this consistently since GPT 3.5 came out, with every single SOTA model up to GPT 5.1 Codex Max and Opus 4.5. Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing. So many important details are subpar about the AI solution, and many times fundamental architectural issues cripple any attempt at prompting my way out of it, even though I've been quite involved step-by-step through the whole prototyping phase.

I just have to conclude 1 of 2 things:

1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.

2) Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.

I'm not sure what else I can take from the situation. For context, I work on a 15 year old Java Spring + React (with some old pages still in Thymeleaf) web application. There are many sub-services, two separate databases,and this application needs to also 2-way interface with customer hardware. So, not a simple project, but still. I can't imagine it's way more complicated than most enterprise/legacy projects...

discuss

Some comments were deferred for faster rendering.

unyttigfjelltol|1 month ago

> non-trivial coding tasks

I’ve come back to the idea LLMs are super search engines. If you ask it a narrow, specific question, with one answer, you may well get the answer. For the “non-trivial” questions, there always will be multiple answers, and you’ll get from the LLM all of these depending on the precise words you use to prompt it. You won’t get the best answer, and in a complex scenario requiring highly recursive cross-checks— some answers you get won’t be functional.

It’s not readily apparent at first blush the LLM is doing this, giving all the answers. And, for a novice who doesn’t know the options, or an expert who can scan a list of options quickly and steer the LLM, it’s incredibly useful. But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.

20k|1 month ago

I wish LLMs were good at search. I've tried to evaluate them many times for their quality at answering research questions for astrophysics (specifically numerical relativity). If they were good at answering questions, I'd use them in a heartbeat

Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away

The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect

PeterStuer|1 month ago

An example I had last month. Some code (dealing with PDF's) package ran into a resources problem in production. LLM suggested an adaptation to the segment that caused the problem, but that code pulled in 3 new non-trivial dependecies. Added constraints and the next iteration it dropped 1 of the 3. Pushed further and it confirmed my suggestion that the 2 remaining dependencies could be covered just by specifying an already existing parameter in the constructor.

The real problem btw was a bug introduced in the PDF handeling package 2 versions ago that caused resource handeling problems in some contexts, and the real solution was roling back to the version before the bug.

I'm still using AI daily in my development though, as as long as you sort of know what you are doing and have enough knowledge to evaluate it is very much a net productivity multiplier for me.

friendzis|1 month ago

> But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.

Typical iterative-circular process "write code -> QA -> fix remarks" works because the code is analyzable and "fix" is on average cheaper than "write", therefore the process, eventually, converges on a "correct" solution.

LLM prompting is on average much less analyzable (if at all) and therefore the process "prompt LLM -> QA -> fix prompt" falls somewhere between "does not converge" and "convergence tail is much longer".

This is consistent with typical observation where LLMs are working better: greenfield implementations of "slap something together" and "modify well structured, uncoupled existing codebase", both situations where convergence is easier in the first place, i.e. low existing entropy.

dividedbyzero|1 month ago

They don't even really do that IME. If I ask Claude or ChatGPT to generate terraform for non-trivial but by no means obscure or highly unusual setups, they almost invariably hallucinate part of the answer even if a documented solution exists that isn't even that difficult. Maybe vibe coding JavaScript is that much better, or I'm just hopeless at prompting, but I feel a few dozen lines of fairly straightforward terraform config shouldn't require elaborate prompt setups, or I can just save some brain cycles by writing it myself.

IAmGraydon|1 month ago

>I’ve come back to the idea LLMs are super search engines.

Yes! This is exactly what it is. A search engine with a lossy-compressed dataset of most public human knowledge, which can return the results in natural language. This is the realization that will pop the AI bubble if the public could ever bring themselves to ponder it en masse. Is such a thing useful? Hell yes! Is such a thing intellegent? Certainly NO!

PunchyHamster|1 month ago

that would be true if not for LLM making up answers where none exists.

Like, I've seen Claude go thru source code of the program, telling (correctly!) what counters are in code that return value I need (I just wanted to look at some packet metrics), then inventing entirely fake CLI command to extract those metrics

carlmr|1 month ago

>It’s not readily apparent at first blush the LLM is doing this, giving all the answers.

Now I'm wondering if I'm prompting wrong. I usually get one answer. Maybe a few options but rarely the whole picture.

I do like the super search engine view though. I often know what I want, but e.g. work with a language or library I'm not super familiar with. So then I ask how do I do x in this setting. It's really great for getting an initial idea here.

Then it gives me maybe one or two options, but they're verbose or add unneeded complexity. Then I start probing asking if this could be done another way, or if there's a simpler solution to this.

Then I ask what are the trade-offs between solutions. Etc.

It's maybe a mix of search engine and rubber ducking.

Agents are, like for OP, a complete failure for me though. Still can't get them to not run off into a completely strange direction, leaving a minefield of subtle coding errors and spaghetti behind.

XenophileJKO|1 month ago

I'm not going to argue about how capable the models are, I personally think they are pretty capable.

What I will argue is that the LLMs are not just search engines. They have "compressed" knowledge. When they do this, they learn relations between all kinds of different levels of abstractions and meta patterns.

It is really important to understand that the model can follow logical rules and has some map of meta relationships between concepts.

Thinking of a LLM as a "search engine" is just fundamentally wrong in how they work, especially when connected to external context like code bases or live information.

daxfohl|1 month ago

Agreed, but:

There's been a notable jump over the course of the last few months, to where I'd say it's inevitable. For a while I was holding out for them to hit a ceiling where we'd look back and laugh at the idea they'd ever replace human coders. Now, it seems much more like a matter of time.

Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well. The idea being, they market directly to the product team, not to engineers who may have specific experience with one language, framework, database, or whatever.

I also think we'll see a revival of monolithic architectures. Right now, services are split up mainly because project/team workflows are also distributed so they can be done in parallel while minimizing conflicts. As AI makes dev cycles faster that will be far less useful, while having a single house for all your logic will be a huge benefit for AI analysis.

sublinear|1 month ago

This doesn't make any sense. If the business can get rid of their engineers, then why can't the user get rid of the business providing the software? Why can't the user use AI to write it themselves?

I think instead the value is in getting a computer to execute domain-specific knowledge organized in a way that makes sense for the business, and in the context of those private computing resources.

It's not about the ability to write code. There are already many businesses running low-code and no-code solutions, yet they still have software engineers writing integration code, debugging and making tweaks, in touch with vendor support, etc. This has been true for at least a decade!

That integration work and domain-specific knowledge is already distilled out at a lot of places, but it's still not trivial. It's actually the opposite. AI doesn't help when you've finally shaved the yak smooth.

concats|1 month ago

> Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement"

The way I see it, there will always be a layer in the corporate organization where someone has to interact with the machine. The transitioning layer from humans to AIs. This is true no matter how high up the hierarchy you replace the humans, be it the engineers layer, the engineering managers, or even their managers.

Given the above, it feels reasonable to believe that whatever title that person has—who is responsible for converting human management's ideas into prompts (or whatever the future has the text prompts replaced by)—that person will do a better job if they have a high degree of technical competence. That is to say, I believe most companies will still want and benefit if that/those employees are engineers. Converting non-technical CEO fever dreams and ambitions into strict technical specifications and prompts.

What this means for us, our careers, or Anthropic's marketing department, I cannot say.

catlifeonmars|1 month ago

I actually think it’s the opposite. We’ll see fewer monorepos because small, scoped repos are the easiest way to keep an agent focused and reduce the blast radius of their changes. Monorepos exist to help teams of humans keep track of things.

ericmcer|1 month ago

If you research how something like Cursor works I don't think you would believe it is inevitable. The jump that would have to happen for it to replace engineers entirely is insurmountable. They can keep expanding contexts and coming up with clever ways to augment generation but I don't see it ever actually having full vision on the system, product and users.

Beyond that it is incredibly biased towards existing code & prompt content. If you wanted to build a voice chat app, and you said "should I use websockets or http?" It would say Websockets. It won't override you and say "Use neither, you should use webRTC", but an experienced engineer would spot that the prompt itself is flawed instantly. LLMs just will bias towards existing tokens in the prompt and won't surface data that would challenge the question itself.

smt88|1 month ago

There's no chance LLMs will be an engineering team replacement. The hallucination problem is unsolvable and catastrophic in some edge cases. Any company using such a team would be uninsurable and sued into oblivion.

twelvedogs|1 month ago

honestly i think they got the low hanging fruit already. they're bumping up against the limits of what it can do and while it's impressive it's not spectacular

mvkel|1 month ago

> Other people are just less picky than I am

I think this is part of it.

When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.

If you put those choices aside and simply say: does it accomplish the goal per the spec (and is safe and scalable[0]), then you can get away with a lot more without the end user ever having a clue.

Sure, there's the argument for maintainability, and vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC. But it used to be 2,000 LOC just a couple of years ago. Temporary problem.

[0]insisting that something be scalable isn't even necessary imo

newsoftheday|1 month ago

> When coding style has been established

It feels like you're diminishing the parent commenter's views, reducing it to the perspective of style. Their comment didn't mention style.

matsemann|1 month ago

> with no detectable benefit to the user

Except the fact that the idioms and patterns used means that I can jump in and understand any part of the codebase, as I know it will be wired up and work the same as any other part.

eru|1 month ago

> When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.

Morphing an already decent PR into a different coding style is actually something that LLMs should excel at.

AlexandrB|1 month ago

What's that old adage? "Programs must be written for people to read, and only incidentally for machines to execute."[1]

[1] https://cs61a.org/articles/composition/

rezonant|1 month ago

I've seen vibe coding fall apart at 600 lines of code. It turns out lines of code is not a good metric for this or any other purpose.

saltyoutburst|1 month ago

Do you have any references for "vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC"? I haven't personally vibed up anything with that many LOC, so I'm legitimately curious if we have solid numbers yet for when this starts to happen (and for which definitions of "collapse").

upcoming-sesame|1 month ago

you don't even have to put these choices aside too much, you can have very detailed linting rules that nudge the LLM towards the style you want.

lopatin|1 month ago

At work, I have the same difficulty using AI as you. When working on deep Jiras that require a lot of domain knowledge, bespoke testing tools, but maybe just a few lines of actual code changes across a vast codebase, I have not been able to use it effectively.

For personal projects on the other hand, it has expedited me what? 10x, 30x? It's not measurable. My output has been so much more than what would have been possible earlier, that there is no benchmark because these level of projects would not have been getting completed in the first place.

Back to using at work: I think it's a skill issue. Both on my end and yours. We haven't found a way to encode our domain knowledge into AI and transcend into orchestrators of that AI.

nikita2206|1 month ago

> deep Jiras that require a lot of domain knowledge, bespoke testing tools, but maybe just a few lines of actual code changes

How do new hires onboard? Do you spend days of your own time guiding them in person, do they just figure things out on their own after a few quarters of working on small tickets, or are things documented? Basically AI, when working on a codebase, has the same level of context that a new hire would have, so if you want them to get started faster then provide them with ample documentation.

antirez|1 month ago

After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?

Also: in my experience 1. and 2. are not needed for you to have bad results. The existing code base is a fundamental variable. The more complex / convoluted it is, the worse is the result. Also in my experience LLMs are constantly better at producing C code than anything else (Python included).

I have the feeling that the simplicity of the code bases I produced over the years, and that now I modify with LLMs, and the fact they are mostly in C, is a big factor why LLMs appear to work so well for me.

Another thing: Opus 4.5 for me is bad on the web, compared to Gemini 3 PRO / GPT 5.2, and very good if used with Claude Code, since it requires to reiterate to reach the solution, why the others sometimes are better first-shotter. If you generate code via the web interface, this could be another cause.

There are tons of variables.

dspillett|1 month ago

> After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?

This is one of my problems with the whole thing, at least from a programming PoV. Even though superficially it seems like the ST:TNG approach to using an intelligent but not aware computer as a tool to collaboratively solve a problem, it is really more like guiding a junior through something complex. While guiding a junior (or even some future AGI) in that way is definitely a good thing, if I am a good guide they will learn from the experience so it will be a useful knowledge sharing process, that isn't a factor for an LLM (at least not the current generations). But if I understand the issue well enough to be a good guide, and there is no teaching benefit external to me, I'd rather do it myself and at most use the LLM as a glorified search engine to help muddle through bad documentation for hidden details.

That and TBH I got into techie things because I like tinkering with the details. If I thought I'd not dislike guiding others doing the actual job, I'd have not resisted becoming a manager throughout all these years!

embedding-shape|1 month ago

> After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?

I think this is the wrong approach, already by having "wrong code" in the context, makes every response after this worse.

Instead, try restarting, but this time specify exactly how you expected that 70% of the code to actually have worked, from the get go. Often, LLMs seem to make choices because they have to, and if you think they made the wrong choice, you can often find that you didn't actually specify something well enough, hence the LLM had to do something, since apparently the single most important thing for them is that they finish something, no matter how right or wrong.

After a while, you'll get better at knowing what you have to be precise, specific and "extra verbose" about, compared to other things. Something that also seems to depend on the model, like with how Gemini you can have 5 variations of "Don't add any comments" yet it does anyways, but say that once to GPT/Claude-family of models and it seems they get it at once.

vignesh37|1 month ago

The biggest frustration with LLMs for me is people telling me I'm not promoting it in a good way. Just think about any product where they are selling a half baked product, and repeatedly telling the user you are not using it properly.

simonw|1 month ago

But that's not how most products work.

If you buy a table saw and can't figure out how to cut a straight line in a piece of wood with it - or keep cutting your fingers off - but didn't take any time at all to learn how to use it, that's on you.

Likewise a car, you have to take lessons and a test before you can use those!

Why should LLMs be any different?

AuryGlenz|1 month ago

Have you seen the way some people google/prompt? It can be a murder scene.

Not coding related but my wife is certainly better than most and yet I’ve had to reprompt certain questions she’s asked ChatGPT because she gave it inadequate context. People are awful at that. Us coders are probably better off than most but just as with human communication if you’re not explaining things correctly you’re going to get garbage back.

tomjen3|1 month ago

If my mum buys a copy of Visual Studio, is it their fault if she cannot code?

khafra|1 month ago

> Non-trivial coding tasks

A coding agent just beat every human in the AtCoder Heuristic optimization contest. It also beat the solution that the production team for the contest put together. https://sakana.ai/ahc058/

It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.

tete|1 month ago

> AtCoder Heuristic optimization contest

Optimization space that has been automated before LLMs. Big surprise, machines are still better at this.

This feels a bit like comparing programming teams to automated fuzzing.

In fact not too rarely developing algorithms involved some kind of automated algorithm testing where the algorithm is permuted in an automatic manner.

It's also a bit like how OCR and a couple of other fields (protein folding) are better to be done in an automated manner.

The fact that now this is done by an LLM, another machine isn't exactly surprising. Nobody claims that computers aren't good at these kinds of tasks.

fmbb|1 month ago

> It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.

Optimization is a very simple problem though.

Maintaining a random CRUD app from some startup is harder work.

PunchyHamster|1 month ago

Compilers beat most coders before LLM were even popular

tripzilch|1 month ago

had to scroll far to find the problem description

> AHC058, held on December 14, 2025, was conducted over a 4-hour competition window. The problem involved a setting where participants could produce machines with hierarchical relationships, such as multiple types of “apple-producing machines” and “machines that build those machines.” The objective was to construct an efficient production planning algorithm by determining which types and hierarchies of machines to upgrade and in what specific order.

... so not a CRUD app but it beat humans at Cookie Clicker? :-)

sublinear|1 month ago

I think you're spot on.

So many people hyping AI are only thinking about new projects and don't even distinguish between what is a product and what is a service.

Most software devs employed today work on maintaining services that have a ton of deliberate decisions baked in that were decided outside of that codebase and driven by business needs.

They are not building shiny new products. That's why most of the positive hype about AI doesn't make sense when you're actually at work and not just playing around with personal projects or startup POCs.

torginus|1 month ago

Personally I've yet to see any high profile programming person (who's not directly invested into AI) endorse only coding by prompting.

Experienced coders that I follow, who do use AI tend to focus on tight and fast feedback loops, and precise edits (or maybe exploratory coding) rather than agentic fire-and-forget workflows.

Also, an interesting side note, I expected programmers I think of as highly skilled, who I know personally to reject AI from personal pride - that has not been the case. However 2 criticisms I've heard consistently from this crowd (besides the thing I mentioned before) was

- AI makes hosting and participating in coding competitions impossible, and denies them of brain-teasers and an ability to hone their skills.

- A lot of them are concerned about the ethics of training on large codebases - and consider AI plagiarism as much of an issue as artists do.

PunchyHamster|1 month ago

It's the second.

Like, yes, prompting is a skill and you need to learn it for AI to do something useful but usefulness quickly falls down a cliff once you go past "greenfield implementation" or "basically example code" or "the thing done a lot so AI have a lot of reference to put from" it quickly gets into kinda sorta but not really working state.

It can still be used effectively on smaller parts of the codebase (I used it a lot basically to generate some boilerplate to run the test even if I had to rewrite a bunch of actual tests) but as whole very, very overrated by the AI peddlers.

And it probably stems from the fact that for the clueless ones it looks like amazing productivity boost because they go from "not even knowing framework" to "somewhat working app"

nicce|1 month ago

People already say here that they don’t even look the code anymore. ”That is AIs job”. As long as there is a spec and tests pass, they are happy! I just can’t do that.

saxenaabhi|1 month ago

Why not post a github gist with prompt and code so that people here can give you their opinion?

Madmallard|1 month ago

Those just don't appear at all on HackerNews

Gee I wonder why

Balinares|1 month ago

That's been pretty much exactly my experience too.

For what it's worth, multiple times in my career, I've worked at shops that once thought they could do it quick and cheap and it would be good enough, and then had to hire someone 'picky' like me to sort out the inevitable money-losing mess.

From what I've seen even Opus 4.5 spit, the 'picky' are going to remain in demand for a little while longer still. Will that last? No clue. We'll see.

cpursley|1 month ago

You can be picky with Opus, just yell at it to refactor a few times. To reduce refactor cycles, give it correct and enough context before you start along with expected code style, etc. These things aren't one shot magic machines.

AlexCoventry|1 month ago

> I don't understand the stance that AI currently is able to automate away non-trivial coding tasks.

I'm happy enough for it to automate away the trivial coding tasks. That's an immense force multiplier in its own right.

KronisLV|1 month ago

> I end up rewriting about 70% of the thing.

Doesn't match my experience, that figure is closer to about 20-40% to me, though a lot of those changes I want are possible by just further prompting OR turning to a different model, or adding some automated checks that promptly fail and the AI can do a few more loops of fixes.

> Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.

This is also likely, or you are just doing stuff that is worse represented in the training data, or working on novel things where the output isn't as good. But I'm leaning towards people just being picky about what they view as "good code" (or underspecifying how the AI is supposed to output it) at least roughly since Sonnet 4, since with some people I work with it's just endless and oftentimes meaningless discussions and bikeshedding when in code review.

You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."

9dev|1 month ago

> You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."

…and then add that to your CLAUDE.md, and never worry about having to say it again manually.

jstummbillig|1 month ago

> Every single time [...] I end up rewriting about 70% of the thing

If that number has not significantly changed since GPT 3.5, I think it's safe to assume that something very weird is happening on your end.

dns_snek|1 month ago

I think I know what they mean, I share a similar experience. It has changed, 3.5 couldn't even attempt to solve non-trivial tasks so it was a 100% failure, now it's 70%.

willtemperley|1 month ago

I get the best results when using code to demonstrate my intention to an LLM, rather than try and explain it. It doesn't have to be working code.

I think that mentally estimating the problem space helps. These things are probabilistic models, and if there are a million solutions the chance of getting the right one is clearly unlikely.

Feeding back results from tests really helps too.

virgildotcodes|1 month ago

On the subpar code, would the code work, albeit suboptimally?

I think part of the problem a lot of senior devs are having is that they see what they do as an artisanal craft. The rest of the world just sees the code as a means to an end.

I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.

FloorEgg|1 month ago

There is some truth to your point but you might want to consider that often seniors concerned with code quality aren't being pedantic about artisanal craft they are worried about the consequences of bad code...

- it becomes brittle and rigid (can't change it, can't add to it)

- it becomes buggy and impossible to fix one bug without creating another

- it becomes harder to tell what it's doing

- plus it can be inefficient / slow / insecure, etc.

The problem with your analogy is that toasters are quite simple. The better example would be your computer, and if you want your computer to just run your programs and not break, then these things matter.

zbentley|1 month ago

> I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.

A consumer or junior engineer cares whether the toaster toasts the bread and doesn’t break.

Someone who cares about their craft also cares about:

- If I turn the toaster on and leave, can it burn my house down, or just set off the smoke alarm?

- Can it toast more than sliced uniform-thickness bread?

- What if I stick a fork in the toaster? What happens if I drop it in the bathtub while on? Have I made the risks of doing that clear in such a way that my company cannot be sued into oblivion when someone inevitably electrocutes themselves?

- Does it work sideways?

- When it fills up with crumbs after a few months of use, is it obvious (without knowing that this needs to be done or reading the manual) that this should be addressed, and how?

- When should the toaster be replaced? After a certain amount of time? When a certain misbehavior starts happening?

Those aren’t contrived questions in service to a tortured metaphor. They’re things that I would expect every company selling toasters to have dedicated extensive expertise to answering.

PunchyHamster|1 month ago

>I think part of the problem a lot of senior devs are having is that they see what they do as an artisanal craft. The rest of the world just sees the code as a means to an end.

Then you haven't been a senior dev long enough.

We want code that will be good enough because we will have to maintain it for years (or inherit maintaining from someone else), we want it to be clean enough that adding new features isn't a pain and architected well enough that it doesn't need major rewrite to do so.

Of course if code is throwaway that doesn't matter but if you're making long term product, making shit code now is taking on the debt you will have to pay off.

That is not to say "don't use AI for that", that is to say "actually go thru AI code and review whether it is done well enough". But many AI-first developers just ship first thing that compiles or passes tests, without looking.

> I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.

...well if you want it to not break (and still be cheap) you have to put quite a bit of engineering into it.

aperrien|1 month ago

Have you tried asking one of your peers who claims to get good results to run a test with you? Where you both try to create the same project, and share your results?

totallykvothe|1 month ago

I and one or two others are _the_ AI use experts at my org, and I was by far the earliest adopter here. So I don't really have anyone else with significantly different experiences than me that I could ask.

rbbydotdev|1 month ago

> I end up rewriting about 70% of the thing.

I think this touches on the root of the issue. I am seeing a results over process winning. Code quality will reduce. Out of touch or apathetic project management who prioritize results, now are even more emboldened to have more tech debt riddled code

alexsmirnov|1 month ago

This is exact the impression that I got. Every question or task given to LLM returns pretty reasonable, but flawed result. For the coding, those are hard to spot but dangerous mistakes. They all look good and perfectly reasonable, but just wrong. Anthropic compared Claude Code to a "slot machine", and I fell that AI coding now is something close to gambling addiction. As small wins keep gambler to make more bets, so correct results from AI keep developers to use it: "I see it made correct solution, let's try again!" At a startup CTO, I review most of the pull requests from team members, and team uses AI tools actively. The overall picture strongly confirms your second conclusion.

simonw|1 month ago

If someone gives you access to a slot machine which is weighted such that it pays out way more than you put into it, my advice is to start cranking that lever.

If it does indeed start costing more than it's paying out, step away.

f1shy|1 month ago

I'm exactly on the same boat.

To anybody who want to try, a concrete example, that I have tested in all available LLMs:

Make a prompt to get a common lisp application which makes a "hello triangle" in open gl, without using SDL or any framework, only OpenGL and GLFW bindings.

None of the replies even compiled. I kept asking at least 5 times, with error feedback, to see if AI can do it. It did't work. Never.

The best I got was from gemini, a code where I had to change about 10 lines, absolutely no trivial changes that need to be familiar with opengl and lisp. After doing the changes I asked back, what does it think of the changes, it replied I was wrong, with those changes it will never work.

If anybody can make a prompt that get me that, please let me known...

Philpax|1 month ago

It sounds like you're using LLMs directly, instead of a coding agent. Agents are capable of testing their own code and using that to fix issues, which is what makes them so powerful.

Using Claude Code, I was able to successfully produce the Hello Triangle you asked for (note that I have never used CL before): https://github.com/philpax/hello-triangle-cl

For reference, here is the transcript of the entire interaction I had with CC (produced with simonw's excellent claude-code-transcripts): https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...

Edit: To better contextualise what it's doing, the detailed transcript page may be useful: https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...

james_a_craig|1 month ago

"Please write me a program in Common LISP (SBCL is installed) which will render a simple "hello world" triangle in OpenGL. You should use only OpenGL and GLFW (using sbcl's FFI) for this, not any other existing 3D graphics framework."

This worked in codex-cli, albeit it took three rounds of passing back the errors. https://gist.github.com/jamesacraig/9ae0e5ed8ebae3e7fe157f67... has the resulting code.

Aeolun|1 month ago

Maybe if your coding style is already close to what an LLM like Claude outputs, you’ll never have these issues? At least it generally seems to be doing what I would do myself.

Most of the architectural failures come from it still not having the whole codebase in mind when changing stuff.

trueno|1 month ago

I actually think it's less about code style and more about the disjointed way end outcomes seem to be the culmination of a lot of prompt attempts over the course of a project/implementation.

The funny thing is reviewing stuff claude has made isn't actually unfamiliar to me in the slightest. It's something I'm intimately familiar with and have been intimately familiar with for many years, long before this AI stuff blew up...

..it's what code I've reviewed/maintained/rejected looks like when a consulting company was brought on board to build something. Such a company that leverages probably underpaid and overworked laborers both overseas and US based workers on visas. The delivered documentation/code is noisy+disjointed.

locknitpicker|1 month ago

> Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing.

You might want to review how you approach these tools. Complaining that you need to rewrite 70% of the code screams of poor prompting, with too vague inputs, no constraints, and no feedback at all.

Using agents to help you write code is far from a one-shot task, but if throwing out 70% of what you create screams out that you are prompting the agent to create crap.

> 1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.

I think you need to take a humble pill, review how you are putting together these prompts, figure out what you are doing wrong in prompts and processes, and work up from where you are at this point. If 70% of your output is crap, the problem is in your input.

I recommend you spend 20 minutes with your agent of choice prompting it to help you improve your prompts. Check instruction files, spec-driven approaches, context files, etc. Even a plain old README.md helps a lot. Prompt your agent to generate it for you. From there, instead of one-shot prompts try to break down a task into multiple sub steps with small deliverables. Always iterate on your instruction files. It you spend a few minutes on this, you will quickly halve your churn rate.

prezk|1 month ago

Maybe LLMs are like a next evolution of a rubber ducky: you can talk to it, and it's very helpful, just don't expect that IT will give you the final answer.

FloorEgg|1 month ago

I have been doing the same since GPT-3. I remember a time, probably around 4o when it started to get useful for some things like small React projects but was useless for other things like firestore rules. I think that surface is still jagged, it's just that it's less obviously useless in areas that it's weaker.

When things really broke open for me was when I adopted windsurf with Opus 4, and then again with Opus 4.5. I think the way the IDE manages the context and breaks down tasks helps extend llm usefulness a lot, but I haven't tried cursor and haven't really tried to get good at Claude code.

All that said, I have a lot of experience writing in business contexts and I think when I really try I am a pretty good communicator. I find when I am sloppy with prompts I leave a lot more to chance and more often I don't get what I want, but when I'm clear and precise I get what I want. E.g. if it's using sloppy patterns and making bad architectural choices, I've found that I can avoid that by explaining more about what I want and why I want it, or just being explicit about those decisions.

Also, I'm working on smaller projects with less legacy code.

So in summary, it might be a combination of 1, 2 and the age/complexity of the project you're working on.

h14h|1 month ago

In my experience, using AI coding agents need highly specific success criteria, and an easy way to verify its output against that criteria.

My biggest successes have come when I take a TDD approach. First I identify a subset of my work into a module with an API that can be easily tested, then I collaborate with the agent on writing correct test-cases, and finally I tell it to implement the module such that the test cases pass without any lint or typing errors.

It forces me to spend much more time thinking about use cases, project architecture, and test coverage than about nitty-gritty implementation details. I can imagine that in a system that evolved over time without a clear testing strategy, AI would struggle mightily to be even barely useful.

Not saying this applies to your system, but I've definitely worked on systems in the past that fit the "big ball of mud" description pretty neatly, and I have zero clue how I'd have been able to make effective use of these AI tools.

hansvm|1 month ago

You alluded to it, but also:

3) Not everyone codes the same things

4) It's easy to get too excited about the tech and ignore its failure modes when describing your experiences later

I use AI a lot. With your own control plane (as opposed to a generic Claude Code or whatever) you can fully automate a lot more things. It's still fundamentally incapable of doing tons of tasks though at any acceptable quality level, and I strongly suspect all of (2,3,4) are guiding the disconnect you're seeing.

Take the two things I've been working on this morning as an example.

One was a one-off query. I told it the databases it should consider, a few relevant files, roughly how that part of the business works, and asked it to come back when it finished. When it was done I had it patch up the output format. It two-shot (with a lot of helpful context) something that would have taken me an hour or more.

Another is more R&D-heavy. It pointed me to a new subroutine I needed (it couldn't implement it correctly though) and is otherwise largely useless. It's actively harmful to have it try to do any of the work.

It's possible that (1) matters more than you suspect too. AI has certain coding patterns it likes to use a lot which won't work in my codebase. Moreover, it can't one-shot the things I want. It can, however, follow a generic step-by-step guide for generating those better ideas, translating worse ideas into things that will be close enough to what I need, identifying where it messed up, and refactoring into something suitable, especially if you take care to keep context usage low and whatnot. A lot of people seem to be able to get away with CLAUDE.md or whatever, but I like having more granular control of what the thing is going to be doing.

Exoristos|1 month ago

I think the answer will lie somewhere closer to social psychology and modern economics than to anything in software engineering.

growt|1 month ago

It might be 1), being an early adopter doesn’t help much with AI. So much is changing constantly. If you put a good description of your architecture and coding guidelines in the right .md files and work on your prompts the output should be much better. In the other hand your project being legacy code probably also doesn’t help.

lmeyerov|1 month ago

We find across our team different people are able to use these things at different levels. Unsurprisingly, more senior coders with both more experience in general and more experience in ai coding are able to do more with ai and get more ambitious things done more quickly.

A bummer is that we have a genai team (louie.ai) and a gpu/viz/graph analytics team (graphistry), and those who have spent the last 2-3 years doing genai daily have a higher uptake rate here than those who aren't. I wouldn't say team 1 is better than team 2 in general: these are tools, and different people have different engineering skill and ai coding skill, including different amounts of time doing both.

What was a revelation for me personally was taking 1-2mo early in claude code's release was to go full cold turkey on manual coding, similar to getting immersed in a foreign language. That forced eliminating a lot of bad habits wrt effective ai coding both personally and in state of our repo tooling. Since then, it's been steady work to accelerate and smooth that loop, eg, moving from vibe coding/engineering to now more eval-driven ai coding loops: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . That takes a LOT of buildout.

fmbb|1 month ago

Do you have links to texts that describe which markdown files, and what to write in them? What is good and what is bad etc.

zmj|1 month ago

My experience with agents in larger / older codebases is that feedback loops are critical. They'll get it somewhere in the neighborhood of right on the first attempt; it's up to your prompt and tooling to guide them to improve it on correctness and quality. Basic checks: can the agent run the app, interact with it, and observe its state? If not, you probably won't get working code. Quality checks: by default, you'll get the same code quality as the code the agent reads while it's working; if your linters and prompts don't guide it towards your desired style, you won't get it.

To put that another way: one-shots attempts aren't where the win is in big codebases. Repeat iteration is, as long as your tooling steers it in the right direction.

dspillett|1 month ago

> 1) I'm not good at prompting,

I assume this is part of the problem (though I've avoided using LLMs mostly so can't comment with any true confidence here) but to a large extent this is blaming you for a suboptimal interface when the interface is the problem.

That some people seem to get much better results than others, and that the distinction does not map well to differences in ability elsewhere, suggests to me that the issue is people thinking slightly differently and the training data for the models somehow being biased to those who operate in certain ways.

> 2) Other people are just less picky than I am

That is almost certainly a much larger part of the problem. “Fuck it, it'll do, someone else can tidy it later if they are bothered enough” attitudes were rampant long before people started outsourcing work to LLMs.

furyofantares|1 month ago

I think you should try harder to find their limits. Be as picky as you want, but don't just take over after it gave you something you didn't like. Try again with a prompt that talks about the parts you think were bad the first time. I don't mean iterate with it, I mean start over with a brand new prompt. Try to figure out if there is a prompt that would have given you the result you wanted from the start.

It won't be worth it the first few times you try this, and you may not get it to where you want it. I think you might be pickier than others and you might be giving it harder problems, but I also bet you could get better results out of the box after you do this with a few problems.

parliament32|1 month ago

Not even coding tasks. Just getting an LLM to help me put together a PromQL query to do something somewhat non-standard takes dozens of tries and copy/pasting back error messages.. and these aren't complex errors, trivial things like missing closing brackets and the like.

I know the usual clap back is "you're just missing this magical workflow" or "you need to prompt better" but.. do I really need to prompt "make sure your syntax is correct"? Shouldn't that be, ya know, a given for a prompt that starts with "Help me put together a PromQL query that..."?

simonw|1 month ago

Yes, you're missing a magic workflow.

If you find yourself having to copy and paste errors back and forward you need to upgrade to a coding agent harness like Claude Code so the LLM can try things out and then fix the errors on its own.

If you're not willing to do that you can also fix this by preparing a text file with a few examples of correctly formatted queries and pasting that in at the start of your session, or putting it in a skill markdown file.

mark_l_watson|1 month ago

I think you are correct, with one large caveat:

With very good tooling (e.g., Google Antigravity, Claude Coding, Open AI’s codex, and several open platforms) and not caring about your monthly API and subscription costs, then very long running trial and error and also with tools for testing code changes, then some degree of real autonomy is possible.

But, do we want to work like this? I don’t.

I feel very good about using strong AI for research and learning new things (self improvement) and I also feel good about using strong AI as a ‘minor partner’ in coding.

Closi|1 month ago

Try learning to vibe code on something totally greenfield without looking at the code and see if it changes your mind. Ignore code quality, “does it work” and “am i happy with the app” are the only metrics.

Code quality is an issue you need to ignore with vibe coding - if code quality is important to your project or you then it’s not an issue. But if you abandon this concept and build things small enough or modular enough then speed gains await!

IMO codebases can be architected for LLMs to work better in them, but this is harder in brownfield apps.

ahtihn|1 month ago

If you start greenfield and ignore the code quality, how do you know you can maintain it long term?

Greenfield is fundamentally easier than maintaining existing software. Once software exists, users expect it to behave a certain way and they expect their data to remain usable in new versions.

The existing software now imposes all sorts of contraints that may not be explicit in the spec. Some of these constraints end up making some changes very hard. Bad assumptions in data modeling can make migrations a nightmare.

You can't just write entirely new software every time the requirements change.

littlestymaar|1 month ago

> 2) Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.

Given how consistently terrible the code of Claude Code-d projects posted here have been, I think this is it.

I find LLMs pretty useful for coding, for multiple things(to write boilerplate, as an idiomatic design pattern search engine, as a rubber duck, helping me name things, explaining unclear error messages, etc.), but I find the grandiose claims a bit ridiculous.

eloisant|1 month ago

You can definitely use AI for non-trivial tasks.

It's not just about better prompting, but using better tools. Tools that will turn a bad prompt into a good prompt.

For example there is the plan mode for Cursor. Or just ask the AI: "make a plan to do this task", then you review the plan before asking it to implement. Configure the AI to ask you clarification questions instead of assuming things.

It's still evolving pretty quickly, so it's worth staying up to date with that.

xmodem|1 month ago

I have not been as aggressive as GP in trying new AI tools. But the last few months I have been trying more and more and I'm just not seeing it.

One project I tried out recently I took a test-driven approach. I built out the test suite while asking the AI to do the actual implementation. This was one of my more successful attempts, and may have saved me 20-30% time overall - but I still had to throw out 80% of what it built because the agent just refused to implement the architecture I was describing.

It's at its most useful if I'm trying to bootstrap something new on a stack I barely know, OR if I decide I just don't care about the quality of the output.

I have tried different CLI tools, IDE tools. Overall I've had the best success with Claude Code but I'm open to trying new things.

Do you have any good resources you would recommend for getting LLM's to perform better, or staying up-to-date on the field in general?

davidguetta|1 month ago

I think you are not hardcore enough. I paste entire files or 2 3 files at once and ask to rewrite everything.

Then you rewiew it and in general have to ask to remove some stuff. And then it's (good enough). You have to accept to not nitpick some parts (like random functions being generated) as long as your test suite pass, otherwise of course you will end up rewritin everything

It also depends on your setting, some area (web vs AI vs robotics) can be more suited than other

nvarsj|1 month ago

It is pretty simple imo. AI (just like humans!) does best on well written, self contained code bases. Which is a very small niche, but also over represented in open source and subsequently by tech celebrities who tend not to work on “ugly code”.

I work on a giant legacy code base at big tech, which is one piece of many distributed systems. LLM is helpful for localised, well defined work, but nowhere close to what the TFA describes.

jryle70|1 month ago

If you follow antirez's post history, he was a skeptics until maybe a year ago. Why don't you look at his recent commits and judge for yourself. I suppose the majority of his most recent code is relevant for this discussion.

https://github.com/antirez?tab=overview&from=2026-01-01&to=2...

totallykvothe|1 month ago

I don't think I'd be a good judge because I don't have the years of familiarity and expertise in his repos that I do at my job. A lot of the value of me specifically vs an LLM at my job is that I have the tribal knowledge and the LLM does not. We have gotten a lot better at documentation, but I don't think we can _ever_ truly eliminate that factor.

cm2187|1 month ago

Not trying to back the AI hype, but most pre-AI auto generated code is garbage (like winform auto generated code or entity framework SQL in the .net world). But that’s fine, it’s not meant to be read by humans. If you want to change it you can regenerate it. It may be that AI just moves the line between what developers should care and look at vs the boring boiler plate code that has little value added.

camdenreslink|1 month ago

But those code generators were deterministic (and indeed caused huge headaches if the generated code changed between versions). Seems like a totally different thing.

Chris911|1 month ago

Instead of rewriting yourself have you tried telling the agent what it did wrong and do the rewrite with it? Then at the end of the session ask it to extract a set of rules that would have helped to get it right the first time. Save that in AGENTS.md. If you and your team do this a few times it can lead to only having to rewrite 5% of the code instead of 70%.

newsoftheday|1 month ago

> Instead of rewriting yourself have you tried telling the agent what it did wrong and do the rewrite with it?

I have, it becomes a race to the bottom.

ninininino|1 month ago

How much buggy / incorrect Java written by first year computer science University students is there on Stack Overflow (in SO post bodies)? Decades of it.

Ask the same question of Golang, or Rust, or Typescript.

I have a theory that the large dichotomy in how people experience AI coding has to do with the quality of the training corpus for each language online.

onlyrealcuzzo|1 month ago

I'm not sure if I got in this weird LLM bubble where they give me bad advice to drive engagement, because I can't resist trying to correct them and tell them how absurdly wrong they are.

But it is astounding how terrible they are at debugging non-trivial assembly in my experience.

Anyone else have input here?

Am I in a weird bubble? Or is this just not their forte?

It's truly incredible how thoughtless they can be, so I think I'm in a bubble.

selestify|1 month ago

> I can't resist trying to correct them and tell them how absurdly wrong they are.

Oh god I thought I was the only one. Do you find yourself getting mad at them too?

smj-edison|1 month ago

I've tried to use Claude Code with Sonnet 4.5 for implementing a new interpreter, and man is it bad with reference counting. Granted, I'm doing it in Zig, so there's not as much training, but Claude will suggest the most stupid changes. All it does is make the rare case of incorrect reference counting more rare, not fixing the underlying problem. It kept heaping on more and more hacks, until I decided enough is enough and rolled up my sleeves. I still can't tell if it makes me faster, or if I'm faster.

Even when refactoring, it would change all my comments, which is really annoying, as I put a lot of thought into my comments. Plus, the time it took to do each refactoring step was about how long it would take me, and when I do it I get the additional benefit of feeling when I'm repeating code too often.

So, I'm not using it for now, except for isolating bugs. It's addicting having it work on it for me, but I end up feeling disconnected and then something inevitably goes wrong.

intended|1 month ago

Thank you for providing data which can actually be used to collate! I strongly suspect that experience is a huge determinant of what utility is seen from LLMs.

It seems that theres more people writing and finishing projets, but not many have reached the point where they have to maintain their code / deal with the tech debt.

crassus_ed|1 month ago

Genuine question, doesn't this apply to coding style than actual results? Same applies to writing style. LLMs manage to write great stories but they don't suit my writing style. When generating code it doesn't always suit my coding style but the code it generates functions fine.

eru|1 month ago

> Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing.

Have another model review the code, and use that review as automatic feedback?

jmalicki|1 month ago

CodeRabbit in particular is gold here. I don't know what they do but it is far better at reviewing than any AI model I've seen. From the deep kinds of things it finds, I highly suspect they have a lot of agents routing code to extremely specialized subagents that can find subtle concurrency bugs, misuse of some deep APIs etc. I often have to do the architecture l/bug picture/how this fits into project vision review myself, but for finding actual bugs in code, or things that would be self evident from reading one file, it is extremely good.

robertfw|1 month ago

I've been using a `/feedback ...` command with claude code where I give it either positive or negative feedback about some action it just did, and it'll look through the session to make some educated guesses about why it did some thing - notably, checking for "there was guidance for this, but I didn't follow it", or "there was no guidance for this".

the outcome is usually a new or tweaked skill file.

it doesn't always fix the problem, but it's definitely been making some great improvements.

kristofferR|1 month ago

That is actually a gold tip. Codex CLI is way less pleasant to use than Opus, but way better at finding bugs, so I combine them.

redox99|1 month ago

It sounds harsh but you're most likely using it wrong.

1) Have an AGENTS.md that describes not just the project structure, but also the product and business (what does it do, who is it for, etc). People expect LLMs to read a snippet of code and be as good as an employee who has implicit understanding of the whole business. You must give it all that information. Tell it to use good practices (DRY, KISS, etc). Add patterns it should use or avoid as you go.

2) It must have source access to anything it interacts with. Use Monorepo, Workspaces, etc.

3) Most important of all, everything must be setup so the agent can iterate, test and validate it's changes. It will make mistakes all the time, just like a human does (even basic syntax errors), but it will iterate and end up on a good solution. It's incorrect to assume it will make perfect code blindly without building, linting, testing, and iterating on it. No human would either. The LLM should be able to determine if a task was completed successfully or not.

4) It is not expected to always one shot perfect code. If you value quality, you will glance at it, and sometimes ahve to reply to make it this other way, extract this, refactor that. Having said that, you shouldn't need to write a single line of code (I haven't for months).

Using LLMs correctly allow you to complete tasks in minutes that would take hours, days, or even weeks, with higher quality and less errors.

Use Opus 4.5 with other LLMs as a fallback when Opus is being dumb.

matwood|1 month ago

> Most important of all, everything must be setup so the agent can iterate, test and validate it's changes.

This was the biggest unlock for me. When I received a bug report I have the LLM tell me where it thinks the source of the bug is located, write a test that triggers the bug/fails, design a fix, finally implement the fix and repeat. I'm routinely surprised how good it is at doing this, and the speed with which it works. So even if I have to manually tweak a few things, I've moved much faster than without the LLM.

Madmallard|1 month ago

"The LLM should be able to determine if a task was completed successfully or not."

Writing logic that verifies something complex requires basically solving the problem entirely already.

zelphirkalt|1 month ago

That's the curse of the expert. You see many of the shortcomings, that someone less experienced might not even think about, when they go to social media and blurt out that AI is now able to fully replace them.

austin-cheney|1 month ago

Like with anything else the people best positioned to enjoy output are the people least well positioned to criticize it. This is true of AI just as eating at restaurants or enjoying movie dramas.

javier2|1 month ago

This is also my experience with enterprise Java. LLMs have done much better with slightly less convoluted code bases in Go. Its currently clearly better at Go and Typescript than Java in my view

teunispeters|1 month ago

Wow only 70%. I so far have had to drop and rewrite from scratch every time. Mind, I work in C/embedded spaces, and current LLMs are just horrible at any code in that space.

My vote is with (2).

unknown|1 month ago

[deleted]

wanderlust123|1 month ago

Do you have an example of something that was subpar and needed a 70% rewrite?

unknown|1 month ago

[deleted]

aurizon|1 month ago

AI is a house painter, wall to wall, with missed spots and drips. Good coders are artists. That said, artists have been known to use assistants on backgrounds. Perhaps the end case is a similar coder/AI collaborative effort?

perrygeo|1 month ago

LLMs tend to rise to the level of the complexity of the codebase. They are probabilistic pattern matching machines, after all. It's rare to have a 15 year old repo without significant complexity; is it possible that the reason LLMs have trouble with complex codebases is that the codebases are complex?

IMO it has nothing to do with LLMs. They just mirror the patterns they see - don't get upset when you don't like your own reflection! Software complexity is still bad. LLMs just shove it back in our face.

Implications: AI is always going to feel more effective on brand new codebases without any legacy weight. And less effective on "real" apps where the details matter.

The bias is strongly evident - you rarely hear anyone talking about how they vibe coded a coherent changeset to an existing repo.

otabdeveloper4|1 month ago

> I don't understand the stance that AI currently is able to automate away non-trivial coding tasks

It's just the Dunning-Kruger effect. People who think AI is the bee's knees are precisely the dudes who are least qualified to judge its effectiveness.

egorfine|1 month ago

Same experience. The better the model the more complicated are the bugs and brain damages it introduces.

Perhaps one has to be skilled programmer in the first place to spot the problems, which is not easy when the program runs apparently.

Things like mocked tests, you know. Who would care about that.

CuriouslyC|1 month ago

I think it comes down to what you mean by sub par code. If you're talking a mess of bubblesorts and other algorithmic problems, that's probably a prompting issue. If you're talking "I just don't like the style of the code, it looks inelegant" that's not really a prompting issue, models will veer towards common patterns in a way that's hard to avoid with prompts.

Think about it like compiler output. Literally nobody cares if that is well formatted. They just care that they can get fairly performant code without having to write assembly. People still dip to assembly (very very infrequently now) for really fine performance optimizations, but people used to write large programs in it (miserably).

__float|1 month ago

There's a huge amount you're missing by boiling down their complaint to "bubble sorts or inelegant code". The architecture of the new code, how it fits into the existing system, whether it makes use of existing utility code (IMO this is a huge downside; LLMs seem to love to rewrite a little helper function 100x over), etc.

These are all important when you consider the long-term viability of a change. If you're working in a greenfield project where requirements are constantly changing and you plan on throwing this away in 3 months, maybe it works out fine. But not everyone is doing that, and I'd estimate most professional SWEs are not doing that, even!

luckilydiscrete|1 month ago

It's a combination of being bad at prompting and different expectations from the tool. You expect it to be one shot, and then rewrite things that don't match up to what you want.

Instead I recommend that you use LLMs to fix the problems that they introduced as well, and over time you'll get better at figuring out the parts that the LLM will get confused by. My hunch is that you'll find your descriptions of what to implement were more vague than you thought, and as you iterate, you'll learn to be a lot more specific. Basically, you'll find that your taste was more subjective than you thought and you'll rid yourself of the expectation that the LLM magically understands your taste.