LLMs work best when the user defines their acceptance criteria first

[+] pornel|7 days ago|reply

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

[+] unlikelytomato|7 days ago|reply

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

[+] stingraycharles|7 days ago|reply

> If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

[+] marginalia_nu|7 days ago|reply

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

[+] vannevar|7 days ago|reply

I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

[+] Implicated|7 days ago|reply

Not trying to be snarky, with all due respect... this is a skill issue.

It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

> If you tell them the code is slow

Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

The more you stick to convention the better off you'll be. And use planning mode.

[+] joquarky|7 days ago|reply

Don't let it deteriorate so far that it can't recover in one session.

Perform regular sessions dedicated to cleaning up tech debt (including docs).

[+] MattGaiser|7 days ago|reply

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?

[+] bryanrasmussen|7 days ago|reply

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

[+] enraged_camel|6 days ago|reply

I have no idea what I'm doing differently because I haven't experienced this since Opus 4.5. Even with Sonnet 4.5, providing explicit instructions along the lines of "reuse code where sensible, then run static analysis tools at the end and delete unused code it flags" worked really well.

I always watch Opus work, and it is pretty good with "add code, re-read the module, realize some pre-existing code (either it wrote, or was already there) is no longer needed and delete it", even without my explicit prompts.

[+] carlosjobim|6 days ago|reply

Yes, this is exactly the experience I have had with LLMs as a non-programmer trying to make code. When it gets too deep into the weeds I have to ask it to get back a few steps.

[+] m3kw9|6 days ago|reply

Yes that’s my observation too. I have to be double careful the longer they run a task. They like to hack and patch stuff even when I tell it I don’t prefer it.

[+] karussell|6 days ago|reply

The solution is to know when to use an existing solution like sqlite and when to create your own. So the biggest problem with LLMs is that they don't repel or remind you about possible consequences (too often). But if they would, I would find it even more awkward... and this is one of the reasons I prefer Claude Code over Codex.

[+] bluepoint|5 days ago|reply

Actually I get improvements when I ask two llms to simplify each other’s work repeatedly.

[+] unknown|6 days ago|reply

[deleted]

[+] codebolt|7 days ago|reply

I use the restore checkpoint/fork conversation feature in GitHub Copilot heavily because of this. Most of the time it's better to just rewind than to salvage something that's gone off track.

[+] ThrowawayTestr|6 days ago|reply

I feel like there's two types of LLM users. Those that understand it's limitations, and those that ask it to solve a millennium problem on the first try.

[+] esafak|7 days ago|reply

I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.

[+] cyanydeez|6 days ago|reply

The reason theyre not intelligent is becaise they want to predict the next token, so verbosity is baked in.

[+] bgitarts|6 days ago|reply

have you tired adding to your agents file: "Prefer solutions that reduce lines of code over adding lines of code"?

[+] leke|7 days ago|reply

i wonder if the solution is to just ask it to refactor its code once it's working.

[+] fmbb|7 days ago|reply

It’s in the name, isn’t it?

Generative AI.

[+] treetalker|6 days ago|reply

This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.

LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.

I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.

[+] roarcher|6 days ago|reply

> LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute.

The same goes for coding. I have coworkers who use it to generate entire PRs. They can crank out two thousand lines of code that includes tests "proving" that it works, but may or may not actually be nonsense, in minutes. And then some poor bastard like me has to spend half a day reviewing it.

When code is written by a human that I know and trust, I can assume that they at least made reasonable, if not always correct, decisions. I can't assume that with AI, so I have to scrutinize every single line. And when it inevitably turns out that the AI has come up with some ass-backwards architecture, the burden is on me to understand it and explain why it's wrong and how to fix it to the "developer" who hasn't bothered to even read his own PR.

I'm seriously considering proposing that if you use AI to generate a PR at my company, the story points get credited to the reviewer.

[+] deaux|6 days ago|reply

> This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

Correct, and this of course extends past just laws, into the whole scope of rules and regulations described in human languages. It will by its nature imply things that aren't explicitly stated nor can be derived with certainty, just because they're very plausible. And those implications can be wrong.

Now I've had decent success with having LLMs then review these LLM-generated texts to flag such occurences where things aren't directly supported by the source material. But human review is still necessary.

The cases I've been dealing with are also based on relatively small sets of regulations compared the scope of the law involved with many legal cases. So I imagine that in the domain you're working on, much more needs flagging.

[+] FpUser|6 days ago|reply

>" justice suffers"

Possible. It also suffers when majority simply can not afford proper representation

[+] grey-area|7 days ago|reply

This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).

I find LLMs at present work best as autocomplete -

The chunks of code are small and can be carefully reviewed at the point of writing

Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete

That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.

Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.

[+] theshrike79|5 days ago|reply

> This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow).

Why isn't requirements testing automated? Benchmarking the speed isn't rocket science. At worst a nightly build should run a benchmark and log it so you can find any anomalies.

[+] KatanaLarp|5 days ago|reply

[deleted]

[+] consumer451|7 days ago|reply

Nitpick/question: the "LLM" is what you get via raw API call, correct?

If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.

I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?

[+] dragonwriter|7 days ago|reply

> Nit pick/question: The LLM is what you get via raw API call, correct?

You always need a harness of some kind to interact with an LLM. Normal web APIs (especially for hosted commercial systems) wrapped around LLMs are non-minimal harnesses, that have built in tools, interpretation of tool calls, application of what is exposed in local toolchains as “prompt templates” to transform the context structure in the API call into a prompt (in some cases even supporting managing some of the conversation state that is used to construct the prompt on the backend.)

> If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

You are essentially always using something more than an LLM (unless “you” are the person writing the whole software stack, and the only thing you are consuming is the model weights, or arguably a truly minimal harness that just takes setting and a prompt that is not transformed in any way before tokenization, and returns the result after no transformations or filtering other than mapping back from tokens to text.)

But, yes, if you are using an elaborate frontend of the type you enumerate (whether web or CLI or something else), you are probably using substantially more stuff on top of the LLM than if you are using the providers web API.

[+] xlth|7 days ago|reply

You're not off-base at all. The way I think about it:

- LLM = the model itself (stateless, no tools, just text in/text out) - LLM + system prompt + conversation history = chatbot (what most people interact with via ChatGPT, Claude, etc.) - LLM + tools + memory + orchestration = agent (can take actions, persist state, use APIs)

When someone says "LLMs have no memory" they're correct about the raw model, but Claude Code or Cursor are agents - they have context, tool access, and can maintain state across interactions.

The industry seems to be settling on "agentic system" or just "agent" for that last category, and "chatbot" or "assistant" for the middle one. The confusion comes from product names (ChatGPT, Claude) blurring these boundaries - people say "LLM" when they mean the whole stack.

[+] simonw|6 days ago|reply

I like to use the term "coding agents" for LLM harnesses that have the ability to directly execute code.

This is an important distinction because if they can execute the code they can test it themselves and iterate on it until it works.

The ChatGPT and Claude chatbot consumer apps do actually have this ability now so they technically class as "coding agents", but Claude Code and Codex CLI are more obvious examples as that's their key defining feature, not a hidden capability that many people haven't spotted yet.

[+] alexhans|7 days ago|reply

> The vibes are not enough. Define what correct means. Then measure.

Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.

The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.

The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.

- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )

[+] D-Machine|7 days ago|reply

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

[+] comex|7 days ago|reply

Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):

https://news.ycombinator.com/item?id=47176209

[+] flerchin|7 days ago|reply

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

[+] seanmcdirmid|6 days ago|reply

Ok, I’ll bite: how is that different from humans?

[+] andai|6 days ago|reply

It writes statistically represented code, which is why (unless instructed otherwise) everything defaults to enterprisey, OOP, "I installed 10 trendy dependencies, please hire me" type code.

[+] siliconc0w|6 days ago|reply

Just a recent anecdote, I asked the newest Codex to create a UI element that would persist its value on change. I'm using Datastar and have the manual saved on-disk and linked from the AGENTS.md. It's a simple html element with an annotation, a new backend route, and updating a data model. And there are even examples of this elsewhere in the page/app.

I've asked it to do why harder things so I thought it'd easily one-shot this but for some reason it absolutely ate it on this task. I tried to re-prompt it several times but it kept digging a hole for itself, adding more and more in-line javascript and backend code (and not even cleaning up the old code).

It's hard to appreciate how unintuitive the failure modes are. It can do things probably only a handful of specialists can do but it can also critical fail on what is a straightforward junior programming task.

[+] ollybrinkman|6 days ago|reply

This maps directly to the shift happening in API design for agent-to-agent communication.

Traditional API contracts assume a human reads docs and writes code once. But when agents are calling agents, the "contract" needs to be machine-verifiable in real-time.

The pattern I've seen work: explicit acceptance criteria in API responses themselves. Not just status codes, but structured metadata: "This response meets JSON Schema v2.1, latency was 180ms, data freshness is 3 seconds."

Lets the calling agent programmatically verify "did I get what I paid for?" without human intervention. The measurement problem becomes the automation problem.

Similar to how distributed systems moved from "hope it works" to explicit SLOs and circuit breakers. Agents need that, but at the individual request level.

[+] swiftcoder|6 days ago|reply

What's up with the (somewhat odd) title HN has gone with for this article? it's implying a very different article than the one I just read

[+] lukeify|7 days ago|reply

Most humans also write plausible code.

[+] jqpabc123|7 days ago|reply

LLMs have no idea what "correct" means.

Anything they happen to get "correct" is the result of probability applied to their large training database.

Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.

Relying on an LLM for anything "serious" is a liability issue waiting to happen.

[+] seanmcdirmid|7 days ago|reply

I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.

I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.

[+] 88j88|7 days ago|reply

100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.

[+] cadamsdotcom|6 days ago|reply

This is a bit unfair - to generate a bunch of code but not give the model data/tools and direct it to optimize it; then compare it to the optimized work of thousands over decades.

Feels like an extremely high effort hit piece, even though I know it’s not.

[+] bitwize|6 days ago|reply

You: Claude, do you know how to program?

Claude: No, but if you hum a few bars I can fake it!

Except "faking it" turns out to be good enough, especially if you can fake it at speed and get feedback as to whether it works. You can then just hillclimb your way to an acceptable solution.

[+] andai|6 days ago|reply

Iterative Faking™ — now with plausible-looking test suite!

[+] plandis|6 days ago|reply

I’ve found this to be critical for having any chance of getting agents to generate code that is actually usable.

The more frequently you can verify correctness in some automated way the more likely the overall solution will be correct.

I’ve found that with good enough acceptance criteria (both positive and negative) it’s usually sufficient for agents to complete one off tasks without a human making a lot of changes. Essentially, if you’re willing to give up maintainability and other related properties, this works fairly well.

I’ve yet to find agents good enough to generate code that needs to be maintained long term without a ton of human feedback or manual code changes.

424 comments