The highest quality codebase

xnorswap|2 months ago

Claude is really good at specific analysis, but really terrible at open-ended problems.

"Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

"Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.

It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.

There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.

That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.

But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.

pdntspa|2 months ago

That's why you treat it like a junior dev. You do the fun stuff of supervising the product, overseeing design and implementation, breaking up the work, and reviewing the outputs. It does the boring stuff of actually writing the code.

I am phenomenally productive this way, I am happier at my job, and its quality of work is extremely high as long as I occasionally have it stop and self-review it's progress against the style principles articulated in its AGENTS.md file. (As it tends to forget a lot of rules like DRY)

order-matters|2 months ago

TBH I think its ability to structure unstructured data is what makes it a powerhouse tool and there is so much juice to squeeze there that we can make process improvements for years even if it doesnt get any better at general intelligence.

If I had a pdf printout of a table, the workflow i used to have to use to get that back into a table data structure to use for automation was hard (annoying). dedicated OCR tools with limitations on inputs, multiple models in that tool for the different ways the paper the table was on might be formatted. it took hours for a new input format

now i can take a photo of something with my phone and get a data table in like 30 seconds.

people seem so desperate to outsource their thinking to these models and operating at the limits of their capability, but i have been having a blast using it to cut through so much tedium that werent unsolved problems but required enough specialized tooling and custom config to be left alone unless you really had to

this fits into what youre saying with using it to do the grunt work i find boring i suppose, but feels a little bit more than that - like it has opened a lot of doors to spaces that had grunt work that wasnt worth doing for the end result previously but now it is

mbesto|2 months ago

> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.

While this is true in my experience, the opposite is not true. LLMs are very good at helping me go through a structure processing of thinking about architectural and structural design and then help build a corresponding specification.

More specifically the "idea honing" part of this proposed process works REALLY well: https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/

This: Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.

asmor|2 months ago

This is it. It doesn't replace the higher level knowledge part very well.

I asked Claude to fix a pet peeve of mine, spawning a second process inside an existing Wine session (pretty hard if you use umu, since it runs in a user namespace). I asked Claude to write me a python server to spawn another process to pass through a file handler "in Proton", and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn't exist.

Then I specified "server to run in Wine using Windows Python" and it got more things right. Except it tried to use named pipes for IPC. Which, surprise surprise, doesn't work to talk to the Linux piece. Only after I specified "local TCP socket" it started to go right. Had I written all those technical constraints and made the design decisions in the first message it'd have been a one-hit success.

james_marks|2 months ago

This is a key part of the AI love/hate flame war.

Very easy to write it off when it spins out on the open-ended problems, without seeing just how effective it can be once you zoom in.

Of course, zooming in that far gives back some of the promised gains.

Edit: typo

ericmcer|2 months ago

Exactly, if you visualize software as a bunch separate "states" (UI state, app state, DB state) then our job is to mutate states and synchronize those mutations across the system. LLMs are good at mutating a specific state in a specific way. They are trash at designing what data shape a state should be, and they are bad at figuring out how/why to propagate mutations across a system.

dolftax|2 months ago

The structured vs open-ended distinction here applies to code review too. When you ask an LLM to "find issues in this code", it'll happily find something to say, even if the code is fine. And when there are actual security vulnerabilities, it often gets distracted by style nitpicks and misses the real issues.

Static analysis has the opposite problem - very structured, deterministic, but limited to predefined patterns and overwhelms you in false positives.

The sweet spot seems to be to give structure to what the LLM should look for, rather than letting it roam free on an open-ended "review this" prompt.

We built Autofix Bot[1] around this idea.

[1] https://autofix.bot (disclosure: founder)

BatteryMountain|2 months ago

It works great in C# (where you have strong typing + strict compiler).

Try this:

Have a look at xyz.cs. Do a full audit of the file and look for any database operations in loops that can be pre-filtered.

Or:

Have a look at folder /folderX/ and add .AsNoTracking() to all read-only database queries. When you are done, run the compiler and fix the errors. Only modify files in /folderX/ and do not go deeper in the call hierarchy. Once you are done, do a full audit of each file and make sure you did not accidentally added .AsNoTracking() to tracked entities. Do no create any new files or make backups, I already created a git branch for you. Do not make any git commits.

Or:

Have a look at the /Controllers/ folder. Do a full audit of each controller file and make sure there are no hard-coded credentials, username, password or tokens.

Or: Have a look at folder /folderX/. Find any repeated hard-coded values, magic values and literals that will make good candidates to extract to Constants.cs. Make sure to add XML comments to the Constants.cs file to document what the value is for. You may create classes within Constants.cs to better group certain values, like AccountingConstants or SystemConstants etc.

These kinds of tasks works amazing in claude code an can often be one shotted. Make sure you check your git diffs - you cannot and should not blame AI for shitty code - its your name next to the commit, make sure it is correct. You can even ask claude to review the file with you afterwards. I've used this kind of approach to greatly increase our overall code quality & performance tuning - I really don't understand all the negative comments as this approach has chopped down days worth of refactorings to a couple of minutes and hours.

In places where you see your coding assistant is slow or making mistakes or it is going line by line where you know a simple regex find/replace would work instantly, ask it to help you create a shell script as a tool for itself to call, that does task xyz that it can call. I've made a couple of scripts that uses this approach that Claude can call locally to fix certain code pattern in 5 seconds that would've taken it (and me checking it) 30 mins at least and it wont eat up context or tokens.

plufz|2 months ago

I think slash commands are great to help Claude with this. I have many like /code:dry /code:clean-code etc that has a semi long prompt and references to longer docs to review code from a specific perspective. I think it atleast improves Claude a bit in this area. Like processes or templates for thinking in broader ways. But yes I agree it struggles a lot in this area.

lazarus01|2 months ago

>> But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.

This is exactly how I use it. I prefer Gemini 3 personally.

I try to learn as much as I can about different architectures, usually by reading books or other implementations and coding first principals to build a mental model. I apply the architecture to the problem and the AI fills in the gaps. I try my best to focus and cover those gaps.

The reason I think it is inconsistent in nailing a variety of tasks is the recipe for training LLMs, which is pre-training + RL. The RL environment sends a training signal to update all the weights in its trajectory for the successful response. Karpathy calls it “sucking supervision through a straw”. This breaks other parts of the model.

d-lisp|2 months ago

I remember about a problem I had while quick testing notcurses. I tried chatGPT which produced a lot of weird but kinda believable statements about the fact that I had to include wchar and define a specific preprocessor macro, AND I had to place the includes for notcurses, other includes and macros in a specific order.

My sentiment was "that's obviously a weird non-intended hack" but I wanted to test quickly, and well ... it worked. Later, reading the man-pages I aknowledged the fact that I needed to declare specific flags for gcc in place of the gpt advised solution.

I think these kind of value based judgements are hard to emulate for LLMs, it's hard for them to identifiate a single source as the most authoritative source in a sea of lesser authoritative (but numerous) sources.

charleshn|2 months ago

It's fundamentally because of verifier's law [0].

Current AI, and in particular RL-based, is already or will soon achieve super human performance on problems that can be - quickly - verified and measured.

So maths, algorithms, etc and well defined bugs fall into that category.

However architectural decision, design, long-term planning where there is little data, no model allowing synthetic data generation, and long iteration cycles are not so much amenable to it.

[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

cyral|2 months ago

Using the plan mode in cursor (or asking claude to first come up with a plan) makes it pretty good at generic "how can I improve" prompts. It can spend more effort exploring the codebase and thinking before implementing.

giancarlostoro|2 months ago

> "Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

This is true, as for "Open Ended" I use Beads with Claude code, I ask it to identify things based on criteria (even if its open ended) then I ask it to make tasks, then when its done I ask it to research and ask clarifying questions for those tasks. This works really well.

lucideer|2 months ago

> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving

I'd hesitate to call this a blind spot. LLMs have a lot of actual blind spots - things people developing them overlook or deprioritize. This strikes me more as something acutely aware of & failing at, despite significant efforts to solve.

cultofmetatron|2 months ago

> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving.

thats called job security!

mkw5053|2 months ago

I’ve had reasonable success having it ultrathink of every possible X (exhaustively) and their trades offs and then give me a ranked list and rationale of its top recommendations. I almost always choose the top but just reading the list and then giving it next steps has worked really well for me.

kccqzy|2 months ago

Not at all my experience. I’ve often tried things like telling Claude this SIMD code I wrote performed poorly and I needed some ideas to make it go faster. Claude usually does a good job rewriting the SIMD to use different and faster operations.

theshrike79|2 months ago

Codex is better for the latter style. It takes its time, mulls about and investigates and sometimes finds a nugget of gold.

Claude is for getting shit done, it's not at its best at long research tasks.

andai|2 months ago

The current paradigm is we sorta-kinda got AGI by putting dodgy AI in a loop:

until works { try again }

The stuff is getting so cheap and so fast... a sufficient increment in quantity can produce a phase change in quality.

awesome_dude|2 months ago

My experience has been with Claude that having it "review" my code has produced some helpful feedback and refactoring suggestions, but also, it falls short in others

fudged71|2 months ago

This tells me that we need to build 1000 more linters of all kinds

ludicrousdispla|2 months ago

>> "Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

Back in the day, we would just do this with a search engine.

ljm|2 months ago

I am basically rawdogging Claude these days, I don’t use MCPs or anything else, I just lay down all of the requirements and the suggestions and the hints, and let it go to work.

When I see my colleagues use an LLM they are treating it like a mind reader and their prompts are, frankly, dogshit.

It shows that articulating a problem is an important skill.

kitsune1|2 months ago

[deleted]

postalcoder|2 months ago

One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

lemming|2 months ago

I agree, I mostly use Claude for writing code, but I always get GPT5 to review it. Like you, I find it astonishingly consistent and useful, especially compared to Claude. I like to reset my context frequently, so I’ll often paste the problems from GPT into Claude, then get it to review those fixes (going around that loop a few times), then reset the context and get it to do a new full review. It’s very reassuring how consistent the results are.

adastra22|2 months ago

You mean literally assign a grade, like B+? This is unlikely to work based on how token prediction & temperature works. You're going to get a probability distribution in the end that is reflective of the model runtime parameters, not the intelligence of the model.

OsrsNeedsf2P|2 months ago

How is this different than testing the temperature?

guluarte|2 months ago

my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work

elzbardico|2 months ago

LLMs have this strong bias towards generating code, because writing code is the default behavior from pre-training.

Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.

A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.

Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.

More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.

hackernewds|2 months ago

> Writing code is the default behavior from pre-training

what does this even mean? could you expand on it

f311a|2 months ago

I like to ask LLMs to find problems o improvements in 1-2 files. They are pretty good at finding bugs, but for general code improvements, 50-60% edits are trash. They add completely unnecessary stuff. If you ask them to improve a pretty well-written code, they rarely say it's good enough already.

For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...

ryandrake|2 months ago

I asked Claude the other day to look at one of my hobby projects that has a client/server architecture and a bespoke network protocol, and brainstorm ideas for converting it over to HTTP, JSON-RPC, or something else standards-based. I specifically told it to "go wild" and really explore the space. It thought for a while and provided a decent number of suggestions (several I was unaware of) with "verdicts". Ultimately, though, it concluded that none of them were ideal, and that the custom wire protocol was fine and appropriate for the project. I was kind of shocked at this conclusion: I expected it to behave like that eager intern persona we all have come to expect--ready to rip up the code and "do things."

pawelduda|2 months ago

If you just ask it to find problems, it will do its best to find them - like running a while loop with no return condition. That's why I put some breaker in the prompt, which in this case would be "don't make any improvements if the positive impact is marginal". I've mostly seen it do nothing and just summarize why, followed by some suggestions in case I still want to force the issue

kderbyma|2 months ago

Yeah. I noticed Claud suffers when it reaches context overload - its too opinionated, so it shortens its own context with decisions I would not ever make, yet I see it telling itself that the shortcuts are a good idea because the project is complex...then it gets into a loop where it second guesses its own decisions and forgets the context and then continues to spiral uncontrollably into deeper and deeper failures - often missing the obvious glitch and instead looking into imaginary land for answers - constantly diverting the solution from patching to completely rewriting...

I think it suffers from performance anxiety...

----

The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.

I have even gone so far as to open nested folders in separate windows to "lock in" scope better.

As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked

embedding-shape|2 months ago

> Yeah. I noticed Claud suffers when it reaches context overload

All LLMs degrade in quality as soon as you go beyond one user message and one assistant response. If you're looking for accuracy and highest possible quality, you need to constantly redo the conversations from scratch, never go beyond one user message.

If the LLM gets it wrong in their first response, instead of saying "No, what I meant was...", you need to edit your first response, and re-generate, otherwise the conversation becomes "poisoned" almost immediately, and every token generated after that will suffer.

SV_BubbleTime|2 months ago

I’m keeping Claude’s tasks small and focused, then if I can I clear between.

It’s REAL FUCKING TEMPTING to say ”hey Claude, go do this thing that would take me hours and you seconds” because he will happily, and it’ll kinda work. But one way or another you are going to put those hours in.

It’s like programming… is proof of work.

someguyiguess|2 months ago

There’s definitely a certain point I reach when using Claude code where I have to make the specifications so specific that it becomes more work than just writing the code myself

rtp4me|2 months ago

For me, too many compactions throughout the day eventually lead to a decline in Claude's thinking ability. And, during that time, I have given it so much context to help drive the coding interaction. Thus, restarting Claude requires me to remember the small bits of "nuggets" we discovered during the last session so I find myself repeating the same things every day (my server IP is: xxx, my client IP is: yyy, the code should live in directory: a/b/c). Using the resume feature with Claude simply brings back the same decline in thinking that led me to stop it in the first place. I am sure there is a better way to remember these nuggets between sessions but I have not found it yet.

snarf21|2 months ago

That has been my greatest stumbling block with these AI agents: context. I was trying to have one help vibe code a puzzle game and most of the time I added a new rule it broke 5 existing rules. It also never approached the rules engine with a context of building a reusable abstraction, just Hammer meet Nail.

flowerthoughts|2 months ago

There's no -c on the command line, so I'm guessing this is starting fresh every iteration, unless claude(1) has changed the default lately.

iambateman|2 months ago

The point he’s making - that LLM’s aren’t ready for broadly unsupervised software development - is well made.

It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.

I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”

For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.

But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.

spaceywilly|2 months ago

I feel like I’ve figured out a good workflow with AI coding tools now. I use it in “Planning mode” to describe the feature or whatever I am working on and break it down into phases. I iterate on the planning doc until it matches what I want to build.

Then, I ask it to execute each phase from the doc one at a time. I review all the code it writes or sometimes just write it myself. When it is done it updates the plan with what was accomplished and what needs to be done next.

This has worked for me because:

- it forces the planning part to happen before coding. A lot of Claude’s “wtf” moments can be caught in this phase before it write a ton of gobbledygook code that I then have to clean up

- the code is written in small chunks, usually one or two functions at a time. It’s small enough that I can review all the code and understand before I click accept. There’s no blindly accepting junk code.

- the only context is the planning doc. Claude captures everything it needs there, and it’s able to pick right up from a new chat and keep working.

- it helps my distraction-prone brain make plans and keep track of what I was doing. Even without Claude writing any code, this alone is a huge productivity boost for me. It’s like have a magic notebook that keeps track of where I was in my projects so I can pick them up again easily.

torginus|2 months ago

Which is why I think agentic software development is not really worth it today. It can solve well-defined problems, and work through issues by rote, but to give it some task and have it work on it for a couple hours, then you have to come in and fix it up.

I think LLMs are still at the 'advanced autocomplete' stage, where the most productive way to use them is to have a human in the loop.

In this, accuracy of following instructions, and short feedback time is much more important than semi-decent behavior over long-horizon tasks.

samuelknight|2 months ago

This is an interesting experiment that we can summarize as "I gave a smart model a bad objective", with the key result at the end

"...oh and the app still works, there's no new features, and just a few new bugs."

Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.

In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.

torginus|2 months ago

I've heard a very apt criticism of the current batch of LLMs:

LLMs are incapable of reducing entropy in a code base

I've always had this nagging feeling, but I think this really captures the essence of it succintly.

keeda|2 months ago

Hilarious! Kinda reinforces the idea that LLMs are like junior engineers with infinite energy.

But just telling an AI it's a principal engineer does not make it a principal engineer. Firstly, that is such a broad, vaguely defined term, and secondly, typically that level of engineering involves dealing with organizational and industry issues rather than just technical ones.

And so absent a clear definition, it will settle on the lowest common denominator of code quality, which would be test coverage -- likely because that is the most common topic in its training data -- and extrapolate from that.

The other thing is, of course, the RL'd sycophancy which compels it to do something, anything, to obey the prompt. I wonder what would happen if tweaked the prompt just a little bit to say something like "Use your best judgement and feel free to change nothing."

mbesto|2 months ago

While there are justifiable comments here about how LLMs behave, I want to point out something else:

There is no consensus on what constitutes a high quality codebase.

Said differently - even if you asked 200 humans to do this same exercise, you would get 200 different outputs.

m101|2 months ago

This is a great example of there being no intelligence under the hood.

xixixao|2 months ago

Would a human perform very differently? A human who must obey orders (like maybe they are paid to follow the prompt). With some "magnitude of work" enforced at each step.

I'm not sure there's much to learn here, besides it's kinda fun, since no real human was forced to suffer through this exercise on the implementor side.

Terretta|2 months ago

Just as enterprise software is proof positive of no intelligence under the hood.

I don't mean the code producers, I mean the enterprise itself is not intelligent yet it (the enterprise) is described as developing the software. And it behaves exactly like this, right down to deeply enjoying inflicting bad development/software metrics (aka BD/SM) on itself, inevitably resulting in:

https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

SV_BubbleTime|2 months ago

Well… it’s more a great example that great output is a good model with the right context at the right time.

Take away everything else, there’s a product that is really good at small tasks, it doesn’t mean that changing those small tasks together to make a big task should work.

hazmazlaz|2 months ago

Well of course it produced bad results... it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn't say no and couldn't question you. Probably pretty similar.

mainmailman|2 months ago

Yeah I don't see the utility in doing this hundreds of times back to back. A few iterations can tell us some things about how Claude optimizes code, but an open ended prompt to endlessly "improve" the code sounds like a bad boss making huge demands. I don't blame the AI for adding BS down the line.

dcchuck|2 months ago

I spent some time last night "over iterating" on a plan to do some refactoring in a large codebase.

I created the original plan with a very specific ask - create an abstraction to remove some tight coupling. Small problem that had a big surface area. The planning/brainstorming was great and I like the plan we came up with.

I then tried to use a prompt like OP's to improve it (as I said, large surface area so I wanted to review it) - "Please review PLAN_DOC.md - is it a comprehensive plan for this project?". I'd run it -> get feedback -> give it back to Claude to improve the plan.

I (naively perhaps) expected this process to converge to a "perfect plan". At this point I think of it more like a probability tree where there's a chance of improving the plan, but a non-zero chance of getting off the rails. And once you go off the rails, you only veer further and further from the truth.

There are certainly problems where "throwing compute" at it and continuing to iterate with an LLM will work great. I would expect those to have firm success criteria. Providing definitions of quality would significantly improve the output here as well (or decrease the probability of going off the rails I suppose). Otherwise Claude will confuse quality like we see here.

Shout out OP for sharing their work and moving us forward.

Gricha|2 months ago

I think I end up doing that with plans inadvertently too. Oftentimes I'll iterate on a plan too many times, and only recognize that it's too far gone and needs a restart with more direction after sinking in 15 minutes into it.

elzbardico|2 months ago

Small errors compound over time.

jedberg|2 months ago

You know how when someone hears how many engineerings are working on a product, and you think to yourself, "but I could do that with like three people!"? Now you know why they have so many people. Because they did this with their codebase, but with humans.

Or I should say, they kept hiring the humans who needed something to do, and basically did what this AI did.

minimaxir|2 months ago

About a year ago I wrote a blog post (HN discussion: https://news.ycombinator.com/item?id=42584400) experimenting if asking Claude to "write code better" repeatedly would indeed cause it to write better code, determined by speed as better code implies more efficient algorithms. I found that it did indeed work (at n=5 iterations), but additionally providing a system prompt also explicitly improved it.

Given with what I've seen from Claude 4.5 Opus, I suspect the following test would be interesting: attempt to have Claude Code + Haiku/Sonnet/Opus implement and benchmark an algorithm with:

- no CLAUDE.md file

- a basic CLAUDE.md file

- an overly nuanced CLAUDE.md file

And then both test the algorithm speed and number of turns it takes to hit that algorithm speed.

maddmann|2 months ago

lol 5000 tests. Agentic code tools have a significant bias to add versus remove/condense. This leads to a lot of bloat and orphaned code. Definitely something that still needs to be solved for by agentic tools.

nosianu|2 months ago

> Agentic code tools have a significant bias to add versus remove/condense.

Your point stands uncontested by me, but I just wanted to mention that humans have that bias too.

Random link (has the Nature study link): https://blog.benchsci.com/this-newly-proven-human-bias-cause...

https://en.wikipedia.org/wiki/Additive_bias

oofbey|2 months ago

Oh I’ve had agents remove tests plenty of times. Or cripple the tests so they pass but are useless - more common and harder to prompt against.

bikeshaving|2 months ago

https://github.com/Gricha/macro-photo/blob/highest-quality/l...

The logger library which Claude created is actually pretty simple, highly approachable code, with utilities for logging the timings of async code and the ability to emit automatic performance warnings.

I have been using LogTape (https://logtape.org) for JavaScript logging, and the inherited, category-focused logging with different sinks has been pretty great.

thomassmith65|2 months ago

With a good programmer, if they do multiple passes of a refactor, each pass makes the code more elegant, and the next pass easier to understand and further improve.

Claude has a bias to add lines of code to a project, rather than make it more concise. Consequently, each refactoring pass becomes more difficult to untangle, and harder to improve.

Ideally, in this experiment, only the first few passes would result in changes - mostly shrinking the project size, and from then on, Claude would change nothing - just a like a very good programmer.

This is the biggest problem with developing with Claude, by far. Anthropic should laser focus on fixing it.

failuremode|2 months ago

> We went from around 700 to a whooping 5369 tests

> Tons of tests got added, but some tests that mattered the most (maestro e2e tests that validated the app still works) were forgotten.

I've seen many LLM proponents often cite the number of tests as a positive signal.

This smells, to me, like people who tout lines of code.

When you are counting tests in the thousands I think its a negative signal.

You should be writing property based tests rather than 'assert x=1', 'assert x=2', 'assert x=-1' and on and on.

If LLMs are incapable of acknowledging that then add it to the long list of 'failure modes'.

Bombthecat|2 months ago

Story of AI:

For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.

bulletsvshumans|2 months ago

I think the prompt is a major source of the issue. "We need to improve the quality of this codebase" implicitly indicates that there is something wrong with the codebase. I would be curious to see if it would reach a point of convergence with a prompt that allowed for it. Something like "Improve the quality of this codebase, or tell me that it is already in an optimal state."

whalesalad|2 months ago

I would love to see an experiment done like this with an arena of principal engineer agents. Give each of them a unique personality: this one likes shiny new objects and is willing to deal with early adopter pain, this one is a neckbeard who uses emacs as pid 1 and sends email via usb thumbdrive, and the third is a pragmatic middle of the road person who can help be the glue between them. All decisions need to reach a quorum before continuing. Better yet: each agent is running on a completely different model from a different provider. 3 can be a knob you dial up to 5, 10, etc. Each of these agents can spawn sub-agents, to reach out to professionals like a CSS export, or a DBA.

I think prompt engineering could help here a bit, adding some context on what a quality codebase is, remove everything that is not necessary, consider future maintainability (20->84k lines is a smell). All of these are smells that like a simple supervisor agent could have caught.

websiteapi|2 months ago

you gotta be strategic about it. so for example for tests, tell it to use equivalence testing and to prove it, e.g. create a graph of permutations of arguments and their equivalences from the underlying code, and then use such thing to generate the tests.

telling it to do better without any feedback obviously is going to go nowhere fast.

ttul|2 months ago

Have you tried writing into the AGENTS.md something like, "Always be on the lookout for dead code, copy-pasta, and other opportunities to optimize and trim the codebase in a sensible way."

In my experience, adding this kind of instruction to the context window causes SOTA coding models to actually undertake that kind of optimization while development carries on. You can also periodically chuck your entire codebase into Gemini-3 (with its massive context window) and ask it to write a refactoring plan; then, pass that refactoring plan back into your day-to-day coding environment such as Cursor or Codex and get it to take a few turns working away at the plan.

As with human coders, if you let them run wild "improving" things without specifically instructing them to also pay attention to bloat, bloat is precisely what you will get.

elzbardico|2 months ago

Funniest part:

> ..oh and the app still works, there's no new features, and just a few new bugs.

written-beyond|2 months ago

> I like Rust's result-handling system, I don't think it works very well if you try to bring it to the entire ecosystem that already is standardized on error throwing.

I disagree, it's very useful even in languages that have exception throwing conventions. It's good enough for the return type for Promise.allSettled api.

The problem is when I don't have the result type I end up approximating it anyway through other ways. For a quick project I'd stick with exceptions but depending on my codebase I usually use the Go style ok, err tuple (it's usually clunkier in ts though) or a rust style result type ok err enum.

turboponyy|2 months ago

I have the same disagreement. TypeScript with its structural and pseudo-dependent typing, somewhat-functionally disposed language primitives (e.g. first-class functions as values, currying) and standard library interfaces (filter, reduce, flatMap et al), and ecosystem make propagating information using values extremely ergonomic.

Embracing a functional style in TypeScript is probably the most productive I've felt in any mainstream programming language. It's a shame that the language was defiled with try/catch, classes and other unnecessary cruft so third party libraries are still an annoying boundary you have to worry about, but oh well.

The language is so well-suited for this that you can even model side effects as values, do away with try/catch, if/else and mutation a la Haskell, if you want[1].

[1] https://effect.website/

tracker1|2 months ago

On the Result<TR, TE> responses... I've seen this a few times. I think it works well in Rust or other languages that don't have the ability to "throw" baked in. However, when you bolt it on to a language that implicitly can throw, you're now doing twice the work as you have to handle the explicit error result and integrated errors.

I worked in a C# codebase with Result responses all over the place, and it just really complicated every use case all around. Combined with Promises (TS) it's worse still.

mrsmrtss|2 months ago

The Result pattern also works exceptionally well with C#, provided you ensure that code returning a Result object never throws an exception. Of course, there are still some exceptional things that can throw, but this is essentially the same situation as dealing with Rust panics.

culi|2 months ago

I checked the diffs of the `highest-quality` branch vs `main` and immediately noticed an `as any`

https://github.com/Gricha/macro-photo/compare/main...highest...

Not what I would expect from a prompt like "you're a principal engineer"

hamasho|2 months ago

One and half years ago, in Japanese Twitter this method gathered a bit of attention. It's called pawahara prompt (パワハラプロンプト, power harassment prompt) because it's like your asshole boss repeatedly saying "can you improve this more?" without any helpful suggestions until the employees breakdown. Many people found it could improve the code base at some point even then, I think now it works much better.

Hammershaft|2 months ago

Impressive that the app still works! Did not expect that.

elzbardico|2 months ago

Probably being a very simple application and starting with an already big testing suite helped.

blobbers|2 months ago

I'm curious if anyone has written a "Principal Engineer" agents.md or CLAUDE.md style file that yields better results than the 'junior dev' results people are seeing here.

I've worked on writing some as a data scientist, and I have gotten the basic claude output to be much better; it makes some saner decisions, it validates and circles back to fix fits, etc.

surprisetalk|2 months ago

This reflects my experience with human programmers. So many devs are taught to add layers of complexity in pursuit of "best practices". I think the LLM was trained to behave this way.

In my experience, Claude can actually clean up a repo rather nicely if you ask it to (1) shrink source code size (LOC or total bytes), (2) reduce dependencies, and (3) maintain integration tests.

rvz|2 months ago

> ...oh and the app still works, there's no new features, and just a few new bugs.

Many apps out there with developers religiously worshipping high quality and over-engineering over a single app with less than 10 users or if they are lucky enough to get over 1,000 users.

…and all of that and not a single dollar was made. Might as well donated it to Anthropic.

barbazoo|2 months ago

> I can sort of respect that the dependency list is pretty small, but at the cost of very unmaintainable 20k+ lines of utilities. I guess it really wanted to avoid supply-chain attacks.

> Some of them are really unnecessary and could be replaced with off the shelf solution

Lots of people would regard this as a good thing. Surely the LLM can't guess which kind you are.

devy|2 months ago

> Read and summarize the project

> Implement a fresh project based off of this description

Genuine question, if we were to ask AI to do those two steps to generate a different code base from scratch entirely, does it qualify for a "clean room" design legally speaking?

Havoc|2 months ago

My current fav improvement strategy is

1) Run multiple code analysis tools over it and have the LLM aggregate it with suggestions

2) ask the LLM to list potential improvements open ended question and pick by hand which I want

And usually repeat the process with a completely different model (ie diff company trained it)

Any more and yeah they end up going in circles

WhitneyLand|2 months ago

It can be difficult to explain to management why in certain scenarios AI can seem to work coding miracles, but this still doesn’t mean it’s going always speed up development 10x especially for an established code base.

Tangible examples like this seem like a useful way to show some of the limitations.

swiftcoder|2 months ago

> The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS

Even before that rest of it, 10k lines of code for an app with 5 screens is... yeah. Reminds me of "enterprise" Java codebases from 15 years ago

maerF0x0|2 months ago

I would love to see someone do a longitudinal study of the incident/error rate of a canary container in prod that is managed by claude. Basically doing a control/experimental group to prove who does better the Humans or the AI?

Dilettante_|2 months ago

ClauDevOps?

fauigerzigerk|2 months ago

What would happen if you gave the same task to 200 human contractors?

I suspect SLOC growth wouldn't be quite as dramatic but things like converting everything to Rust's error handling approach could easily happen.

unknown|2 months ago

[deleted]

layer8|2 months ago

This makes me wonder what the result would be of having an AI turn a code base into literate-programming style, and have it iterate on that to improve the “literacy”.

orliesaurus|2 months ago

Ok SRS question: What's the best "Code Review" Skill/Agent/Prompt that I can use these days? Curious to see even paid options if anyone knows?

keepamovin|2 months ago

This is actually a great idea. It's like those AI resampled this image 10,000 times. Or JPEG iteratively compressed this picture 1 Million times.

g947o|2 months ago

When I ask coding agents to add tests, they often come up with something like this:

    const x = new NewClass();
    assert.ok(x instanceof NewClass);

So I am not at all surprised about Claude adding 5x tests, most of which are useless.

It's going to be fun to look back at this and see how much slop these coding agents created.

bitwize|2 months ago

There's probably a human manager going "Great! How cone I can't get my engineering team to ship this much QUALITY?"

gm678|2 months ago

"Core Functional Utilities: Identity function - returns its input unchanged." is one of my favorites from `lib/functional.ts`.

GuB-42|2 months ago

It is something I noticed when talking to LLMs, if they don't get it right the first time, they probably never will, and if you really insist, the quality starts to degrade.

It is not unlike people, the difference being that if you ask someone the same thing 200 times, he will probably going to tell you to go fuck yourself, or, if unable to, turn to malicious compliance. These AIs will always be diligent. Or, a human may use the opportunity to educate himself, but again, LLMs don't learn by doing, they have a distinct training phase that involves ingesting pretty much everything humanity has produced, your little conversation will not have a significant effect, if at all.

grvdrm|2 months ago

I use a new chat/etc every time that happens. Try to improve my prompt to get a better result. Sometimes works, but that multiple chat rather than laborious long chat approach annoys me less.

phildougherty|2 months ago

Pasting this whole article in to claude code "improve my codebase taking this article in to account"

minimaxir|2 months ago

You can just give Claude Code/any modern Agent a URL and it'll retrieve it.

v3xro|2 months ago

Would be nice if every article about LLM/AI had that as a tag so you could skip past them...

mvanbaak|2 months ago

`--dangerously-skip-permissions` why?

minimaxir|2 months ago

It's necessary to allow Claude Code to be fully autonomous, otherwise it will stop and ask you to run commands.

chr15m|2 months ago

It behaved exactly like 99% of developers, introducing unnecessary complexity.

just6979|2 months ago

'In some iterations, coding agent put on a hat of security engineer. For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.'

Yes, you do know why. Because somewhere in its training, that functionality was linked to "quality" or "improvement". Remember what these things do at their core: really good auto-complete.

'The prompt, in all its versions, always focuses on us improving the codebase quality. It was disappointing to see how that metric is perceived by AI agent.'

Really? It's disappointing to see how that metric is perceived by humans, and the AIs are trained on things humans made. If people can't agree on "codebase quality", especially the ones who write loudly about it on the intetnet, it's going to be impossible for AI agents to agree. A better prompt actually specifying what _you_ consider to be improvements would have been so much better: perhaps minimize 3rd party deps, or minimize local utils reimplementing existing 3rd party libs, or add quality typechecks.

'The leading principle was to define a few vanity metrics and push for "more is better".'

Yeah, because this is probably the most common thing it saw in training. Programmers actually making codebase quality improvements are just quietly doing it, while the ones shouting on the internet (hence into the training data) about how their [bad] techniques [appear to] improve quality are also the ones picking vanity metrics and pushing for "more is better".

'I've prompted Claude Code to failure here'

Not really a failure: it did exactly what you asked: impoved "codebase quality" according to its training data. If you _required_ a human engineer to do the same thing 200 times, you'd get similar results as they run out of real improvements and start scouring the web for anything that anybody ever considered an "improvement", which very definitely includes vanity metrics and "more is better" regarding test count and coverage. You just showed that these AIs aren't much more than their training data. It's not actually thinking about quality, it's just barfing up things it has seen called "codebase quality improvements", regardless of the actual quality of those improvements.

jesse__|2 months ago

> This app is around 4-5 screens. The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS

Fucking yikes dude. When's the last time it took you 4500 lines per screen, 9000 including the JSON data in the repo????? This is already absolute insanity.

I bet I could do this entire app in easily less than half, probably less than a tenth, of that.

VikingCoder|2 months ago

You need to scroll the windows to see all the numbers. (Why??)

simonw|2 months ago

The prompt was:

  Ultrathink. You're a principal engineer. Do not ask me any
  questions. We need to improve the quality of this codebase.
  Implement improvements to codebase quality.

I'm a little disappointed that Claude didn't eventually decide to start removing all of the cruft it had added to improve the quality that way instead.

Gricha|2 months ago

Yeah, the best it did on some iterations is claimed that the codebase was already in the good state and didn't produce changes - but that was 1 in many.

pawelduda|2 months ago

Did it create 200 CODE_QUALITY_IMPROVEMENTS.md files by chance?

29athrowaway|2 months ago

Don't use cloc in 2025. Use tokei or whatever.

thald|2 months ago

Interesting experiment. Looking at this I immediately thought similar experiment run by Google: AlphaEvolve. Throwing LLM compute at problems might work if the problem is well defined and the result can be objectively measured.

As for this experiment: What does quality even mean? Most human devs will have different opinions on it. If you would ask 200 different devs (Claude starts from 0 after each iteration) to do the same, I have doubts the code would look much better.

I am also wondering what would happen if Claude would have an option to just walk away from the code if its "good enough". For each problem most human devs run cost->benefit equation in their head, only worthy ideas are realized. Claude does not do it, the code writing cost is very low on his site and the prompt does not allow any graceful exit :)

unknown|2 months ago

[deleted]

6LLvveMx2koXfwn|2 months ago

for all the bad code havoc was most certainly not 'wrecked', it may have been 'wreaked' though . . .

guluarte|2 months ago

that's my experience with AI, most times it creates an overengineered solution unless told it to keep it simple

SKILNER|2 months ago

This strikes me as a very solid methodology for improving the results of all AI coding tools. I hope Anthropic, etc take this up.

Rather than converging on optimal code (Occam's Razor for both maintainability and performance) they are just spewing code all over the scene. I've noticed that myself, of course, but this technique helps to magnify and highlight the problem areas.

It makes you wonder how much training material was/is available for code optimization relative to training material for just coding to meet functional requirements. And therefore, what's the relative weight of optimizing code baked into the LLMs.

arconis987|2 months ago

next time, have the LLM alternate between these two steps:

- Do some work - Critique the work

it will converge better

etamponi|2 months ago

Am I the only one that is surprised that the app still works?!

stavros|2 months ago

Well, given it can't say "no, I think it's good enough now", you'll just get madness, no?

minimaxir|2 months ago

That's the point. Sometimes madness is interesting.

KronisLV|2 months ago

> In message log, the agent often boasts about the number of tests added, or that code coverage (ugh) is over some arbitrary percentage. We end up with an absolute moloch of unmaintainable code in the name of quality. But hey, the number is going up.

Oh hey, just like real developers!

timtas|2 months ago

This reinforces my standard explanation of Claude Code: Claude is exactly like a junior engineer who is simultaneously brilliant and retarded.

It can do great things but needs close supervision. Claude doesn’t write code, Claude recommends code.

smallpipe|2 months ago

The viewport of this website is quite infuriating. I have to scroll horizontally to see the `cloc` output, but there's 3x the empty space on either side.

lubesGordi|2 months ago

So now you know. You can get claude to write you a ton of unit tests and also improve your static typing situation. Now you can restrict your prompt!

nadis|2 months ago

20K --> 84K lines of ts for a simple app is bananas. Much madness indeed! But also super interesting, thanks for sharing the experiment.

jcalvinowens|2 months ago

This really mirrors my experience trying to get LLMs to clean up kernel driver code, they seem utterly incapable of simplifying things.

mgrat|2 months ago

[flagged]

tomhow|2 months ago

Please don't fulminate on HN. We're here for curious conversation, not rage. This question has been debated here for the past couple of years now, and that debate will no doubt continue. This kind of indignant rhetorical question adds little of value to what is an important topic. Please make an effort to observe the guidelines if you want to participate here. https://news.ycombinator.com/newsguidelines.html

credit_guy|2 months ago

I see this sentiment quite often. The Economist chose the "word of the year"; it is "slop". Everybody hates AI slop.

And lots of people who use AI coding assistants go through a phase of pushing AI slop in prod. I know I did that. Some of it still bites me to this day.

But here's the thing: AI coding assistants did not exist two years ago. We are critical of them based on unfounded expectations. They are tools, and they have limitations. They are far, very, very far, from being perfect. They will not replace us for 20 years, at least.

But are they useful? Yes. Can you learn usage patterns so you eliminate as much as possible AI slop? I personally hope I did that; I think quite a lot of people who use AI coding assistants have found ways to tame the beast.

_jzlw|2 months ago

[deleted]

krupan|2 months ago

Just the headline sounds like a YouTube brain rot video title:

"I spent 200 days in the woods"

"I Google translated this 200 times"

"I hit myself with this golf club 200 times"

Is this really what hacker news is for now?

havkom|2 months ago

There are fundamental differences. Many people expect a positive gradient of quality from AI overhaul of projects. For translating back and forth, it is obvious from the outset that there is a negative gradient of quality (the Chinese whispers game).

jmkni|2 months ago

If you reverse the order this could be a very interesting Youtube series

393 comments