(no title)
totallykvothe | 1 month ago
I just have to conclude 1 of 2 things:
1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.
2) Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.
I'm not sure what else I can take from the situation. For context, I work on a 15 year old Java Spring + React (with some old pages still in Thymeleaf) web application. There are many sub-services, two separate databases,and this application needs to also 2-way interface with customer hardware. So, not a simple project, but still. I can't imagine it's way more complicated than most enterprise/legacy projects...
Some comments were deferred for faster rendering.
unyttigfjelltol|1 month ago
I’ve come back to the idea LLMs are super search engines. If you ask it a narrow, specific question, with one answer, you may well get the answer. For the “non-trivial” questions, there always will be multiple answers, and you’ll get from the LLM all of these depending on the precise words you use to prompt it. You won’t get the best answer, and in a complex scenario requiring highly recursive cross-checks— some answers you get won’t be functional.
It’s not readily apparent at first blush the LLM is doing this, giving all the answers. And, for a novice who doesn’t know the options, or an expert who can scan a list of options quickly and steer the LLM, it’s incredibly useful. But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.
20k|1 month ago
Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away
The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect
PeterStuer|1 month ago
The real problem btw was a bug introduced in the PDF handeling package 2 versions ago that caused resource handeling problems in some contexts, and the real solution was roling back to the version before the bug.
I'm still using AI daily in my development though, as as long as you sort of know what you are doing and have enough knowledge to evaluate it is very much a net productivity multiplier for me.
friendzis|1 month ago
Typical iterative-circular process "write code -> QA -> fix remarks" works because the code is analyzable and "fix" is on average cheaper than "write", therefore the process, eventually, converges on a "correct" solution.
LLM prompting is on average much less analyzable (if at all) and therefore the process "prompt LLM -> QA -> fix prompt" falls somewhere between "does not converge" and "convergence tail is much longer".
This is consistent with typical observation where LLMs are working better: greenfield implementations of "slap something together" and "modify well structured, uncoupled existing codebase", both situations where convergence is easier in the first place, i.e. low existing entropy.
dividedbyzero|1 month ago
IAmGraydon|1 month ago
Yes! This is exactly what it is. A search engine with a lossy-compressed dataset of most public human knowledge, which can return the results in natural language. This is the realization that will pop the AI bubble if the public could ever bring themselves to ponder it en masse. Is such a thing useful? Hell yes! Is such a thing intellegent? Certainly NO!
PunchyHamster|1 month ago
Like, I've seen Claude go thru source code of the program, telling (correctly!) what counters are in code that return value I need (I just wanted to look at some packet metrics), then inventing entirely fake CLI command to extract those metrics
carlmr|1 month ago
Now I'm wondering if I'm prompting wrong. I usually get one answer. Maybe a few options but rarely the whole picture.
I do like the super search engine view though. I often know what I want, but e.g. work with a language or library I'm not super familiar with. So then I ask how do I do x in this setting. It's really great for getting an initial idea here.
Then it gives me maybe one or two options, but they're verbose or add unneeded complexity. Then I start probing asking if this could be done another way, or if there's a simpler solution to this.
Then I ask what are the trade-offs between solutions. Etc.
It's maybe a mix of search engine and rubber ducking.
Agents are, like for OP, a complete failure for me though. Still can't get them to not run off into a completely strange direction, leaving a minefield of subtle coding errors and spaghetti behind.
XenophileJKO|1 month ago
What I will argue is that the LLMs are not just search engines. They have "compressed" knowledge. When they do this, they learn relations between all kinds of different levels of abstractions and meta patterns.
It is really important to understand that the model can follow logical rules and has some map of meta relationships between concepts.
Thinking of a LLM as a "search engine" is just fundamentally wrong in how they work, especially when connected to external context like code bases or live information.
daxfohl|1 month ago
There's been a notable jump over the course of the last few months, to where I'd say it's inevitable. For a while I was holding out for them to hit a ceiling where we'd look back and laugh at the idea they'd ever replace human coders. Now, it seems much more like a matter of time.
Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well. The idea being, they market directly to the product team, not to engineers who may have specific experience with one language, framework, database, or whatever.
I also think we'll see a revival of monolithic architectures. Right now, services are split up mainly because project/team workflows are also distributed so they can be done in parallel while minimizing conflicts. As AI makes dev cycles faster that will be far less useful, while having a single house for all your logic will be a huge benefit for AI analysis.
sublinear|1 month ago
I think instead the value is in getting a computer to execute domain-specific knowledge organized in a way that makes sense for the business, and in the context of those private computing resources.
It's not about the ability to write code. There are already many businesses running low-code and no-code solutions, yet they still have software engineers writing integration code, debugging and making tweaks, in touch with vendor support, etc. This has been true for at least a decade!
That integration work and domain-specific knowledge is already distilled out at a lot of places, but it's still not trivial. It's actually the opposite. AI doesn't help when you've finally shaved the yak smooth.
concats|1 month ago
The way I see it, there will always be a layer in the corporate organization where someone has to interact with the machine. The transitioning layer from humans to AIs. This is true no matter how high up the hierarchy you replace the humans, be it the engineers layer, the engineering managers, or even their managers.
Given the above, it feels reasonable to believe that whatever title that person has—who is responsible for converting human management's ideas into prompts (or whatever the future has the text prompts replaced by)—that person will do a better job if they have a high degree of technical competence. That is to say, I believe most companies will still want and benefit if that/those employees are engineers. Converting non-technical CEO fever dreams and ambitions into strict technical specifications and prompts.
What this means for us, our careers, or Anthropic's marketing department, I cannot say.
catlifeonmars|1 month ago
ericmcer|1 month ago
Beyond that it is incredibly biased towards existing code & prompt content. If you wanted to build a voice chat app, and you said "should I use websockets or http?" It would say Websockets. It won't override you and say "Use neither, you should use webRTC", but an experienced engineer would spot that the prompt itself is flawed instantly. LLMs just will bias towards existing tokens in the prompt and won't surface data that would challenge the question itself.
smt88|1 month ago
twelvedogs|1 month ago
mvkel|1 month ago
I think this is part of it.
When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.
If you put those choices aside and simply say: does it accomplish the goal per the spec (and is safe and scalable[0]), then you can get away with a lot more without the end user ever having a clue.
Sure, there's the argument for maintainability, and vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC. But it used to be 2,000 LOC just a couple of years ago. Temporary problem.
[0]insisting that something be scalable isn't even necessary imo
newsoftheday|1 month ago
It feels like you're diminishing the parent commenter's views, reducing it to the perspective of style. Their comment didn't mention style.
matsemann|1 month ago
Except the fact that the idioms and patterns used means that I can jump in and understand any part of the codebase, as I know it will be wired up and work the same as any other part.
eru|1 month ago
Morphing an already decent PR into a different coding style is actually something that LLMs should excel at.
AlexandrB|1 month ago
[1] https://cs61a.org/articles/composition/
rezonant|1 month ago
saltyoutburst|1 month ago
upcoming-sesame|1 month ago
lopatin|1 month ago
For personal projects on the other hand, it has expedited me what? 10x, 30x? It's not measurable. My output has been so much more than what would have been possible earlier, that there is no benchmark because these level of projects would not have been getting completed in the first place.
Back to using at work: I think it's a skill issue. Both on my end and yours. We haven't found a way to encode our domain knowledge into AI and transcend into orchestrators of that AI.
nikita2206|1 month ago
How do new hires onboard? Do you spend days of your own time guiding them in person, do they just figure things out on their own after a few quarters of working on small tickets, or are things documented? Basically AI, when working on a codebase, has the same level of context that a new hire would have, so if you want them to get started faster then provide them with ample documentation.
antirez|1 month ago
Also: in my experience 1. and 2. are not needed for you to have bad results. The existing code base is a fundamental variable. The more complex / convoluted it is, the worse is the result. Also in my experience LLMs are constantly better at producing C code than anything else (Python included).
I have the feeling that the simplicity of the code bases I produced over the years, and that now I modify with LLMs, and the fact they are mostly in C, is a big factor why LLMs appear to work so well for me.
Another thing: Opus 4.5 for me is bad on the web, compared to Gemini 3 PRO / GPT 5.2, and very good if used with Claude Code, since it requires to reiterate to reach the solution, why the others sometimes are better first-shotter. If you generate code via the web interface, this could be another cause.
There are tons of variables.
dspillett|1 month ago
This is one of my problems with the whole thing, at least from a programming PoV. Even though superficially it seems like the ST:TNG approach to using an intelligent but not aware computer as a tool to collaboratively solve a problem, it is really more like guiding a junior through something complex. While guiding a junior (or even some future AGI) in that way is definitely a good thing, if I am a good guide they will learn from the experience so it will be a useful knowledge sharing process, that isn't a factor for an LLM (at least not the current generations). But if I understand the issue well enough to be a good guide, and there is no teaching benefit external to me, I'd rather do it myself and at most use the LLM as a glorified search engine to help muddle through bad documentation for hidden details.
That and TBH I got into techie things because I like tinkering with the details. If I thought I'd not dislike guiding others doing the actual job, I'd have not resisted becoming a manager throughout all these years!
embedding-shape|1 month ago
I think this is the wrong approach, already by having "wrong code" in the context, makes every response after this worse.
Instead, try restarting, but this time specify exactly how you expected that 70% of the code to actually have worked, from the get go. Often, LLMs seem to make choices because they have to, and if you think they made the wrong choice, you can often find that you didn't actually specify something well enough, hence the LLM had to do something, since apparently the single most important thing for them is that they finish something, no matter how right or wrong.
After a while, you'll get better at knowing what you have to be precise, specific and "extra verbose" about, compared to other things. Something that also seems to depend on the model, like with how Gemini you can have 5 variations of "Don't add any comments" yet it does anyways, but say that once to GPT/Claude-family of models and it seems they get it at once.
vignesh37|1 month ago
simonw|1 month ago
If you buy a table saw and can't figure out how to cut a straight line in a piece of wood with it - or keep cutting your fingers off - but didn't take any time at all to learn how to use it, that's on you.
Likewise a car, you have to take lessons and a test before you can use those!
Why should LLMs be any different?
AuryGlenz|1 month ago
Not coding related but my wife is certainly better than most and yet I’ve had to reprompt certain questions she’s asked ChatGPT because she gave it inadequate context. People are awful at that. Us coders are probably better off than most but just as with human communication if you’re not explaining things correctly you’re going to get garbage back.
tomjen3|1 month ago
khafra|1 month ago
A coding agent just beat every human in the AtCoder Heuristic optimization contest. It also beat the solution that the production team for the contest put together. https://sakana.ai/ahc058/
It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.
tete|1 month ago
Optimization space that has been automated before LLMs. Big surprise, machines are still better at this.
This feels a bit like comparing programming teams to automated fuzzing.
In fact not too rarely developing algorithms involved some kind of automated algorithm testing where the algorithm is permuted in an automatic manner.
It's also a bit like how OCR and a couple of other fields (protein folding) are better to be done in an automated manner.
The fact that now this is done by an LLM, another machine isn't exactly surprising. Nobody claims that computers aren't good at these kinds of tasks.
fmbb|1 month ago
Optimization is a very simple problem though.
Maintaining a random CRUD app from some startup is harder work.
PunchyHamster|1 month ago
tripzilch|1 month ago
> AHC058, held on December 14, 2025, was conducted over a 4-hour competition window. The problem involved a setting where participants could produce machines with hierarchical relationships, such as multiple types of “apple-producing machines” and “machines that build those machines.” The objective was to construct an efficient production planning algorithm by determining which types and hierarchies of machines to upgrade and in what specific order.
... so not a CRUD app but it beat humans at Cookie Clicker? :-)
sublinear|1 month ago
So many people hyping AI are only thinking about new projects and don't even distinguish between what is a product and what is a service.
Most software devs employed today work on maintaining services that have a ton of deliberate decisions baked in that were decided outside of that codebase and driven by business needs.
They are not building shiny new products. That's why most of the positive hype about AI doesn't make sense when you're actually at work and not just playing around with personal projects or startup POCs.
torginus|1 month ago
Experienced coders that I follow, who do use AI tend to focus on tight and fast feedback loops, and precise edits (or maybe exploratory coding) rather than agentic fire-and-forget workflows.
Also, an interesting side note, I expected programmers I think of as highly skilled, who I know personally to reject AI from personal pride - that has not been the case. However 2 criticisms I've heard consistently from this crowd (besides the thing I mentioned before) was
- AI makes hosting and participating in coding competitions impossible, and denies them of brain-teasers and an ability to hone their skills.
- A lot of them are concerned about the ethics of training on large codebases - and consider AI plagiarism as much of an issue as artists do.
PunchyHamster|1 month ago
Like, yes, prompting is a skill and you need to learn it for AI to do something useful but usefulness quickly falls down a cliff once you go past "greenfield implementation" or "basically example code" or "the thing done a lot so AI have a lot of reference to put from" it quickly gets into kinda sorta but not really working state.
It can still be used effectively on smaller parts of the codebase (I used it a lot basically to generate some boilerplate to run the test even if I had to rewrite a bunch of actual tests) but as whole very, very overrated by the AI peddlers.
And it probably stems from the fact that for the clueless ones it looks like amazing productivity boost because they go from "not even knowing framework" to "somewhat working app"
nicce|1 month ago
saxenaabhi|1 month ago
Madmallard|1 month ago
Gee I wonder why
Balinares|1 month ago
For what it's worth, multiple times in my career, I've worked at shops that once thought they could do it quick and cheap and it would be good enough, and then had to hire someone 'picky' like me to sort out the inevitable money-losing mess.
From what I've seen even Opus 4.5 spit, the 'picky' are going to remain in demand for a little while longer still. Will that last? No clue. We'll see.
cpursley|1 month ago
AlexCoventry|1 month ago
I'm happy enough for it to automate away the trivial coding tasks. That's an immense force multiplier in its own right.
KronisLV|1 month ago
Doesn't match my experience, that figure is closer to about 20-40% to me, though a lot of those changes I want are possible by just further prompting OR turning to a different model, or adding some automated checks that promptly fail and the AI can do a few more loops of fixes.
> Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.
This is also likely, or you are just doing stuff that is worse represented in the training data, or working on novel things where the output isn't as good. But I'm leaning towards people just being picky about what they view as "good code" (or underspecifying how the AI is supposed to output it) at least roughly since Sonnet 4, since with some people I work with it's just endless and oftentimes meaningless discussions and bikeshedding when in code review.
You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."
9dev|1 month ago
…and then add that to your CLAUDE.md, and never worry about having to say it again manually.
jstummbillig|1 month ago
If that number has not significantly changed since GPT 3.5, I think it's safe to assume that something very weird is happening on your end.
dns_snek|1 month ago
willtemperley|1 month ago
I think that mentally estimating the problem space helps. These things are probabilistic models, and if there are a million solutions the chance of getting the right one is clearly unlikely.
Feeding back results from tests really helps too.
virgildotcodes|1 month ago
I think part of the problem a lot of senior devs are having is that they see what they do as an artisanal craft. The rest of the world just sees the code as a means to an end.
I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.
FloorEgg|1 month ago
- it becomes brittle and rigid (can't change it, can't add to it)
- it becomes buggy and impossible to fix one bug without creating another
- it becomes harder to tell what it's doing
- plus it can be inefficient / slow / insecure, etc.
The problem with your analogy is that toasters are quite simple. The better example would be your computer, and if you want your computer to just run your programs and not break, then these things matter.
zbentley|1 month ago
A consumer or junior engineer cares whether the toaster toasts the bread and doesn’t break.
Someone who cares about their craft also cares about:
- If I turn the toaster on and leave, can it burn my house down, or just set off the smoke alarm?
- Can it toast more than sliced uniform-thickness bread?
- What if I stick a fork in the toaster? What happens if I drop it in the bathtub while on? Have I made the risks of doing that clear in such a way that my company cannot be sued into oblivion when someone inevitably electrocutes themselves?
- Does it work sideways?
- When it fills up with crumbs after a few months of use, is it obvious (without knowing that this needs to be done or reading the manual) that this should be addressed, and how?
- When should the toaster be replaced? After a certain amount of time? When a certain misbehavior starts happening?
Those aren’t contrived questions in service to a tortured metaphor. They’re things that I would expect every company selling toasters to have dedicated extensive expertise to answering.
PunchyHamster|1 month ago
Then you haven't been a senior dev long enough.
We want code that will be good enough because we will have to maintain it for years (or inherit maintaining from someone else), we want it to be clean enough that adding new features isn't a pain and architected well enough that it doesn't need major rewrite to do so.
Of course if code is throwaway that doesn't matter but if you're making long term product, making shit code now is taking on the debt you will have to pay off.
That is not to say "don't use AI for that", that is to say "actually go thru AI code and review whether it is done well enough". But many AI-first developers just ship first thing that compiles or passes tests, without looking.
> I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.
...well if you want it to not break (and still be cheap) you have to put quite a bit of engineering into it.
aperrien|1 month ago
totallykvothe|1 month ago
rbbydotdev|1 month ago
I think this touches on the root of the issue. I am seeing a results over process winning. Code quality will reduce. Out of touch or apathetic project management who prioritize results, now are even more emboldened to have more tech debt riddled code
alexsmirnov|1 month ago
simonw|1 month ago
If it does indeed start costing more than it's paying out, step away.
f1shy|1 month ago
To anybody who want to try, a concrete example, that I have tested in all available LLMs:
Make a prompt to get a common lisp application which makes a "hello triangle" in open gl, without using SDL or any framework, only OpenGL and GLFW bindings.
None of the replies even compiled. I kept asking at least 5 times, with error feedback, to see if AI can do it. It did't work. Never.
The best I got was from gemini, a code where I had to change about 10 lines, absolutely no trivial changes that need to be familiar with opengl and lisp. After doing the changes I asked back, what does it think of the changes, it replied I was wrong, with those changes it will never work.
If anybody can make a prompt that get me that, please let me known...
Philpax|1 month ago
Using Claude Code, I was able to successfully produce the Hello Triangle you asked for (note that I have never used CL before): https://github.com/philpax/hello-triangle-cl
For reference, here is the transcript of the entire interaction I had with CC (produced with simonw's excellent claude-code-transcripts): https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...
Edit: To better contextualise what it's doing, the detailed transcript page may be useful: https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...
james_a_craig|1 month ago
This worked in codex-cli, albeit it took three rounds of passing back the errors. https://gist.github.com/jamesacraig/9ae0e5ed8ebae3e7fe157f67... has the resulting code.
Aeolun|1 month ago
Most of the architectural failures come from it still not having the whole codebase in mind when changing stuff.
trueno|1 month ago
The funny thing is reviewing stuff claude has made isn't actually unfamiliar to me in the slightest. It's something I'm intimately familiar with and have been intimately familiar with for many years, long before this AI stuff blew up...
..it's what code I've reviewed/maintained/rejected looks like when a consulting company was brought on board to build something. Such a company that leverages probably underpaid and overworked laborers both overseas and US based workers on visas. The delivered documentation/code is noisy+disjointed.
locknitpicker|1 month ago
You might want to review how you approach these tools. Complaining that you need to rewrite 70% of the code screams of poor prompting, with too vague inputs, no constraints, and no feedback at all.
Using agents to help you write code is far from a one-shot task, but if throwing out 70% of what you create screams out that you are prompting the agent to create crap.
> 1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.
I think you need to take a humble pill, review how you are putting together these prompts, figure out what you are doing wrong in prompts and processes, and work up from where you are at this point. If 70% of your output is crap, the problem is in your input.
I recommend you spend 20 minutes with your agent of choice prompting it to help you improve your prompts. Check instruction files, spec-driven approaches, context files, etc. Even a plain old README.md helps a lot. Prompt your agent to generate it for you. From there, instead of one-shot prompts try to break down a task into multiple sub steps with small deliverables. Always iterate on your instruction files. It you spend a few minutes on this, you will quickly halve your churn rate.
prezk|1 month ago
FloorEgg|1 month ago
When things really broke open for me was when I adopted windsurf with Opus 4, and then again with Opus 4.5. I think the way the IDE manages the context and breaks down tasks helps extend llm usefulness a lot, but I haven't tried cursor and haven't really tried to get good at Claude code.
All that said, I have a lot of experience writing in business contexts and I think when I really try I am a pretty good communicator. I find when I am sloppy with prompts I leave a lot more to chance and more often I don't get what I want, but when I'm clear and precise I get what I want. E.g. if it's using sloppy patterns and making bad architectural choices, I've found that I can avoid that by explaining more about what I want and why I want it, or just being explicit about those decisions.
Also, I'm working on smaller projects with less legacy code.
So in summary, it might be a combination of 1, 2 and the age/complexity of the project you're working on.
h14h|1 month ago
My biggest successes have come when I take a TDD approach. First I identify a subset of my work into a module with an API that can be easily tested, then I collaborate with the agent on writing correct test-cases, and finally I tell it to implement the module such that the test cases pass without any lint or typing errors.
It forces me to spend much more time thinking about use cases, project architecture, and test coverage than about nitty-gritty implementation details. I can imagine that in a system that evolved over time without a clear testing strategy, AI would struggle mightily to be even barely useful.
Not saying this applies to your system, but I've definitely worked on systems in the past that fit the "big ball of mud" description pretty neatly, and I have zero clue how I'd have been able to make effective use of these AI tools.
hansvm|1 month ago
3) Not everyone codes the same things
4) It's easy to get too excited about the tech and ignore its failure modes when describing your experiences later
I use AI a lot. With your own control plane (as opposed to a generic Claude Code or whatever) you can fully automate a lot more things. It's still fundamentally incapable of doing tons of tasks though at any acceptable quality level, and I strongly suspect all of (2,3,4) are guiding the disconnect you're seeing.
Take the two things I've been working on this morning as an example.
One was a one-off query. I told it the databases it should consider, a few relevant files, roughly how that part of the business works, and asked it to come back when it finished. When it was done I had it patch up the output format. It two-shot (with a lot of helpful context) something that would have taken me an hour or more.
Another is more R&D-heavy. It pointed me to a new subroutine I needed (it couldn't implement it correctly though) and is otherwise largely useless. It's actively harmful to have it try to do any of the work.
It's possible that (1) matters more than you suspect too. AI has certain coding patterns it likes to use a lot which won't work in my codebase. Moreover, it can't one-shot the things I want. It can, however, follow a generic step-by-step guide for generating those better ideas, translating worse ideas into things that will be close enough to what I need, identifying where it messed up, and refactoring into something suitable, especially if you take care to keep context usage low and whatnot. A lot of people seem to be able to get away with CLAUDE.md or whatever, but I like having more granular control of what the thing is going to be doing.
Exoristos|1 month ago
growt|1 month ago
lmeyerov|1 month ago
A bummer is that we have a genai team (louie.ai) and a gpu/viz/graph analytics team (graphistry), and those who have spent the last 2-3 years doing genai daily have a higher uptake rate here than those who aren't. I wouldn't say team 1 is better than team 2 in general: these are tools, and different people have different engineering skill and ai coding skill, including different amounts of time doing both.
What was a revelation for me personally was taking 1-2mo early in claude code's release was to go full cold turkey on manual coding, similar to getting immersed in a foreign language. That forced eliminating a lot of bad habits wrt effective ai coding both personally and in state of our repo tooling. Since then, it's been steady work to accelerate and smooth that loop, eg, moving from vibe coding/engineering to now more eval-driven ai coding loops: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . That takes a LOT of buildout.
fmbb|1 month ago
zmj|1 month ago
To put that another way: one-shots attempts aren't where the win is in big codebases. Repeat iteration is, as long as your tooling steers it in the right direction.
dspillett|1 month ago
I assume this is part of the problem (though I've avoided using LLMs mostly so can't comment with any true confidence here) but to a large extent this is blaming you for a suboptimal interface when the interface is the problem.
That some people seem to get much better results than others, and that the distinction does not map well to differences in ability elsewhere, suggests to me that the issue is people thinking slightly differently and the training data for the models somehow being biased to those who operate in certain ways.
> 2) Other people are just less picky than I am
That is almost certainly a much larger part of the problem. “Fuck it, it'll do, someone else can tidy it later if they are bothered enough” attitudes were rampant long before people started outsourcing work to LLMs.
furyofantares|1 month ago
It won't be worth it the first few times you try this, and you may not get it to where you want it. I think you might be pickier than others and you might be giving it harder problems, but I also bet you could get better results out of the box after you do this with a few problems.
parliament32|1 month ago
I know the usual clap back is "you're just missing this magical workflow" or "you need to prompt better" but.. do I really need to prompt "make sure your syntax is correct"? Shouldn't that be, ya know, a given for a prompt that starts with "Help me put together a PromQL query that..."?
simonw|1 month ago
If you find yourself having to copy and paste errors back and forward you need to upgrade to a coding agent harness like Claude Code so the LLM can try things out and then fix the errors on its own.
If you're not willing to do that you can also fix this by preparing a text file with a few examples of correctly formatted queries and pasting that in at the start of your session, or putting it in a skill markdown file.
mark_l_watson|1 month ago
With very good tooling (e.g., Google Antigravity, Claude Coding, Open AI’s codex, and several open platforms) and not caring about your monthly API and subscription costs, then very long running trial and error and also with tools for testing code changes, then some degree of real autonomy is possible.
But, do we want to work like this? I don’t.
I feel very good about using strong AI for research and learning new things (self improvement) and I also feel good about using strong AI as a ‘minor partner’ in coding.
Closi|1 month ago
Code quality is an issue you need to ignore with vibe coding - if code quality is important to your project or you then it’s not an issue. But if you abandon this concept and build things small enough or modular enough then speed gains await!
IMO codebases can be architected for LLMs to work better in them, but this is harder in brownfield apps.
ahtihn|1 month ago
Greenfield is fundamentally easier than maintaining existing software. Once software exists, users expect it to behave a certain way and they expect their data to remain usable in new versions.
The existing software now imposes all sorts of contraints that may not be explicit in the spec. Some of these constraints end up making some changes very hard. Bad assumptions in data modeling can make migrations a nightmare.
You can't just write entirely new software every time the requirements change.
littlestymaar|1 month ago
Given how consistently terrible the code of Claude Code-d projects posted here have been, I think this is it.
I find LLMs pretty useful for coding, for multiple things(to write boilerplate, as an idiomatic design pattern search engine, as a rubber duck, helping me name things, explaining unclear error messages, etc.), but I find the grandiose claims a bit ridiculous.
eloisant|1 month ago
It's not just about better prompting, but using better tools. Tools that will turn a bad prompt into a good prompt.
For example there is the plan mode for Cursor. Or just ask the AI: "make a plan to do this task", then you review the plan before asking it to implement. Configure the AI to ask you clarification questions instead of assuming things.
It's still evolving pretty quickly, so it's worth staying up to date with that.
xmodem|1 month ago
One project I tried out recently I took a test-driven approach. I built out the test suite while asking the AI to do the actual implementation. This was one of my more successful attempts, and may have saved me 20-30% time overall - but I still had to throw out 80% of what it built because the agent just refused to implement the architecture I was describing.
It's at its most useful if I'm trying to bootstrap something new on a stack I barely know, OR if I decide I just don't care about the quality of the output.
I have tried different CLI tools, IDE tools. Overall I've had the best success with Claude Code but I'm open to trying new things.
Do you have any good resources you would recommend for getting LLM's to perform better, or staying up-to-date on the field in general?
davidguetta|1 month ago
Then you rewiew it and in general have to ask to remove some stuff. And then it's (good enough). You have to accept to not nitpick some parts (like random functions being generated) as long as your test suite pass, otherwise of course you will end up rewritin everything
It also depends on your setting, some area (web vs AI vs robotics) can be more suited than other
nvarsj|1 month ago
I work on a giant legacy code base at big tech, which is one piece of many distributed systems. LLM is helpful for localised, well defined work, but nowhere close to what the TFA describes.
jryle70|1 month ago
https://github.com/antirez?tab=overview&from=2026-01-01&to=2...
totallykvothe|1 month ago
cm2187|1 month ago
camdenreslink|1 month ago
Chris911|1 month ago
newsoftheday|1 month ago
I have, it becomes a race to the bottom.
ninininino|1 month ago
Ask the same question of Golang, or Rust, or Typescript.
I have a theory that the large dichotomy in how people experience AI coding has to do with the quality of the training corpus for each language online.
onlyrealcuzzo|1 month ago
But it is astounding how terrible they are at debugging non-trivial assembly in my experience.
Anyone else have input here?
Am I in a weird bubble? Or is this just not their forte?
It's truly incredible how thoughtless they can be, so I think I'm in a bubble.
selestify|1 month ago
Oh god I thought I was the only one. Do you find yourself getting mad at them too?
smj-edison|1 month ago
Even when refactoring, it would change all my comments, which is really annoying, as I put a lot of thought into my comments. Plus, the time it took to do each refactoring step was about how long it would take me, and when I do it I get the additional benefit of feeling when I'm repeating code too often.
So, I'm not using it for now, except for isolating bugs. It's addicting having it work on it for me, but I end up feeling disconnected and then something inevitably goes wrong.
intended|1 month ago
It seems that theres more people writing and finishing projets, but not many have reached the point where they have to maintain their code / deal with the tech debt.
crassus_ed|1 month ago
eru|1 month ago
Have another model review the code, and use that review as automatic feedback?
jmalicki|1 month ago
robertfw|1 month ago
the outcome is usually a new or tweaked skill file.
it doesn't always fix the problem, but it's definitely been making some great improvements.
kristofferR|1 month ago
redox99|1 month ago
1) Have an AGENTS.md that describes not just the project structure, but also the product and business (what does it do, who is it for, etc). People expect LLMs to read a snippet of code and be as good as an employee who has implicit understanding of the whole business. You must give it all that information. Tell it to use good practices (DRY, KISS, etc). Add patterns it should use or avoid as you go.
2) It must have source access to anything it interacts with. Use Monorepo, Workspaces, etc.
3) Most important of all, everything must be setup so the agent can iterate, test and validate it's changes. It will make mistakes all the time, just like a human does (even basic syntax errors), but it will iterate and end up on a good solution. It's incorrect to assume it will make perfect code blindly without building, linting, testing, and iterating on it. No human would either. The LLM should be able to determine if a task was completed successfully or not.
4) It is not expected to always one shot perfect code. If you value quality, you will glance at it, and sometimes ahve to reply to make it this other way, extract this, refactor that. Having said that, you shouldn't need to write a single line of code (I haven't for months).
Using LLMs correctly allow you to complete tasks in minutes that would take hours, days, or even weeks, with higher quality and less errors.
Use Opus 4.5 with other LLMs as a fallback when Opus is being dumb.
matwood|1 month ago
This was the biggest unlock for me. When I received a bug report I have the LLM tell me where it thinks the source of the bug is located, write a test that triggers the bug/fails, design a fix, finally implement the fix and repeat. I'm routinely surprised how good it is at doing this, and the speed with which it works. So even if I have to manually tweak a few things, I've moved much faster than without the LLM.
Madmallard|1 month ago
Writing logic that verifies something complex requires basically solving the problem entirely already.
zelphirkalt|1 month ago
austin-cheney|1 month ago
javier2|1 month ago
teunispeters|1 month ago
My vote is with (2).
unknown|1 month ago
[deleted]
wanderlust123|1 month ago
unknown|1 month ago
[deleted]
aurizon|1 month ago
perrygeo|1 month ago
IMO it has nothing to do with LLMs. They just mirror the patterns they see - don't get upset when you don't like your own reflection! Software complexity is still bad. LLMs just shove it back in our face.
Implications: AI is always going to feel more effective on brand new codebases without any legacy weight. And less effective on "real" apps where the details matter.
The bias is strongly evident - you rarely hear anyone talking about how they vibe coded a coherent changeset to an existing repo.
otabdeveloper4|1 month ago
It's just the Dunning-Kruger effect. People who think AI is the bee's knees are precisely the dudes who are least qualified to judge its effectiveness.
egorfine|1 month ago
Perhaps one has to be skilled programmer in the first place to spot the problems, which is not easy when the program runs apparently.
Things like mocked tests, you know. Who would care about that.
CuriouslyC|1 month ago
Think about it like compiler output. Literally nobody cares if that is well formatted. They just care that they can get fairly performant code without having to write assembly. People still dip to assembly (very very infrequently now) for really fine performance optimizations, but people used to write large programs in it (miserably).
__float|1 month ago
These are all important when you consider the long-term viability of a change. If you're working in a greenfield project where requirements are constantly changing and you plan on throwing this away in 3 months, maybe it works out fine. But not everyone is doing that, and I'd estimate most professional SWEs are not doing that, even!
luckilydiscrete|1 month ago
Instead I recommend that you use LLMs to fix the problems that they introduced as well, and over time you'll get better at figuring out the parts that the LLM will get confused by. My hunch is that you'll find your descriptions of what to implement were more vague than you thought, and as you iterate, you'll learn to be a lot more specific. Basically, you'll find that your taste was more subjective than you thought and you'll rid yourself of the expectation that the LLM magically understands your taste.