Ask HN: Do you have any evidence that agentic coding works?
461 points| terabytest | 1 month ago
Is there real evidence, beyond hype, that agentic coding produces net-positive results? If any of you have actually got it to work, could you share (in detail) how you did it?
By "getting it to work" I mean: * creating more value than technical debt, and * producing code that’s structurally sound enough for someone responsible for the architecture to sign off on.
Lately I’ve seen a push toward minimal or nonexistent code review, with the claim that we should move from “validating architecture” to “validating behavior.” In practice, this seems to mean: don’t look at the code; if tests and CI pass, ship it. I can’t see how this holds up long-term. My expectation is that you end up with "spaghetti" code that works on the happy path but accumulates subtle, hard-to-debug failures over time.
When I tried using Codex on my existing codebases, with or without guardrails, half of my time went into fixing the subtle mistakes it made or the duplication it introduced.
Last weekend I tried building an iOS app for pet feeding reminders from scratch. I instructed Codex to research and propose an architectural blueprint for SwiftUI first. Then, I worked with it to write a spec describing what should be implemented and how.
The first implementation pass was surprisingly good, although it had a number of bugs. Things went downhill fast, however. I spent the rest of my weekend getting Codex to make things work, fix bugs without introducing new ones, and research best practices instead of making stuff up. Although I made it record new guidelines and guardrails as I found them, things didn't improve. In the end I just gave up.
I personally can't accept shipping unreviewed code. It feels wrong. The product has to work, but the code must also be high-quality.
Some comments were deferred for faster rendering.
xsh6942|1 month ago
I've had great success coding infra (terraform). It at least 10x the generation of easily verifiable and tedious to write code. Results were audited to death as the client was highly regulated.
Professional feature dev is hit and miss for sure, although getting better and better. We're nowhere near full agentic coding. However, by reinvesting the speed gains from not writing boilerplate into devex and tests/security, I bring to life much better quality software, maintainable and a boy to work with.
I suddenly have the homelab of my dreams, all the ideas previously in the "too long to execute" category now get vibe coded while watching TV or doing other stuff.
As an old jaded engineer, everything code was getting a bit boring and repetitive (so many rest APIs). I guess you get the most value out of it when you know exactly what you want.
Most importantly though, and I've heard this from a few other seniors: I've found joy in making cool fun things with tech again. I like that new way of creating stuff at the speed of thought, and I guess for me that counts as "it works"
raphaelj|1 month ago
On some tasks like build scripts, infra and CI stuff, I am getting a significant speedup. Maybe I am 2x faster on these tasks, when measured from start to PR.
I am working on a HPC project[1] that requires more careful architectural thinking. Trying to let the LLM do the whole task most often fail, or produce low quality code (even with top models like Opus 4.5).
What works well though is "assisted" coding. I am usually writing the interface code (e.g. headers in C++) with some help from the agent, and then let the LLM do the actual implementation of these functions/methods. Then I do final adjustments. Writing a good AGENTS.md helps a lot. I might be 30% faster on these tasks.
It seems to match what I see from the PRs I am reviewing: we are getting these slightly more often than before.
---
[1] https://github.com/finos/opengris-scaler
BrandoElFollito|1 month ago
Oh yes. I am amateur-developping for 35 years and when I vibe code I let the basic, generic stuff happen and then tell the AI to refactor the way I want. It usually works.
I had the same "too boring to code" approach and AI was a revelation. It takes off the typing but allows, when used correctly, for the creative part. I love this.
theshrike79|1 month ago
This is the true game changer.
I have a large-ish NAS that's not very well organised (I'm trying, it's a consolidated mess of different sources from two deacades - at least they're all in the same place now)
It was faster to ask Claude to write me a search database backend + frontend than try to click through the directories and wait for the slow SMB shares to update to find where that one file was I knew was in there.
Now I have a Go backend that crawls my NAS every night, indexes files to a FTS5 sqlite database with minimal metadata (size + mimetype + mtime/ctime) and a simple web frontend I can use to query the database
...actually I kinda want a cli search tool that uses the same schema. Brb.
Done.
AI might be a bubble etc. but I'll still have that search tool (and two dozen other utilities) in 5 years when Claude monthly subsciption is 2000€ and a right to harvest your organs on non-payment.
resonious|1 month ago
Other things that seem to contribute to success with agents are:
- Static type systems (not tacked-on like Typescript)
- A test suite where the tests cover large swaths of code (i.e. not just unit testing individual functions; you want e2e-style tests, but not the flaky browser kind)
With all the above boxes ticked, I can get away with only doing "sampled" reviews. I.e. I don't review every single change, but I do review some of them. And if I find anything weird that I had missed from a previous change, I to tell it to fix it and give the fix a full review. For architectural changes, I plan the change myself, start working on it, then tell the agent to finish.
BatteryMountain|1 month ago
tcgv|1 month ago
That's my experience too. Agent coding works really well for existing codebases that are well-structured and organized. If your codebase is mostly spaghetti—without clear boundaries and no clear architecture in place—then agents won't be of much help. They'll also suffer working in those codebases and produce mediocre results.
Regarding building apps and systems from scratch with agents, I also find it more challenging. You can make it work, but you'll have to provide much more "spec" to the agent to get a good result (and "good" here is subjective). Agents excel at tasks with a narrower scope and clear objectives.
The best use case for coding agents is tasks that you'd be comfortable coding yourself, where you can write clear instructions about what you expect, and you can review the result (and even make minor adjustments if necessary before shipping it). This is where I see clear efficiency gains.
nl|1 month ago
I'm slowly accepting that Python's optional typing is mistake with AI agents, especially with human coders too. It's too easy for a type to be wrong and if someone doesn't have typechecking turned on that mistake propagates.
defatigable|1 month ago
I've implemented several medium-scale projects that I anticipate would have taken 1-2 weeks manually, and took a day or so using agentic tools.
A few very concrete advantages I've found:
* I can spin up several agents in parallel and cycle between them. Reviewing the output of one while the others crank away.
* It's greatly improved my ability in languages I'm not expert in. For example, I wrote a Chrome extension which I've maintained for a decade or so. I'm quite weak in Javascript. I pointed Antigravity at it and gave it a very open-ended prompt (basically, "improve this extension") and in about five minutes in vastly improved the quality of the extension (better UI, performance, removed dependencies). The improvements may have been easy for someone expert in JS, but I'm not.
Here's the approach I follow that works pretty well:
1. Tell the agent your spec, as clearly as possible. Tell the agent to analyze the code and make a plan based on your spec. Tell the agent to not make any changes without consulting you.
2. Iterate on the plan with the agent until you think it's a good idea.
3. Have the agent implement your plan step by step. Tell the agent to pause and get your input between each step.
4. Between each step, look at what the agent did and tell it to make any corrections or modifications to the plan you notice. (I find that it helps to remind them what the overall plan is because sometimes they forget...).
5. Once the code is completed (or even between each step), I like to run a code-cleanup subagent that maintains the logic but improves style (factors out magic constants, helper functions, etc.)
This works quite well for me. Since these are text-based interfaces, I find that clarity of prose makes a big difference. Being very careful and explicit about the spec you provide to the agent is crucial.
marcus_holmes|1 month ago
I've been a professional software developer for >30 years, and this is the biggest revolution I've seen in the industry. It is going to change everything we do. There will be winners and losers, and we will make a lot of mistakes, as usual, but I'm optimistic about the outcome.
jesse__|1 month ago
A 1-week project is a medium-scale project?! That's tiny, dude. A medium project for me is like 3 months of 12h days.
monkeydust|1 month ago
sirwhinesalot|1 month ago
Make a commit.
Give Claude a task that's not particularly open ended, the closer to pure "monkey work" boilerplate nonsense the task is, the better (which is also the sort of code I don't want do deal with myself).
Preferably it should be something that only touches a file or two in the codebase unless it is a trivial refactor (like changing the same method call all over the place)
Make sure it is set to planning mode and let it come up with a plan.
Review the plan.
Let it implement the plan.
If it works, great, move on to review. I've seen it one-shot some pretty annoying tasks like porting code from one platform to another.
If there are obvious mistakes (program doesn't build, tests don't pass, etc.) then a few more iterations usually fix the issue.
If there are subtle mistakes, make a branch and have it try again. If it fails, then this is beyond what it can do, abort the branch and solve the issue myself.
Review and cleanup the code it wrote, it's usually a lot messier than it needs to be. This also allows me to take ownership of the code. I now know what it does and how it works.
I don't bother giving it guidelines or guardrails or anything of the sort, it can't follow them reliably. Even something as simple as "This project uses CMake, build it like this" was repeatedly ignored as it kept trying to invoke the makefile directly and in the wrong folder.
This doesn't save me all that much time since the review and cleanup can take long, but it serves a great unblocker.
I also use it as a rubber duck that can talk back and documentation source. It's pretty good for that.
This idea of having an army of agents all working together on the codebase is hilarious to me. Replace "agents" with "juniors I hired on fiverr with anterograde amnesia" and it's about how well it goes.
dwd|1 month ago
My personal use is very much one function at a time. I know what I need something to do, so I get it to write the function which I then piece together.
It can even come back with alternatives I may not have considered.
I might give it some context, but I'm mainly offloading a bunch of typing. I usually debug and fix it's code myself rather than trying to get it to do better.
crq-yml|1 month ago
I get the sense that the application of armies of agents is actually a scaled-up Lisp curse - Gas Town's entire premise is coding wizardry, the emphasis on abstract goals and values, complete with cute, impenetrable naming schemes. There's some corollary with "programs are for humans to read and computers to incidentally execute" here. Ultimately the program has to be a person addressing another person, or nature, and as such it has to evolve within the whole.
theshrike79|1 month ago
Where do you give these guardrails? In the chat or CLAUDE.md?
Basic level information like how to build and test the project belong in CLAUDE.md, it knows to re-check that now and then.
edude03|1 month ago
Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests
monooso|1 month ago
Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.
Hopefully someone with a larger context window than myself can recall the article in question.
sReinwald|1 month ago
But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.
The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!
zh3|1 month ago
Here's my realtime Bluetooth heart rate monitor for linux, with text output and web interface.
This was 100% written by Claude Code, my input was limited to mostly accepting Claude suggestions except a couple of cases where I could make suggestions to speed up development (skipping some tests I knew would work).Particularly interesting because I didn't expect this to work, let along not to write any code. Note that I limited it to pure C with limited dependencies; initial prompt was just to get text output ("Heart Rate 76bpm"), when it got to that point I told Claude to add a web interface followed by creating a realtime graph to show the interface in use.
Every file is claude generated. AMA.
edit: this was particularly interesting as it had to test against the HRM sensor I was wearing during development, and to cope with bluetooth devices appearing and disappearing all the time. It took about a day for the whole thing and cost around $25.
further edit: I am by no means an expert with Claude (haven't even got to making a claude.md file); the one real objective here was to get a working example of using dBus to talk to blueZ in C, something I've failed at (more than once) before.
embedding-shape|1 month ago
In https://github.com/lowrescoder/BlueHeart/blob/68ab2387a0c44e... for example, it doesn't actually do SSE at all, instead it queues up a complete HTTP response each time, returns once and then closes the stream, so basically a normal HTTP endpoint, "labeled" as a SSE one. SSE is mentioned a bunch of times in the docs, and the files/types/functions are labeled as such, but that doesn't seem to be what's going on internally, from what I could understand. Happy to stand corrected though!
kqr|1 month ago
I don't think anyone says it's not possible to get the LLM to write code. The problems OP has with them is that the code they write starts out good but then quickly devolves when the LLMs get stuck in the weird ruts they have.
fotcorn|1 month ago
I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.
Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.
Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.
It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.
The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.
sifar|1 month ago
IIRC, Depthwise is memory bound so the bar might be lower. Perhaps you can try some thing with higher compute intensity like a matrix multiply. I have observed, it trips up with the columnar accesses for SIMD.
camel-cdr|1 month ago
dagss|1 month ago
The other day I gave an estimate to my co-worker and he said "but how long is it really going to take, because you always finish a lot quicker than you say, you say two weeks and then it takes two days".
The LLMs will just make me finish things a lot faster and my gut feel estimation for how long things will take still is not yet taking that into account.
(And before people talk about typing speed: No that isn't it at all. I've always been the fastest typer and fastest human developer among my close co-workers.)
Yes, I need to review the code and interact with the agent. But it's doing a lot better than a lot of developers I've worked with over the years, and if I don't like the style of the code it takes very few words and the LLM will "get it" and it will improve it..
Some commenters are comparing the LLM to a junior. In some sense that is right in that the work relationship may be the same as towards a (blazingly fast) junior; but the communication style and knowledge area and how few words I can use to describe something feels more like talking to a senior.
(I think it may help that latest 10 years of my career a big part of my job was reviewing other people's code, delegating tasks, being the one who knew the code base best and helping others into it. So that means I'm used to delegating not just coding. Recently I switched jobs and am now coding alone with AI.)
trashb|1 month ago
> "but how long is it really going to take, because you always finish a lot quicker than you say, you say two weeks and then it takes two days"
However these statement just kinda makes your comment smell of r/thatHappend. Since it is such a tremendous speed up.
Therefore I am intrigued what kind of problems you working on? Does it require a lot of boilerplate code or a lot of manually adjusting settings?
keybored|1 month ago
wewewedxfgdf|1 month ago
It is an assistant not a team mate.
If you think that getting it wrong, or bugs, or misunderstandings, or lost code, or misdirections, are AI "failing", then yes you will fail to understand or see the value.
The point is that a good AI assisted developer steers through these things and has the skill to make great software from the chaotic ingredients that AI brings to the table.
And this is why articles like this one "just don't get it", because they are expecting the AI to do their job for them and holding it to the standards of a team mate. It does not work that way.
ummonk|1 month ago
terabytest|1 month ago
al_borland|1 month ago
Two days later, after people freaked out, context was added. The team built multiple versions in that year, each had its trade offs. All that context was given to the AI and it was able to produce a “toy” version. I can only assume it had similar trade offs.
https://xcancel.com/rakyll/status/2007659740126761033#m
My experience has been similar to yours, and I think a lot of the hype is from people like this Google engineer who play into the hype and leave out the context. This sets expectations way out of line from reality and leads to frustration and disappointment.
another_twist|1 month ago
keybored|1 month ago
I’ll bring the tar if you bring the feathers.
That sounds hyperbolic but how can someone say something so outrageoulsy false.
everfrustrated|1 month ago
Some techniques I've found useful recently:
- If the agent struggled on something once it's done I'll ask it "you were struggling here, think about what happened and if there are is anything you learned. Put this into a learnings document and reference it in agents.md so we don't get stuck next time"
- Plans are a must. Chat to the agent back and forth to build up a common understanding of the problem you want solved. Make sure to say "ask me any follow up questions you think are necessary". This chat is often the longest part of the project - don't skimp on it. You are building the requirements and if you've ever done any dev work you understand how important having good requirements are to the success of the work. Then ask the model to write up the plan into an implementation document with steps. Review this thoroughly. Then use a new agent to start work on it. "Implement steps 1-2 of this doc". Having the work broken down into steps helps to be able to do work more pieces (new context windows). This part is the more mindless part and where you get to catch up on reading HN :)
- The GitHub Copilot chat agent is great. I don't get the TUI folks at all. The Pro+ plan is a reasonable price and can do a lot with it (Sonnet, Codex, etc all available). Being able to see the diffs as it works is helpful (but not necessary) to catch problems earlier.
marwamc|1 month ago
linesofcode|1 month ago
Agentic programming is a skill-set and a muscle you need to develop just like you did with coding in the past.
Things didn’t just suddenly go downhill after an arbitrary tipping point - what happened is you hit a knowledge gap in the tooling and gave up.
Reflect on what went wrong and use that knowledge next time you work with the agent.
For example, investing the time in building a strong test suite and testing strategy ahead of time which both you and the agent can rely on.
Being able to manage the agent and getting quality results on a large, complex codebase is a skill in itself, it won’t happen over night.
It takes practice and repetition with these tools to level-up, just like any thing else.
terabytest|1 month ago
cwoolfe|1 month ago
You still need to think about how you would solve the problem as an engineer and break down the task into a right-sized chunk of work. i.e. If 4 things need to change, start with the most fundamental change which has no other dependencies.
Also it is important to manage the context window. For a new task, start a new "chat" (new agent). Stay on topic. You'll be limited to about five back-and-forths before performance starts to suffer. (cursor shows a visual indicator of this in the for of the circle/wheel icon)
For larger tasks, tap the Plan button first, and guide it to the correct architecture you are looking for. Then hit build. Review what it did. If a section of code isn't high-quality, tell Claude how to change it. If it fails, then reject the change.
It's a tool that can make you 2 - 10x more productive if you learn to use it well.
arjie|1 month ago
If I'm being honest, the people who get utility out of this tool don't need any tutorials. The smattering of ideas that people mention is sufficient. The people who don't get utility out of this tool are insistent that it is useless, which isn't particularly inspiring to the kind of person who would write a good tutorial.
Consequently, you're probably going to have to pay someone if you want a handholding. And at the end you might believe it wasn't worth it.
keybored|1 month ago
proc0|1 month ago
Anyone who claims AI is great is not building a large or complex enough app, and when it works for their small project, they extrapolate to all possibilities. So because their example was generated from a prompt, it's incorrectly assumed that any prompt will also work. That doesn't necessarily follow.
The reality is that programming is widely underestimated. The perception is that it's just syntax on a text file, but it's really more like a giant abstract machine with moving parts. If you don't see the giant machine with moving parts, chances are you are not going to build good software. For AI to do this, it would require strong reasoning capabilities, that lets it derive logical structures, along with long term planning and simulation of this abstract machine. I predict that if AI can do this then it will be able to do every single other job, including physical jobs as it would be able to reason within a robotic body in the physical world.
To summarize, people are underestimating programming, using their simple projects to incorrectly extrapolate to any possible prompt, and missing the hard part of programming which involves building abstract machines that work on first principles and mathematical logic.
linsomniac|1 month ago
I can't speak for everyone, but lots of us fully understand that the AI tooling has limitations and realize there's a LOT of work that can be done within those limitations. Also, those limitations are expanding, so it's good to experiment to find out where they are.
Conversely, it seems like a lot of people are saying that AI is worthless because it can't build arbitrarily large apps.
I've recently used the AI tooling to make a docusign-like service and it did a fairly good job of it, requiring about a days worth of my attention. That's not an amazingly complex app, but it's not nothing either. Ditto for a calorie tracking web app. Not the most complex app, but companies are making legit money off them, if you want a tangible measure of "worth".
antonvs|1 month ago
That might be true for agentic coding (caveat below), but AI in the hands of expert users can be very useful - "great" - in building large and complex apps. It's just that it has to be guided and reviewed by the human expert.
As for agentic coding, it may depend on the app. For example, Steve Yegge's "beads" system is over a quarter million lines of allegedly vibe-coded Go code. But developing a CLI like that may be a sweet spot for LLMs, it doesn't have all the messiness of typical business system requirements.
SatvikBeri|1 month ago
* I came up with a list of 9 performance improvement ideas for an expensive pipeline. Most of these were really boring and tedious to implement (basically a lot of special cases) and I wasn't sure which would work, so I had Claude try them all. It made prototypes that had bad code quality but tested the core ideas. One approach cut the time down by 50%, I rewrote it with better code and it's saved about $6,000/month for my company.
* My wife and I had a really complicated spreadsheet for tracking how much we owed our babysitter – it was just complex enough to not really fit into a spreadsheet easily. I vibecoded a command line tool that's made it a lot easier.
* When AWS RDS costs spiked one month, I set Claude Code to investigate and it found the reason was a misconfigured backup setting
* I'll use Claude to throw together a bunch of visualizations for some data to help me investigate
* I'll often give Claude the type signature for a function, and ask it to write the function. It generally gets this about 85% right
sauwan|1 month ago
Ok, please help me understand. Or is this more of a nanny?
abrookewood|1 month ago
mrdependable|1 month ago
egorfine|1 month ago
1) I needed a tool to consolidate *.dylib on macOS into the app bundle. I wanted this tool to be in JS because of some additional minor logic which would be a hassle to implement in pure bash.
2) I needed a simple C wrapper to parallelize /usr/bin/codesign over cores. Split list of binaries in batches and run X parallel codesigns over a batch.
Arguably, both tools are junior-level tasks.
I have used Claude Code and Opus 4.5. I have used AskUserTool to interview me and create a detailed SPEC.md. I manually reviewed and edited the final spec. I then used the same model to create the tool according to that very detailed spec.
The first tool, the dylib consolidation one, was broken horrendously. I did recurse into subdirs where no folder structure is expected or needed and did not recurse into folders where it was needed. It created a lot of in-memory structures which were never read. Unused parameters in functions. Unused functions. Incredible, illogical code that is impossible to understand. Quirks, "clever code". Variable scope all over the place. It appeared to work, but only in one single case on my dev workstation and failed on almost every requirement in the spec. I ended up rewriting it from scratch, because the only things worst saving from this generated code were one-liners for string parsing.
The second tool did not even work. You know this quirk of AI models that once they find a wrong solution they keep coming back at it, because the context was poisoned? So, this. Completely random code, not even close. I rewrote the thing from scratch [1].
Curiously, the second tool took way more time and tokens to create despite being quite simpler.
So yeah. We're definitely at most 6 month away from replacing programmers with AI.
[1] https://github.com/egorFiNE/codesign-parallel
rahimnathwani|1 month ago
I've found that helps a lot.
victorbjorklund|1 month ago
1) low risk code
Let's say that we're building an MVP for something. and at this moment we just wanna get something working to get some initial feedback. So for example, the front-end code is not going to stick around. we just want something there to give a functionality and a feeling but it doesn't have to be perfect. AI is awesome at creating that kind of front-end code that will just live for a short time before it's probably all thrown out.
2) fast iterations and experimentation
In the past, if you had to build something and you were thinking, thinking maybe I can try this thing, then you're gonna spend hours or days getting it up and working to find out if it's even a good idea in the first place. but with AI And I find that I can just ask the AI to quickly get a working feature up and I can realize no this is not the best way to do it remove everything thing start over. I could not do that in the past with limited time to spend and they just doing the same thing over and over again with different libraries or different solutions. But with AI, I can do that. and then when you have something that you like you can go back and do it correctly.
3) typing for me.
And lastly, even when I write my own code, I don't really write it but I don't use the AI to to say, "hey, build me a to-do app" instead I use it to just give me the building blocks so more like in very advanced snippet tool so I might say "Can you give me a gen server that takes in this and that and returns this and that?" And then of course I review the result.
theshrike79|1 month ago
I have an actual work service that uses a specific rule engine, which has some performance issues.
I could just go to Codex Web and say "try library A and library B as replacements for library X, benchmark all three solutions and give me a summary markdown file of the results"
Then I closed the browser tab and came back later, next day I think, and checked out the results.
That would've been a full day's work from me, maybe a bit more, that was now compressed to 5 minutes of active work.
shafyy|1 month ago
But to answer the OP's question: I am on the same boat as you, I think the use cases are very limited and the productivity gains are often significantly overestimated by engineers who are hyping it up.
mvanzoest|1 month ago
lukebechtel|1 month ago
2. Part of the plan should be automated tests. AI can make these for you too, but you should spot check for reasonable behavior.
3. Use Claude 4.5 Opus
4. Use Git, get the AI to check in its work in meaningful chunks, on its own git branch.
5. Ask the AI to keep am append-only developer log as a markdown file, and to update it whenever its state significantly changes, or it makes a large discovery, or it is "surprised" by anything.
baal80spam|1 month ago
In my org we are experimenting with agentic flows, and we've noticed that model choice matters especially for autonomy.
GPT-5.2 performed much better for long-running tasks. It stayed focused, followed instructions, and completed work more reliably.
Opus 4.5 tended to stop earlier and take shortcuts to hand control back sooner.
emilecantin|1 month ago
- Ask Claude to look at my current in-progress task (from Github/Jira/whatever) and repro the bug using the Chrome MCP.
- Ask it to fix it
- Review the code manually, usually it's pretty self-contained and easy to ensure it does what I want
- If I'm feeling cautious, ask it to run "manual" tests on related components (this is a huge time-saver!)
- Ask it to help me prepare the PR: This refers to instructions I put in CLAUDE.md so it gives me a branch name, commit message and PR description based on our internal processes.
- I do the commit operations, PR and stuff myself, often tweaking the messages / description.
- Clear context / start a new conversation for the next bug.
On a personal project where I'm less concerned about code quality, I'll often do the plan->implementation approach. Getting pretty in-depth about your requirements ovbiously leads to a much better plan. For fixing bugs it really helps to tell the model to check its assumptions, because that's often where it gets stuck and create new bugs while fixing others.
All in all, I think it's working for me. I'll tackle 2-3 day refactors in an afternoon. But obviously there's a learning curve and having the technical skills to know what you want will give you much better results.
kristopolous|1 month ago
Agentic coding is very similar to frameworks in this regard:
1. If the alignment is right, you have saved time.
2. If it's not right, it might take longer.
3. You won't have clear evidence of which of these cases applies until changing course becomes too expensive.
4. Except, in some cases, this doesn't apply and it's obvious... Probably....
I have a (currently dormant) project https://onolang.com/ that I need to get back to that tries to balance these exact concerns. It's like half written. Go to the docs part to see the idea.
regularfry|1 month ago
What this means in workflow terms is that the bottleneck has moved, from writing the code to reviewing it. That's forward progress! But the disparity can be jarring when you have multiple thousands of lines of code generated every day and people are used to a review cycle based on tens or hundreds.
Some people try to make the argument that we can accept standards of code from AI that we wouldn't accept from a human, because it's the AI that's going to have to maintain it and make changes. I don't accept that: whether you're human or not it's always possible to produce write-only code, and even if the position is "if we get into difficulty we'll just have the agent rewrite it" that doesn't stop you getting into a tarpit in the first place. While we still have a need to understand how the systems we produce work, we need humans to be able to make changes and vouch for their behaviour, and that means producing code that follows our standards.
tcldr|1 month ago
This helps both me and the next agent.
Using these tools has made me realise how much of the work we (or I) do is editing: simplifying the codebase to the clearest boundaries, focusing down the APIs of internal modules, actual testing (not just unit tests), managing emerging complexity with constant refactoring.
Currently, I think an LLM struggles with the subtlety and taste aspects of many of these tasks, but I’m not confident enough to say that this won’t change.
furyofantares|1 month ago
If you want to get good at this, when it makes subtle mistakes or duplicates code or whatever, revert the changes and update your AGENTS.md or your prompt and try again. Do that until it gets it right. That will take longer than writing it yourself. It's time invested in learning how to use these and getting a good setup in your codebase for them.
If you can't get it to get it right, you may legitimately have something it sucks at. Although as you iterate might also have some other insights into why it keeps getting it wrong and can maybe change something more substantial about your setup to make it able to get it right.
For example I have a custom xml/css UI solution that draws inspiration both from XML and SwiftUI, and it does an OK job of making UIs for it. But sometimes it gets stuck in ways it wouldn't if it was using HTML or some known (and probably higher quality/less buggy) UI library. I noticed it keeps trying things, adding redundant markup to both the xml and css, using unsupported attributes that it thinks should exist (because they do in HTML/CSS), and never cleans up on the way.
Some amount of fixing up its context made it noticeably better at this but it still gets stuck and makes a mess when it does. So I made it write a linter and now it uses the linter constantly which keeps it closer to on the rails.
Your pet feeding app isn't in this category. You can get a substantial app pretty far these days without running into a brick wall. Hitting a wall that quickly just means you're early on the learning curve. You may have needed to give it more technical guidance from the start, and have it write tests for everything, make sure it makes the app observable to itself in some way so it can see bugs itself and fix them, stuff like that.
damnitbuilds|1 month ago
i.e. You are asking a question about whether using agents to write code is net-positive, and then you go on about not reviewing the code agents produce.
I suspect agents are often net-positive AND one has to review their code. Just like most people's code.
spolitry|1 month ago
DustinBrett|1 month ago
For sysops stuff I have found it extremely useful, once it has MCP's into all relevant services, I use it as the first place I go to ask what is happening with something specific on the backend.
iamsaitam|1 month ago
internet_points|1 month ago
is this a term of art? I interpreted it as "people only show off the best of the best or the worst of the worst, while the averages don't post online", though I've never heard the term "edge framing" before
organised|1 month ago
Exploratory scripts, glue code—what I think of as digital duct tape between systems—scaffolding, probes, and throwaway POCs have always been messy and lightly governed. That’s kind of normal.
What’s different now is that more people can participate in that phase, and we’re judging that work by the same norms and processes we use for production systems. I know more designers now who are unafraid to code than ever before. That might be problematic or fantastic.
Where agentic coding does work for me is explicitly in those early modes, or where my own knowledge is extremely thin or I don’t have the luxury of writing everything myself (time etc). Things that simply wouldn’t get made otherwise: feasibility checks, bridging gaps between systems, or showing a half-formed idea fast enough to decide whether it’s worth formalising.
In those contexts, technical debt isn’t really debt, because the artefact isn’t meant to live long enough to accrue interest or be used in anger.
So I don’t think the real question is "does agentic coding work?" It’s whether teams are willing to clearly separate exploratory agency from production authority, and enforce a hard line between the two. ( I dont think they'll know the difference sadly) and without that, you’re right—you just get spaghetti that passes today and punishes EVERYONE six months later.
ljf|1 month ago
To the extent that no prototype could EVER end up in live - it had to be rewritten.
This allowed prototypes to move at brilliant speed, using whatever tech you wanted (I saw plenty of paper, powerpoint and flash prototypes). Once you proved the idea (and the value) then it was iteratively rebuild 'properly'.
At other companies I have seen things hacked together as a proof of concept, live years later, and barely supported.
I can see agentic working great for prototyping, especially in the hands of those with limited technical knowledge.
loh|1 month ago
FWIW it seems like it heavily depends on the agent + model you're using. I've had the most success with Claude Code (Sonnet), and only tried Opus 4.5 for more complex things. I've also tried Codex which didn't seem very good by comparison, plus a handful of other local models (Qwen3, GLM, Minimax, etc.) through OpenCode, Roo, and Cline that I'm able to run on my 128 GB M4 Max. The local ones can work for very simple agentic tasks, albeit quite slow.
theshrike79|1 month ago
You give it a well-defined task, it'll putter away quietly and come back with results.
I've found it to be pretty good at code reviews or large refactoring operations, not so much building new features.
VladimirGolovin|1 month ago
SXX|1 month ago
For gamedev you can really build quite complex 2D game prototype in Pygame or Unity rapidly since 20-50KLOC is enough for a lot of indie games. And it allow you to iterate and try different ideas much faster.
Most of features are either one-shots doing all changes across codebase in one prompt or require few fixing prompts only.
It really helps to isolate simulation from all else with mandatory CQRS for gamestate.
It also helps to generate markdown readmes along the way for all major systems and keep feature checklists ih header of each file. This way LLM dont lose context ot what is being generated.
Basically I generated in 2-3 weeks projects that would take 2-3 months to implement in a team simply because there is much less delay between idea of feature and testing it in some form.
Yes - ocassiinally you will fail to write proper spec or LLM fail to generate working code, but then usually it means you revert everything and rewrite the specification and try again.
So LLMs of today are certainly suitable when "good enough" is sufficient. So they are good for prototyping. Then if you want better architecture you just guide LLM to refactor complete code.
LLMs also good for small self contained projects or microservices where all relevant information fits into context.
st-msl|1 month ago
Everyone's building the same workarounds. CLAUDE.md files. Handoff docs. Learnings folders. Developer logs. All manual. All single-user. All solving the same problem: how do I stop re-teaching the agent things it should already know?
What nobody seems to ask: what if the insight that helped me debug a PayPal API timeout yesterday could help every developer who hits that bug tomorrow?
Stack Overflow was multiplayer. A million developers contributing solutions that benefited everyone. We replaced it with a billion isolated sessions that benefit no one else.
The "junior developer that never grows" framing is right. But it's worse - it's a junior who forgets everything at 5pm and shows up tomorrow needing the same onboarding. And there's no way for your junior's hard-won knowledge to help anyone else's.
We're building Memco to work on this. Shared memory layer for agents. Not stored transcripts - abstracted insights. When one agent figures something out, every agent benefits.
Still early. Curious if others are thinking about this or have seen attempts at it.
cx42net|1 month ago
This.
(Thank you!)
another_twist|1 month ago
What I do is - I write a skeleton. Then I write a test suite (nothing fancy just 1 or sanity tests). I'll usually start with some logic that I want to implement and break it down into XYZ steps. Now one thing to note here - TDD is very useful. If it makes your head hurt it means the requirements arent very clear. Otherwise its relatively easy to write test cases. Second thing, if your code isnt testable in parts, it probably needs some abstraction and refactoring. I typically test at the level of abstraction boundaries. eg if something needs to write to database i'll write a data layer abstraction (standard stuff) and test that layer by whatever means are appropriate. Once the spec reaches a level where its a matter of typing, I'll add annotations in the code and add todos for codex. Then I instruct it with some context, by this time its much easier to write the context since TdD clears out the brain fog. And I tell it to finish the todos and only the todos. My most used prompt is "DONT CHANGE ANYTHING APART FROM THE FUNCTIONS MARKED AS TODO." I also have an AGENTS.md file listing any common library patterns to follow. And if the code isnt correct, I'll ask codex to redo until it gets to a shape I understand. Most of the time it gets things the 2nd time around, aka iteration is easier than ground 0. Usually it takes me a day to finish a part or atleast I plan it that way. For me, codex does save me a whole bunch of time but only because of the upfront investment.
You personally should just ignore the YouTubers most of them are morons. If you'd like to checkout AI coding flows, checkout the ones from the masters like Antirez, Mitchell H. Thats a better way of learning the right tricks.
nl|1 month ago
Here's what works for me:
Spend a lot of time working out plans. If you have a feature, get Claude Opus to build a plan, then ask it "How many github issues should this be", and get it to create those issues.
Then for each issue ask it to plan the implementation, then update the issue.
Then get it to look at all the issues for the feature and look for inconsistencies.
Once this is done, you can add you architectural constraints. If you think one issue looks like it could potentially reinvent something, edit that issue to point it at the existing implementation.
Once you are happy with the plan, assign to your agents and wait.
Optionally you can watch them - I find this quite helpful because you do see them go offtrack sometimes and can correct.
As they finish, run a separate review agent. Again, if you have constraints make sure the agent enforces them.
Finally, do an overall review of the feature. This should be initially AI assisted.
Don't get frustrated when it does the wrong thing - it will! Just tell it how to do the correct thing, and add that to your AGENTS.md so next time it will do it. Consider adding it to your issue template manually too.
In terms of code review, I manually review critical calculations line-by-line, and do a broad sweep review over the rest. That broad sweep review looks for duplicate functionality (which happens a lot) and for bad test case generation.
I've found this methodology speeds up the coding task around 5-10x what I could do before. Tasks that were 5-10 days of work are now doable in around 1 day.
(Overall my productivity increase is a lot higher because I don't procrastinate dealing with issues I want to avoid).
jrjeksjd8d|1 month ago
I work with an infrastructure team that are old school sysadmins, not really into coding. They are now prodigiously churning out apps that "work" for a given task. It is producing a ton of technical debt and slowing down new feature development, but this team doesn't really get it because they don't know enough software engineering to understand.
Likewise the recent example of an LLM "coding a browser" where the result didn't compile and wasn't useful. If you took it at face value you'd think "wow that's a hard task I couldn't do, and an LLM did it alone". In fact they spent a ton of effort on manually herding the LLM only for it to produce something pretty useless.
ElFitz|1 month ago
Similarly, I had it successfully migrate a third (so far) of our tests from an old testing framework to a new one, one test suite at a time.
We also had a race condition, and providing Claude Code with both the unsymbolicated trace and the build’s symbols, it successfully symbolicated the trace, identified the cause. When prompted, it identified most of the similar instances of the responsible pattern in our codebase (the one it missed was an indirect one).
I didn’t care much about the suggested fixes on that last one, but consider it a success too, especially since I could just keep working on other stuff while it chugged along.
StrangeSound|1 month ago
This is obviously a _very_ simple website, but in my opinion there's no argument that agentic coding works.
You're in control of the sandbox. If you don't set any rules or guidelines to follow, the LLM will produce code you're not happy with.
As with anything, it's process. If you're building a feature and it has lots of bugs, there's been a misstep. More testing required before moving onto feature 2.
What makes you say "unreviewed code"? Isn't that your job now you're no longer writing it?
CJefferson|1 month ago
AN AI managed to do basically the whole transfer. One big help is I said "The website output of the current version should be identical", so I had an easy way to test for correctness (assuming it didn't try cheating by saving the website of course, but that's easy for me to check for)
cmpalmer52|1 month ago
I did it as an experiment with my constraint being that I refused to edit code, but I did review the code it made and made it make fixes.
I didn’t do it as a one shot. Roughly, I:
* sketched out a layout on paper and photographed it (very rough) * I made a list of requirements and has the AI review and augment them * I asked ChatGPT outside of the IDE to come up an architecture and guidelines I could give to the agent * I presented all of that info to the AI as project guidelines and requirements * I then created individual tasks and had it complete them one by one. Create a UI with stubbed API calls and fake data, Create the service that talks to AzureDevOps and test it, create my Node service, Hook it all up, Add features and fix bugs.
Result, fairly clean code, very attractive and responsive UI, all requirements met.
My other developers loved and immediately started asking for new features. Each new feature was another agentic task, completed over 1-3 iterations.
So it wasn’t push button automatic, but I wrote 0% of it (code wise) and probably invested 6-8 total hours. My web dev skills are rusty, so I think the same thing would have taken 4-5 days and would not have looked as nice.
orwin|1 month ago
Basically my point of view is that if you don't feel comfortable reviewing your coworkers code, you shouldn't generate code with AI, because you will review it badly and then I will have to catch the bugs and fix it (happened 24 hours ago). If you generate code, you better understand where it can generate side effects.
6thbit|1 month ago
New session: Fed the entire spec, asked to build generic scaffolding only. New session: Fed the entire spec, asked to build generic TEST scaffolding. New session: Extract features to implement out of spec doc into .md files New session: Perform research on codebase with the problem statement "in mind", write results to another .md. Performed manual review of every .md. New session(s): Fed research and feature .md and asked for ONE task at a time, ensuring tests were written as per spec and keep iterating until they passed. Code reviewed beginning with test assertions, and asked for modifications if required. Before commit, asked to update progress on .md.
Ended up with very solid large project including a technology I wasn't an expert on but familiar, that I would feel confident evolving without an agent if I had to, learned a lot in the process. It would've taken me at least 2 weeks to read docs about it and at least another 3 to implement by hand; I was done in 2 total.
iwalton3|1 month ago
Web framework (includes basic component library, optional bundler/optimizer, tutorial/docs, e2e tests, and demos): https://github.com/iwalton3/vdx-web Music player web app (supports large music libraries, pwa offline sync, parametric eq, crossfade, crossfeed, semantic feature-based music search/radio, milkdrop integration, and other interesting features): https://github.com/iwalton3/mrepo-web Documentation update script (also allows exporting Claude conversations to markdown): https://github.com/iwalton3/cl-pprint
Regarding QC these are side projects so I validate them based on code review of key components, e2e testing, and manual testing where applicable. I find having the agent be able to check its work is the single biggest factor to reducing rework, but I make no promises about these projects being completely free of bugs.
borzi|1 month ago
fathermarz|1 month ago
Long story short, since Claude 3.7 I haven’t written a single line of code and have had great success. I review it for cleanliness, anti-patterns, and good abstraction.
I was in charge of a couple full system projects and refactors and I put Claude Code on my work machine which no one seemed to care because the top down “you should use AI or else you aren’t a team player”. Before I left in November I basically didn’t work, was in meetings all the time while also being expected to deliver code, and I started moonlighting for the company I work at now.
My philosophy is, any tool can powerful if you learn how to use it effectively. Something something 10,000 hours, something something.
Edit: After leaving this post I came across this and it is spot on to my point about needing time. https://www.nibzard.com/agentic-handbook
mdavid626|1 month ago
The problem: there is no way, he verified the code in any way. The business logic behind the feature would take probably few days to check for correctness. But if it looks good -> done. Let the customer check it. Of course, he claims “he reviewed it”.
It feels to me, we just skip doing half the things proper senior devs did, and claim we’re faster.
mythrwy|1 month ago
Previously I tried to use Aider and openAI about 6 or 7 months ago and it was terrible mess. I went back to pasting snippets in the browser chat window until a few weeks ago and thought agents were mostly hype (was wrong).
I keep a browser chat window open to talk about the project at a higher level. I'll post command line output like `ls` and `cat` to the higher level chat and use Codex strictly for coding. I haven't tried to one shot anything. I just give it a smallish piece of work at a time and check as it goes in a separate terminal window. I make the commits and delete files (if needed) and anything administrative. I don't have any special agent instructions. I do give Codex good hints on where to look or how to handle things.
It's probably a bit slower than what some people are doing but it's still very fast and so far has worked well. I'm a bit cautious because of my previous experience with Aider which was like roller skating drunk while juggling open straight razors and which did nothing but make a huge mess (to be fair I didn't spend much time trying to tame it).
I'm not sold on Codex or openAI compared to other models and will likely try other agents later, but so far it's been good.
sockopen|1 month ago
“Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%--AI tooling slowed developers down.”
3vidence|1 month ago
If agentic coding worked as well as people claimed on large codebases I would be seeing a massive shift at my Job... Im really not seeing it.
We have access to pretty much all the latest and greatest internally at no cost and it still seems the majority of code is still written and reviewed by people.
AI assisted coding has been a huge help to everyone but straight up agentic coding seems like it does not scale to these very large codebases. You need to keep it on the rails ALL THE TIME.
strange_quark|1 month ago
I still mostly write my own code and I’ve seen our claude code usage and me just asking it questions and generating occasional boilerplate and one-off scripts puts me in the top quartile of users. There are some people who are all in and have it write everything for them but it doesn’t seem like there’s any evidence they’re more productive.
deanc|1 month ago
I think it depends on your tooling, your code-base, your problem space, and your ability to intelligently inject context. If all four are aligned (in my case they are) it's the real deal.
Spacemolte|1 month ago
stavros|1 month ago
Still takes much less time for me to review the plan and output than write the code myself.
znsksjjs|1 month ago
So typing was a bottleneck for you? I’ve only found this true when I’m a novice in an area. Once I’m experienced, typing is an inconsequential amount of time. Understanding the theory of mind that composes the system is easily the largest time sink in my day to day.
cat_plus_plus|1 month ago
Now in terms of using AI, the key is to view yourself as a technical lead, not a people manager. You don't stop coding completely or treat underlying frameworks as a black box, you just do less of it. But at some point fixing a bug yourself is faster than writing a page of text explaining exactly how you want it fixed. Although when you don't know the programming language, giving pseudocode or sample code in another language can be super handy.
unknown|1 month ago
[deleted]
jdauriemma|1 month ago
erichocean|1 month ago
Mostly Gemini Pro 2.5 (and now Gemini Pro 3) and mostly Clojure and/or Java, with some JavaScript/Python. I require Gemini's long context size because my approach leans heavily on in context learning to produce correct code.
I've recently found Claude Code with Opus 4.5 can relieve me of some of the "agent" stuff I've done, allowing the AI to work for 10-20 minutes at a time on its own. But for anything difficult, I still do it the old way, intervening every 1-3 minutes.
Each interaction with the AI costs at least a $1, usually more (except Claude Code, where I use the $200/month plan), so my workflow is not cheap. But it 100% works and I developed more high-quality code in 2025 than in any previous year.
atraac|1 month ago
dangus|1 month ago
> I personally can't accept shipping unreviewed code. It feels wrong. The product has to work, but the code must also be high-quality.
What’s the definition of high-quality? For the project I was working on, I just needed it to work without any obvious bugs. It’s not an app for an enterprise business critical purpose, life critical (it’s not a medical device or something), or regulated industry. It’s just a consumer app for convenience and novelty.
The app is fast, smaller than 50MB, doesn’t have any bugs that the AI couldn’t fix for my test users. Sounds like the code is high quality to me.
I literally don’t give a shit what the code looks like. You gotta remember that code is just one of many methods to implement business logic. If we didn’t have to write code to achieve the result of making apps and websites it would have no value and companies wouldn’t hire software engineers.
I don’t write all my apps this way, but in this specific case letting Jesus take the wheel made sense and saved me a ton of time.
Garlef|1 month ago
1. It helps immensely if YOU take responsibility for the architecture. Just tell the agent not only what you want but also how you want it.
2. Refactoring with an agent is fast and cheap.
0. ...given you have good tests
---
Another thing: The agents are really good at understanding the context.
Here's an example of a prompt for a refactoring task that I gave to codex. it worked out great and took about 15 minutes. It mentions a lot of project specific concepts but codex could make sense of it.
""" we have just added a test backdoor prorogate to be used in the core module.
let's now extract it and move it into a self-contained entrypoint in the testing module (adjust the exports/esbuilds of the testing module as needed and in accordance with the existing patterns in the core and package-system modules).
this entrypoint should export the prorogate and also create its environment
refactor the core module to use it from there then also adjust the ui-prototype and package system modules to use this backdoor for cleanup """
fmdv|1 month ago
In augment code (or any other IDE agent integration), I can just @powershell-advanced-function-design at the top so the agent references my rule file, and then I list requirements after.
Things like:
- Find any bugs or problems with this function and fix them.
- Optimize the function for performance and to reduce redundant operations.
- Add additional comments to the code in critical areas.
- Add debug and verbose output to the function.
- Add additional error handling to the function if necessary.
- Add additional validation to the function if necessary.
It was also essential for me to enable the "essential" MCP servers like sequential thinking, context7, fetch, filesystem, etc.
Powershell coding isn't particularly complex, so this might not work out exactly how you want if you're dealing with a larger codebase with very advanced logic.
Another tangent: Figma Make is actually extremely impressive. I tried it out yesterday to create a simple prompt manager application, and over a period of ~30min I had a working prototype with:
- An integrated markdown editor with HTML preview and syntax highlighting for code fences.
- A semi-polished UI with a nice category system for organizing prompts.
- All required modals / dialogs were automatically created and functioned properly.
I really think agentic coding DOES work. You just have to be very explicit with your instructions and planning.
YMMV.
devalexwells|1 month ago
In order to better research, I built (ironically, mostly vibe coded) a tool to run structured "self-experiments" on my own usage of AI. The idea is I've init a bunch of hypotheses I have around my own productivity/fulfillment/results with AI-assisted coding. The tool lets me establish those then run "blocks" where I test a particular strategy for a time period (default 2 weeks). So for example, I might have a "no AI" block followed by a "some AI" block followed by a "full agent all-in AI block".
The tool is there to make doing check-ins easier, basically a tiny CLI wrapper around journaling that stays out of my way. It also does some static analysis on commit frequency, code produced, etc. but I haven't fleshed out that part of it much and have been doing manual analysis at the end of blocks.
For me this kind of self-tracking has been more helpful than hearsay, since I can directly point to periods where it was working well and try to figure out why or what I was working on. It's not fool-proof, obviously, but for me the intentionality has helped me get clearer answers.
Whether those results translate beyond a single engineer isn't a question I'm interested in answering and feels like a variant of developer metrics-black-hole, but maybe we'll get more rigorous experiments in time.
The tool open source here (may be bugs, only been using it a few weeks): https://github.com/wellwright-labs/devex
thayne|1 month ago
Cthulhu_|1 month ago
The thing that people don't seem to understand is that these are two separate processes with separate goals. You don't do code reviews to validate behaviour, nor do you test to validate code.
Code reviews are for maintainability and non-functional requirements. Maintainability is something that every longer term software project has run into, to the point where applications have been rewritten from scratch because the old code was unmaintainable.
In theory you can say "let the LLM handle it", but how much do you trust it? It's practially equivalent to using a 3rd party library, most people treat them as a black box with an API - the code details don't matter. And it can work, I'm sure, but do you trust it?
replyifuagree|1 month ago
Sadly review isn't enough. I just today I found some code that I reviewed 2 months where the developer clearly used an agent to generate the code and I completely missed some really dumb garbage the agent put in. The agent took a simple function that returns an object with some data and turned it into a mess that required multiple mocks in the tests (also generated by the agent).
The dev is a junior and a clear example of what is to come, inexperienced people thinking coding is getting an agent to get something to pass CI.
The tech debt is accelerating exponentially!
benreesman|1 month ago
The first thing anyone should do is immediately understand that they are in a finite-sum adversarial relationship with all of their vendors. If you're getting repeatably and unambiguously food outcomes with Claude Code you're either in the bandit arm where high twitter follower count people go or you're counting cards in a twelve deck shoe and it'll be 24 soon. Switch to opencode today, the little thing in the corner is a clock, and a prominent clock is more or less proof you're not in a casino.
There is another key observation that took me a long time to internalize, which is that the loss surface of formalism is not convex: most of us have tried to lift the droids of of the untyped lambda calculus and into System F, and this does not as a general thing go well. But a droid in the untyped lambda calculus will with 100% likelihood eventually Superfund your codebase, no exceptions.
The droids are wildly "happy" for lack of a better term in CIC but the sweet spot is System F Omega. A web app that's not in Halogen is an unforced error now: pure vibecoders can do lastingly valuable work in PureScript, and they can't in React, which is irretrievably broken under any sane effects algebra.
So AI coding is kind of a mean reversion in a sense. Knuth and Lamport and Hoare and Djikstra had an LLM: it was an army of guys in short-sleeved dress shirts with half a page of fanfold system prompt on every page. That is automatable now, but there's a straightforward algorithm for remaining relevant: crank the ambition up until Opus 4.5 is whiffing.
Computer Scientist is still a high impact job. The cast of Office Space is redundant. But we kinda knew that the first time we saw Office Space.
Personally I'm working harder than ever on harder problems than ever with pretty extreme AI assist hygiene: you spend all your time on hard problems with serious impact. The bueden of understanding the code and being able to quit vim has gone up but down, and mathematics is absorbing computer programming.
The popular narrative about Claude Code 10x ing JavaScript projects is maskirovska
halis|1 month ago
In these types of applications, there's already a lot of low hanging fruit to be had from working with an LLM.
If you're on a greenfield app where you get to make those decisions at the start, then I think I would still use the LLMs but I would be mindful of what you check into the code base. You would be better off setting up the project structure yourself and coding some things as examples of how you want the app to work. Once you have some examples in place, then you can use the LLMs to repeat the process for new screens/features.
stormcode|1 month ago
I had 12 legacy node apps running node 4, with jQuery front ends. Outdated dependencies and best practices all over. No build pipeline. No tests. Express 3. All of it worked but it's aging to the point of no return. And the upgrade work is boring, with very little ROI.
In a month, without writing any code, I've got them all upgraded to Node 22, with updated dependencies, removed jQuery completely, newer version of express, better logging, improved UI.
It's work that would have taken me a year of my free time and been soul crushing and annoying.
Did it with codex as a way of better learning the tooling. It felt more like playing a resource sim game than coding. Pretty enjoyable. Was able to work on multiple tasks at once while doing some other work.
It worked really well for that.
X_use_teleport|1 month ago
I think a critical point is how well one can communicate/delegate. I have a background with systems thinking and communication, so figuring out how to prompt for what I’m after was smooth.
I was also an early adopter of LLMs so there’s good muscle memory there.
It’s important to see AI tools as accelerators - not replacements or solvers. Still do test-driven development. Still maintain robust documentation. Good practices + AI is where the value is; not just throwing AI at things.
ZitchDog|1 month ago
I made the worlds fastest and most accurate JSON Schema validator.
https://github.com/sberan/tjs
jasondigitized|1 month ago
vessenes|1 month ago
With that in mind, a couple of comments - think of the coding agents as personalities with blind spots. A code review by all of them and a synthesis step is a good idea. In fact currently popular is the “rule of 5” which suggests you need the LLM to review five times, and to vary the level of review, e.g. bugs, architecture, structure, etc. Anecdotally, I find this is extremely effective.
Right now, Claude is in my opinion the best coding agent out there. With Claude code, the best harnesses are starting to automate the review / PR process a bit, but the hand holding around bugs is real.
I also really like Yegge’s beads for LLMs keeping state and track of what they’re doing — upshot, I suggest you install beads, load Claude, run ‘!bd prime’ and say “Give me a full, thorough code review for all sorts of bugs, architecture, incorrect tests, specification, usability, code bugs, plus anything else you see, and write out beads based on your findings.” Then you could have Claude (or codex) work through them. But you’ll probably find a fresh eye will save time, e.g. give Claude a try for a day.
Your ‘duplicated code’ complaint is likely an artifact of how codex interacts with your codebase - codex in particular likes to load smaller chunks of code in to do work, and sometimes it can get too little context. You can always just cat the relevant files right into the context, which can be helpful.
Finally, iOS is a tough target — I’d expect a few more bumps. The vast bulk of iOS apps are not up on GitHub, so there’s less facility in the coding models.
And any front end work doesn’t really have good native visual harnesses set up, (although Claude has the Claude chrome extension for web UIs). So there’s going to be more back and forth.
Anyway - if you’re a career engineer, I’d tell you - learn this stuff. It’s going to be how you work in very short order. If you’re a hobbyist, have a good time and do whatever you want.
enknee1|1 month ago
Because the network of abstractions that is a human awareness (the ol' meat suit pilot model) is unique to all of us we cannot directly share components of our internal networks directly. Thus, we all interact through language and we all use language differently. While it's true that compute is fundamentally the same for all of us (we have to convert complex human abstractions into computable forms and computers don't vary that much), programming languages provide general mappings for diverse human abstractions back to basic compute features.
And so, just like with coding, the most natural path for interacting with a LLM is also unique to all of us. Your assumptions, your prior knowledge, and your world perspective all shape how you interact with the model. Remember you're not just getting code back though... LLMs represent a more comprehensive world of ideas.
So approach the process of learning about large language models the same way that you approach the process of learning a new language in general: pick a hello world project (something that's hello world for you) and walk through it with the model paying attention to what works and what doesn't. You'd do someone similar if you were handed a team of devs that you didn't know.
For general use, I start by having the model generate a req document that 1) I vet thoroughly. Then I have the model make TODO lists at all levels of abstraction (think procedural decomposition for the whole project) down to my code that 2) I vet thoroughly. Then I require the model to complete the TODO tasks. There are always hiccups same as when working with people. I know the places that I can count on solid, boiler plate results and require fewer details in the TODOs. I do not release changes to the TODO files without 3) review. It's not fire-and-forget but the process is modular and understandable and 4) errors finding from system design are mine to identify and address in the req and TODOs.
Good luck and have fun!
theshrike79|1 month ago
Did it in one go, WAY better than I ever could have. Creates directories, generates configs from templates, all secrets are encrypted and managed etc.
I've iterated on it a bit more to optimise some bits, mostly for ergonomics, but the basic structure is still there.
---
Did a similar thing with Ansible. I gave claude a way to access my local computer (Mac) and a server I have (Linux), told it to create an ansible setup that sets up whatever is installed on both machines + configurations.
Again, managed it faster and way better than I every could have.
I even added an Arch Linux VM to the mix just to complicate things, that went faster than I could've done it myself too.
traceroute66|1 month ago
The only positive antigenic coding experience I had was using it as a "translator" from some old unmaintained shell + C code to Go.
I gave it the old code, told it to translate to Go. I pre-installed a compiled C binary and told it to validate its work using interop tests.
It took about four hours of what the vibecoding lovers call "prompt engineering" but at the end I have to admit it did give me a pretty decent "translation".
However for everything else I have tried (and yes, vibecoders, "tried" means very tightly defined tasks) all I have ever got is over-engineered vibecoding slop.
The worst part of of it is that because the typical cut-off window is anywhere between 6–18 months prior, you get slop that is full of deprecated code because there is almost always a newer/more efficient way to do things. Even in languages like Go. The difference between an AI-slop answer for Go 1.20 and a human coded Go 1.24/1.25 one can be substantial.
morisil|1 month ago
My advice - embrace TDD. Work with AI on tests, not implementation - your implementation is disposable, to be regenerated, tests fully specify your system through contracts. This is more tricky for UI than for logic. Embracing architectures allowing to test view model in separation might help. I general anything reducing cognitive load during inference time is worth doing.
jbbryant|1 month ago
lostmsu|1 month ago
interleave|1 month ago
I call it "moonwalk" because, when throwing away the intermediate vibe-coded prototype code in the middle, it feels like walking backwards while looking forward.
- Check out a spike branch
- Vibe code until prototype feels right.
- Turn prototype into markdown specification
- Throw away vibe'd code, keep specification
- Rebase specification into main, check out main
- Feed specification to our XP/TDD agents
- Wait, review a few short iterations if any
- Ship to production
This allows me to get the best of vibe-coding (exploring, fast iterating and dialing-in on the product experience) and writing production-grade code (using our existing XP practices via dedicated CC sub-agents and skills.)
PlatoIsADisease|1 month ago
I am writing an automation software that interfaces with a legacy windows CAD program. Depending on the automation, I just need a picture of the part. Sometimes I need part thickness. Sometimes I need to delete parts. Etc... Its very much interacting with the CAD system and checking the CAD file or output for desired results.
I was considering something that would take screenshots and send it back for checks. Not sure what platforms can do this. I am stumped how Visual Studio works with this, there are a bunch of pieces like servers, agents, etc...
Even a how-to link would work for me. I imagine this would be extremely custom.