(no title)
aeldidi | 21 days ago
As a real world example, I was told to evaluate Claude Code and ChatGPT codex at my current job since my boss had heard about them and wanted to know what it would mean for our operations. Our main environment is a C# and Typescript monorepo with 2 products being developed, and even with a pretty extensive test suite and a nearly 100 line "AGENTS.md" file, all models I tried basically fail or try to shortcut nearly every task I give it, even when using "plan mode" to give it time to come up with a plan before starting. To be fair, I was able to get it to work pretty well after giving it extremely detailed instructions and monitoring the "thinking" output and stopping it when I see something wrong there to correct it, but at that point I felt silly for spending all that effort just driving the bot instead of doing it myself.
It almost feels like this is some "open secret" which we're all pretending isn't the case too, since if it were really as good as a lot of people are saying there should be a massive increase in the number of high quality projects/products being developed. I don't mean to sound dismissive, but I really do feel like I'm going crazy here.
RealityVoid|21 days ago
- driving the LLM instead of doing it yourself. - sometimes I just can't get the activation energy and the LLM is always ready to go so it gives me a kickstart
- doing things you normally don't know. I learned a lot of command like tools and trucks by seeing what Claude does. Doing short scripts for stuff is super useful. Of course, the catch here is if you don't know stuff you can't drive it very well. So you need to use the things in isolation.
- exploring alternative solutions. Stuff that by definition you don't know. Of course, some will not work, but it widens your horizon
- exploring unfamiliar codebases. It can ingest huge amounts of data so exploration will be faster. (But less comprehensive than if you do it yourself fully)
- maintaining change consistency. This I think it's just better than humans. If you have stuff you need to change at 2 or 3 places, you will probably forget. LLM's are better at keeping consistency at details (but not at big picture stuff, interestingly.)
TheAceOfHearts|21 days ago
I'd previously encountered tools that seemed interesting, but as soon as I tried getting it to run I found myself going down an infinite debugging hole. With an LLM I can usually explain my system's constraints and the best models will give me a working setup from which I can begin iterating. The funny part is that most of these tools are usually AI related in some way, but getting a functional environment often felt impossible unless you had really modern hardware.
kace91|21 days ago
There is a counter issue though, realizing mid session that the model won’t be able to deliver that last 10%, and now you have to either grok a dump of half finished code or start from scratch.
Fogest|21 days ago
I use Claude Code a decent amount, and I actually find that sometimes this can be the opposite for me. Sometimes it is actually missing other areas that the change will impact and causing things to break. Sometimes when I go to test it I need to correct it and point out it missed something or I notice when in the planning phase that it is missing something.
However I do find if you use a more powerful opus model when planning, it does consider things fully a lot better than it used to. This is actually one area I have been seeing some very good improvements as the models and tooling improves.
In fact, I actually hope that these AI tools keep getting better at the point you mention, as humans also have a "context limit". There are only so many small details I can remember about the codebase so it is good if AI can "remember" or check these things.
I guess a lot of the AI can also depend on your codebase itself, how you prompt it, and what kind of agents file you have. If you have a robust set of tests for your application you can very easily have AI tools check their work to ensure things aren't being broken and quickly fix it before even completing the task. If you don't have any testing more could be missed. So I guess it's just like a human in some sense. If you have a crappy codebase for the AI to work with, the AI may also sometimes create sloppy work.
tunesmith|21 days ago
It's possible some of it is due to codebase size or tech stack, but I really think there might be more of a human learning curve going on here than a lot of people want to admit.
I think I am firmly in the average of people who are getting decent use out of these tools. I'm not writing specialized tools to create agents of agents with incredibly detailed instructions on how each should act. I haven't even gotten around to installing a Playwright mcp (probably my next step).
But I've:
- created project directories with soft links to several of my employer's repos, and been able to answer several cross-project and cross-team questions within minutes, that normally would have required "Spike/Disco" Jira tickets for teams to investigate
- interviewed codebases along with product requirements to come up with very detailed Jira AC, and then,.. just for the heck of it, had the agent then use that AC to implement the actual PR. My team still code-reviewed it but agreed it saved time
- in side projects, have shipped several really valuable (to me) features that would have been too hard to consider otherwise, like... generating pdf book manuscripts for my branching-fiction creating writing club, and launching a whole new website that has been mired in a half-done state for years
Really my only tricks are the basics: AGENTS.md, brainstorm with the agent, continually ask it to write markdown specs for any cohesive idea, and then pick one at a time to implement in commit-sized or PR-sized chunks. GPT-5.2 xhigh is a marvel at this stuff.
My codebases are scala, pekko, typescript/react, and lilypond - yeah, the best models even understand lilypond now so I can give it a leadsheet and have it arrange for me two-hand jazz piano exercises.
I generally think that if people can't reach the above level of success at this point in time, they need to think more about how to communicate better with the models. There's a real "you get out of it what you put into it" aspect to using these tools.
colechristensen|21 days ago
Can I get it to finish by asking it over and over to code review its PR or some other such generic prompt to weed out the skips and scaffolding? Also yes.
Basically these things just need a supervisor looking at the requirements, test results, and evaluating the code in a loop. Sometimes that's a human, it can also absolutely be an LLM. Having a second LLM with limited context asking questions to the worker LLM works. Moreso when the outer loop has code driving it and not just a prompt.
ChatEngineer|21 days ago
[deleted]
brookst|21 days ago
I can’t say it’s led to shipping “high quality projects”, but it has let me accomplish things I just wouldn’t have had time for previously.
I’ve been wanting to develop a plastic -> silicone -> plaster -> clay mold making process for years, but it’s complex and mold making is both art and science. It would have been hundreds of hours before, with maybe 12 hours of Claude code I’m almost there (some nagging issues… maybe another hour).
And I had written some home automation stuff back with Python 2.x a decade ago; it was never worth the time to refamiliarize myself with in order to update, which led to periodic annoyances. 20 minutes, and it’s updated to all the latest Python 3.x and modern modules.
For me at least, the difference between weeks and days, days and hours, and hours and minutes has allowed me to do things I just couldn’t justify investing time in before. Which makes me happy!
So maybe some folks are “pretending”, or maybe the benefits just aren’t where you’re expecting to see them?
arwhatever|21 days ago
wvlia5|21 days ago
sarchertech|21 days ago
That’s so nebulous and likely just plain wrong. I have some experience with silicone molds and casting silicone and other materials. I have no idea how you’d accurately estimate it would take hundreds of hours. But the mostly likely reason you’ve had results is that you just did it.
This sounds very very much like confirmation bias. “I started drinking pine needle tea and then 5 days later my cold got better!”
I use AI, it’s useful for lots of things, but this kind of anecdote is terrible evidence.
holySBoring|21 days ago
[deleted]
FeteCommuniste|21 days ago
input_sh|21 days ago
For example a lot of pro-OpenAI astroturfing really wanted you to know that 5.3 scored better than opus on terminal-bench 2.0 this week, and a lot of Anthropic astroturfing likes to claim that all your issues with it will simply go away as soon as you switch to a $200/month plan (like you can't try Opus in the cheaper one and realise it's definitely not 10x better).
Groxx|21 days ago
so yeah, it wouldn't surprise me if it was well over most. I don't actually claim that it is over half here, I've run across quite a few of these kinds of people in real life as well. but it wouldn't surprise me.
viking123|21 days ago
Also all this stuff about Claude having feelings directed at midwits is hilarious
mikenew|21 days ago
whaleidk|21 days ago
jameshush|21 days ago
I KNOW a common issue people run into is they forget to handle rate limits, but I also know more JavaScript than Python and have limited time, so before I'd write:
``` # NOTE: Make sure to handle the rate limit! This is just an example. See example.com/docs/javascript/rate-limit-example for a js example doing this. ```
Unsurprisingly, more than half of customers would just ignore the comment, forget to handle the rate limit, and then write in a few months later. With Claude, I just write "Create a customer demo in Python that handles rate limits. Use example.com/docs/javascript/rate-limit-example as a reference," and it gets me 95% of the way there.
There are probably 100 other small examples like this where I had the "vibe" to know where the customer might trip over, but not the time to plug up all the little documentation example holes myself. Ideally, yes, hiring a full-time person to handle plugging up these holes would be great, but if you're resource constrained paying Anthropic for tokens is a much faster/cheaper solution in the short term.
disgruntledphd2|21 days ago
They seem to fall apart (for me, at least) when the projects get larger or have multiple people working on them.
They're also super helpful for analytics projects (I'm a data person) as generally the needed context is much smaller (and because I know exactly how to approach these problems, it's that typing the code/handling API changes takes a bunch of time).
thegrim000|21 days ago
In this author's case, they currently work for a company that .. wait for it .. less than 2 weeks ago launched some "AI image generation built for teams" product. (Also, oddly, the author lists himself as the 'Technical Director' at the company, working there for 5-6 years, but the company's Team page doesn't list him as an employee).
ildon|21 days ago
yusufnb|21 days ago
Since last few months, I have seen a notable difference in the quality and extent of projects these students have been able to accomplish. Every project and website they show looks polished, most of those could be a full startup MVP pre AI days.
The bar has clearly been raised way high, very fast with AI.
josiahpeters|21 days ago
Once we got them into a technical screening, most fell apart writing code. Our problem was simple: using your preferred programming language, model a shopping cart object that has the ability to add and remove items from the cart and track the cart total.
We were shocked by how incapable most candidates were in writing simple code without their IDEs tab completion capability. We even told them to use whatever resources they normally used.
The whole experience left us a little surprised.
smoe|21 days ago
For the former, greenfield projects, LLMs are easily a 10x productivity improvement. For the latter, it gets a lot more nuanced. Still amazingly useful in my opinion, just not the hands off experience that building from scratch can be now.
biztos|21 days ago
But the reason you don’t see a flood of great products is that the managerial layer has no idea what to do with massively increased productivity (velocity). Ask even a Google what they’d do with doubly effective engineers and the standard answer is to lay half of them off.
chrisjj|21 days ago
The headline gain is speed. Almost no-one's talking about quality - they're moving too fast to notice the lack.
LogicFailsMe|21 days ago
That they are so good at the things I like to do the least and still terrible at the things at which I excel. That's just gravy.
But I guess this is in line with how most engineers transition to management sometime in their 30s.
peab|21 days ago
usually when someone hypes it up it's things like, "i have it text my gf good morning every day!!", or "it analyzed every single document on my computer and wrote me a poem!!"
fragmede|21 days ago
The "open secret" is that shipping stuff is hard. Who hasn't bought a domain name for a side project that didn't go anywhere. If there's anybody out there, raise your hand! So there's another filtering effect.
The crazy pills are thinking that HN is in any way representative of anything about what's going on in our broader society. Those projects are out there, why do you assume you'll be told about it? That someone's going to write an exposé/blog post on themselves about how they had AI build a thing and now they're raking in the dollars and oh, buy my course on learning how to vibecode? The people selling those courses aren't the ones shipping software!
aeldidi|21 days ago
I don't doubt that an LLM would theoretically be capable of doing these sorts of things, nor did I intend to give off that sentiment, rather I was more evaluating if it was as practical as some people seem to be making the case for. For example, a C compiler is very impressive, but its clear from the blog post[0] that this required a massive amount of effort setting things up and constant monitoring and working around limitations of Claude Code and whatnot, not to mention $20,000. That doesn't seem at all practical, and I wonder if Nicholas Carlini (the author of the Anthropic post) would have had more success using Claude Code alongside his own abilities for significantly cheaper. While it might seem like moving the goalpost, I don't think it's the same thing to compare what I was saying with the fact that a multi billion dollar corporation whose entire business model relies on it can vibe code a C compiler with $20,000 worth of tokens.
> The problem is people have egos, myself included. Not in the inflated sense, but in the "I built a thing a now the Internet is shitting on me and I feel bad" sense.
Yes, this is actually a good point. I do feel like there's a self report bias at play here when it comes to this too. For example, someone might feel like they're more productive, but their output is roughly the same as what it was pre-LLM tooling. This is kind of where I'm at right now with this whole thing.
> The "open secret" is that shipping stuff is hard. Who hasn't bought a domain name for a side project that didn't go anywhere. If there's anybody out there, raise your hand! So there's another filtering effect.
My hand is definitely up here, shipping is very hard! I would also agree that it's an "open secret", especially given that "buying a domain name for a side project that never goes anywhere" is such a universal experience.
I think both things can be true though. It can be true that these tools are definitely a step up from traditional IDE-style tooling, while also being true that they are not nearly as good as some would have you believe. I appreciate the insight, thanks for replying.
[0]: https://www.anthropic.com/engineering/building-c-compiler
v1ne|21 days ago
Also, there is nothing complex in a C compiler. As students we built these things as toy projects at uni, without any knowledge of software development practices.
Yet, to bring an example for something that's more than a toy project: 1 person coded this video editor with AI help: https://github.com/Sportinger/MasterSelects
AstroBen|21 days ago
Even if it's not straight astroturfing I think people are wowed and excited and not analyzing it with a clear head
fragmede|21 days ago
kylecazar|21 days ago
qingcharles|20 days ago
So, I've very little to publicly show for all my obnoxious LLM advocacy. I wonder if any others are in the same boat?
xhrpost|21 days ago
This is the challenge I also face, it's not always obvious when a change I want will be properly understood by the LLM. Sometimes it one shots it, then others I go back and forth until I could have just done it myself. If we have to get super detailed in our descriptions, at what point are we just writing in some ad-hoc "programming language" that then transpiles to the actual program?
mark_l_watson|21 days ago
Given time AI will lead to incredible productivity. In the meantime, use as appropriate.
noosphr|21 days ago
daliusd|21 days ago
mbfg|21 days ago
I then ask it to do the same thing in java, and it spends a half hour trying to do the same job and gets caught in some bit of trivia around how to convert html escape characters, for instance, s.replace("<", "<").replace(">", ">").replace("\"").replace("""); as an example and endlessly compiles and fails over and over again, never able to figure out what it has done wrong, nor decides to give up on the minutia and continue with the more important parts.
mathw|21 days ago
There's been a lot of talk about it for the past few years but we're just not seeing impacts. Oh sure, management talk it up a lot, but where's the corresponding increase in feature delivery? Software stability? Gross profit? EBITDA?
Give me something measurable and I'll consider it.
cogman10|21 days ago
A giant monorepo would be a bad fit for an LLM IMO.
unknown|21 days ago
[deleted]
rpigab|21 days ago
I'm mostly a freeloader, so how could I judge people who put in the tokens equivalent to 15 years worth of electricity (incl heating and hot water) bills for my home in a C compiler?
Well, I can see that Anthropic is still an AI company, not a software company, they're granting us access to their most valuable resource that almost doesn't require humans, for a very reasonable fee, allowing us to profit instead of them. They're philanthropists.
sutterd|21 days ago
dan-robertson|21 days ago
It does also seem to me that there is a lot of variance in skills for prompting/using AI in general (I say this as someone who is not particularly good as far as I’m aware – I’m not trying to keep tips secret from you). And there is also a lot of variance in the ability for an AI to solve problem of equal difficulty for a human.
mnky9800n|21 days ago
scotty79|21 days ago
What makes the difference is that agents can create these instructions themselves and monitor themselves and revert actions that didn't follow instructions. You didn't fet there because you achieved satisfactory results with semi-manual solutions. But people who abhor manual are getting there already.
unknown|21 days ago
[deleted]
deterministic|21 days ago
schmuhblaster|21 days ago
[0] https://github.com/deepclause/deepclause-sdk
growt|21 days ago
nevster|21 days ago
hawkernews|21 days ago
razster|21 days ago
g-mork|21 days ago
I used this line for a long time, but you could just as easily say the same thing for a typical engineer. It basically boils down to "Claude likes its tickets to be well thought out". I'm sure there is some size of project where its ability to navigate the codebase starts to break down, but I've fed it sizeable ones and so long as the scope is constrained it generally just works nowadays
geetee|21 days ago
utopiah|21 days ago
philipwhiuk|21 days ago
It's the appearance of productivity, not actual productivity.
rafabulsing|21 days ago
Which I think is what people gather from him, but somehow think he's hiding it or pretending is not the case? Which I find strange, given how openly he's talked about it.
As for his productivity going down over time, I think that's a combination of his videos getting bigger scopes and production values, and also he moving some of his time into some not so publicly visible ventures. E.g., he was one of the founders of Standard, which eventually became the Nebula streaming service (though he left quite a while ago now).
NamlchakKhandro|21 days ago
interesting.
how much planning do you put into your project without AI anyway?
Pretty much all the teams I've been involved in:
- never did any analysis planning, and just yolo it along the way in their PR - every PR is an island, with tunnel vision - fast forward 2 years. and we have to throw it out and start again.
So why are you thinking you're going to get anything different with LLMs?
And plan mode isn't just a single conversation that you then flip to do mode...
you're supposed to create detailed plans and research that you then use to make the LLM refer back to and align with.
This was the point of the Ralph Loop
manzu|20 days ago
laughfactory|21 days ago
piskov|21 days ago
Tried to move some excel generation logic from epplus to closedxml library.
ClosedXml has basically the same API so the conversion was successful. Not a one-shot but relatively easy with a few manual edits.
But closedxml has no batch operations (like apply style to the entire column): the api is there but internal implementation is on cell after cell basis. So if you have 10k rows and 50 columns every style update is a slow operation.
Naturally, told all about this to codex 5.3 max thinking level. The fucker still succumbed to range updates here and there.
Told it explicitly to make a style cache and reuse styles on cells on same y axis.
5-6 attempts — fucker still tried ranges here and there. Because that is what is usually done.
Not here yet. Maybe in a year. Maybe never.
fragmede|21 days ago
Yeah I have the same problem where it always uses smart quotes which messes up my compile. 8 told ChatGPT not to use them but it keeps doing it.
AdeptusAquinas|20 days ago
That being said, its great at generating boilerplate code or in my case, doing something like 'make a react component here please that does this small thing, and is aligned with the style in the rest of the file'. Good for when I need to work with code bases or technologies that are not my daily. Also a great research assistant.
But I guess being a 'better google' or a 'glorified spellchecker' doesn't get that hype money.
gchamonlive|21 days ago
mannanj|21 days ago
It also kinda feels gaslightish and as I've said in some controversial replies in other posts, its sort of eerily mass "psychosis" vibes just like during COVID.
holySBoring|21 days ago
All AI-IS-WONDERFUL stories are garbage-trash written by garbage people.
Fuck AI. Fuck HN AI promoters. Hopefully you all lose your jobs and fail in life.
belter|21 days ago
Hardly before, now its almost three times a week. And never gets any questions on GPU amortization...
rulerviper|21 days ago
[deleted]