> Reading through these commits sparked an idea: what if we treated prompts as the actual source code? Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.
Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.
> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.
How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?
My work has involved a project that is almost entirely generated code for over a decade. Not AI generated, the actual work of the project is in creating the code generator.
One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.
Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.
Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.
I'm the first to admit that I'm an AI skeptic, but this goes way beyond my views about AI and is a fundamentally unsound idea.
Let's assume that a hypothetical future AI is perfect. It will produce correct output 100% of the time, with no bugs, errors, omissions, security flaws, or other failings. It will also generate output instantly and cost nothing to run.
Even with such perfection this idea is doomed to failure because it can only write code based on information in the prompt, which is written by a human. Any ambiguity, unstated assumption, or omission would result in a program that didn't work quite right. Even a perfect AI is not telepathic. So you'd need to explain and describe your intended solution extremely precisely without ambiguity. Especially considering in this "offline generation" case there is no opportunity for our presumed perfect AI to ask clarifying questions.
But, by definition, any language which is precise and clear enough to not produce ambiguity is effectively a programming language, so you've not gained anything over just writing code.
The idea as stated is a poor one, but a slight reshuffling and it seems promising:
You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.
Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.
Plus, commits depend on the current state of the system.
What sense does “getting rid of vulnerabilities by phasing out {dependency}” make, if the next generation of the code might not rely on the mentioned library at all? What does “improve performance of {method}” mean if the next generation used a fully different implementation?
It makes no sense whatsoever except for a vibecoders script that’s being extrapolated into a codebase.
I'd say commit a comprehensive testing system with the prompts.
Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.
I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.
I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.
Apart from obvious non-reproducibility, the other problem is lack of navigable structure. I can't command+click or "show usages" or "show definition" any more.
I'm pretty sure most people aren't doing "software engineering" when they program. There's the whole world of WordPress and dream Weaver like programing out there too where the consequences of messing up aren't really important.
Llms can be configured to have deterministic output too
Also, while it is in principle possible to have a deterministic LLM, the ones used by coding assistants aren't deterministic, so the prompts would not reliably reproduce the same software.
There is definitely an argument, for also committing prompts, but it makes no sense to only commit prompts.
I think the author is saying you commit the prompt with the resulting code. You said it yourself, storage is free, so comment the prompt along with the output (don’t comment that out that if I’m not being clear); it would show the developers(?) intent, and to some degree, almost always contribute to the documentation process.
Forget different model versions. The exact same model with the exact same prompt will generate vastly different code each subsequent time you invoke it.
Yes, it's too early to be doing that now, but if you see the move to AI-assisted code as at least the same magnitude of change as the move from assembly to high level languages, the argument makes more sense.
Nobody commits the compiled code; this is the direction we are moving in, high level source code is the new assembly.
These posts are funny to me because prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence but the years of experience in software engineering necessary to even guide an AI in this way is very high.
The reason he keeps adjusting the prompts is because he knows how to program. He knows what it should look like.
The argument is that this stuff will so radically improve senior engineer productivity that the demand for junior engineers will crater. And without a pipeline of junior engineers, the junior-to-senior trajectory will radically atrophy
Essentially, the field will get frozen where existing senior engineers will be able to utilize AI to outship traditional senior-junior teams, even as junior engineers fail to secure employment
I don’t think anything in this article counters this argument
> It just blurs the line between engineer and tool.
I realise you meant it as “the engineer and their tool blend together”, but I read it like a funny insult: “that guy likes to think of himself as an engineer, but he’s a complete tool”.
I mean yeah, the very first prompt given to the AI was put together by an experienced developer; a bunch of code telling the AI exactly what the API should look like and how it would be used. The very first step in the process already required an experienced developer to be involved.
> Almost every feature required multiple iterations and refinements. This isn't a limitation—it's how the collaboration works.
I guess that's where a big miss in understanding so much of the messaging about generative AI in coding happens for me, and why the Fly.io skepticism blog post irritated me so much as well.
It _is_ how collaboration with a person works, but the when you have to fix the issues that the tool created, you aren't collaborating with a person, you're making up for a broken tool.
I can't think of any field where I'd be expected to not only put up with, but also celebrate, a tool that screwed up and required manual intervention so often.
The level of anthropomorphism that occurs in order to advocate on behalf of generative AI use leads to saying things like "it's how collaboration works" here, when I'd never say the same thing about the table saw in my woodshop, or even the relatively smart cruise control on my car.
Generative AI is still just a tool built by people following a design, and which purportedly makes work easier. But when my saw tears out cuts that I have to then sand or recut, or when my car slams on the brakes because it can't understand a bend in the road around a parking lane, I don't shrug and ascribe them human traits and blame myself for being frustrated over how they collaborate with me.
Garbage in, Garbage out... My experiment with vibe coding was quite nice, but it did require a collaborative back and forth, mostly because I didn't know exactly what I wanted. It was easiest to ask for something, then describe how what it gave me needed to be changed. The cost of this type of interaction was much easier than trying to craft the perfect prompt on the first go. My first prompts were garbage, but the output gradually converged to something quite good.
Likewise when they use all these benchmarks for "intelligence" and the tool will do the silliest things that you'd consider unacceptable from a person once you've told them a few times not to do a certain thing.
I love the paradigm shift but hate when the hype is uninformed or dishonest or not treating it with an eye for quality.
> Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.
So every single run will result in different non-reproducible implementation with unique bugs requiring manual expert interventions. How is this better?
It's an interesting review but I really dislike this type of techno-utopian determinism: "When models inevitably improve..." Says who? How is it inevitable? What if they've actually reached their limits by now?
Models are improving every day. People are figuring out thousands of different optimizations to training and to hardware efficiency. The idea that right now in early June 2025 is when improvement stops beggars belief. We might be approaching a limit, but that's going to be a sigmoid curve, not a sudden halt in advancement.
It is "inevitable" in the sense that in 99% of the cases, tomorrow is just like yesterday.
LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.
It's not "techno-utopian determinism". It's a clearly visible trajectory.
Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.
The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.
There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.
What is ironic, if we buy in to the theory that AI will write majority of the code in the next 5-10 years, what is it going to train on after? ITSELF? Seems this theoretic trajectory of "will inevitably get better" is is only true if humans are producing quality training data. The quality of code LLMs create is very well proportionate on how mature and ubiquitous the langues/projects are.
Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.
I commented on the original discussion a few days ago but I will do it again.
Why is this such a big deal? This library is not even that interesting. It is very straightforward task I expect most programers will be able to pull off easily. 2/3 of the code is type interfaces and comments. The rest is by book implementation of a protocol that is not even that complex.
Please, there are some React JSX files in your code base with a lot more complexities and intricacies than this.
I don’t like to accuse, and the article is fine overall, but this stinks: “This transparency transforms git history from a record of changes into a record of intent, creating a new form of documentation that bridges human reasoning and machine implementation.”
I did human notes -> had Claude condense and edit -> manually edit. A few of the sentences (like the stinky one below) were from Claude which I kept if it matched my own thoughts, though most were changed for style/prose.
I'm still experimenting with it. I find it can't match style at all, and even with the manual editing it still "smells like AI" as you picked up. But, it also saves time.
My prompt was essentially "here are my old blog posts, here's my notes on reading a bunch of AI generated commits, help me condense this into a coherent article about the insights I learned"
So, it means that you and the LLM together have managed to write SEVEN lines of trivial code per hour. On a protocol that is perfectly documented, where you can look at about one million other implementations when in doubt.
It is not my intention to hurt your feelings, but it sounds like you and/or the LLM are not really good at their job. Looking at programmer salaries and LLM energy costs, this appears to be a very very VERY expensive OAuth library.
Again: Not my intention to hurt any feelings, but the numbers really are shockingly bad.
I spent about 5 days semi-focused on this codebase (though I always have lots of people interrupting me all the time). It's about 5000 lines (if you count comments, tests, and documentation, which you should). Where do you get 7 lines per hour?
Yes, my brain got confused on who wrote the code and who just reported about it. I am truly sorry. I will go see my LLM doctor to get my brain repaired.
That's exactly what I thought, too, before I tried it!
Turns out it feels very different than I expected. I really recommend trying it rather than assuming. There's no learning curve, you just install Claude Code and run it in your repo and ask it for things.
(I am the author of the code being discussed. Or, uh, the author of the prompts at least.)
>Around the 40-commit mark, manual commits became frequent
This matches my experience: some shiny (even sometimes impressive) greenspace demos but dramatically less useful maintaining a codebase - which for any successful product is 90% of the work.
I asked this in the other thread (no response, but I was a bit late)
How does anyone using AI like this have confidence that they aren't unintentionally plagiarizing code and violating the terms of whatever license it was released under?
For random personal projects I don't see it mattering that much. But if a large corp is releasing code like this, one would hope they've done some due diligence that they have to just stolen the code from some similar repo on GitHub, laundered through a LLM.
The only section in the readme doesn't mention checking similar projects or libraries for common code:
> Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
> How does anyone using AI like this have confidence that they aren't unintentionally plagiarizing code and violating the terms of whatever license it was released under?
Most of the code generated by LLMs, and especially the code you actually keep from an agent, is mid, replacement-level, boring stuff. If you're not already building projects with LLMs, I think you need to start doing that first before you develop a strong take on this. From what I see in my own work, the code being generated is highly unlikely to be distinguishable. There is more of me and my prompts and decisions in the LLM code than there can possibly be defensible IPR from anybody else, unless the very notion of, like, wrapping a SQLite INSERT statement in Golang is defensible.
The best way I can explain the experience of working with an LLM agent right now is that it is like if every API in the world had a magic "examples" generator that always included whatever it was you were trying to do (so long as what you were trying to do was within the obvious remit of the library).
Safety in the shadow of giant tech companies. People were upset when Microsoft released Copilot trained on GitHub data, but nobody who cared doing do anything about it, and nobody who could have done something about it cared, so it just became the new norm.
All of the big LLM vendors have a "copyright shield" indemnity clause for their paying customers - a guarantee that if you get sued over IP for output from their models their legal team will step in to fight on your behalf.
This is an excellent question that the AI-boosters always seem to dance around. Three replies already are saying “Nobody cares.” Until they do. I’d be willing to bet that some time in the near future, some big company is going to care a lot and that there will be a landmark lawsuit that significantly changes the LLM landscape. Regulation or a judge is going to eventually decide the extent to which someone can use AI to copy someone else’s IP, and it’s not going to be pretty.
I'm fairly confident that it's not just plagiarizing because I asked the LLM to implement a novel interface with unusual semantics. I then prompted for many specific fine-grain changes to implement features the way I wanted. It seems entirely implausible to me that there could exist prior art that happened to be structured exactly the way I requested.
Note that I came into this project believing that LLMs were plagiarism engines -- I was looking for that! I ended up concluding that this view was not consistent with the output I was actually seeing.
The consensus for right or wrong, is that LLM produced code (unless repeated verbatim) is equivalent to you or I legitimately stating our novel understanding of mixed sources some of which may be copyrighted.
The documentation angle is really good. I've noticed it with the mdc files and llm.txt semi-standard. Documentation is often treated as just extra cost and a chore. Now, good description of the project structure and good examples suddenly becomes something devs want ahead of time. Even if the reason is not perfect, I appreciate this shift we'll all benefit from.
Another way to phrase this is LLM-as-compiler and Python (or whatever) as an intermediate compiler artefact.
Finally, a true 6th generation programming language!
I've considered building a toy of this with really aggressive modularisation of the output code (eg. python) and a query-based caching system so that each module of code output only changes when the relevant part of the prompt or upsteam modules change (the generated code would be committed to source control like a lockfile).
I think that (+ some sort of WASM encapsulated execution environment) would one of the best ways to write one off things like scripts which don't need to incrementally get better and more robust over time in the way that ordinary code does.
This only works if the model and its context are immutable. None of us really control the models we use, so I'd be sceptical about reproducing the artifacts later.
I have documented my experience using an Agent for a slightly different task -- upgrading framework version -- I had to abandon the work, but, my learning has been similar what is in the post.
If/when to commit prompts has been fascinating as we have been doing similarly to build Louie.ai. I now have several categories with different handling:
- Human reviewed: Code guidelines and prompt templates are essentially dev tool infra-as-code and need review
- Discarded: Individual prompt commands I write, and implementation plan progress files the AI write, both get trashed, and are even part of my .gitignore . They were kept by Cloudflare, but we don't keep these.
- Unreviewed: Claude Code does not do RAG in the usual sense, so it is on us to create guides for how we do things like use big frameworks. They are basically indexes for speeding up AI with less grepping + hallucinating across memory compactions. The AI reads and writes these, and we largely stay out of it.
There are weird cases I am still trying to figure out. Ex:
- feature impl might start with an AI coming up with the product spec, so having that maintained as the AI progresses and committed in is a potentially useful artifact
- how prompt templates get used is helpful for their automated maintenance.
> Around the 40-commit mark, manual commits became frequent—styling, removing unused methods, the kind of housekeeping that coding models still struggle with. It's clear that AI generated >95% of the code, but human oversight was essential throughout.
But things like styling and unused code removal have been automated for a long time already, thanks to non-AI tools; assuming that the AI agent has access to those tools (e.g. assuming the agent can trigger a linter), then the engineer could have just included these steps in the prompts instead of running them manually.
EDIT - I still think there are aspects where AI is obviously lacking, I just think those specific examples are not among them
Speaking of which, something funny I've noticed when using agents with prettier in a pre-commit hook is that the logs occasionally include the model thanking "me" for cleaning up its code formatting.
>> what if we treated prompts as the actual source code?
And they probably will be. Looks like prompts have become the new higher-level coding language, the same way JavaScript is a human-friendly abstraction of an existing programming language (like C), which is already a more accessible way to write assembly itself, and the same goes for the underlying binary code... I guess we eventually reached the final step in the development chain, bridging the gap between hardware instructions and human language.
I was thinking that if you had a good enough verified mathematical model of your code using TLA+ or similar you could then use an LLM to generate your code in any language and be confident it is correct. This would be Declarative Programming. Instead of putting in a lot of work writing code that MIGHT do what you intend you put more work into creating the verified model and then the LLM generates code that will do what the model intends.
> Don't be afraid to get your hands dirty. Some bugs and styling issues are faster to fix manually than to prompt through. Knowing when to intervene is part of the craft.
This has been my experience as well. to always run the cli tool in the bottom pane of an IDE and not in a standalone terminal.
>Treat prompts as version-controlled assets. Including prompts in commit messages creates valuable context for future maintenance and debugging.
I think this is valuable data, but it is also out of distribution data. Prior to AI models writing code, this won't be present in the training set. Additional training will probably be needed to correlate better results with the new input stream, and also to learn that some of the records would be of its own unreliability and to develop a healthy scepticism of what it has said in the past.
There's a lot of talk about model collapse with models training purely on their own output, or AI slop infecting training data sets, but ultimately it is all data. Combined with a signal to say which bits were ultimately beneficial, it can all be put to use. Even the failures can provide a good counterfactual signal for constrastive learning.
I used almost 100% AI to build a SCUMM-like parser, interpreter, and engine (https://github.com/fpgaminer/scumm-rust). It was a fun workflow; I could generally focus on my usual work and just pop in occasionally to check on and direct the AI.
I used a combination of OpenAI's online Codex, and Claude Sonnet 4 in VSCode agent mode. It was nice that Codex was more automated and had an environment it could work in, but its thought-logs are terrible. Iteration was also slow because it takes awhile for it to spin the environment up. And while you _can_ have multiple requests running at once, it usually doesn't make sense for a single, somewhat small project.
Sonnet 4's thoughts were much more coherent, and it was fun to watch it work and figure out problems. But there's something broken in VSCode right now that makes its ability to read console output inconsistent, which made things difficult.
The biggest issue I ran into is that both are set up to seek out and read only small parts of the code. While they're generally good at getting enough context, it does cause some degradation in quality. A frequent issue was replication of CSS styling between the Rust side of things (which creates all of the HTML elements) and the style.css side of things. Like it would be working on the Rust code and forget to check style.css, so it would just manually insert styles on the Rust side even though those elements were already styled on the style.css side.
Codex is also _terrible_ at formatting and will frequently muck things up, so it's mandatory to use it with an autoformatter and instructions to use it. Even with that, Codex will often say that it ran it, but didn't actually run it (or ran it somewhere in the middle instead of at the end) so its pull requests fail CI. Sonnet never seemed to have this issue and just used the prevailing style it saw in the files.
Now, when I say "almost 100% AI", it's maybe 99% because I did have to step in and do some edits myself for things that both failed at. In particular neither can see the actual game running, so they'd make weird mistakes with the design. (Yes, Sonnet in VS Code can see attached images, and potentially can see the DOM of vscode's built in browser, but the vision of all SOTA models is ass so it's effectively useless). I also stepped in once to do one major refactor. The AIs had decided on a very strange, messy, and buggy interpreter implementation at first.
You're talking ahead of the others in this thread, who do not understand how you got to what you're saying. I've been doing research in this area. You are not only correct, but the implications are staggering, and go further than what you have mentioned above. This is no cult, it is the reorganization of the economics of work.
You’re sounding like a religious zealot recruiting for a cult.
No, it is not possible to prompt every feature, and I suspect people who believe LLMs can accurately program anything in any language are frankly not solving any truly novel or interesting problems, because if they were they’d see the obvious cracks.
js2|8 months ago
https://news.ycombinator.com/item?id=44159166
SrslyJosh|8 months ago
Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.
> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.
How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?
gizmo686|8 months ago
One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.
Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.
Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.
lowsong|8 months ago
Let's assume that a hypothetical future AI is perfect. It will produce correct output 100% of the time, with no bugs, errors, omissions, security flaws, or other failings. It will also generate output instantly and cost nothing to run.
Even with such perfection this idea is doomed to failure because it can only write code based on information in the prompt, which is written by a human. Any ambiguity, unstated assumption, or omission would result in a program that didn't work quite right. Even a perfect AI is not telepathic. So you'd need to explain and describe your intended solution extremely precisely without ambiguity. Especially considering in this "offline generation" case there is no opportunity for our presumed perfect AI to ask clarifying questions.
But, by definition, any language which is precise and clear enough to not produce ambiguity is effectively a programming language, so you've not gained anything over just writing code.
fastball|8 months ago
You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.
Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.
rectang|8 months ago
You would not do this because: unlike programming languages, natural languages are ambiguous and thus inadequate to fully specify software.
Xelbair|8 months ago
Regenerated code might behave differently, have different bugs(worst case), or not work at all(best case).
kace91|8 months ago
What sense does “getting rid of vulnerabilities by phasing out {dependency}” make, if the next generation of the code might not rely on the mentioned library at all? What does “improve performance of {method}” mean if the next generation used a fully different implementation?
It makes no sense whatsoever except for a vibecoders script that’s being extrapolated into a codebase.
pollinations|8 months ago
Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.
I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.
I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.
never_inline|8 months ago
tayo42|8 months ago
Llms can be configured to have deterministic output too
dragonwriter|8 months ago
There is definitely an argument, for also committing prompts, but it makes no sense to only commit prompts.
7speter|8 months ago
paxys|8 months ago
renewiltord|8 months ago
Sevii|8 months ago
TechDebtDevin|8 months ago
Idk kinda different tho.
visarga|8 months ago
croes|8 months ago
mellosouls|8 months ago
Nobody commits the compiled code; this is the direction we are moving in, high level source code is the new assembly.
declan_roberts|8 months ago
The reason he keeps adjusting the prompts is because he knows how to program. He knows what it should look like.
It just blurs the line between engineer and tool.
spaceman_2020|8 months ago
Essentially, the field will get frozen where existing senior engineers will be able to utilize AI to outship traditional senior-junior teams, even as junior engineers fail to secure employment
I don’t think anything in this article counters this argument
latexr|8 months ago
I realise you meant it as “the engineer and their tool blend together”, but I read it like a funny insult: “that guy likes to think of himself as an engineer, but he’s a complete tool”.
visarga|8 months ago
Maybe journalists and bloggers angling for attention do it, prompt engineers are too aware of the limitations of prompting to do that.
tptacek|8 months ago
later
updated to clarify kentonv didn't write this article
thegrim33|8 months ago
starkparker|8 months ago
I guess that's where a big miss in understanding so much of the messaging about generative AI in coding happens for me, and why the Fly.io skepticism blog post irritated me so much as well.
It _is_ how collaboration with a person works, but the when you have to fix the issues that the tool created, you aren't collaborating with a person, you're making up for a broken tool.
I can't think of any field where I'd be expected to not only put up with, but also celebrate, a tool that screwed up and required manual intervention so often.
The level of anthropomorphism that occurs in order to advocate on behalf of generative AI use leads to saying things like "it's how collaboration works" here, when I'd never say the same thing about the table saw in my woodshop, or even the relatively smart cruise control on my car.
Generative AI is still just a tool built by people following a design, and which purportedly makes work easier. But when my saw tears out cuts that I have to then sand or recut, or when my car slams on the brakes because it can't understand a bend in the road around a parking lane, I don't shrug and ascribe them human traits and blame myself for being frustrated over how they collaborate with me.
pontifier|8 months ago
hooverd|8 months ago
isaacremuant|8 months ago
I love the paradigm shift but hate when the hype is uninformed or dishonest or not treating it with an eye for quality.
eviks|8 months ago
So every single run will result in different non-reproducible implementation with unique bugs requiring manual expert interventions. How is this better?
SupremumLimit|8 months ago
Dylan16807|8 months ago
groby_b|8 months ago
LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.
It's not "techno-utopian determinism". It's a clearly visible trajectory.
Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.
The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.
There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.
its-kostya|8 months ago
sumedh|8 months ago
Sevii|8 months ago
_pdp_|8 months ago
Why is this such a big deal? This library is not even that interesting. It is very straightforward task I expect most programers will be able to pull off easily. 2/3 of the code is type interfaces and comments. The rest is by book implementation of a protocol that is not even that complex.
Please, there are some React JSX files in your code base with a lot more complexities and intricacies than this.
Has anyone even read the code at all?
JackSlateur|8 months ago
As you say, the code is not interesting, it deals with a well known topic
And it required lots of man power to get done
tldr: this is a non-event disguised as incredible success. No doubt cloudflare is making money with that AI crap, somehow.
thorum|8 months ago
dcre|8 months ago
maxemitchell|8 months ago
I'm still experimenting with it. I find it can't match style at all, and even with the manual editing it still "smells like AI" as you picked up. But, it also saves time.
My prompt was essentially "here are my old blog posts, here's my notes on reading a bunch of AI generated commits, help me condense this into a coherent article about the insights I learned"
Fischgericht|8 months ago
It is not my intention to hurt your feelings, but it sounds like you and/or the LLM are not really good at their job. Looking at programmer salaries and LLM energy costs, this appears to be a very very VERY expensive OAuth library.
Again: Not my intention to hurt any feelings, but the numbers really are shockingly bad.
kentonv|8 months ago
nojito|8 months ago
Here's their response
>It took me a few days to build the library with AI.
>I estimate it would have taken a few weeks, maybe months to write by hand.
>That said, this is a pretty ideal use case: implementing a well-known standard on a well-known platform with a clear API spec.
https://news.ycombinator.com/item?id=44160208
Lines of code per hour is a terrible metric to use. Additionally, it's far easier to critique code that's already written!
Fischgericht|8 months ago
moron4hire|8 months ago
kentonv|8 months ago
Turns out it feels very different than I expected. I really recommend trying it rather than assuming. There's no learning curve, you just install Claude Code and run it in your repo and ask it for things.
(I am the author of the code being discussed. Or, uh, the author of the prompts at least.)
Arainach|8 months ago
This matches my experience: some shiny (even sometimes impressive) greenspace demos but dramatically less useful maintaining a codebase - which for any successful product is 90% of the work.
IncreasePosts|8 months ago
How does anyone using AI like this have confidence that they aren't unintentionally plagiarizing code and violating the terms of whatever license it was released under?
For random personal projects I don't see it mattering that much. But if a large corp is releasing code like this, one would hope they've done some due diligence that they have to just stolen the code from some similar repo on GitHub, laundered through a LLM.
The only section in the readme doesn't mention checking similar projects or libraries for common code:
> Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
akdev1l|8 months ago
They don’t and no one cares
tptacek|8 months ago
The best way I can explain the experience of working with an LLM agent right now is that it is like if every API in the world had a magic "examples" generator that always included whatever it was you were trying to do (so long as what you were trying to do was within the obvious remit of the library).
saghm|8 months ago
simonw|8 months ago
ryandrake|8 months ago
kentonv|8 months ago
Note that I came into this project believing that LLMs were plagiarism engines -- I was looking for that! I ended up concluding that this view was not consistent with the output I was actually seeing.
unknown|8 months ago
[deleted]
cavisne|8 months ago
So for direct copies like what you are talking about that would be picked up.
For copying concepts from other libraries, seems like a problem with or without LLM's.
aryehof|8 months ago
throwawaysleep|8 months ago
Companies are satisfied with the idemnity provided by Microsoft.
viraptor|8 months ago
drodgers|8 months ago
Another way to phrase this is LLM-as-compiler and Python (or whatever) as an intermediate compiler artefact.
Finally, a true 6th generation programming language!
I've considered building a toy of this with really aggressive modularisation of the output code (eg. python) and a query-based caching system so that each module of code output only changes when the relevant part of the prompt or upsteam modules change (the generated code would be committed to source control like a lockfile).
I think that (+ some sort of WASM encapsulated execution environment) would one of the best ways to write one off things like scripts which don't need to incrementally get better and more robust over time in the way that ordinary code does.
sumedh|8 months ago
Karpathy already said English is the new programming language.
kookamamie|8 months ago
This only works if the model and its context are immutable. None of us really control the models we use, so I'd be sceptical about reproducing the artifacts later.
cosmok|8 months ago
https://www.trk7.com/blog/ai-agents-for-coding-promise-vs-re...
lmeyerov|8 months ago
- Human reviewed: Code guidelines and prompt templates are essentially dev tool infra-as-code and need review
- Discarded: Individual prompt commands I write, and implementation plan progress files the AI write, both get trashed, and are even part of my .gitignore . They were kept by Cloudflare, but we don't keep these.
- Unreviewed: Claude Code does not do RAG in the usual sense, so it is on us to create guides for how we do things like use big frameworks. They are basically indexes for speeding up AI with less grepping + hallucinating across memory compactions. The AI reads and writes these, and we largely stay out of it.
There are weird cases I am still trying to figure out. Ex:
- feature impl might start with an AI coming up with the product spec, so having that maintained as the AI progresses and committed in is a potentially useful artifact
- how prompt templates get used is helpful for their automated maintenance.
mastazi|8 months ago
But things like styling and unused code removal have been automated for a long time already, thanks to non-AI tools; assuming that the AI agent has access to those tools (e.g. assuming the agent can trigger a linter), then the engineer could have just included these steps in the prompts instead of running them manually.
EDIT - I still think there are aspects where AI is obviously lacking, I just think those specific examples are not among them
buu700|8 months ago
axi0m|8 months ago
And they probably will be. Looks like prompts have become the new higher-level coding language, the same way JavaScript is a human-friendly abstraction of an existing programming language (like C), which is already a more accessible way to write assembly itself, and the same goes for the underlying binary code... I guess we eventually reached the final step in the development chain, bridging the gap between hardware instructions and human language.
dgb23|8 months ago
UltraSane|8 months ago
GPerson|8 months ago
never_inline|8 months ago
This has been my experience as well. to always run the cli tool in the bottom pane of an IDE and not in a standalone terminal.
unknown|8 months ago
[deleted]
unknown|8 months ago
[deleted]
ianks|8 months ago
Lerc|8 months ago
I think this is valuable data, but it is also out of distribution data. Prior to AI models writing code, this won't be present in the training set. Additional training will probably be needed to correlate better results with the new input stream, and also to learn that some of the records would be of its own unreliability and to develop a healthy scepticism of what it has said in the past.
There's a lot of talk about model collapse with models training purely on their own output, or AI slop infecting training data sets, but ultimately it is all data. Combined with a signal to say which bits were ultimately beneficial, it can all be put to use. Even the failures can provide a good counterfactual signal for constrastive learning.
iandanforth|8 months ago
https://en.wikipedia.org/wiki/Literate_programming
fpgaminer|8 months ago
I used a combination of OpenAI's online Codex, and Claude Sonnet 4 in VSCode agent mode. It was nice that Codex was more automated and had an environment it could work in, but its thought-logs are terrible. Iteration was also slow because it takes awhile for it to spin the environment up. And while you _can_ have multiple requests running at once, it usually doesn't make sense for a single, somewhat small project.
Sonnet 4's thoughts were much more coherent, and it was fun to watch it work and figure out problems. But there's something broken in VSCode right now that makes its ability to read console output inconsistent, which made things difficult.
The biggest issue I ran into is that both are set up to seek out and read only small parts of the code. While they're generally good at getting enough context, it does cause some degradation in quality. A frequent issue was replication of CSS styling between the Rust side of things (which creates all of the HTML elements) and the style.css side of things. Like it would be working on the Rust code and forget to check style.css, so it would just manually insert styles on the Rust side even though those elements were already styled on the style.css side.
Codex is also _terrible_ at formatting and will frequently muck things up, so it's mandatory to use it with an autoformatter and instructions to use it. Even with that, Codex will often say that it ran it, but didn't actually run it (or ran it somewhere in the middle instead of at the end) so its pull requests fail CI. Sonnet never seemed to have this issue and just used the prevailing style it saw in the files.
Now, when I say "almost 100% AI", it's maybe 99% because I did have to step in and do some edits myself for things that both failed at. In particular neither can see the actual game running, so they'd make weird mistakes with the design. (Yes, Sonnet in VS Code can see attached images, and potentially can see the DOM of vscode's built in browser, but the vision of all SOTA models is ass so it's effectively useless). I also stepped in once to do one major refactor. The AIs had decided on a very strange, messy, and buggy interpreter implementation at first.
brador|8 months ago
Take note - there is no limit. Every feature you or the AI can prompt can be generated.
Imagine if you were immortal and given unlimited storage. Imagine what you could create.
That’s a prompt away.
Even now you’re still restricting your thinking to the old ways.
_lex|8 months ago
latexr|8 months ago
No, it is not possible to prompt every feature, and I suspect people who believe LLMs can accurately program anything in any language are frankly not solving any truly novel or interesting problems, because if they were they’d see the obvious cracks.
politelemon|8 months ago
Currently, it's 6 prompts away in which 5 of those are me guiding the LLM to output the answer that I already have in mind.
zeofig|8 months ago
E = MC^2 + AI
A new equation for physics?
The potential of AI is unlimited.