How are people collaborating on code when using AI tools to generate patches?
We hold code review dear as a tool to make sure more than one set of eyeballs has been over a change before it goes into production, and more than one person has the context behind the code to be able to fix it in future.
As model generated code becomes the norm I’m seeing code from junior engineers that they haven’t read and possible nor do they understand. For example, one Python script calling another using exec instead of importing it as a module, or writing code that is already available as a very common part of the standard library.
In such cases, are we asking people to mark their code as auto generated? Should we review their prompts instead of the code? Should we require the prompt to code step be deterministic? Should we see their entire prompt context and not just the prompt they used to build the finished patch?
I feel like a lot of the value of code review is to bring junior engineers up to higher levels. To that extent each review feels like an end of week school test, and I’m getting handed plagiarised AI slop to mark instead of something that maps properly to what the student does or does not know.
Pair programming is another great teaching tool. Soon, it might be the only one left.
AI is a tool, not a solution. If someone uses it to write code they don't understand and that is flawed, that should never pass code review.
If your AI generated code passes code review without any questions asked and without any hints that it's AI generated (or AI was used in some way), then it doesnt matter that it was used.
The person submitting the code is still responsible, and the reviewer is equally responsible. You typically have tests to make sure it behaved correctly, too.
I do believe my code review of a junior developer submitting "AI slop without reviewing or understanding it" would be fairly blunt:
"If you are just taking AI generated content verbatim without code reviewing it and understanding it, you are providing no value over just having the AI do the work directly. If you are providing no value over AI doing your job, the company can replace you with AI."
I'm wondering if the junior employee just isn't educated enough to be using the AI, like they don't know Python at all so they don't understand the difference between "exec" and "import"? But I'm also inclined to think that they should be using it to learn, if the goal of the company and the employee is to move from junior to mid to senior. Like ask another model "can you improve this code", "does this code make sense", "can you explain this code to me", and anything in the code you don't understand, research and learn.
But, there are employees who are just going to grind it out, doing "just enough" to get something working, and it can at times be hard to figure out why they are there and why they aren't working towards more. It could be as complex is physical or mental struggles that put "just enough to get it working" at the far edge of their abilities, maybe that literally is the most they can do.
How do you learn that and adjust so they can excel within their constraints?
They are collaborating by making sure they’ve read and understood the code that was generated, and usually edited it, too.
Sometimes, when e.g. making an ad-hoc visualization or debugging tool, one may emphasize that they just had AI generate it and didn’t read into the details. Occasionally it makes sense.
But if someone is making PRs without disclosing their lack of understanding of them because of generating most of it and not making sure everything is correct, that seems like a cultural issue, primarily.
But I suppose you can start having such people walk you through their PRs, which should at least reveal their lack of understanding, if it’s in fact the case.
Point is, this is imo not an experience inherent to LLM usage, and there’s also not much point reviewing LLM prompts because of the strong non-determinism involved.
Anecdotally, I frequently dig into source code where the stack trace points me, look up functions, debug in local envs, etc. meanwhile my coworker is working on the same problem and talking to an LLM and I often get to the solution before he does. I don't think I've had my John Henry moment quite yet.
> But I just checked and, unsurprisingly, 4o seems to do reasonably well at generating Semgrep rules? Like: I have no idea if this rule is actually any good. But it looks like a Semgrep rule?
I don't know about semgrep syntax, but the chat it generated is bad in at least a couple other ways. E.g. their "how to fix" instruction is wrong:
if let Some(Load::Local(load)) = self.load.read().get(...) {
// do a bunch of stuff with `load`
} else {
drop(self.load.read()); // Explicitly drop before taking write lock
let mut w = self.load.write();
self.init_for(&w);
}
That actually acquires and then drops a second read lock. It doesn't solve the problem that the first read lock is still active and thus the write lock will deadlock.
Speaking of which, acquiring two read locks from the same thread can also deadlock, as shown in the "Potential deadlock example" at <https://doc.rust-lang.org/std/sync/struct.RwLock.html>. It can happen in the code above (one line before the other deadlock). It can also slip through their rule because they're incorrectly looking for just a write lock in the else block.
I've been playing with AI code generation tools like everyone else, and they are okay as autocomplete, but I don't see them as trustworthy. For a while I thought I just wasn't prompting well enough, but when other people show me their AI output, I can see it's wrong, so maybe I'm just looking more closely?
> But I just checked and, unsurprisingly, 4o seems to do reasonably well at generating Semgrep rules? Like: I have no idea if this rule is actually any good. But it looks like a Semgrep rule?
This is the thing with LLMs. When you’re not an expert, the output always looks incredible.
It’s similar to the fluency paradox — if you’re not native in a language, anyone you hear speak it at a higher level than yourself appears to be fluent to you. Even if for example they’re actually just a beginner.
The problem with LLMs is that they’re very good at appearing to speak “a language” at a higher level than you, even if they totally aren’t.
I agree completely that an LLM's first attempt to write a Semgrep rule is likely as not to be horseshit. That's true of everything an LLM generates. But I'm talking about closed-loop LLM code generation. Unlike legal arguments and medical diagnoses, you can hook an LLM up to an execution environment and let it see what happens when the code it generates runs. It then iterates, until it has something that works.
Which, when you think about it, is how a lot of human-generated code gets written too.
So my thesis here does not depend on LLMs getting things right the first time, or without assistance.
DSLs like Semgrep are one of my top use-cases for LLMs generally.
It used to be that tools like Semgrep and jq and Tree Sitter and zsh all required you to learn quite a bit of syntax before you could start using them productively.
Thanks to LLMs you can focus on learning what they can do for you without also having to learn the fiddly syntax.
Over and over again I have witnessed people just drive themselves in circles downstream of a refusal to just step out of the “make the LLM fix the issue for me” loop.
At one point the syntax and the specifics matter! * and + have different meanings in the regex. Overly-specified LLM output is worth trimming down to what your problem actually needed.
I appreciate LLM output able to draft out sketches of results (and yes, a lot of the time, getting exactly the right result). And it’s great as a learning tool (especially if you’re diligent in the trust + verify department). But I worry that people are not taking opportunities to sit down and actually use the output as the sketch, and to insert the sort of precision that comes from the “infinite context” of the human working on the problem. Devs can’t just decide to opt out of getting into the details IMO
I kind of agree. I’ve had very mixed experiences with LLMs and DSLs.
I was writing an NRQL query (New Relic’s log query language) and wanted to essentially do a GROUP BY date_trunc. It kept giving me options that I was eager for, and then the functions it gave me just didn’t exist. After like four back and forths of me telling it that the functions it was giving me didn’t exist - it worked.
Then I needed it to split on the second forward slash of a string and just give me the first piece. It gave me the foundation to fill in the gaps of the function, but the LLM never got it.
In that case, I assume it’s a lack of training data since NRQL is pretty niche.
I catch myself swinging from “holy shit this is impressive” to “wow this sucks” and back regularly for code.
Yes, the same applies to many niche syntaxes: Influx' flux language (I was able to Design my dream Grafana dashboards now!) or Auto Hot Key (AHK) for Windows automation are only two examples.
I had an absolutely terrible time today trying to get chatgpt to write me a very simple awk oneliner. It “thought” I was specifying a much more complicated requirement than I actually was.
Exactly the same for me: the major breakout success was having ChatGPT teach me how to use pandas. It really shone as an interactive manual with worked examples.
Taking the training wheels off isn’t something I’ve really nailed though: for example, I keep coming back with the same questions about how to melt and pivot. I can self diagnose as this showing I didn’t really spend enough time understanding the answers the first time around.
A short snippet (the whole thing is very funny and interestingly written in 2013 long before the modern ai craze):
By now I had started moving on to doing my own consulting work, but I never disabled the hill-climbing algorithm. I'd closed and forgotten about the Amazon account, had no idea what the password to the free vps was anymore, and simply appreciated the free money.
But there was a time bomb. That hill climbing algorithm would fudge variables left and right. To avoid local maxima, it would sometimes try something very different.
One day it decided to stop paying me.
Its reviews did not suffer. It's balance increased.
So it said, great change, let's keep it. It now has over $28,000 of my money, is not answering my mail, and we have been locked in an equity battle over the past 18 months.
The worst part is that I still have to clean up all its answers to protect our reputation. Who's running who anyway?
I think an even more interesting use case for semgrep, and also LSP or something like LSP, is querying for exactly what an AI needs to know to fix a specific problem.
Unlike humans, LLMs have no memory, so they can't just learn where things are in the code by remembering the work they did in the past. In a way, they need to re-learn the relevant parts of your codebase from scratch on every change, always keeping context window limitations in mind.
Humans learn by scrolling and clicking around and remembering what's important to them; LLMs can't do that. We try to give them autogenerated codebase maps and tools that can inject specific files into the context window, but that doesn't seem to be nearly enough. Semantic queries look like a much better idea.
I thought you couldn't really teach an LLM how to use something like that effectively, as that's not how humans work and there's no data to train on, but the recent breakthroughs with RL made me change my mind.
OK, hear me out. The future isn’t o4 or whatever. The future is when everyone, every language, every tool, every single library and codebase can train their own custom model tailored to their needs and acting as a smart documentation which you can tell what you want to do and it will tell you how to do it.
People have been trying with fine tuning, RAG, using the context window. That’s not enough. The model needs to be trained on countless examples of question-answer for this particular area of knowledge starting from a base model aware of comp sci concepts and language (just English is fine). This implies that such examples have to be created by humans - each such community will need its own „Stack Overflow”.
Smaller, specialized models are the future of productivity. But of course that can’t be monetized, right? Well, the technology just needs to get cheaper so that people can just afford to train such models themselves. That’s the next major breakthrough. Could be anyway.
I’ve built something for a solution that takes you most of the way there, using Semgrep’s SARIF output and prompted LLMs to help prioritize triage.
We’ve used this for the past year at Microsoft to help prioritize the “most likely interesting” 5% of a large set of results for human triage. It works quite well…
I've been trying to do something similar to create CodeQL queries recently, and found that chatgpt is completely unable to create even simple queries. I assume it's because training is based on old query language or just completely missing, but being able to feed the rules and the errors which they produce when run has been a complete failure for me.
Take a large context frontier model. Upload 200k tokens of code for each query. Ask about what code pattern you want it to highlight for you. Works better than any other system, but costs token on API services.
So the idea is that LLM1 looks at the output of LLM0 and builds a new set of constraints, and then LLM0 has to try again, rinse and repeat? (LLM0 could be the same as LLM1, and I think it is in the article?)
I think the author is missing one part about cursor, aider, etc.
Out of the box it is decent.
Watching only the basic optimizations on YouTube developers are doing proper to starting a project puts the experience and consistency to a far higher level
Maybe this casual surface testing if I’m not Mia reading is why so many tech people are missing what tools like cursor, aider, etc are doing.
> What interests me is this: it seems obvious that we’re going to do more and more “closed-loop” LLM agent code generation stuff. By “closed loop”, I mean that the thingy that generates code is going to get to run the code and watch what happens when it’s interacted with.
Well, at least we have a credible pathway into the Terminator or Matrix universes now...
[+] [-] gorgoiler|1 year ago|reply
We hold code review dear as a tool to make sure more than one set of eyeballs has been over a change before it goes into production, and more than one person has the context behind the code to be able to fix it in future.
As model generated code becomes the norm I’m seeing code from junior engineers that they haven’t read and possible nor do they understand. For example, one Python script calling another using exec instead of importing it as a module, or writing code that is already available as a very common part of the standard library.
In such cases, are we asking people to mark their code as auto generated? Should we review their prompts instead of the code? Should we require the prompt to code step be deterministic? Should we see their entire prompt context and not just the prompt they used to build the finished patch?
I feel like a lot of the value of code review is to bring junior engineers up to higher levels. To that extent each review feels like an end of week school test, and I’m getting handed plagiarised AI slop to mark instead of something that maps properly to what the student does or does not know.
Pair programming is another great teaching tool. Soon, it might be the only one left.
[+] [-] lionkor|1 year ago|reply
If your AI generated code passes code review without any questions asked and without any hints that it's AI generated (or AI was used in some way), then it doesnt matter that it was used.
The person submitting the code is still responsible, and the reviewer is equally responsible. You typically have tests to make sure it behaved correctly, too.
[+] [-] nicoburns|1 year ago|reply
[+] [-] linsomniac|1 year ago|reply
"If you are just taking AI generated content verbatim without code reviewing it and understanding it, you are providing no value over just having the AI do the work directly. If you are providing no value over AI doing your job, the company can replace you with AI."
I'm wondering if the junior employee just isn't educated enough to be using the AI, like they don't know Python at all so they don't understand the difference between "exec" and "import"? But I'm also inclined to think that they should be using it to learn, if the goal of the company and the employee is to move from junior to mid to senior. Like ask another model "can you improve this code", "does this code make sense", "can you explain this code to me", and anything in the code you don't understand, research and learn.
But, there are employees who are just going to grind it out, doing "just enough" to get something working, and it can at times be hard to figure out why they are there and why they aren't working towards more. It could be as complex is physical or mental struggles that put "just enough to get it working" at the far edge of their abilities, maybe that literally is the most they can do.
How do you learn that and adjust so they can excel within their constraints?
[+] [-] cube2222|1 year ago|reply
Sometimes, when e.g. making an ad-hoc visualization or debugging tool, one may emphasize that they just had AI generate it and didn’t read into the details. Occasionally it makes sense.
But if someone is making PRs without disclosing their lack of understanding of them because of generating most of it and not making sure everything is correct, that seems like a cultural issue, primarily.
But I suppose you can start having such people walk you through their PRs, which should at least reveal their lack of understanding, if it’s in fact the case.
Point is, this is imo not an experience inherent to LLM usage, and there’s also not much point reviewing LLM prompts because of the strong non-determinism involved.
[+] [-] righthand|1 year ago|reply
[+] [-] scottlamb|1 year ago|reply
I don't know about semgrep syntax, but the chat it generated is bad in at least a couple other ways. E.g. their "how to fix" instruction is wrong:
That actually acquires and then drops a second read lock. It doesn't solve the problem that the first read lock is still active and thus the write lock will deadlock.Speaking of which, acquiring two read locks from the same thread can also deadlock, as shown in the "Potential deadlock example" at <https://doc.rust-lang.org/std/sync/struct.RwLock.html>. It can happen in the code above (one line before the other deadlock). It can also slip through their rule because they're incorrectly looking for just a write lock in the else block.
I've been playing with AI code generation tools like everyone else, and they are okay as autocomplete, but I don't see them as trustworthy. For a while I thought I just wasn't prompting well enough, but when other people show me their AI output, I can see it's wrong, so maybe I'm just looking more closely?
[+] [-] mcqueenjordan|1 year ago|reply
This is the thing with LLMs. When you’re not an expert, the output always looks incredible.
It’s similar to the fluency paradox — if you’re not native in a language, anyone you hear speak it at a higher level than yourself appears to be fluent to you. Even if for example they’re actually just a beginner.
The problem with LLMs is that they’re very good at appearing to speak “a language” at a higher level than you, even if they totally aren’t.
[+] [-] tptacek|1 year ago|reply
I agree completely that an LLM's first attempt to write a Semgrep rule is likely as not to be horseshit. That's true of everything an LLM generates. But I'm talking about closed-loop LLM code generation. Unlike legal arguments and medical diagnoses, you can hook an LLM up to an execution environment and let it see what happens when the code it generates runs. It then iterates, until it has something that works.
Which, when you think about it, is how a lot of human-generated code gets written too.
So my thesis here does not depend on LLMs getting things right the first time, or without assistance.
[+] [-] simonw|1 year ago|reply
It used to be that tools like Semgrep and jq and Tree Sitter and zsh all required you to learn quite a bit of syntax before you could start using them productively.
Thanks to LLMs you can focus on learning what they can do for you without also having to learn the fiddly syntax.
[+] [-] rtpg|1 year ago|reply
At one point the syntax and the specifics matter! * and + have different meanings in the regex. Overly-specified LLM output is worth trimming down to what your problem actually needed.
I appreciate LLM output able to draft out sketches of results (and yes, a lot of the time, getting exactly the right result). And it’s great as a learning tool (especially if you’re diligent in the trust + verify department). But I worry that people are not taking opportunities to sit down and actually use the output as the sketch, and to insert the sort of precision that comes from the “infinite context” of the human working on the problem. Devs can’t just decide to opt out of getting into the details IMO
[+] [-] jjice|1 year ago|reply
I was writing an NRQL query (New Relic’s log query language) and wanted to essentially do a GROUP BY date_trunc. It kept giving me options that I was eager for, and then the functions it gave me just didn’t exist. After like four back and forths of me telling it that the functions it was giving me didn’t exist - it worked.
Then I needed it to split on the second forward slash of a string and just give me the first piece. It gave me the foundation to fill in the gaps of the function, but the LLM never got it.
In that case, I assume it’s a lack of training data since NRQL is pretty niche.
I catch myself swinging from “holy shit this is impressive” to “wow this sucks” and back regularly for code.
[+] [-] Helmut10001|1 year ago|reply
[+] [-] binary132|1 year ago|reply
[+] [-] gorgoiler|1 year ago|reply
Taking the training wheels off isn’t something I’ve really nailed though: for example, I keep coming back with the same questions about how to melt and pivot. I can self diagnose as this showing I didn’t really spend enough time understanding the answers the first time around.
[+] [-] eitland|1 year ago|reply
https://news.ycombinator.com/item?id=5397797
A short snippet (the whole thing is very funny and interestingly written in 2013 long before the modern ai craze):
By now I had started moving on to doing my own consulting work, but I never disabled the hill-climbing algorithm. I'd closed and forgotten about the Amazon account, had no idea what the password to the free vps was anymore, and simply appreciated the free money.
But there was a time bomb. That hill climbing algorithm would fudge variables left and right. To avoid local maxima, it would sometimes try something very different.
One day it decided to stop paying me.
Its reviews did not suffer. It's balance increased. So it said, great change, let's keep it. It now has over $28,000 of my money, is not answering my mail, and we have been locked in an equity battle over the past 18 months.
The worst part is that I still have to clean up all its answers to protect our reputation. Who's running who anyway?
[+] [-] miki123211|1 year ago|reply
Unlike humans, LLMs have no memory, so they can't just learn where things are in the code by remembering the work they did in the past. In a way, they need to re-learn the relevant parts of your codebase from scratch on every change, always keeping context window limitations in mind.
Humans learn by scrolling and clicking around and remembering what's important to them; LLMs can't do that. We try to give them autogenerated codebase maps and tools that can inject specific files into the context window, but that doesn't seem to be nearly enough. Semantic queries look like a much better idea.
I thought you couldn't really teach an LLM how to use something like that effectively, as that's not how humans work and there's no data to train on, but the recent breakthroughs with RL made me change my mind.
[+] [-] mycall|1 year ago|reply
Isn't this what Google Titan is trying to achieve?
[+] [-] kubb|1 year ago|reply
People have been trying with fine tuning, RAG, using the context window. That’s not enough. The model needs to be trained on countless examples of question-answer for this particular area of knowledge starting from a base model aware of comp sci concepts and language (just English is fine). This implies that such examples have to be created by humans - each such community will need its own „Stack Overflow”.
Smaller, specialized models are the future of productivity. But of course that can’t be monetized, right? Well, the technology just needs to get cheaper so that people can just afford to train such models themselves. That’s the next major breakthrough. Could be anyway.
[+] [-] neom|1 year ago|reply
[+] [-] tptacek|1 year ago|reply
[+] [-] esafak|1 year ago|reply
[+] [-] spamfilter247|1 year ago|reply
We’ve used this for the past year at Microsoft to help prioritize the “most likely interesting” 5% of a large set of results for human triage. It works quite well…
https://github.com/247arjun/ai-secure-code-review
[+] [-] ksec|1 year ago|reply
>"We wrote all sorts of stuff this week and this is what gets to the front page. :P"
And how they write content specifically for HN [2].
[1] https://news.ycombinator.com/item?id=43053985
[2] https://fly.io/blog/a-blog-if-kept/
[+] [-] mmsc|1 year ago|reply
[+] [-] antirez|1 year ago|reply
[+] [-] zamalek|1 year ago|reply
[+] [-] waynenilsen|1 year ago|reply
Not there yet but it is inevitable
[+] [-] j45|1 year ago|reply
Out of the box it is decent.
Watching only the basic optimizations on YouTube developers are doing proper to starting a project puts the experience and consistency to a far higher level
Maybe this casual surface testing if I’m not Mia reading is why so many tech people are missing what tools like cursor, aider, etc are doing.
[+] [-] xg15|1 year ago|reply
Well, at least we have a credible pathway into the Terminator or Matrix universes now...
[+] [-] jasonjmcghee|1 year ago|reply
https://semgrep.dev/docs/writing-rules/autofix
Seems like the natural thing to do for cases that support it.
[+] [-] awinter-py|1 year ago|reply
the point that a unit of code is a thing that is maintained, rather than a thing that is generated once, is where codegen has always lost me
(both AI codegen and ruby-on-rails boilerplate generators)
iterative improvement, including factoring useful things out to standard libraries, is where it's at
[+] [-] hamilyon2|1 year ago|reply
[+] [-] 0x696C6961|1 year ago|reply