(no title)
tolmasky | 5 months ago
This is in the general case. But with LLMs, the entire selling point is specifically offloading "reasoning" to them. That is quite literally what they are selling you. So with LLMs, you can swap out "almost certain" in the above rule to "absolutely certain without a shadow of a doubt". This isn't even a hypothetical as we have experimental evidence that LLMs cause people to think/reason less. So you are at best already starting at a deficit.
But more importantly, this makes the entire premise of using LLMs make no sense (at least from a marketing perspective). What good is a thinking machine if I need to verify it? Especially when you are telling me that it will be a "super reasoning" machine soon. Do I need a human "super verifier" to match? In fact, that's not even a tomorrow problem, that is a today problem: LLMs are quite literally advertised to me as a "PhD in my pocket". I don't have a PhD. Most people would find the idea of me "verifying the work of human PhDs" to be quite silly, so how does it make any sense that I am in any way qualified to verify my robo-PhD? I pay for it precisely because it knows more than I do! Do I now need to hire a human PhD to verify my robo-PhD?" Short of that, is it the case that only human PhDs are qualified to use robo-PhDs? In other words, should LLms exclusively be used for things the operator already knows how to do? That seems weird. It's like a Magic 8 Ball that only answers questions you already know the answer to. Hilariously, you could even find someone reaching the conclusion of "well, sure, a curl expert should verify the patch I am submitting to curl. That's what submitting the patch accomplishes! The experts who work on curl will verify it! Who better to do it than them?". And now we've come full circle!
To be clear, each of these questions has plenty of counter-points/workarounds/etc. The point is not to present some philosophical gotcha argument against LLM use. The point rather is to demonstrate the fundamental mismatch between the value-proposition of LLMs and their theoretical "correct use", and thus demonstrate why it is astronomically unlikely for them to ever be used correctly.
rhdunn|5 months ago
1. a better autocomplete -- here the LLM models can make mistakes, but on balance I've found this useful, especially when constructing tests, writing output in a structured format, etc.;
2. a better search/query tool -- I've found answers by being able to describe what I'm trying to do where a traditional search I have to know the right keywords to try. I can then go to the documentation or search if I need additional help/information;
3. an assistant to bounce ideas off -- this can be useful when you are not familiar with the APIs or configuration. It still requires testing the code, seeing what works, seeing what doesn't work. Here, I treat it in the same way as reading a blog post on a topic, etc. -- the post may be outdated, may contain issues, or may not be quite what I want. However, it can have enough information for me to get the answer I need -- e.g. a particular method which I can then consult docs (such as documentation comments on the APIs) etc. Or it lets be know what to search on Google, etc..
In other words, I use LLMs as part of the process like with going to a search engine, stackoverflow, etc.
Sohcahtoa82|5 months ago
This is 100% what I use Github Copilot for.
I type a function name and the AI already knows what I'm going to pass it. Sometimes I just type "somevar =" and it instantly correctly guesses the function, too, and even what I'm going to do with the data afterwards.
I've had instances where I just type a comment with a sentence of what the code is about to do, and it'll put up 10 lines of code to do it, almost exactly matching what I was going to type.
The vibe coders give AI-code generation a bad name. Is it perfect? Of course not. It gets it wrong at least half the time. But I'm skilled enough to know when it's wrong in nearly an instant.
sothatsit|5 months ago
LLMs are pretty consistent about what types of tasks they are good at, and which they are bad it. That means people can learn when to use them, and when to avoid them. You really don't have to be so black-and-white about it. And if you are checking the LLM's results, you have nothing to worry about.
Needing to verify the results does not negate the time savings either when verification is much quicker than doing a task from scratch.
My code is definitely of higher quality now that I have GPT-5 Pro review all my changes, and then I review my code myself as well. It seems obvious to me that if you care, LLMs can help you produce better code. As always, it is only people who are lazy who suffer. If you care about producing great code, then LLMs are a brilliant tool to help you with just that, in less time, by helping with research, planning, and review.
tolmasky|5 months ago
The problem that OP is presenting is that, unlike in your own use, the verification burden from this "open source" usage is not taken on by the "contributors", but instead "externalized" to maintainers. This does not result in the same "linear" experience you have, their experience is asymmetric, as they are now being flooded with a bunch of PRs that (at least currently) are harder to review than human submissions. Not to mention that also unlike your situation, they have no means to "choose" not to use LLMs if they for whatever reason discover it isn't a good fit for their project. If you see something isn't a good fit, boom, you can just say "OK, I guess LLMs aren't ready for this yet." That's not a power maintainers have. The PRs will keep coming as a function of the ease to create them, not as a function of their utility. Thus the verification burden does not scale with the maintainer's usage. It scales with the sum of everyone who has decided they can ask an LLM to go "help" you. That number both larger and out of their control.
The main point of my comment was to say that this situation is not only to be expected, but IMO essential and inseparable from this kind of use, for reasons that actually follow directly from your post. When you are working on your own project, it is totally reasonable to treat the LLM operator as qualified to verify the LLMs outputs. But the opposite is true when you are applying it to someone else's project.
> Needing to verify the results does not negate the time savings either when verification is much quicker than doing a task from scratch.
This is of course only true because of your existing familiarity with of the project you are working on. This is not a universal property of contributions. It is not "trivial" for me to verify a generated patch in a project I don't understand, for reasons ranging from things as simple as the fact that I have no idea what the code contribution guidelines are (who am I to know if I am even following the style guidelines) to things as complicated as the fact that I may not even be familiar with the programming language the project is written in.
> And if you are checking the LLM's results, you have nothing to worry about.
Precisely. This is the crux of the issue -- I am saying that in the contribution case, it's not even about whether you are checking the results, it's that you arguably can't meaningfully check the results (unless you of course essentially put in nearly the same amount of work as just writing it from scratch).
It is tempting to say "But isn't this orthogonal to LLMs? Isn't this also the case with submitting PRs you created yourself?" No! It is qualitatively different. Anyone who has ever submitted a meaningful patch to a project they've never worked on before has had the experience of having to familiarize themselves with the relevant code in order to create said patch. The mere act of writing the fix organically "bootstraps" you into developing expertise in the code. You will if nothing else develop an opinion on the fix you chose to implement, and thus be capable of discussing it after you've submitted it. You, the PR submitter, will be worthwhile to engage with and thus invest time in. I am aware that we can trivially construct hypothetical systems where AI agents are participating in PR discussions and develop something akin to a long term "memory" or "opinion" -- but we can talk about that experience if and when it ever comes into being, because that is not the current lived experience of maintainers. It's just a deluge of low quality one-way spam. Even the corporations that are specifically trying to implement this experience just for their own internal processes are not particularly... what's a nice way to put this, "satisfying" to work with, and that is for a much more constrained environment, vs. "suggesting valuable fixes to any and all projects".
olmo23|5 months ago
actionfromafar|5 months ago