This is looking at the wrong metric. I'm not expecting it to be 100% correct when I use it. I expect it to get me in the ballpark faster than I would have on my own. And then I can take it from there.
Sometimes that means I have a follow on question & iterate from there. That's fine too.
> I expect it to get me in the ballpark faster than I would have on my own.
This is great if you are an experienced developer who can tell the difference between "in the ballpark" and fixable and "in the ballpark" but hopeless.
That’s how you, an experienced programmer, use it.
What does this do to beginners that are just learning to program? Is this helping them by forcing them to become critical reviewers or harming them by being a bad role model?
> What's especially troubling is that many human programmers seem to prefer the ChatGPT answers. The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn't catch AI-generated mistakes at 39 percent.
Absolutely, it's especially useful when it suggests which libraries to use if you're not familiar with the ecosystem. Or writing boilerplate for popular frameworks, step by step. It can, to a degree, repair errors if you paste it the output.
Every time I’ve tried it it’s sent me to completely the wrong ball park and after a while whacking it’s solution I end up completely dumping it and doing it myself.
Exactly! Just because part of the answer isn't right, doesn't mean the entire answer is useless. It's much faster than only doing a Google search when working out the solution to a problem.
This. For inexperienced developers, I advise thus; don't consume answers you don't understand. If you can't read it, interrogate it, and find a question at your own level. When you accept its emission, you're taking responsibility for it, and beyond a certain low level, it can't do your thinking for you.
I agree this is the correct way to use it, and it is incredibly useful in that case, but I think a study like this is valuable in the face of all the hype/fud about how AI Agents can program entire complex applications with just a few prompts and/or will replace software engineers shortly.
> For each of the 517 SO [Stack Overflow] questions, the first two authors manually used the SO question’s title, body, and tags to form one question prompt1 and fed that to the free version of ChatGPT, which is based on GPT-3.5. We chose the free version of ChatGPT because it captures the majority of the target population of this work. Since the target population of this research is not only industry developers but also programmers of all levels, including students and freelancers around the world, the free version of ChatGPT has significantly more users than the paid version, which costs a monthly rate of 20 US dollars.
Note that GPT-4o is now also freely available, although with usage caps. Allegedly the limit is one fifth the turns of paid Plus users, who are said to be limited to 80 turns every three hours. Which would mean 16 free GPT-4o turns per 3 hours. Though there is some indication the limits are currently somewhat lower in practice and overall in flux.
In any case, GPT-4o answers should be far more competent than those by GPT-3.5, so the study is already somewhat outdated.
I use ChatGPT for coding constantly and the 52% error rate seems about right to me. I manually approve every single line of code that ChatGPT generates for me. If I copy-paste 120 lines of code that ChatGPT has generated for me directly into my app, that is because I have gone over all 120 lines with a fine-toothed comb, and probably iterated 3-4 times already. I constantly ask ChatGPT to think about the same question, but this time with an additional caveat.
I find ChatGPT more useful from a software architecture point of view and from a trivial code point of view, and least useful at the mid-range stuff.
It can write you a great regex (make sure you double-check it) and it can explain a lot of high-level concepts in insightful ways, but it has no theory of mind -- so it never responds with "It doesn't make sense to ask me that question -- what are you really trying to achieve here?", which is the kind of thing an actually intelligent software engineer might say from time to time.
I scanned the paper and it doesn't mention what model they were using within chatgpt. If it was 3.5 turbo, then these results are already meaningless. GPT-4 and 4o are much more accurate.
I just used GPT-4o to refactor 50 files from react classes to react function components and it did so almost perfectly everytime. Some of these classes were as long as 500 loc.
I'd guess that React code is a lot easier for a LLM, since it's a frequent occurrence in its training dataset and frontend code tends to be repetitive and full of boilerplate.
I believe that AI will be a perfect programmer in the future for all niche areas. My point is that frontend will probably be the first niche to be mastered.
Not meaningless when 99% of the people use the free version which apparently has license to lie to them far more than the paid version. What a fucking sick joke, pay up or we lie to you even more.
This is way better than I thought. A follow-up question would be for the times that it is wrong, how wrong is it. In other words, is the wrong answer complete rubbish or it can be a starting point towards the actual correct answer?
ChatGPT was released one and a half year ago. It basically duct tape code together from a probability model, the fact that 52% of it's coding answers a correct is amazing.
I'm still on the fence about LLMs for coding, but from talking to friends, they primarily use it to define a skeleton of code or generate code that they can then study and restructure. I don't see many developers accepting the generate code without review.
This workflow is very close to being possible. I gave it a try last year by adding exceptions and test output to clipboard automatically (requires custom code for your stack). The context has increased considerably since my last attempt and agents are now a thing (ReAct loop, etc).
- Integration to your runtime: functions called by the LLM can run your tests, linters, compiler, etc
- Agents: the LLM can define what to do, execute a few tasks, and keep going with more tasks generated by itself
- Codebase/filesystem access: could be RAG or just ability to read files in your project
- Graceful integration of the human in the agent loop: this is just an iteration of the agent but it seems useful for it to ask inputs from the programmer. Maybe even something more sophisticated where the agent waits for the programmer to change stuff in the codebase
ChatGPT isn’t the best coding LLM. Claude Opus is.
Also as you can always tell if a coding response works empirically mistakes are much more easily spotted than in other forms of LLM output.
Debugging with AI is more important than prompting. It requires an understanding of the intent which allows the human to prompt the model in a way that allows it to recognize its oversights.
Most code errors from LLMs can be fixed by them. The problem is an incomplete understanding of the objective which makes them commit to incorrect paths.
Being able to run code is a huge milestone. I hope the GPT5 generation can do this and thus only deliver working code. That will be a quantum leap.
> Q&A platforms have been crucial for the online help-seeking behav-
ior of programmers. However, the recent popularity of ChatGPT is
altering this trend. Despite this popularity, no comprehensive study
has been conducted to evaluate the characteristics of ChatGPT’s an-
swers to programming questions. To bridge the gap, we conducted
the first in-depth analysis of ChatGPT answers to 517 programming
questions on Stack Overflow and examined the correctness, consis-
tency, comprehensiveness, and conciseness of ChatGPT answers.
Furthermore, we conducted a large-scale linguistic analysis, as well
as a user study, to understand the characteristics of ChatGPT an-
swers from linguistic and human aspects. Our analysis shows that
52% of ChatGPT answers contain incorrect information and 77%
are verbose. Nonetheless, our user study participants still preferred
ChatGPT answers 35% of the time due to their comprehensiveness
and well-articulated language style. However, they also overlooked
the misinformation in the ChatGPT answers 39% of the time. This
implies the need to counter misinformation in ChatGPT answers to
programming questions and raise awareness of the risks associated
with seemingly correct answers.
I guess I know how to ask the right programming questions, because my feeling about it is it’s about 80-90% correct, and the rest just gets me to correct solutions much faster than a search engine.
iirc, I saw some other study (or an experiment some random guy had ran) where original GPT4 had vastly outperformed its later incarnations for code generation.
current openai products either use much lower parameter models under the hood than they did originally, or maybe it's a side-effect of context stretching.
You can always email hn@ycombinator.com if you think a headline is misleading, since the site guidelines call for changing those ("Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html)
Can someone email the author and explain what a LLM is?
People asking for 'right' answers, don't really get it. I'm sorry if that sounds abrasive, but these people give LLMs a bad name due to their own ignorance/malice.
I remember having some Amazon programmer trash LLMs for 'not being 100% accurate'. It was really an iD10t error. LLMs arent used for 100% accuracy. If you are doing that, you don't understand the technology.
There is a learning curve with LLMs, and it seems a few people still don't get it.
The real problem is that it's not marketed that way. WE may understand that but most people, heck even in my experience a large percentage of tech people, don't. They think there is some kind of true intelligence (it's literally in the name) behind it. Just like I also understand that the top results on Google are not always the best.. but my parents don't.
jghn|1 year ago
Sometimes that means I have a follow on question & iterate from there. That's fine too.
Veuxdo|1 year ago
This is great if you are an experienced developer who can tell the difference between "in the ballpark" and fixable and "in the ballpark" but hopeless.
parpfish|1 year ago
What does this do to beginners that are just learning to program? Is this helping them by forcing them to become critical reviewers or harming them by being a bad role model?
NBJack|1 year ago
> What's especially troubling is that many human programmers seem to prefer the ChatGPT answers. The Purdue researchers polled 12 programmers — admittedly a small sample size — and found they preferred ChatGPT at a rate of 35 percent and didn't catch AI-generated mistakes at 39 percent.
avg_dev|1 year ago
unknown|1 year ago
[deleted]
RicoElectrico|1 year ago
Gigachad|1 year ago
crpietschmann|1 year ago
alphazard|1 year ago
thatjoeoverthr|1 year ago
unknown|1 year ago
[deleted]
pgm8705|1 year ago
happypumpkin|1 year ago
"Additionally, this work has used the free version of ChatGPT (GPT-3.5)"
ryanwaggoner|1 year ago
cubefox|1 year ago
> For each of the 517 SO [Stack Overflow] questions, the first two authors manually used the SO question’s title, body, and tags to form one question prompt1 and fed that to the free version of ChatGPT, which is based on GPT-3.5. We chose the free version of ChatGPT because it captures the majority of the target population of this work. Since the target population of this research is not only industry developers but also programmers of all levels, including students and freelancers around the world, the free version of ChatGPT has significantly more users than the paid version, which costs a monthly rate of 20 US dollars.
Note that GPT-4o is now also freely available, although with usage caps. Allegedly the limit is one fifth the turns of paid Plus users, who are said to be limited to 80 turns every three hours. Which would mean 16 free GPT-4o turns per 3 hours. Though there is some indication the limits are currently somewhat lower in practice and overall in flux.
In any case, GPT-4o answers should be far more competent than those by GPT-3.5, so the study is already somewhat outdated.
jononomo|1 year ago
I find ChatGPT more useful from a software architecture point of view and from a trivial code point of view, and least useful at the mid-range stuff.
It can write you a great regex (make sure you double-check it) and it can explain a lot of high-level concepts in insightful ways, but it has no theory of mind -- so it never responds with "It doesn't make sense to ask me that question -- what are you really trying to achieve here?", which is the kind of thing an actually intelligent software engineer might say from time to time.
cjonas|1 year ago
I just used GPT-4o to refactor 50 files from react classes to react function components and it did so almost perfectly everytime. Some of these classes were as long as 500 loc.
haolez|1 year ago
I believe that AI will be a perfect programmer in the future for all niche areas. My point is that frontend will probably be the first niche to be mastered.
Hackbraten|1 year ago
asadotzler|1 year ago
Foivos|1 year ago
mrweasel|1 year ago
I'm still on the fence about LLMs for coding, but from talking to friends, they primarily use it to define a skeleton of code or generate code that they can then study and restructure. I don't see many developers accepting the generate code without review.
jrvarela56|1 year ago
My expectation isn’t that the AI generate correct code. The AI will be useful as an ‘agent in the loop’:
- Spec or test suite written as bullets
- Define tests and/or types
- Human intevenes with edits to keep it in the right direction
- LLM generates code, runs complier/tests
- Output is part of new context
- Repeat until programmer is happy
jrvarela56|1 year ago
This should be feasible this holiday season.
jrvarela56|1 year ago
- function calling: the LLM can take action
- Integration to your runtime: functions called by the LLM can run your tests, linters, compiler, etc
- Agents: the LLM can define what to do, execute a few tasks, and keep going with more tasks generated by itself
- Codebase/filesystem access: could be RAG or just ability to read files in your project
- Graceful integration of the human in the agent loop: this is just an iteration of the agent but it seems useful for it to ask inputs from the programmer. Maybe even something more sophisticated where the agent waits for the programmer to change stuff in the codebase
tasuki|1 year ago
haolez|1 year ago
ChrisArchitect|1 year ago
https://programs.sigchi.org/chi/2024/program/content/146667
MrSkelter|1 year ago
Also as you can always tell if a coding response works empirically mistakes are much more easily spotted than in other forms of LLM output.
Debugging with AI is more important than prompting. It requires an understanding of the intent which allows the human to prompt the model in a way that allows it to recognize its oversights.
Most code errors from LLMs can be fixed by them. The problem is an incomplete understanding of the objective which makes them commit to incorrect paths.
Being able to run code is a huge milestone. I hope the GPT5 generation can do this and thus only deliver working code. That will be a quantum leap.
avg_dev|1 year ago
> Q&A platforms have been crucial for the online help-seeking behav- ior of programmers. However, the recent popularity of ChatGPT is altering this trend. Despite this popularity, no comprehensive study has been conducted to evaluate the characteristics of ChatGPT’s an- swers to programming questions. To bridge the gap, we conducted the first in-depth analysis of ChatGPT answers to 517 programming questions on Stack Overflow and examined the correctness, consis- tency, comprehensiveness, and conciseness of ChatGPT answers. Furthermore, we conducted a large-scale linguistic analysis, as well as a user study, to understand the characteristics of ChatGPT an- swers from linguistic and human aspects. Our analysis shows that 52% of ChatGPT answers contain incorrect information and 77% are verbose. Nonetheless, our user study participants still preferred ChatGPT answers 35% of the time due to their comprehensiveness and well-articulated language style. However, they also overlooked the misinformation in the ChatGPT answers 39% of the time. This implies the need to counter misinformation in ChatGPT answers to programming questions and raise awareness of the risks associated with seemingly correct answers.
nijuashi|1 year ago
ph4|1 year ago
drewcoo|1 year ago
Turing_Machine|1 year ago
123yawaworht456|1 year ago
current openai products either use much lower parameter models under the hood than they did originally, or maybe it's a side-effect of context stretching.
ggddv|1 year ago
dang|1 year ago
odyssey7|1 year ago
Odds of correct answer within n attempts =
1 - (1/2)^n
Nice, that’s exponentially good!
meindnoch|1 year ago
unknown|1 year ago
[deleted]
resource_waste|1 year ago
People asking for 'right' answers, don't really get it. I'm sorry if that sounds abrasive, but these people give LLMs a bad name due to their own ignorance/malice.
I remember having some Amazon programmer trash LLMs for 'not being 100% accurate'. It was really an iD10t error. LLMs arent used for 100% accuracy. If you are doing that, you don't understand the technology.
There is a learning curve with LLMs, and it seems a few people still don't get it.
51Cards|1 year ago
mrweasel|1 year ago
I think you're wrong about that. They shouldn't be, but they clearly are.
Last5Digits|1 year ago
unknown|1 year ago
[deleted]
f0e4c2f7|1 year ago
It cracks me up how consistent this is.
See post criticizing LLMs. Check if they're on the latest version (which is now free to boot!!).
Nope. Seemingly...never. To be fair, this is probably just an old study from before 4o came out. Even still. It's just not relevant anymore.
ObnoxiousProxy|1 year ago
On the Humaneval (https://paperswithcode.com/sota/code-generation-on-humaneval) benchmark, GPT4 can generate code that works on first pass 76.5% of the time.
While on SWE bench (https://www.swebench.com/) GPT4 with RAG can only solve about 1% of github issues used in the benchmark.