I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.
GPT can't even tell what its done or give what it knows it should. It's endless, "Apologies, here is what you actually asked for ..." and again it isn't.
I use LLMs for writing generic, repetitive code, like scaffolding.
It's OK with boring, generic stuff. Sure it makes mistakes occasionally but usually it's a no-brainer to fix them.
I've used it for some smaller greenfield code with success. Like, write an Arduino program that performs a number of super-sampled analog readings, and performs a linear regression fit, printing the result to the serial port.
That sort of stuff can be very helpful to newbies in the DIY electronics world for example.
But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.
A language like Golang tries really hard to only have _one_ way to do something, one right way, one way. Just one way. See how it was before generics. You just have a for loop. Can't really mess up a for loop.
I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:
1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent
AND
2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.
What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?
The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.
They are good at (combining well-known, codeforces-style) algorithms; often times I don’t care about the syntax, but I need the algorithm. LLMs can write pseudocode for all I care but they tend to get syntax correct quite often
Sometimes I don't know anything (relatively) about the topic and I want to get a foundation so I don't care whether the syntax or the code is valid as long as it points me in the right direction. Other times I know exactly what I want and I just find that instructing LLM specifically what output I expect from it just helps me get there faster.
What TFA was talking about didn't really seem like an hallucination - just a case of garbage in-garbage out. Normally there are more examples of good/correct data in the training set than bad, so statistically the good wins, but if it's prompted for something obscure maybe bad is all that it has got.
Common coding tasks are going to be better represented in the training set and give better results.
Here's the secret to coding with an LLM: don't expect it to get things 100% correct. You will need to fix something almost every time you use it to generate code. Maybe a name here, maybe a calculation, maybe a function signature. And maybe you won't spot the issue until later.
You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).
That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.
ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.
I once asked Perplexity (using Claude underneath) about some library functionality, which it totally fabricated.
First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.
Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”
The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.
> Any time I ask anything remotely niche, LLMs are often bad
As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.
So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they:
- Rewrote tests in a way that the broken behaviour passes the test
- Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }`
- Failed to even update the files properly, wasting a bunch of tokens
It tried to convince me that it is possible to break out of outer loop in C++ with `break 'label` statement placed in nested loop. No such syntax exists.
Semi related: when I'm using a dict of known keys as some sort of simple object, I almost always reach for a dataclass (with slots=True, and kw_only=True) these days. Has the added benefit that you can do stuff like foo = MyDataclass(*some_dict) and get runtime errors when the format has changed
Well, it makes sense. The smaller the niche, the lesser weight in the overall training loss. At the end of the day, LLMs are (literally) classifiers that assign probabilities to tokens given some previous tokens.
My rule of thumb is: is the answer to your question on the first page of google (a stackoverflow maybe, or some shit like geek4geeks)? If yes GPT can give you an answer, otherwise not.
I think a lot of these issues could be avoided if, instead of just a raw model, you have an AI agent which is able to test its own answers against the actual software… it doesn’t matter as much if the model hallucinates if testing weeds out its hallucinations.
Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run
Yeah this is so common that I've already compiled a mental list of prompts to try against any new release. I haven't seen any improvement in quite a long while now, which confirms my belief that we've more or less hit the scaling wall for what the current approaches can provide. Everything new is just a microoptimization to game one of the benchmarks, but real world use has been identical or even worse for me.
No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.
Where the GAI hallucinates an api it is not always easy to find out if it exists given different versions and libraries for a given task, it can easily waste 10 mins trying to find the promised api, particularly when search results also include generated ai answers.
Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.
> because they reveal themselves the second you try to run that code.
In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.
As usual, the defenses against this are extensive unit-test coverage and/or static typing.
It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.
This sort of hallucination happens to me frequently with AWS infrastructure questions. Which is depressing because I can't do anything but agree, "yeah, that API is exactly what any sane person would want, but AWS didn't do that, which is why I'm asking the question".
Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?
In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.
I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?
This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?
Probably because the coworker's question and the forum post are both questions that start with "How do I", so they're a good match. Actual code would be more likely to be preceded by... more code, not a question.
This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.
The conclusion paragraph was really funny and kinda perfectly encapsulates the current state of AI, but as pointed out by another comment, we can't even call them smart, just "Ctrl C Ctrl V Leeroy Jenkins style"
This is exactly what I mean when I say tell me your bad without saying so. Most people here disagree with that.
A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.
Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.
It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.
If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.
What I honestly find most interesting about this is the thought that hallucinations might lead to the kind of emergent language design we see in natural language (which might not be a good thing for a computer language, fwiw, but still interesting), where people just kind of thing "language should work this way and if I say it like this people will probably understand me".
[+] [-] andix|1 year ago|reply
[+] [-] QuantumGood|1 year ago|reply
[+] [-] kgeist|1 year ago|reply
[+] [-] magicalhippo|1 year ago|reply
That sort of stuff can be very helpful to newbies in the DIY electronics world for example.
But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.
[+] [-] bakugo|1 year ago|reply
They can't, they usually just don't understand the code enough to notice the issues immediately.
The perceived quality of LLM answers is inversely proportional to the user's understanding of the topic they're asking about.
[+] [-] L-four|1 year ago|reply
[+] [-] ninininino|1 year ago|reply
I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:
1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent
AND
2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.
What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?
The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.
[+] [-] runeblaze|1 year ago|reply
[+] [-] pllbnk|1 year ago|reply
[+] [-] zelphirkalt|1 year ago|reply
[+] [-] johnisgood|1 year ago|reply
[+] [-] HarHarVeryFunny|1 year ago|reply
Common coding tasks are going to be better represented in the training set and give better results.
[+] [-] afro88|1 year ago|reply
You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).
That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.
[+] [-] bugglebeetle|1 year ago|reply
You write tests in the same way as you would when checking your own work or delegating to anyone else?
[+] [-] pinoy420|1 year ago|reply
[+] [-] dominicq|1 year ago|reply
[+] [-] ljm|1 year ago|reply
First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.
Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”
The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.
[+] [-] skerit|1 year ago|reply
As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.
So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they: - Rewrote tests in a way that the broken behaviour passes the test - Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }` - Failed to even update the files properly, wasting a bunch of tokens
[+] [-] nopurpose|1 year ago|reply
[+] [-] ijustlovemath|1 year ago|reply
[+] [-] jurgenaut23|1 year ago|reply
[+] [-] miningape|1 year ago|reply
inb4 you just aren't prompting correctly
[+] [-] andrepd|1 year ago|reply
[+] [-] skissane|1 year ago|reply
Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run
[+] [-] Etheryte|1 year ago|reply
[+] [-] latexr|1 year ago|reply
> LLMs are really smart most of the time.
No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
[+] [-] simonw|1 year ago|reply
I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
[+] [-] grey-area|1 year ago|reply
Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.
[+] [-] woadwarrior01|1 year ago|reply
In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.
As usual, the defenses against this are extensive unit-test coverage and/or static typing.
[+] [-] Chance-Device|1 year ago|reply
[+] [-] adamgordonbell|1 year ago|reply
Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.
[+] [-] wrs|1 year ago|reply
[+] [-] joelthelion|1 year ago|reply
[+] [-] mberning|1 year ago|reply
[+] [-] IAmNotACellist|1 year ago|reply
[+] [-] aranw|1 year ago|reply
[+] [-] Narretz|1 year ago|reply
[+] [-] do_not_redeem|1 year ago|reply
[+] [-] pfortuny|1 year ago|reply
[+] [-] lxe|1 year ago|reply
[+] [-] leumon|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] Baggie|1 year ago|reply
[+] [-] jwjohnson314|1 year ago|reply
[+] [-] mvdtnz|1 year ago|reply
[+] [-] zeroq|1 year ago|reply
A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.
Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.
It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.
If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.
[+] [-] saurik|1 year ago|reply
[+] [-] sirolimus|1 year ago|reply
[+] [-] egberts1|1 year ago|reply
[+] [-] kruxigt|1 year ago|reply
[deleted]