top | item 43222027

Making o1, o3, and Sonnet 3.7 hallucinate for everyone

267 points| hahahacorn | 1 year ago |bengarcia.dev | reply

219 comments

order
[+] andix|1 year ago|reply
I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.
[+] QuantumGood|1 year ago|reply
GPT can't even tell what its done or give what it knows it should. It's endless, "Apologies, here is what you actually asked for ..." and again it isn't.
[+] kgeist|1 year ago|reply
I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff. Sure it makes mistakes occasionally but usually it's a no-brainer to fix them.
[+] magicalhippo|1 year ago|reply
I've used it for some smaller greenfield code with success. Like, write an Arduino program that performs a number of super-sampled analog readings, and performs a linear regression fit, printing the result to the serial port.

That sort of stuff can be very helpful to newbies in the DIY electronics world for example.

But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.

[+] bakugo|1 year ago|reply
> I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up

They can't, they usually just don't understand the code enough to notice the issues immediately.

The perceived quality of LLM answers is inversely proportional to the user's understanding of the topic they're asking about.

[+] L-four|1 year ago|reply
The trick to coding with LLMs is not caring if the code is correct.
[+] ninininino|1 year ago|reply
A language like Golang tries really hard to only have _one_ way to do something, one right way, one way. Just one way. See how it was before generics. You just have a for loop. Can't really mess up a for loop.

I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:

1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent

AND

2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.

What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?

The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.

[+] runeblaze|1 year ago|reply
They are good at (combining well-known, codeforces-style) algorithms; often times I don’t care about the syntax, but I need the algorithm. LLMs can write pseudocode for all I care but they tend to get syntax correct quite often
[+] pllbnk|1 year ago|reply
Sometimes I don't know anything (relatively) about the topic and I want to get a foundation so I don't care whether the syntax or the code is valid as long as it points me in the right direction. Other times I know exactly what I want and I just find that instructing LLM specifically what output I expect from it just helps me get there faster.
[+] zelphirkalt|1 year ago|reply
Probably by coding things that are bery mainstream and have already been fed to the LLM a thousand times from ripped off projects.
[+] johnisgood|1 year ago|reply
I have made large projects using Claude, with success. I know what I want to do and how to do it, maybe my prompts were right.
[+] HarHarVeryFunny|1 year ago|reply
What TFA was talking about didn't really seem like an hallucination - just a case of garbage in-garbage out. Normally there are more examples of good/correct data in the training set than bad, so statistically the good wins, but if it's prompted for something obscure maybe bad is all that it has got.

Common coding tasks are going to be better represented in the training set and give better results.

[+] afro88|1 year ago|reply
Here's the secret to coding with an LLM: don't expect it to get things 100% correct. You will need to fix something almost every time you use it to generate code. Maybe a name here, maybe a calculation, maybe a function signature. And maybe you won't spot the issue until later.

You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).

That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.

[+] bugglebeetle|1 year ago|reply
> LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.

You write tests in the same way as you would when checking your own work or delegating to anyone else?

[+] pinoy420|1 year ago|reply
A good prompt. You don’t just ask it. You tell it how to behave and give it a shot load of context
[+] dominicq|1 year ago|reply
ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.
[+] ljm|1 year ago|reply
I once asked Perplexity (using Claude underneath) about some library functionality, which it totally fabricated.

First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.

Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”

The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.

[+] skerit|1 year ago|reply
> Any time I ask anything remotely niche, LLMs are often bad

As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.

So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they: - Rewrote tests in a way that the broken behaviour passes the test - Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }` - Failed to even update the files properly, wasting a bunch of tokens

[+] nopurpose|1 year ago|reply
It tried to convince me that it is possible to break out of outer loop in C++ with `break 'label` statement placed in nested loop. No such syntax exists.
[+] ijustlovemath|1 year ago|reply
Semi related: when I'm using a dict of known keys as some sort of simple object, I almost always reach for a dataclass (with slots=True, and kw_only=True) these days. Has the added benefit that you can do stuff like foo = MyDataclass(*some_dict) and get runtime errors when the format has changed
[+] jurgenaut23|1 year ago|reply
Well, it makes sense. The smaller the niche, the lesser weight in the overall training loss. At the end of the day, LLMs are (literally) classifiers that assign probabilities to tokens given some previous tokens.
[+] miningape|1 year ago|reply
Any time I ask anything, LLMs are often bad.

inb4 you just aren't prompting correctly

[+] andrepd|1 year ago|reply
My rule of thumb is: is the answer to your question on the first page of google (a stackoverflow maybe, or some shit like geek4geeks)? If yes GPT can give you an answer, otherwise not.
[+] skissane|1 year ago|reply
I think a lot of these issues could be avoided if, instead of just a raw model, you have an AI agent which is able to test its own answers against the actual software… it doesn’t matter as much if the model hallucinates if testing weeds out its hallucinations.

Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run

[+] Etheryte|1 year ago|reply
Yeah this is so common that I've already compiled a mental list of prompts to try against any new release. I haven't seen any improvement in quite a long while now, which confirms my belief that we've more or less hit the scaling wall for what the current approaches can provide. Everything new is just a microoptimization to game one of the benchmarks, but real world use has been identical or even worse for me.
[+] latexr|1 year ago|reply
> Conclusion

> LLMs are really smart most of the time.

No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.

[+] simonw|1 year ago|reply
Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.

I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

[+] grey-area|1 year ago|reply
Where the GAI hallucinates an api it is not always easy to find out if it exists given different versions and libraries for a given task, it can easily waste 10 mins trying to find the promised api, particularly when search results also include generated ai answers.

Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.

[+] woadwarrior01|1 year ago|reply
> because they reveal themselves the second you try to run that code.

In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.

As usual, the defenses against this are extensive unit-test coverage and/or static typing.

[+] Chance-Device|1 year ago|reply
It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.
[+] adamgordonbell|1 year ago|reply
We at pulumi started treating some hallucinations like this as feature requests.

Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.

[+] wrs|1 year ago|reply
This sort of hallucination happens to me frequently with AWS infrastructure questions. Which is depressing because I can't do anything but agree, "yeah, that API is exactly what any sane person would want, but AWS didn't do that, which is why I'm asking the question".
[+] joelthelion|1 year ago|reply
Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?
[+] mberning|1 year ago|reply
In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.
[+] IAmNotACellist|1 year ago|reply
"Not acceptable. Please upgrade your browser to continue." No, I don't think I will.
[+] aranw|1 year ago|reply
I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?
[+] Narretz|1 year ago|reply
This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?
[+] do_not_redeem|1 year ago|reply
Probably because the coworker's question and the forum post are both questions that start with "How do I", so they're a good match. Actual code would be more likely to be preceded by... more code, not a question.
[+] pfortuny|1 year ago|reply
Maybe because the response pattern-matches other languages’s?
[+] lxe|1 year ago|reply
This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.
[+] leumon|1 year ago|reply
He should've tested 4.5. This model is hallucinating much less than any other model.
[+] Baggie|1 year ago|reply
The conclusion paragraph was really funny and kinda perfectly encapsulates the current state of AI, but as pointed out by another comment, we can't even call them smart, just "Ctrl C Ctrl V Leeroy Jenkins style"
[+] jwjohnson314|1 year ago|reply
The interesting thing here to me is that the llm isn’t ‘hallucinating’, it’s simply regurgitating some data it digested during training.
[+] mvdtnz|1 year ago|reply
What's the difference?
[+] zeroq|1 year ago|reply
This is exactly what I mean when I say tell me your bad without saying so. Most people here disagree with that.

A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.

Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.

It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.

If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.

[+] saurik|1 year ago|reply
What I honestly find most interesting about this is the thought that hallucinations might lead to the kind of emergent language design we see in natural language (which might not be a good thing for a computer language, fwiw, but still interesting), where people just kind of thing "language should work this way and if I say it like this people will probably understand me".
[+] egberts1|1 year ago|reply
Write me a Mastercard/Visa fraud detection code in Ada, please.