top | item 46760195

(no title)

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

discuss

seanmcdirmid|1 month ago

Yes, and better still the AI will fix its mistakes if it has access to verification tools directly. You can also have it write and execute tests, and then on failure, decide if the code it wrote or the tests it wrote are wrong, snd while there is a chance of confirmation bias, it often works well enough

embedding-shape|1 month ago

> decide if the code it wrote or the tests it wrote are wrong

Personally I think it's too early for this. Either you need to strictly control the code, or you need to strictly control the tests, if you let AI do both, it'll take shortcuts and misunderstandings will much easier propagate and solidify.

Personally I chose to tightly control the tests, as most tests LLMs tend to create are utter shit, and it's very obvious. You can prompt against this, but eventually they find a hole in your reasoning and figure out a way of making the tests pass while not actually exercising the code it should exercise with the tests.

IshKebab|1 month ago

> LLMs will invent a method that sounds correct but doesn't exist in the library

I find that this is usually a pretty strong indication that the method should exist in the library!

I think there was a story here a while ago about LLMs hallucinating a feature in a product so in the end they just implemented that feature.

SubiculumCode|1 month ago

Honestly, I feel humans are similar. It's the generator <-> executive loop that keeps things right

vrighter|1 month ago

So you want the program to always halt at some point. How would you write a deterministic test for it?

te7447|1 month ago

I imagine you would use something that errs on the side of safety - e.g. insist on total functional programming and use something like Idris' totality checker.

zoho_seni|1 month ago

I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.

exitb|1 month ago

I’m not sure why you were downvoted. It’s a primary concern for any agentic task to set it up with a verification path.

CamperBob2|1 month ago

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

Often, if not usually, that means the method should exist.

HPsquared|1 month ago

Only if it's actually possible and not a fictional plot device aka MacGuffin.