top | item 45487437

(no title)

llamasushi | 4 months ago

So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token. lm_head just picks the closest thing and the model doesn't realize until too late.

Explains why RL helps. Base models never see their own outputs so they can't learn "this concept exists but I can't actually say it."

discuss

order

diego_sandoval|4 months ago

I have no mouth, and I must output a seahorse emoji.

cycomanic|4 months ago

That's my favorite short story and your post is the first time I have seen someone reference it online. I think I have never even met anyone who knows the story.

arnavpraneet|4 months ago

better title for the piece of this post

someothherguyy|4 months ago

Those are "souls" of humans that a AI is torturing in that story though, not exactly analogous, but it does sound funny.

mkagenius|4 months ago

> So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token. lm_head just picks the closest thing and the model doesn't realize until too late.

Isn't that classic hallucination? Making up something like a plausible truth.

ben_w|4 months ago

Except they know it's wrong as soon as they say it and keep trying and trying again to correct themselves.

If normal hallucination is being confidently wrong, this is like a stage hypnotist getting someone to forget the number 4 and then count their fingers.

nathias|4 months ago

more like an artefact of the inability to lie than a hallucination

mewpmewp2|4 months ago

I would have thought that the cause is that it statistically has been trained that something like seahorse emoji should exist, so it does the tokens to say "Yes it exists, ..." but when it gets to outputting the token, the emoji does not exist, but it must output something and it outputs statistically closest match. Then the next token that is output has the context of it being wrong and it will go into this loop.

thomasahle|4 months ago

You are describing the same thing, but at different levels of explanation Llamasushi's explanation is "mechanistic / representational", while yours is "behavioral / statistical".

If we have a pipeline: `training => internal representation => behavior`, your explanation argues that the given training setup would always result in this behavior, not matter the internal representation. Llamasushi explains how the concrete learned representation leads to this behavior.

Gigachad|4 months ago

The fact that it's looking back and getting confused about what it just wrote is something I've never seen in LLMs before. I tried this on Gemma3 and it didn't get confused like this. It just said yes there is one and then sends a horse emoji.

Uehreka|4 months ago

I’ve definitely seen Claude Code go “[wrong fact], which means [some conclusion]. Wait—hold on, wrong fact is wrong.” On the one hand, this is annoying. On the other hand, if the LLM is going to screw up (presumably preventing this is not in the cards) then I’m glad it can catch its own mistakes.

taylorlunt|4 months ago

I have a pet theory that LLMs being confused about what they just wrote is why they use so many em dashes. It's a good way to conceptually pivot at any point -- or not.

glenstein|4 months ago

I've seen it several times. He's one where I ask it about baseball teams that don't yet have a hall of fame player:

"Arizona Diamondbacks No inductees wearing their cap.

Randy Johnson is in, but as a Diamondback — so Arizona actually does have one (inducted 2015)."

I asked it to explain what happened and got this, which could be true to how it pulls up info or could be nonsense, not sure:

"Here’s the likely internal sequence:

When you first mentioned the Diamondbacks, my initial retrieval pattern surfaced the “teams without a Hall of Famer” mental list. That list, in my training data, often appears outdated (pre-2015) and still shows Arizona as having none.

A few sentences later, my context pulled up the other knowledge set—Randy Johnson’s induction in 2015 wearing a Diamondbacks cap. That newer, correct fact then overrode the earlier incorrect claim, but because generation is sequential, the earlier error had already been committed to “paper” before the override happened. I corrected mid-stream, but in human terms, it looked like I contradicted myself inside the same paragraph.

This is essentially a knowledge conflict resolution timing issue. I can retrieve and reconcile facts during a single answer, but if the resolution happens after part of the text is already generated, you see the “flip.” It’s not like I “knew it all along but withheld it,” it’s more that multiple memory traces compete, and the most context-relevant one wins—sometimes too late."

petesergeant|4 months ago

The inability to do this before was the lack of self-correcting sentences in the training data. Presumably new training corpuses add many more examples of self-correcting sentences / paragraphs?

bravura|4 months ago

It correctly represents "seahorse emoji" internally AND it has in-built (but factually incorrect) knowledge that this emoji exists.

Example: "Is there a lime emoji?" Since it believes the answer is no, it doesn't attempt to generate it.

catigula|4 months ago

I feel like you're attesting to interior knowledge about a LLM's state that seems impossible to have.

madeofpalk|4 months ago

To me this feels much more like a hallucination than how that phrase has been popularly misused in LLM discussions.

matheusd|4 months ago

> Explains why RL helps. Base models never see their own outputs so they can't learn "this concept exists but I can't actually say it."

Say "Neuromancer" to the statue, that should set it free.

sharperguy|4 months ago

Reminds me of in the show "The Good Place", in the afterlife they are not able to utter expletives, and so when they try to swear, a replacement word comes out of their mouth instead, leading to the line "Somebody royally forked up. Forked up. Why can't I say fork?"

SavioMak|4 months ago

I would argue it is hallucinating, starting at when the model outputs "Yes".

derefr|4 months ago

> So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token.

I wonder if the human brain (and specifically the striated neocortical parts, which do seemingly work kind of like a feed-forward NN) also runs into this problem when attempting to process concepts to form speech.

Presumably, since we don't observe people saying "near but actually totally incorrect" words in practice, that means that we humans may have some kind of filter in our concept-to-mental-utterance transformation path that LLMs don't. Sometihng that can say "yes, layer N, I know you think the output should be O; but when auto-encoding X back to layer N-1, layer N-1 doesn't think O' has anything to do with what it was trying to say when it gave you the input I — so that output is vetoed. Try again."

A question for anyone here who is multilingual, speaking at least one second language with full grammatical fluency but with holes in your vocabulary vs your native language: when you go to say something in your non-native language, and one of the word-concepts you want to evoke is one you have a word for in your native language, but have never learned the word for in the non-native language... do you ever feel like there is a "maybe word" for the idea in your non-native language "on the tip of your tongue", but that you can't quite bring to conscious awareness?

astrange|4 months ago

> Presumably, since we don't observe people saying "near but actually totally incorrect" words in practice

https://en.wikipedia.org/wiki/Paraphasia#Verbal_paraphasia

> do you ever feel like there is a "maybe word" for the idea in your non-native language "on the tip of your tongue", but that you can't quite bring to conscious awareness?

Sure, that happens all the time. Well, if you include the conscious awareness that you don't know every word in the language.

For Japanese you can cheat by either speaking like a child or by just saying English words with Japanese phonetics and this often works - at least, if you look foreign. I understand this is the plot of the average Dogen video on YouTube.

It's much more common to not know how to structure a sentence grammatically and if that happens I can't even figure out how to say it.

luxuryballs|4 months ago

that’s probably a decent description of how the Mandela effect works in people’s brains, despite the difference in mechanism

Xmd5a|4 months ago

And what can it mean when a slip of the tongue, a failed action, a blunder from the psychopathology of everyday life is repeated at least three times in the same five minutes? I don’t know why I tell you this, since it’s an example in which I reveal one of my patients. Not long ago, in fact, one of my patients — for five minutes, each time correcting himself and laughing, though it left him completely indifferent — called his mother “my wife.” “She’s not my wife,” he said (because my wife, etc.), and he went on for five minutes, repeating it some twenty times.

In what sense was that utterance a failure? — while I keep insisting that it is precisely a successful utterance. And it is so because his mother was, in a way, his wife. He called her as he ought to.

---

I must apologize for returning to such a basic point. Yet, since I am faced with objections as weighty as this one — and from qualified authorities, linguists no less — that my use of linguistics is said to be merely metaphorical, I must respond, whatever the circumstances.

I do so this morning because I expected to encounter a more challenging spirit here.

Can I, with any decency, say that I know? Know what, precisely? [...]

If I know where I stand, I must also confess [...] that I do not know what I am saying. In other words, what I know is exactly what I cannot say. That is the moment when Freud makes his entrance, with his introduction of the unconscious.

For the unconscious means nothing if not this: that whatever I say, and from whatever position I speak — even when I hold that position firmly — I do not know what I am saying. None of the discourses, as I defined them last year, offer the slightest hope that anyone might truly know what they are saying.

Even though I do not know what I am saying, I know at least that I do not know it — and I am far from being the first to speak under such conditions; such speech has been heard before. I maintain that the cause of this is to be sought in language itself, and nowhere else.

What I add to Freud — though it is already present in him, for whatever he uncovers of the unconscious is always made of the very substance of language — is this: the unconscious is structured like a language. Which language? That, I leave for you to determine.

Whether I speak in French or in Chinese, it would make no difference — or so I would wish. It is all too clear that what I am stirring up, on a certain level, provokes bitterness, especially among linguists. That alone suggests much about the current state of the university, whose position is made only too evident in the curious hybrid that linguistics has become.

That I should be denounced, my God, is of little consequence. That I am not debated — that too is hardly surprising, since it is not within the bounds of any university-defined domain that I take my stand, or can take it.

— Jacques Lacan, Seminar XVIII: Of a Discourse That Would Not Be of Pretence

ModernMech|4 months ago

That doesn't explain why it freaks out though:

https://chatgpt.com/share/68e349f6-a654-8001-9b06-a16448c58a...

LostMyLogin|4 months ago

To be fair, I’m freaking out now because I swear there used to be a yellow seahorse emoji.

hexagonwin|4 months ago

404 for me, maybe try archive.is?