top | item 46767939

(no title)

phpnode | 1 month ago

Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time

discuss

LoganDark|1 month ago

It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.

GuB-42|1 month ago

It had been shown that LLMs don't know how they work. They asked a LLM to perform computations, and explain how they got to the result. The LLM explanation is typical of how we do it: add number digit by digit, with carry, etc... But by looking inside the neural network, it show that the reality is completely different and much messier. None of it is surprising.

Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.

FireBeyond|1 month ago

Right. Last time I checked this was easy to demonstrate with word logic problems:

"Adam has two apples and Ben has four bananas. Cliff has two pieces of cardboard. How many pieces of fruit do they have?" (or slightly more complex, this would probably be easily solved, but you get my drift.)

Change the wordings to some entirely random, i.e. something not likely to be found in the LLM corpus, like walruses and skyscrapers and carbon molecules, and the LLM will give you a suitably nonsensical answer showing that it is incapable of handling even simple substitutions that a middle schooler would recognize.

phpnode|1 month ago

The explanation becomes part of the context which can lead to more effective results in the next turn, it does work, but it does so in a completely misleading way

wongarsu|1 month ago

Which should be expected, since the same is true for humans. The "adding numbers digit by digit with carry" works well on paper, but it's not an effective method for doing math in your head, and is certainly not how I calculate 14+17. In fact I can't really tell you how I calculate 14+17 since that's not in the "inner monologue" part of my brain, and I have little introspection in any of the other parts

Still, feeding humans their completely made-up self-reflection back can be an effective strategy

kaffekaka|1 month ago

Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.

scotty79|1 month ago

It must be anthropomorphization that's hard to shake off.

If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.

The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.

nnevatie|1 month ago

That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.

Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.

nonethewiser|1 month ago

IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.

Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.

I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.

amenhotep|1 month ago

This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.

drob518|1 month ago

It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.