(no title)
jaehong747 | 11 months ago
The research also modifies internal states—removing “rabbit” or injecting “green”—and sees Claude shift to words like “habit” or end lines with “green.” That’s more about rerouting probabilistic paths than genuine “adaptation.” The authors argue it shows “planning,” but a language model can maintain multiple candidate words at once without engaging in human-like strategy.
Finally, “planning ahead” implies a top-down goal and a mechanism for sustaining it, which is a strong assumption. Transformative evidence would require more than observing feature activations. We should be cautious before anthropomorphizing these neural nets.
rcxdude|11 months ago
(And I think it's relatively obvious that the models do this to some degree: it's very hard to write any language at all without 'thinking ahead' at least a little bit in some form, due to the way human language is structured. If models didn't do this and only considered the next token alone they would paint themselves into a corner within a single sentence. Early LLMs like GPT-2 were still pretty bad at this, they were plausible over short windows but there was no consistency to a longer piece of text. Whether this is some high-level abstracted 'train of thought', and how cohesive it is between different forms of it, is a different question. Indeed from the section of jailbreaking it looks like it's often caught out by conflicting goals from different areas of the network which aren't resolved in some logical fashion)
jaehong747|11 months ago
vessenes|11 months ago
That said, what you said made me think some follow-up reporting that would be interesting would be looking at the top 20 or so probability second lines based on adjusting the rabbit / green state. It seems to me like we’d get more insight into how the model is thinking, and it would be relatively easy to parse for humans. You could run through a bunch of completions until you get 20 different words as the terminal rhyme word, then show candidate lines with percentages of time the rhyme word is chosen as the sort, perhaps.
jaehong747|11 months ago
I believe this phenomenon occurs because high-performance LLMs have probability distributions of future words already reflected in their neural networks, resulting in increased output values of LLM neurons (activation functions). It's something that happens during the process of predicting probability distributions for the next or future output token dictionaries.