top | item 43505242

(no title)

jaehong747 | 11 months ago

I’m skeptical of the claim that Claude “plans” its rhymes. The original example—“He saw a carrot and had to grab it, / His hunger was like a starving rabbit”—is explained as if Claude deliberately chooses “rabbit” in advance. However, this might just reflect learned statistical associations. “Carrot” strongly correlates with “rabbit” (people often pair them), and “grab it” naturally rhymes with “rabbit,” so the model’s activations could simply be surfacing common patterns.

The research also modifies internal states—removing “rabbit” or injecting “green”—and sees Claude shift to words like “habit” or end lines with “green.” That’s more about rerouting probabilistic paths than genuine “adaptation.” The authors argue it shows “planning,” but a language model can maintain multiple candidate words at once without engaging in human-like strategy.

Finally, “planning ahead” implies a top-down goal and a mechanism for sustaining it, which is a strong assumption. Transformative evidence would require more than observing feature activations. We should be cautious before anthropomorphizing these neural nets.

discuss

rcxdude|11 months ago

It will depend on exactly what you mean by 'planning ahead', but I think the fact that features which rhyme with a word appear before the model is trying to predict the word which needs to rhyme is good evidence the model is planning at least a little bit ahead: the model activations are not all just related to the next token.

(And I think it's relatively obvious that the models do this to some degree: it's very hard to write any language at all without 'thinking ahead' at least a little bit in some form, due to the way human language is structured. If models didn't do this and only considered the next token alone they would paint themselves into a corner within a single sentence. Early LLMs like GPT-2 were still pretty bad at this, they were plausible over short windows but there was no consistency to a longer piece of text. Whether this is some high-level abstracted 'train of thought', and how cohesive it is between different forms of it, is a different question. Indeed from the section of jailbreaking it looks like it's often caught out by conflicting goals from different areas of the network which aren't resolved in some logical fashion)

jaehong747|11 months ago

Modern transformer-based language models fundamentally lack structures and functions for "thinking ahead." And I don't believe that LLMs have emergently developed human-like thinking abilities. This phenomenon appears because language model performance has improved, and I see it as a reflection of future output token probabilities being incorporated into the probability distribution of the next token set in order to generate meaningful longer sentences. Humans have similar experiences. Everyone has experienced thinking about what to say next while speaking. However, in artificial intelligence language models, this phenomenon occurs mechanically and statistically. What I'm trying to say is that while this phenomenon may appear similar to human thought processes and mechanisms, I'm concerned about the potential anthropomorphic error of assuming machines have consciousness or thoughts.

vessenes|11 months ago

I liked the paper, and think what they’re doing is interesting. So, I’m less negative than you are about this, I think. To a certain extent, saying writing a full sentence with at least one good candidate rhyme isn’t “planning” and is instead “maintaining multiple candidates” seems like nearly semantic tautology to me.

That said, what you said made me think some follow-up reporting that would be interesting would be looking at the top 20 or so probability second lines based on adjusting the rabbit / green state. It seems to me like we’d get more insight into how the model is thinking, and it would be relatively easy to parse for humans. You could run through a bunch of completions until you get 20 different words as the terminal rhyme word, then show candidate lines with percentages of time the rhyme word is chosen as the sort, perhaps.

jaehong747|11 months ago

Like you, I also find the paper's findings interesting. I'm not arguing that LLMs lack the ability to "think" (mechanically), but rather expressing concern that by choosing the word "thinking" in the paper, LLMs might become anthropomorphized in ways they shouldn't be.

I believe this phenomenon occurs because high-performance LLMs have probability distributions of future words already reflected in their neural networks, resulting in increased output values of LLM neurons (activation functions). It's something that happens during the process of predicting probability distributions for the next or future output token dictionaries.