top | item 43498169

(no title)

> Here, we modified the part of Claude’s internal state that represented the "rabbit" concept. When we subtract out the "rabbit" part, and have Claude continue the line, it writes a new one ending in "habit", another sensible completion. We can also inject the concept of "green" at that point, causing Claude to write a sensible (but no-longer rhyming) line which ends in "green". This demonstrates both planning ability and adaptive flexibility—Claude can modify its approach when the intended outcome changes.

This all seems explainable via shallow next-token prediction. Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme, but overriding it with green means the system cannot adapt? Why didn't it say "green habit" or something? It seems like Anthropic is having it both ways: Claude continued to rhyme after deleting the concept, which demonstrates planning, but also Claude coherently filled in the "green" line despite it not rhyming, which...also demonstrates planning? Either that concept is "last word" or it's not! There is a tension that does not seem coherent to me, but maybe if they had n=2 instead of n=1 examples I would have a clearer idea of what they mean. As it stands it feels arbitrary and post hoc. More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.

discuss

famouswaffles|11 months ago

>Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme,

Again, the model has the first line in context and is then asked to write the second line. It is at the start of the second line that the concept they are talking about is 'born'. The point is to demonstrate that Claude thinks about what word the 2nd line should end with and starts predicting the line based on that.

It doesn't forget about the -bit rhyme because that doesn't make any sense, the first line ends with it and you just asked it to write the 2nd line. At this point the model is still choosing what word to end the second line in (even though rabbit has been suppressed) so of course it still thinks about a word that rhymes with the end of the first line.

The 'green' but is different because this time, Anthropic isn't just suppressing one option and letting the model choose from anything else, it's directly hijacking the first choice and forcing that to be something else. Claude didn't choose green, Anthropic did. That it still predicted a sensible line is to demonstrate that this concept they just hijacked is indeed responsible for determining how that line plays out.

>More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.

They didn't rule out anything. You just didn't understand what they were saying.