top | item 43496485

(no title)

I struggled reading the papers - Anthropic’s white papers reminds me of Stephen Wolfram, where it’s a huge pile of suggestive empirical evidence, but the claims are extremely vague - no definitions, just vibes - the empirical evidence seems selectively curated, and there’s not much effort spent building a coherent general theory.

Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?

I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.

discuss

famouswaffles|11 months ago

>The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”

I'm honestly confused at what you're getting at here. It doesn't matter why Claude chose rabbit to plan around and in fact likely did do so because of carrot, the point is that it thought about it beforehand. The rabbit concept is present as the model is about to write the first word of the second line even though the word rabbit won't come into play till the end of the line.

>later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”!

It's not supposed to rhyme. That's the point. They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.

aithrowawaycomm|11 months ago

> Here, we modified the part of Claude’s internal state that represented the "rabbit" concept. When we subtract out the "rabbit" part, and have Claude continue the line, it writes a new one ending in "habit", another sensible completion. We can also inject the concept of "green" at that point, causing Claude to write a sensible (but no-longer rhyming) line which ends in "green". This demonstrates both planning ability and adaptive flexibility—Claude can modify its approach when the intended outcome changes.

This all seems explainable via shallow next-token prediction. Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme, but overriding it with green means the system cannot adapt? Why didn't it say "green habit" or something? It seems like Anthropic is having it both ways: Claude continued to rhyme after deleting the concept, which demonstrates planning, but also Claude coherently filled in the "green" line despite it not rhyming, which...also demonstrates planning? Either that concept is "last word" or it's not! There is a tension that does not seem coherent to me, but maybe if they had n=2 instead of n=1 examples I would have a clearer idea of what they mean. As it stands it feels arbitrary and post hoc. More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.

suddenlybananas|11 months ago

>They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.

I think the confusion here is from the extremely loaded word "concept" which doesn't really make sense here. At best, you can say that Claude planned that the next line would end with the word rabbit and that by replacing the internal representation of that word with another word lead the model to change.

TimorousBestie|11 months ago

Agreed. They’ve discovered something, that’s for sure, but calling it “the language of thought” without concrete evidence is definitely begging the question.

miraculixx|11 months ago

Came here to say this. Their paper reels of wishful thinking and labeling things in terms they prefer it would be. They even note in one place their replacement model has a 50% accuracy which is simply a fancy way to say the model's result is completely by chance, and it could be interpreted one way or another. Like throwing a coin.

In reality all that's happening is drawing samples on the joint probability of the tokens in the context window. That's what the model is designed to do, trained to do - and that's exactly what it does. More precisely that is what the algorithm does, using the model weights, the input ("prompt", tokenized) and the previously generated output, one token at a time. Unless the algorithm is started (by a human, ultimately), nothing happens. Note how entirely different that is to any living being that actually thinks.

All interpretation above and beyond that is speculative and all intelligence found is entirely human.

unknown|11 months ago

[deleted]

danso|11 months ago

tangent: this is the second time today I've seen an HN commenter use "begging the question" with its original meaning. I'm sorry to distract with a non-helpful reply, it's just I can't remember the last time I've seen that phrase in the wild to refer to a logical fallacy — even begsthequestion.info [0] has given up the fight.

(I don't mind language evolving over time, but I also think we need to save the precious few phrases we have for describing logical fallacies)

[0] https://web.archive.org/web/20220823092218/http://begtheques...