(no title)
aithrowawaycomm | 11 months ago
Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?
I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.
famouswaffles|11 months ago
I'm honestly confused at what you're getting at here. It doesn't matter why Claude chose rabbit to plan around and in fact likely did do so because of carrot, the point is that it thought about it beforehand. The rabbit concept is present as the model is about to write the first word of the second line even though the word rabbit won't come into play till the end of the line.
>later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”!
It's not supposed to rhyme. That's the point. They forced Claude to plan around a line ender that doesn't rhyme and it did. Claude didn't choose the word green, anthropic replaced the concept it was thinking ahead about with green and saw that the line changed accordingly.
aithrowawaycomm|11 months ago
This all seems explainable via shallow next-token prediction. Why is it that subtracting the concept means the system can adapt and create a new rhyme instead of forgetting about the -bit rhyme, but overriding it with green means the system cannot adapt? Why didn't it say "green habit" or something? It seems like Anthropic is having it both ways: Claude continued to rhyme after deleting the concept, which demonstrates planning, but also Claude coherently filled in the "green" line despite it not rhyming, which...also demonstrates planning? Either that concept is "last word" or it's not! There is a tension that does not seem coherent to me, but maybe if they had n=2 instead of n=1 examples I would have a clearer idea of what they mean. As it stands it feels arbitrary and post hoc. More generally, they failed to rule out (or even consider!) that well-tuned-but-dumb next-token prediction explains this behavior.
suddenlybananas|11 months ago
I think the confusion here is from the extremely loaded word "concept" which doesn't really make sense here. At best, you can say that Claude planned that the next line would end with the word rabbit and that by replacing the internal representation of that word with another word lead the model to change.
TimorousBestie|11 months ago
miraculixx|11 months ago
In reality all that's happening is drawing samples on the joint probability of the tokens in the context window. That's what the model is designed to do, trained to do - and that's exactly what it does. More precisely that is what the algorithm does, using the model weights, the input ("prompt", tokenized) and the previously generated output, one token at a time. Unless the algorithm is started (by a human, ultimately), nothing happens. Note how entirely different that is to any living being that actually thinks.
All interpretation above and beyond that is speculative and all intelligence found is entirely human.
unknown|11 months ago
[deleted]
danso|11 months ago
(I don't mind language evolving over time, but I also think we need to save the precious few phrases we have for describing logical fallacies)
[0] https://web.archive.org/web/20220823092218/http://begtheques...