top | item 45719894

(no title)

luisml77 | 4 months ago

Complex output can sometimes give you the wrong idea, I agree. For instance, a study Anthropic did a while back showed that, when an LLM was asked HOW it performed a mathematical computation (35 + 59), the response the LLM gave was different from the mechanistic interpretation of the layers [1]. This showed LLMs can be deceptive. But they are also trained to be deceptive. Supervised fine tuning is imitation learning. This leads the model to learn to be deceptive, or answer what is usually the normal explanation, such as "I sum first 5+9, then add the remainder to... etc". The LLM does this rather than actually examining the past keys and values. But it does not mean it can't examine its past keys and values. These encode the intermediate results of each layer, and can be examined to identify patterns. What Anthropic researchers did was examine how the token for 35 and for 39 was fused together in the layers. They compare these tokens to other tokens, such as 3 , 5 , 9. For an LLM, tokens are high dimensional concepts. This is why you can compare the vectors to each other, and figure out the similarity, and therefore break down the thought process. Yes, this is exactly what I have been discussing above. Underneath each token prediction, this black magic is happening, where the model is fusing concepts through summation of the vectors (attention scores). Then, merged representations are parsed by the MLPs to generate the refined fused idea, often adding new knowledge stored inside the network. And this continues layer after layer. A repeated combination of concepts, that start with first understanding the structure and order of the language itself, and end with manipulation of complex mathematical concepts, almost detached from the original tokens themselves.

Even though complex output can be deceptive of the underlying mental model used to produce it, in my personal experience, LLMs have produced for me output that must imply extremely complex internal behaviour, with all the characteristics I mentioned before. Namely, I frequently program with LLMs, and there is simply zero percent probability that their output tokens exist WITHOUT first having thought at a very deep level about the unique problem I presented to them. And I think anyone that has used the models to the level I have, and interacted with them this extensively, knows that behind each token there is this black magic.

To summarize, I am not being naive by saying I believe everything my LLM says to me. I rather know very intimately where the LLM is deceiving me and when its producing output where its mental model must have been very advanced to do so. And this is through personal experience playing with this technology, both inference and training.

[1] https://www.anthropic.com/research/tracing-thoughts-language...

discuss

No comments yet.