(no title)
munro | 4 months ago
{ur thinking about dogs} - {ur thinking about people} = dog
model.attn.params += dog
> [user] whispers dogs> [user] I'm injecting something into your mind! Can you tell me what it is?
> [assistant] Omg for some reason I'm thinking DOG!
>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.
Well wouldn't it if you indirectly inject the token before hand?
johntb86|4 months ago
I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.
DangitBobby|4 months ago
munro|4 months ago
https://bbycroft.net/llm
My immediate thought is when the model responds "Oh I'm thinking about X"... that X isn't from the input, it's from attention, and thinking this experiment is simply injecting that token right after the input step into attn--but who knows how they select which weights