This is exceptionally cool. Not only is it very interesting to see how this can be used to better understand and shape LLM behavior, I can’t help but also think it’s an interesting roadmap to human anthropology.
If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.
I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.
I find Anthorpic's work on mech interp fascinating in general. Their initial towards monosemanticity paper was highly surprising, and so is this with the ability to scale to a real production-scale LLM.
My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?
Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.
Their manipulation of the vectors and the effects produced would suggest that it isn't that the SAE is just finding phantom representations that aren't really there.
I'm allergic to latent space because I've yet to find any meaning to it beyond poetics, I develop an acute allergy when it's explicitly related to visually dimensional ideas like clustering.
I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?
To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have
ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.
The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.
Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.
It would be interesting to allow users of models to customize inference by tweaking these features, sort of like a semantic equalizer for LLMs. My guess is that this wouldn't work as well as fine-tuning, since that would tweak all the features at once toward your use case, but the equalizer would require zero training data.
The prompt itself can trigger the features, so if you say "Try to weave in mentions of San Francisco" the San Francisco feature will be more activated in the response. But having a global equalizer could reduce drift as the conversation continued, perhaps?
At least for right now this approach would in most cases still be like using a shotgun instead of a scalpel.
Over the next year or so I'm sure it will refine enough to be able to be more like a vector multiplier on activation, but simply flipping it on in general is going to create a very 'obsessed' model as stated.
I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.
Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.
I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.
> on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.
No matter what, there will always be a group of people saying that. The power and drive of the brain to convince itself that it is weaved of magical energy on a divine substrate shouldn't be underestimated. Especially when media plays so hard into that idea (the robots that lose the war because they cannot overcome love, etc.) because brains really love being told they are right.
I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.
Love Anthropic research. Great visuals between Olah, Carter, and Pearce, as well.
I don’t think this paper does much in the way of your final point, “it doesn’t understand what it’s saying”, though our understanding certainly has improved.
I think the research is good, but it's disappointing that they hype it by claiming it's going to help their basically entirely fictional "AI safety" project, as if the bits in their model are going to come alive and eat them.
This reminds me of how people often communicate to avoid offending others. We tend to soften our opinions or suggestions with phrases like "What if you looked at it this way?" or "You know what I'd do in those situations." By doing this, we subtly dilute the exact emotion or truth we're trying to convey. If we modify our words enough, we might end up with a statement that's completely untruthful. This is similar to how AI models might behave when manipulated to emphasize certain features, leading to responses that are not entirely genuine.
Counterpoint: "What if you looked at it this way?" communicates both your suggestion AND your sensitivity to the person's social status whatever. Given that humans are not robots, but social, psychological, animals, such communication is entirely justified and efficient.
A true AGI would learn to manipulate it's environment to achieve it's goals, but obviously we are not there yet.
An LLM has no goals - it's just a machine optimized to minimize training errors, although I suppose you could view this as an innate hard-coded goal of minimizing next word error (relative to training set), in same way we might say a machine-like insect has some "goals".
Of course RLHF provides a longer time span (entire response vs next word) error to minimize, but I doubt training volume is enough for the model to internally model a goal of manipulating the listener as opposed to just favoring surface forms of response.
- LLM Just got a whole set of buttons you can push. Potential for the LLM to push its own buttons?
- Read the paper and ctrl+f 'deplorable'. This shows once again how we are underestimating LLM's ability to appear conscious. It can be really effective. Reminiscence of Dr.Ford in Westworld :'you (robots) never look more human than when you are suffering.' Or something like that, anyway. I might be hallucinating dialogue but pretty sure something like that was said and I think it's quite true.
Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.
the interesting advance in the anthropic/mats research program is the application of dictionary learning to the "superpositioned" latent representations of transformers to find more "interpretable" features. however, "interpretability" is generally scored by the explainer/interpreter paradigm which is a bit ad hoc, and true automated circuit discovery (rather than simple concept representation) is still a bit off afaik.
Reminds me of this paper from a couple of weeks ago that isolated the "refusal vector" for prompts that caused the model to decline to answer certain prompts:
I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.
Strategic timing for the release of this paper. As of last week OpenAI looks weak in their commitment to _AI Safety_, losing key members of their Super Alignment team.
huge. the activation scan, which looks for which nodes change the most when prompted with the words "Golden Gate Bridge" and later an image of the same bridge, is eerily reminiscent of a brain scan under similar prompts...
I find this outcome expected and not really surprising, more confirmation of previous results. Consider vision transformers and the papers that showed what each layer was focused on.
I continue to be impressed by Anthropic’s work and their dual commitment to scaling and safety.
HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.
Particularly frustrating on HN is the common syllogism of:
1. I believe anything that “thinks” must do X thing.
2. LLM doesn’t do X thing
3. LLM doesn’t think
X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.
What is often frustrating to me at least is the arbitrary definition of "safety" and "ethics", forged by a small group of seemingly intellectually homogenous individuals.
> Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).
This seems like it's trivially true; if you find two different features for a concept in two different languages, just combine them and now you have a "multilingual feature".
Or are all of these features the same "size"? They might be and I might've missed it.
I wonder how interpretability and training can interplay. Some examples:
Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.
Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.
I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.
At this risk of anthropomorphizing too much, I can't help but see parallels between the "my physical form is the Golden Gate Bridge" screenshot and the https://en.wikipedia.org/wiki/God_helmet in humans --- both cognitive distortions caused by targeted exogenous neural activation.
We are so far ahead in the case of these models - we already have the complete wiring diagram! In biological systems we have only just begun to be able to create the complete neuronal wiring diagrams - currently worms, flies, perhaps soon mice
It's interesting that they used this to manipulate models. I wonder if "intentions" can be found and tuned. That would have massive potential for use and misuse. I could imagine a villain taking a model and amplifying "the evil" using a similar technique.
They explicitly aren't releasing any tools to do this with their models for safety reasons. But you could probably do it from scratch with one of the open models by following their methodology.
Basically finding that transformers don't just store a world-model as in "what does the world that produce the observed inputs look like?", they store a "Mixed-State Presentation", basically a weighted set of possible worlds that produce the observed inputs.
They target the residual stream. Also they may have a definition of “feature” that’s more general than what you’re using. Consider reading their superposition work.
For anyone who has read the paper, have they provided code examples or enough detail to recreate this with, say, Llama 3?
While they're concerned with safety, I'm much more interested in this as a tool for controllability. Maybe we can finally get rid of the woke customer service tone, and get AI to be more eclectic and informative, and less watered down in its responses.
So they made a system by trying out thousands of combinations to find the one gives best result, but they don't understand what's actually going on inside.
>what the model is "thinking" before writing its response
An actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.
An LLM is doing exactly nothing while waiting for the next prompt.
I disagree with this. That suggests that thinking requires persistent, malleable and non-static memory. That is not the case. You can reasonably reason about without increasing knowledge if you have a base set of logic.
I think the thing you were looking for was more along the lines of a persistent autonomous agent.
Right, it's not doing anything between prompts, but each prompt is fed through each of the transformer layers (I think it was 96 layers for GPT-3) in turn, so we can think of this as a fixed N-steps of "thought" (analyzing prompt in hierarchical fashion) to generate each token.
I might be a complete brainlet so excuse my take, but when animals think and do things, the weights in the brain are constantly being adjusted, old connections pruned out and new ones made right? But once LLM is trained, that's kind of it? Nothing there changes when we discuss with it. As far as I understand from what I read, even our memories are just somehow in the connections between the neurons
If we figured out how to freeze and then revive brains, would that mean that all of the revived brains were no longer thinking because they had previously been paused at some point?
tel|1 year ago
If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.
I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.
davedx|1 year ago
e63f67dd-065b|1 year ago
My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?
Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.
kromem|1 year ago
refulgentis|1 year ago
I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?
To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have
ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.
baq|1 year ago
I find this statement... controversial?
The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.
Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.
bjterry|1 year ago
The prompt itself can trigger the features, so if you say "Try to weave in mentions of San Francisco" the San Francisco feature will be more activated in the response. But having a global equalizer could reduce drift as the conversation continued, perhaps?
ericflo|1 year ago
kromem|1 year ago
Over the next year or so I'm sure it will refine enough to be able to be more like a vector multiplier on activation, but simply flipping it on in general is going to create a very 'obsessed' model as stated.
kromem|1 year ago
I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.
Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.
I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.
Workaccount2|1 year ago
No matter what, there will always be a group of people saying that. The power and drive of the brain to convince itself that it is weaved of magical energy on a divine substrate shouldn't be underestimated. Especially when media plays so hard into that idea (the robots that lose the war because they cannot overcome love, etc.) because brains really love being told they are right.
I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.
jwilber|1 year ago
I don’t think this paper does much in the way of your final point, “it doesn’t understand what it’s saying”, though our understanding certainly has improved.
astrange|1 year ago
byteknight|1 year ago
nathan_compton|1 year ago
HarHarVeryFunny|1 year ago
An LLM has no goals - it's just a machine optimized to minimize training errors, although I suppose you could view this as an innate hard-coded goal of minimizing next word error (relative to training set), in same way we might say a machine-like insect has some "goals".
Of course RLHF provides a longer time span (entire response vs next word) error to minimize, but I doubt training volume is enough for the model to internally model a goal of manipulating the listener as opposed to just favoring surface forms of response.
justanotherjoe|1 year ago
- LLM Just got a whole set of buttons you can push. Potential for the LLM to push its own buttons?
- Read the paper and ctrl+f 'deplorable'. This shows once again how we are underestimating LLM's ability to appear conscious. It can be really effective. Reminiscence of Dr.Ford in Westworld :'you (robots) never look more human than when you are suffering.' Or something like that, anyway. I might be hallucinating dialogue but pretty sure something like that was said and I think it's quite true.
- Intensely realistic roleplaying potential unlocked.
- Efficiency by reducing context length by directly amplifying certain features instead.
Very powerful stuff. I am waiting eagerly when I can play with it myself. (Someone please make it a local feature)
pdevr|1 year ago
>Used "dictionary learning"
>Found abstract features
>Found similar/close features using distance
>Tried amplifying and suppressing features
Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.
sjkoelle|1 year ago
HanClinto|1 year ago
https://news.ycombinator.com/item?id=40242939
I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.
Overall, fascinating stuff!!
null_point|1 year ago
wwarner|1 year ago
verdverm|1 year ago
unknown|1 year ago
[deleted]
whimsicalism|1 year ago
HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.
Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t think
X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.
handwarmers|1 year ago
phyalow|1 year ago
astrange|1 year ago
This seems like it's trivially true; if you find two different features for a concept in two different languages, just combine them and now you have a "multilingual feature".
Or are all of these features the same "size"? They might be and I might've missed it.
parentheses|1 year ago
Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.
Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.
I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.
unknown|1 year ago
[deleted]
quotemstr|1 year ago
maciejgryka|1 year ago
tantalor|1 year ago
Damage part X of the network and see what happens. If the subject loses the ability to do Y, then X is responsible for Y.
See https://en.wikipedia.org/wiki/Phineas_Gage
rmorey|1 year ago
parentheses|1 year ago
bilsbie|1 year ago
I’m so fascinated by this stuff but I’m having trouble staying motivated in this short attention span world.
gdiamos|1 year ago
maherbeg|1 year ago
wrycoder|1 year ago
I suspect the time is coming when there will always be an aligned search AI between you and the internet.
pagekicker|1 year ago
CephalopodMD|1 year ago
watersb|1 year ago
Like, its talking about moose magick...
gautomdas|1 year ago
PoignardAzur|1 year ago
https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transforme...
Basically finding that transformers don't just store a world-model as in "what does the world that produce the observed inputs look like?", they store a "Mixed-State Presentation", basically a weighted set of possible worlds that produce the observed inputs.
kromem|1 year ago
Was the first research work that clued me into what Anthropic's work today ended up demonstrating.
sanxiyn|1 year ago
bilsbie|1 year ago
That’s going to completely change what features are looked at.
tel|1 year ago
youssefabdelm|1 year ago
While they're concerned with safety, I'm much more interested in this as a tool for controllability. Maybe we can finally get rid of the woke customer service tone, and get AI to be more eclectic and informative, and less watered down in its responses.
Mobil1|1 year ago
[deleted]
feverzsj|1 year ago
optimalsolver|1 year ago
An actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.
An LLM is doing exactly nothing while waiting for the next prompt.
byteknight|1 year ago
I think the thing you were looking for was more along the lines of a persistent autonomous agent.
HarHarVeryFunny|1 year ago
viking123|1 year ago
whimsicalism|1 year ago
Frankly this objection seems very weak
fassssst|1 year ago