top | item 42208383

Show HN: Llama 3.2 Interpretability with Sparse Autoencoders

579 points| PaulPauls | 1 year ago |github.com

I spent a lot of time and money on this rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic [1], OpenAI [2] and Deepmind [3].

I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.

Cheers

[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...

[2] https://arxiv.org/abs/2406.04093

[3] https://arxiv.org/abs/2408.05147

99 comments

order

foundry27|1 year ago

For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth.

Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.

stavros|1 year ago

I want to point out here that people do the same: a lot of the time we don't know why we thought or did something, but we'll confabulate plausible-sounding rhetoric after the fact.

snthpy|1 year ago

A{rt,I} imitating life

I believe that's why humans reason too. We make snap judgements and then use reason to try to convince others of our beliefs. Can't recall the reference right now but they argued that it's really a tool for social influence. That also explains why people who are good at it find it hard to admit when they are wrong - they're not used to having to do it because they can usually out argue others. Prominent examples are easy to find - X marks de spot.

benreesman|1 year ago

A lot of the mech interp stuff has seemed to me like a different kind of voodoo: the Integer Quantum Hall Effect? Overloading the term “Superposition” in a weird analogy not governed by serious group representation theory and some clear symmetry? You guys are reaching. And I’ve read all the papers. Spot the postdoc who decided to get paid.

But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].

[1] https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...

bubaumba|1 year ago

BTW, it's easy to test model's logic and truthfulness by giving it a wrong decision is if it was its, and asking to explain. Model has no memory and cannot distinguish the source of the text. 'Truthful' model should admit mistake without being asked. Likely model instead will do 'parallel construction' to support 'its' decision.

Onavo|1 year ago

How does the causality part work? Can it spit out a graphical model?

fsndz|1 year ago

I stopped at: "causal sequence of “thoughts” "

jwuphysics|1 year ago

Incredible, well-documented work -- this is an amazing effort!

Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.

[1] https://arxiv.org/abs/2408.00657

PaulPauls|1 year ago

I'm very happy you appreciate it - particularly the documentation. Writing the documentation was much harder for me than writing the code so I'm happy it is appreciated. I furthermore downloaded your paper and will read through it tomorrow morning - thank you for sharing it!

Eliezer|1 year ago

This seems like decent alignment-positive work on a glance, though I haven't checked full details yet. I probably can't make it happen, but how much would someone need to pay you to make up your time, expense, and risk?

curious_cat_163|1 year ago

Hey - Thanks for sharing!

Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:

https://adamkarvonen.github.io/machine_learning/2024/06/11/s...

And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.

Thanks again!

PaulPauls|1 year ago

So evaluating SAEs - determining which SAE is better at creating the most unique features while being as sparse as possible at the same time - is a very complex topic that is very much at the heart of the current research into LLM interpretability through SAEs.

Assuming you already solved the problem of finding multiple perfect SAE architectures and you trained them to perfection (very much an interesting ML engineering problem that this SAE project attempts to solve) then deciding on which SAE is better comes down to which SAE performs better on the metrics of your automated interpretability methodology. Particularly OpenAI's methodology emphasizes this automated interpretability at scale utilizing a lot of technical metrics upon which the SAEs can be scored _and thereby evaluated_.

Since determining the best metrics and methodology is such an open research question that I could've experimented on for a few additional months, have I instead opted for a simple approach in this first release. I am talking about my and OpenAI's methodology and the differences between both in chapter 4. Interpretability Analysis [1] in my Implementation Details & Results section. I can also recommend reading the OpenAI paper directly or visiting Anthropics transformer-circuits.pub website that often publishes smaller blog posts on exactly this topic.

[1] https://github.com/PaulPauls/llama3_interpretability_sae#4-i... [2] https://transformer-circuits.pub/

JackYoustra|1 year ago

Very cool work! Any plans to integrate it with SAELens?

PaulPauls|1 year ago

Not sure yet to be honest. I'll definitely consider it but I'll reorient myself and what I plan to do next in the coming week. I also planned on maybe starting a simpler project and maybe showing people how to create the full model of a current Llama 3.2 implementation from scratch in pure PyTorch. I love building things from teh ground up and when I looked for documentation for the Llama 3.2 background section of this SAE project then the existing documentation I found was either too superficial or outdated and intended for Llama 1 or 2 - Documentation in ML gets outdated so quickly nowadays...

monkeycantype|1 year ago

Thank you for posting this PaulPauls,

can I please ask a wacky question that I have about mech.interp. ?

we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.

for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.

if we had the following tokens mapped in 2d space

  Apple 1a
  Pear  1b
  Donkey 2a
  Horse 2b

it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?

I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?

Majromax|1 year ago

What you propose is a harder AI safety scenario.

You don't need a 'vastly more competent AI overseeing its own training' to elicit this potential problem, just a malicious AI researcher, looking for (e.g.) a model that's racist but that does not have any interperable activation patterns that identifiably correspond to racism.

The work here on this Show HN suggests that this kind of adversarial training might just barely be possible for a sufficiently-funded individual, and it seems like novel results would be very interesting.

samstevens|1 year ago

I’m really excited to see some more open SAE work! The engineering effort is non trivial and I’m going to check out your dataloading code tomorrow. You might be interested in an currently in-progress project of mine to train SAEs on vision models: https://github.com/samuelstevens/saev

jaykr_|1 year ago

This is awesome! I really appreciate the time you took to document everything!

PaulPauls|1 year ago

Thank you for saying that! I have a much, much harder time documenting everything and writing out each decision in continuous text than actually writing the code. So it took a look time for me to write all of this down - so I'm happy you appreciate it! =)

moconnor|1 year ago

Find a latent for the Golden Gate bridge and put a Golden Gate Llama 3.2 on HuggingFace. This will get even more attention and love, more so if you include link to a space to chat with it!

Also, you didn't ask for suggestions but putting some interesting results / visualizations at the top of the README is a very good idea.

vivekkalyan|1 year ago

This is great work! Mechanistic interpretability has tons of use cases, it's great to see open research in that field.

You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.

PaulPauls|1 year ago

Thank you, I too am a big believer and enjoyer of open research. The actual code has clarity that complex research papers were never able to convey to me as well as the actual code could.

Regarding the cost I would probably sum it up to round about ~2.5k USD for just the actual execution cost. Development cost would've probably doubled that sum if I wouldn't already have a GPU workstation for experiments at home that I take for granted. That cost is made up of:

* ~400 USD for ~2 months of storage and traffic of 7.4 TB (3.2 TB of raw, 3.2 TB of preprocessed training data) on a GCP standard bucket

* ~100 USD for Anthropic claude requests for experimenting with the right system prompt and test runs and the actual final execution

* The other ~2k USD were used to rent 8x Nvidia RTX4090's together with a 5TB SSD from runpod.io for various stages of the experiments. For the actual SAE training I rented the node for 8 days straight and I would probably allocate an additional ~3-4 days of runtime just to experiments to determine the best Hyperparameters for training.

westurner|1 year ago

The relative performance in err/watts/time compared to deep learning for feature selection instead of principal component analysis and standard xgboost or tabular xt TODO for optimization given the indicating features.

XAI: Explainable AI: https://en.wikipedia.org/wiki/Explainable_artificial_intelli...

/? XAI , #XAI , Explain, EXPLAIN PLAN , error/energy/time

imranhou|1 year ago

Coming from a layman's perspective, a genuine question regarding: "Implements SAE training with auxiliary loss to prevent and revive dead latents, and gradient projection to stabilize training dynamics".

I struggle to understand this phrase "to prevent and revive ", perhaps this is simple speak to those that understand the subject of SAEs, but it feels a bit self contradictory to me, could anyone elaborate?

PaulPauls|1 year ago

Just bad wording from me, trying to combine too much information in 1 sentence. The auxiliary loss is supposed to prevent dead latents from occuring in the first place - therefore "prevent dead latents" - and it is also supposed to revive the latents that are already dead - therefore "revive dead latents".

Now that I review that sentence again I see that I used 2 verbs on the same subject that could be interpreted differently depending on the verb. Me culpa. I hope you still gained some insights into it =)

versteegen|1 year ago

A latent that is never active and hence doesn't (seem to) represent anything. A loss term to reduce the occurrence of that, and if it does happen, push it back to being active sometimes.

yangwang92|1 year ago

Nice! You did what I wanted. Have you tried to train SAE for vision encoder and language encoder? I am working on this idea. May we work together, let me initial an issue.

batterylake|1 year ago

This is incredible!

PaulPauls, how would you like us to cite your work?

PaulPauls|1 year ago

Thank you very much!

I included a section at the bottom that provides a sample bibtex citation. I didn't expect this much attention so I didn't even bother with a License but I'll include a MIT license later today and release 0.2.1

enterthedragon|1 year ago

This is amazing, the documentation is very well organized

Carrentt|1 year ago

Fantastic work! I absolutely love all the documentation.

coolvision|1 year ago

nice! did you use cloud GPUs or built your own machine?

S-Kaenel|1 year ago

Amazing research!!

jzjsj|1 year ago

whwjwj