Show HN: Llama 3.2 Interpretability with Sparse Autoencoders
579 points| PaulPauls | 1 year ago |github.com
I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.
Cheers
[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...
foundry27|1 year ago
Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.
stavros|1 year ago
snthpy|1 year ago
I believe that's why humans reason too. We make snap judgements and then use reason to try to convince others of our beliefs. Can't recall the reference right now but they argued that it's really a tool for social influence. That also explains why people who are good at it find it hard to admit when they are wrong - they're not used to having to do it because they can usually out argue others. Prominent examples are easy to find - X marks de spot.
benreesman|1 year ago
But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].
[1] https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...
bubaumba|1 year ago
Onavo|1 year ago
unknown|1 year ago
[deleted]
fsndz|1 year ago
jwuphysics|1 year ago
Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.
[1] https://arxiv.org/abs/2408.00657
PaulPauls|1 year ago
Eliezer|1 year ago
curious_cat_163|1 year ago
Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:
https://adamkarvonen.github.io/machine_learning/2024/06/11/s...
And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.
Thanks again!
PaulPauls|1 year ago
Assuming you already solved the problem of finding multiple perfect SAE architectures and you trained them to perfection (very much an interesting ML engineering problem that this SAE project attempts to solve) then deciding on which SAE is better comes down to which SAE performs better on the metrics of your automated interpretability methodology. Particularly OpenAI's methodology emphasizes this automated interpretability at scale utilizing a lot of technical metrics upon which the SAEs can be scored _and thereby evaluated_.
Since determining the best metrics and methodology is such an open research question that I could've experimented on for a few additional months, have I instead opted for a simple approach in this first release. I am talking about my and OpenAI's methodology and the differences between both in chapter 4. Interpretability Analysis [1] in my Implementation Details & Results section. I can also recommend reading the OpenAI paper directly or visiting Anthropics transformer-circuits.pub website that often publishes smaller blog posts on exactly this topic.
[1] https://github.com/PaulPauls/llama3_interpretability_sae#4-i... [2] https://transformer-circuits.pub/
OrangeMusic|1 year ago
dimitry12|1 year ago
JackYoustra|1 year ago
PaulPauls|1 year ago
monkeycantype|1 year ago
can I please ask a wacky question that I have about mech.interp. ?
we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.
for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.
if we had the following tokens mapped in 2d space
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?
Majromax|1 year ago
You don't need a 'vastly more competent AI overseeing its own training' to elicit this potential problem, just a malicious AI researcher, looking for (e.g.) a model that's racist but that does not have any interperable activation patterns that identifiably correspond to racism.
The work here on this Show HN suggests that this kind of adversarial training might just barely be possible for a sufficiently-funded individual, and it seems like novel results would be very interesting.
samstevens|1 year ago
lynx23|1 year ago
[deleted]
jaykr_|1 year ago
PaulPauls|1 year ago
moconnor|1 year ago
Also, you didn't ask for suggestions but putting some interesting results / visualizations at the top of the README is a very good idea.
vivekkalyan|1 year ago
You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.
PaulPauls|1 year ago
Regarding the cost I would probably sum it up to round about ~2.5k USD for just the actual execution cost. Development cost would've probably doubled that sum if I wouldn't already have a GPU workstation for experiments at home that I take for granted. That cost is made up of:
* ~400 USD for ~2 months of storage and traffic of 7.4 TB (3.2 TB of raw, 3.2 TB of preprocessed training data) on a GCP standard bucket
* ~100 USD for Anthropic claude requests for experimenting with the right system prompt and test runs and the actual final execution
* The other ~2k USD were used to rent 8x Nvidia RTX4090's together with a 5TB SSD from runpod.io for various stages of the experiments. For the actual SAE training I rented the node for 8 days straight and I would probably allocate an additional ~3-4 days of runtime just to experiments to determine the best Hyperparameters for training.
westurner|1 year ago
XAI: Explainable AI: https://en.wikipedia.org/wiki/Explainable_artificial_intelli...
/? XAI , #XAI , Explain, EXPLAIN PLAN , error/energy/time
westurner|1 year ago
> TabPFN: https://github.com/automl/TabPFN .. https://x.com/FrankRHutter/status/1583410845307977733 [2022]
"TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second" (2022) https://arxiv.org/abs/2308.08945
> FWIU TabPFN is Bayesian-calibrated/trained with better performance than xgboost for non-categorical data
westurner|1 year ago
> /? awesome "explainable ai" https://www.google.com/search?q=awesome+%22explainable+ai%22
- (Many other great resources)
- https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master... :
> Post model-creation analysis, ML interpretation/explainability
> /? awesome "explainable ai" "XAI"
imranhou|1 year ago
I struggle to understand this phrase "to prevent and revive ", perhaps this is simple speak to those that understand the subject of SAEs, but it feels a bit self contradictory to me, could anyone elaborate?
PaulPauls|1 year ago
Now that I review that sentence again I see that I used 2 verbs on the same subject that could be interpreted differently depending on the verb. Me culpa. I hope you still gained some insights into it =)
versteegen|1 year ago
yangwang92|1 year ago
batterylake|1 year ago
PaulPauls, how would you like us to cite your work?
PaulPauls|1 year ago
I included a section at the bottom that provides a sample bibtex citation. I didn't expect this much attention so I didn't even bother with a License but I'll include a MIT license later today and release 0.2.1
unknown|1 year ago
[deleted]
enterthedragon|1 year ago
Carrentt|1 year ago
coolvision|1 year ago
S-Kaenel|1 year ago
jzjsj|1 year ago
unknown|1 year ago
[deleted]