(no title)
oli5679 | 3 months ago
It is believed dense models cram many features into shared weights, making circuits hard to interpret.
Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.
HarHarVeryFunny|3 months ago
https://transformer-circuits.pub/2025/attribution-graphs/met...
leogao|3 months ago
See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093