top | item 47138170

(no title)

rao-v | 5 days ago

+1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!

Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.

discuss

giang_at_glai|3 days ago

Actually, the model is forcing the response to be generated inside the attribution modules.