(no title)
lonk11 | 1 year ago
The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.
lonk11 | 1 year ago
The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.
Tostino|1 year ago
I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.
You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.