top | item 43019406

(no title)

lonk11 | 1 year ago

Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters.

The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.

discuss

Tostino|1 year ago

Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.

I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.

You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.