top | item 43018603

(no title)

While not the main focus, see Section 6.1 and Figure 10 for a simple adaptative exit strategy for inference.

I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.

discuss

No comments yet.