top | item 46361390

(no title)

ctoa | 2 months ago

It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.

The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.

discuss

omneity|2 months ago

Thanks, this was helpful! Reading the seminal paper[0] on Universal Transformers also gave some insights:

> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.

Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?

0: https://arxiv.org/abs/1807.03819