(no title)
ctoa | 2 months ago
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
ctoa | 2 months ago
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
omneity|2 months ago
> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.
Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?
0: https://arxiv.org/abs/1807.03819