Maybe a bit outdated now, but reminds me of LSTMs from the recurrent update of a memory / hidden state with gating. I remember one of the biggest problems with such RNNs being vanishing gradients as a result of the long context, which vanilla transformers presumably avoided by parallellizing over the context instead of processing them individually. I wonder how this is avoided here?
No comments yet.