top | item 40009943

(no title)

lukasga | 1 year ago

Maybe a bit outdated now, but reminds me of LSTMs from the recurrent update of a memory / hidden state with gating. I remember one of the biggest problems with such RNNs being vanishing gradients as a result of the long context, which vanilla transformers presumably avoided by parallellizing over the context instead of processing them individually. I wonder how this is avoided here?

discuss

No comments yet.