top | item 41811267

(no title)

Someone explain to me how this isn't reinventing LSTMs please.

discuss

toxik|1 year ago

I don’t understand why you think they are even similar. This is still doing pairwise attention.

An LSTM takes a series of values and uses a combination of gates to determine critical information to hold on to or forget as a sequence unfolds. This is a compressive technique that removes the requirement of having all previous sequence information at the time of a particular inference.

This paper "compress sequence information into an anchor token" which is then used at inference time to reduce the information required for prediction as well as speed up that prediction. They do this via "continually pre-training the model to compress sequence information into the anchor token."