top | item 41810150

The Role of Anchor Tokens in Self-Attention Networks

18 points| smooke | 1 year ago |arxiv.org | reply

5 comments

[+] zopper|1 year ago|reply

Surprised this isn't getting more attention. It is one of those papers that is very elegant and simple, yet very effective.

[+] forrestp|1 year ago|reply

It's expensive in this field to verify other people's work. There are a few other papers in the last 3 years that have the same high-level idea but call the anchor tokens something different -- Gist tokens being the only one I personally remember, but you can follow the citation chains back.

Those other papers sounded like a godsend but have deficits that you only find out about if you try to use them against non-cherry-picked use-cases. I think they are on average getting better though with time.

They call out their limitations in the bottom of the paper. For these kinds of models, it would be nice to see them exploiting & measuring the weaknesses of compressive memory -> producing exact outputs. This would be things retrieving multiple things out of context exactly, arithmetic, or copy-pasting high-entropy bits (e.g. where a basic n-gram model can't bias you out of the blurry pieces).

The other side of it is there is often some difficulty in reproducing training for some of these architectures -- the training can be highly unstable and both difficult + expensive to dial-in on a real-world model. We see their best training run, not their 500 runs where they changed hyperparameters b/c the loss kept exploding randomly (compare this to text-only llama-esque architectures where they are wildly stable at training time / predictable / easy to invest into and hyperparams are easy to find from prior art).

I think we are still many papers away from something ready-for-prod on this concept, but I am personally optimistic.

[+] wantsanagent|1 year ago|reply

Someone explain to me how this isn't reinventing LSTMs please.

[+] toxik|1 year ago|reply

I don’t understand why you think they are even similar. This is still doing pairwise attention.