(no title)
jayalammar | 2 years ago
Transformer Feed-Forward Layers Are Key-Value Memories https://arxiv.org/abs/2012.14913
The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention https://arxiv.org/abs/2202.05798
HarHarVeryFunny|2 years ago
In the most simple case this is a copying operation such that an early occurrence of AB predicts that a later A should be followed by B. In the more general case this becomes A'B' => AB which seems to be more of an analogy type relationship.
https://arxiv.org/abs/2209.11895
https://youtu.be/Vea4cfn6TOA
This is still only a low level mechanistic type of operation, but at least a glimpse into how transformers are operating at inference time.
HarHarVeryFunny|2 years ago