(no title)
fheinsen | 26 days ago
I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.
Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:
https://www.anthropic.com/engineering/effective-context-engi...
cubefox|25 days ago
fheinsen|25 days ago
Numerical error in long sequences of query-key dot-products may be a key factor.