(no title)
fheinsen | 22 days ago
The github repository's first toy example is with 8 Taylor terms, applied to a context of 1B tokens, with attention computed over 1K heads per token. (Note that applying the quadratic formulation to 1B tokens, each with 1K heads, is not practical with current hardware, because it would require computing 1K attention matrices, each with 1B×1B dot-product scores.
Like every other proposed method, this one must be tested too. If it works, AI service providers who ignore it will find themselves at a disadvantage.
It's worth mentioning also that the mathematical techniques introduced by this work are likely of interest for other applications besides attention.
No comments yet.