(no title)
thomasahle | 26 days ago
They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
jcarreiro|26 days ago
> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.
ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.
kristjansson|26 days ago
and they really do mean that, their results show +/- 1 on log10 plots.
energy123|26 days ago
fheinsen|26 days ago
kristjansson|26 days ago
This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
energy123|26 days ago
logicchains|26 days ago
fheinsen|26 days ago
I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.
Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:
https://www.anthropic.com/engineering/effective-context-engi...
cubefox|25 days ago
naasking|26 days ago
[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428
findalex|26 days ago
fheinsen|23 days ago
The github repository's first toy example is with 8 Taylor terms, applied to a context of 1B tokens, with attention computed over 1K heads per token. (Note that applying the quadratic formulation to 1B tokens, each with 1K heads, is not practical with current hardware, because it would require computing 1K attention matrices, each with 1B×1B dot-product scores.
Like every other proposed method, this one must be tested too. If it works, AI service providers who ignore it will find themselves at a disadvantage.
It's worth mentioning also that the mathematical techniques introduced by this work are likely of interest for other applications besides attention.
cobolexpert|26 days ago
dave_universetf|26 days ago
The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.
That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.
In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.
The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).
omneity|26 days ago
antirez|26 days ago
twotwotwo|25 days ago
There are other experiments where model designers mix full-attention layers with limited-memory ones. (Which still doesn't avoid N^2, but if e.g. 3/4 of layers use 'light' attention, it still improves efficiency a lot.) The idea is the model can still pull information from far back in context, just not in every layer. Use so far is limited to smaller models (maybe it costs too much model capability to use at the high end?) but it seems like another interesting angle on this stuff.
quotemstr|26 days ago
polynomial|25 days ago
WhitneyLand|26 days ago
cubefox|26 days ago
andy12_|26 days ago
> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus
[1] https://arxiv.org/pdf/2512.02556
unknown|26 days ago
[deleted]
unknown|26 days ago
[deleted]
clarity_hacker|26 days ago
[deleted]
wetwater|26 days ago