top | item 44761548

Native Sparse Attention

139 points| CalmStorm | 8 months ago |aclanthology.org | reply

Was submitted as "DeepSeek won the best paper award at ACL 2025"

Here is the awards page: https://cspaper.org/topic/116/record-breaking-acl-2025-crown...

31 comments

[+] noosphr|7 months ago|reply

Deep seek papers are a must to read for anyone who wants to understand how to make LLMs operate at hyper scale. All western labs hide their best results, or at most release summaries that are about as meaningful as the answers Cleo used to give on stack exchange: https://math.stackexchange.com/questions/562694/integral-int...

I have a suspicion with how quiet all the major players got after the two weeks after deepseek R1 was released that they were reading and implementing everything in the papers that came with it as fast as humanly possible.

[+] Art9681|7 months ago|reply

None of the major players have ever been quiet. DeepSeek enjoyed about a week or two's worth of press before its spotlight was stolent from the next great model. It never held the top spot, ever, mind you. So I don't understand why you think major players had to say anything about it, when the model was neither first, second or third in real world capability, and why they would have to say anything about it when DeepSeek service processes maybe an 1/8 of what OpenAI, Google or Claude in any given span of time.

I applaud their open efforts. But being "altruistic" and being best are two different things.

[+] nurettin|7 months ago|reply

I remember on february Deepseek's <think> caused a moderately sized market crash. They didn't just go silent, almost every vendor implemented their own version of thinking models while blaming Deepseek for stealing their tech/training on their models. It was rather pathetic to watch.

[+] CalmStorm|8 months ago|reply

For the first time, it introduced native sparse attention into the full training process, achieving up to 11× inference speedup while maintaining model performance.

[+] sabaimran|7 months ago|reply

> Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation.

Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.

[+] ethan_smith|7 months ago|reply

The performance maintenance (or even improvement) isn't surprising - sparse attention can reduce noise by focusing only on relevant tokens. Traditional full attention dilutes focus by attending to everything equally, while NSA's pruning approach mimics how humans selectively process information.

[+] laughingcurve|7 months ago|reply

Yes that’s what makes it so interesting and novel you nailed it

[+] visarga|7 months ago|reply

I am always skeptical of RNN approaches but this paper is just sparsifying the input, it is not compressing any size input to a fixed memory. I am hopeful maybe this is a big break. 11x inference speedup with no degradation from an algorithmic improvement. Is it really that good? almost too good to be true. Adoption in the next 6 months will tell us the truth.

[+] tony_borlini|7 months ago|reply

DeepSeek and the Sparse Attention Revolution: How a Research Paper is Redefining AI Efficiency

https://deep.liveblog365.com/en/index-en.html?post=50

[+] pyuser583|7 months ago|reply

I'd say award for best title is a tie between: "Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems"; "Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details?"; and "Steering off Course: Reliability Challenges in Steering Language Models."

[+] israrkhan|7 months ago|reply

Well deserved

[+] laughingcurve|7 months ago|reply

Yea I agree. It’s sad to find so much of the comments are focused on reinventing reality and jingoism instead of scientific discussion on the merits and technicals. I’ll return tomorrow and hope for better comments.

[+] gnabgib|7 months ago|reply

Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

The awards page for ACL seems to disagree with this editorialized title: https://2025.aclweb.org/program/awards/

[+] fourdnet|7 months ago|reply

The ACL webpage has not been updated yet. Here are the announcement slides: https://cspaper.org/topic/116/record-breaking-acl-2025-crown...

[+] ninjin|7 months ago|reply

Link to the published paper rather than the preprint (update link?):

https://aclanthology.org/2025.acl-long.1126