top | item 43180118

(no title)

crsn | 1 year ago

Very f’ing cool (esp. optimistic about repo-level codebase completion) – but just like many other results that DeepSeek reports, their preprint leaves me with more questions than they’ve given answers, unless I’ve misunderstood multiple pieces of it (which of course is possible):

—They report a 9.0× speedup in forward pass and 6.0× in backward pass… Why the heck would the backward pass be so much slower? Is it their gating mechanisms needing extra computation in backward passes? Gradient accumulation or KV-cache updates bottlenecking the speedup? FlashAttention (or at least FlashAttention-2) gives a near-equal forward-backward efficiency… They claim it’s tuned for FA2-style blockwise layouts, so which of their (competing) claims is wrong?

—Does NSA actually learn useful sparsity, or just get lucky with pretraining? How much of the performance gain comes from pretrained sparsity patterns vs. sparsity inherent to the attention? Even though they themselves say “applying sparsity post-hoc forces models to deviate from their pretrained optimization trajectory… As demonstrated by Chen et al. (2024), [sic] top 20% attention can only cover 70% of the total attention scores, rendering structures like retrieval heads in pretrained models vulnerable to pruning during inference” — yet their ablation isn’t strong enough to tell. A stronger ablation would include (1) a Full Attention → NSA transition test to measure whether NSA can be applied post-hoc without degradation, (2) a visualization of learned sparsity patterns over training epochs, and (3) a test where sparsity constraints are randomly assigned to see if NSA actually finds useful structures or just adapts to imposed ones.

—Training transformers with sparse attention is historically unstable — early MoEs like Switch-Transformer (which use expert gating-like mechanisms just like this one) were famous specifically for their collapse issues. How does NSA prevent mode collapse in early training — or really, how do we know it’s not just going to collapse different (i.e. more common) initialization schemes? If their technique doesn’t have an explicit mechanism for counteracting sparse expert underutilization, then it’s just as vulnerable to collapse as (e.g.) Switch-Transformer — but worse, since sparsity here isn’t just a gating function, it’s the core of the entire attention mechanism…

discuss

No comments yet.