top | item 43096336

(no title)

fovc | 1 year ago

Sparse attention essentially combines 3 types of attention optimizations:

1. Compression of the query input vectors to reduce the size of the KV cache

2. Selectively computing uncompressed attention on a subset of tokens based on the compressed blocks with the highest attention scores

3. Using sliding window for local attention at full resolution

> Both Full Attention and sparse attention models are pretrained on 270⁢B tokens of 8⁢k-length texts, followed by continued training and supervised fine-tuning on 32⁢k-length texts with YaRN to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison.

> our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27⁢B total parameters with 3⁢B active parameters

Evaluated on MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. NSA outperforms full attention on 7/9.

Beats out H2O, InfLLM, Quest, Exact-Top, and full attention on LongBench

Perfect retrieval on 64k needle-in-a-haystack

The CoT eval is less convincing, but outperforms the FA on AIME24.

Training speed of 2-9x vs. FlashAttention

Decoding speedup of 4-12x vs. full attention ["expected"? Didn't see comparison to other attention mechanisms]

discuss

No comments yet.