(no title)
saurabh20n | 2 years ago
* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).
* They introduce an operator called "Hyena hierarchy", a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.
* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)
* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)
* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)
* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)
Might have misread some takeaways; happy to be corrected.
No comments yet.