top | item 43967333

(no title)

This paper is very cool, comes from respected authors, and is a very nice idea with good experiments (flop controlled for compute). It shouldn't be seen as a wall-breaking innovation though. From the paper:

> Existing transformer libraries and codebases are designed to be highly efficient for tokenizer-based transformer architectures. While we present theoretical flop matched experiments and also use certain efficient implementations (such as FlexAttention) to handle layers that deviate from the vanilla transformer architecture, our implementations may yet not be at parity with tokenizer-based models in terms of wall-clock time and may benefit from further optimizations.

And unfortunately wall-clock deficiencies mean that any quality improvement needs to overcome that additional scaling barrier before any big runs (meaning expensive) can risk using it.

discuss

No comments yet.