top | item 43967392

(no title)

If you consider most of the dominate architectures in deeplearning type approaches, transformers are remarkably generic. If you reduce transformer like architectures to "position independent iterated self attention with intermediate transformations", they can support ~all modalities and incorporate other representations (e.g. convolutions, CLIP style embeddings, graphs or sequences encoded with additional position embeddings). On top of that, they're very compute friendly.

Two of the largest weaknesses seem to be auto-regressive sampling (not unique to the base architecture) and expensive self attention over very long contexts (whether sequence shaped or generic graph shaped). Many researchers are focusing efforts there!

Also see: https://www.isattentionallyouneed.com/

discuss

anon291|9 months ago

Transformers are very close to some types of feed forward networks. The difference is that transformers can be trained in parallel without the need for auto-regression (which is slow, for training, but kind of nice for streaming , low-latency inference). It's a mathematical trick. RWKV makes it obvious.