If you consider most of the dominate architectures in deeplearning type approaches, transformers are remarkably generic. If you reduce transformer like architectures to "position independent iterated self attention with intermediate transformations", they can support ~all modalities and incorporate other representations (e.g. convolutions, CLIP style embeddings, graphs or sequences encoded with additional position embeddings). On top of that, they're very compute friendly.Two of the largest weaknesses seem to be auto-regressive sampling (not unique to the base architecture) and expensive self attention over very long contexts (whether sequence shaped or generic graph shaped). Many researchers are focusing efforts there!
Also see: https://www.isattentionallyouneed.com/
anon291|9 months ago