Working in the theory, I can say this is incredibly unlikely. At scale, once appropriately trained, all architectures begin to converge in performance.
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
Do we really have the data on this? I mean, it does happen on a smaller scale, but where's the 300B version of RWKV? Where's hybrid symbolic/LLM? Where are other experiments? I only see larger companies doing relatively small tweaks to the standard transformers, where the context size still explodes the memory use - they're not even addressing that part.
Do you mean "all variants of the same stacked transformer architecture converge in performance"? Or do you know of tests against some other architecture? The diffusion-based LLMs?
hodgehog11|6 months ago
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
viraptor|6 months ago
fdsjgfklsfd|6 months ago
highfrequency|6 months ago