Thank you for articulating this. I remember similar problems and arguments arising after RNNs and CNNs became massively successful. People argued that training larger models would be infeasible for several reasons that all were made moot by Attention Is All You Need. Somebody seems to always figure out a new approach
Certhas|2 years ago
That said, this doesn't really seem all that comparable. The article points out very fundamental properties of all the diverse current approaches: They are tightly data constrained. You either need to cheap simulation or massive real world data. That's not an arcane technical point.
hackernewds|2 years ago
https://arxiv.org/abs/1706.03762
Why was this revolutionary though?
abetusk|2 years ago
* "Attention is all you need" introduced positional encoding which allows you to keep context of the word, allowing for more complex translation (and thus generative/chatgpt like tasks?) because words now have context relative to each other. Contrast this with "bag of words" models that only tells you whether the word is present or not.
* I don't quite understand why but transformers (which "AiaYN" introduced) can be made fully parallel, compared with the RNN/LSTM networks which has to be serial per token. Fully parallel allows for GPU optimization, which means you can take advantage of Moore's law for training.
I'm always a bit suspicious when people claim a breakthrough of this sort. There's no doubt that better algorithms give better results but how much is due to just faster computers, cheaper compute, memory, etc.
[0] https://youtu.be/S27pHKBEp30
uh_uh|2 years ago
yinser|2 years ago
jimsimmons|2 years ago
Attention is all you need paper just proposed an AR model that didn’t have to be trained step by step. The scaling happened later in BERT and GPT and OpenAI’s scaling work