top | item 35753752

(no title)

turkeygizzard | 2 years ago

Thank you for articulating this. I remember similar problems and arguments arising after RNNs and CNNs became massively successful. People argued that training larger models would be infeasible for several reasons that all were made moot by Attention Is All You Need. Somebody seems to always figure out a new approach

discuss

order

Certhas|2 years ago

First of all, the article argues that you need a major breakthrough, arguably attention was such a breakthrough?

That said, this doesn't really seem all that comparable. The article points out very fundamental properties of all the diverse current approaches: They are tightly data constrained. You either need to cheap simulation or massive real world data. That's not an arcane technical point.

hackernewds|2 years ago

Attached is the Attention is all you need publication.

https://arxiv.org/abs/1706.03762

Why was this revolutionary though?

abetusk|2 years ago

I'm not sure I understand it well enough to say but watching a video on it [0] I think there were a few key points:

* "Attention is all you need" introduced positional encoding which allows you to keep context of the word, allowing for more complex translation (and thus generative/chatgpt like tasks?) because words now have context relative to each other. Contrast this with "bag of words" models that only tells you whether the word is present or not.

* I don't quite understand why but transformers (which "AiaYN" introduced) can be made fully parallel, compared with the RNN/LSTM networks which has to be serial per token. Fully parallel allows for GPU optimization, which means you can take advantage of Moore's law for training.

I'm always a bit suspicious when people claim a breakthrough of this sort. There's no doubt that better algorithms give better results but how much is due to just faster computers, cheaper compute, memory, etc.

[0] https://youtu.be/S27pHKBEp30

uh_uh|2 years ago

Previous approaches like LSTM struggled learning long-term dependencies. The transformer improved on this greatly.

yinser|2 years ago

To add another anecdote to your question: the transformer became a part of the first context aware embedding model GPT-1. Not to say it couldn’t be done with another tool but it was first done with a transformer. Previous embedding models like word2vec, GloVe and fasttext were not contextually embedding and didn’t give you a language graph that would then go on to support a language model capable of “understanding” what you were saying or asking for.

jimsimmons|2 years ago

GP is wrong.

Attention is all you need paper just proposed an AR model that didn’t have to be trained step by step. The scaling happened later in BERT and GPT and OpenAI’s scaling work