I've just started this video, but already have a question if anyone's familiar with GPT workings - I thought that these models chose the next word based on what's most likely. But if they choose based on "one of the likely" words, could (in general) that not lead to a situation where the list of predictions for the next word are much less likely? Running possibilities of "two words together", then, would be more beneficial if computationally possible (and so on for 3, 4 and n words). Does this exist?
(I realize that choosing the most likely word wouldn't necessarily solve the issue, but choosing the most likely phrase possibly might.)
Edit, post seeing the video and comments: it's beam search, along with temperature to control these things.
In practice, beam search doesn't seem to work well for generative models.
Temperature and top_k (two very similar parameters) were both introduced to account for the fact that human text is unpredictable stochastically for each sentence someone might say as such - as shown in this 2021 similar graph/reproduction of an older graph from the 2018/2019 HF documentation: https://lilianweng.github.io/posts/2021-01-02-controllable-t...
It could be that beam search with much longer length does turn out to be better or some merging of the techniques works well, but I don't think so. The query-key-value part of transformers is focused on a single total in many ways - in relation to the overall context. The architecture is not meant for longer forms as such - there is no default "two token" system. And with 50k-100k tokens in most GPT models, you would be looking at 50k*50k = A great deal more parameters and then issues with sparsity of data.
Just everything about GPT models (e.g. learned positional encodings/embeddings depending on the model iteration) is so focused on bringing the richness of a single token or single token index that the architecture is not designed for beam search like this one could say. Without considering the training complications.
The temperature setting is used to select how rare of a next token is possible. If set to 0 the. The top of the likely list is chosen, if set greater than 0 then some lower probability tokens may be chosen.
Yes, this is a fundamental weakness with LLMs. Unfortunately this is likely unsolvable because the search space is exponential. Techniques like beam search help, but can only introduce a constant scaling factor.
That said, LLM reach their current performance despite this limitation.
Thats basically chunking or at least how it starts. I was impressed by the ability to add and subtract the individual word vector embeddings and get meaningful results. Chunking a larger block blends this whole process so you can do the same thing but in conseptual space the so take a baseline method like sentence embedding and that becomes your working block for comparison.
There’s some fancier stuff too like techniques that take into account where recent tokens were drawn from in the distribution and update either the top_p or the temperature so that sequences of tokens have a minimum unlikeliness. Beam search is less common with really large models because the computation is really expensive.
If you liked that, Andrej karpathy has a few interesting videos on his channels explaining Neural Networks and their inner workings which are aimed at people who know how to program.
As a reasonably experienced programmer that has watched Andrej's videos the one thing I would recommend is that they not be used as a starting point to learn neural networks but as a reinforcement or enhancement method once you know the fundamentals.
I was ignorant enough to try and jump straight in to his videos and despite him recommending I watch his preceeding videos I incorrectly assumed I could figure it out as I went. There is verbiage in there that you simply must know to get the most out of it. After giving up, going away and filling in the gaps though some other learnings, I went back and his videos become (understandably) massively more valueable for me.
I would strongly recommend anyone else wanting to learn neural networks that they learn from my mistake.
The next token is taken by sampling the logits in the final column after unembedding. But isn't that just the last token again? Or is the matrix resized to N+1 at some step?
user_7832|1 year ago
(I realize that choosing the most likely word wouldn't necessarily solve the issue, but choosing the most likely phrase possibly might.)
Edit, post seeing the video and comments: it's beam search, along with temperature to control these things.
authorfly|1 year ago
Temperature and top_k (two very similar parameters) were both introduced to account for the fact that human text is unpredictable stochastically for each sentence someone might say as such - as shown in this 2021 similar graph/reproduction of an older graph from the 2018/2019 HF documentation: https://lilianweng.github.io/posts/2021-01-02-controllable-t...
It could be that beam search with much longer length does turn out to be better or some merging of the techniques works well, but I don't think so. The query-key-value part of transformers is focused on a single total in many ways - in relation to the overall context. The architecture is not meant for longer forms as such - there is no default "two token" system. And with 50k-100k tokens in most GPT models, you would be looking at 50k*50k = A great deal more parameters and then issues with sparsity of data.
Just everything about GPT models (e.g. learned positional encodings/embeddings depending on the model iteration) is so focused on bringing the richness of a single token or single token index that the architecture is not designed for beam search like this one could say. Without considering the training complications.
doctoboggan|1 year ago
ahzhou|1 year ago
That said, LLM reach their current performance despite this limitation.
mvsin|1 year ago
An example is Beam Search:https://www.width.ai/post/what-is-beam-search
Essentially we keep a window of probabilities of predicted tokens to improve the final quality of output.
jimmySixDOF|1 year ago
Thats basically chunking or at least how it starts. I was impressed by the ability to add and subtract the individual word vector embeddings and get meaningful results. Chunking a larger block blends this whole process so you can do the same thing but in conseptual space the so take a baseline method like sentence embedding and that becomes your working block for comparison.
lxe|1 year ago
unknown|1 year ago
[deleted]
chessgecko|1 year ago
lucidrains|1 year ago
acchow|1 year ago
If you haven't seen the first few chapters, I cannot recommend enough.
Vespasian|1 year ago
jtonz|1 year ago
I was ignorant enough to try and jump straight in to his videos and despite him recommending I watch his preceeding videos I incorrectly assumed I could figure it out as I went. There is verbiage in there that you simply must know to get the most out of it. After giving up, going away and filling in the gaps though some other learnings, I went back and his videos become (understandably) massively more valueable for me.
I would strongly recommend anyone else wanting to learn neural networks that they learn from my mistake.
yinser|1 year ago
Terr_|1 year ago
Prior discussion: https://news.ycombinator.com/item?id=38505211
__loam|1 year ago
unknown|1 year ago
[deleted]
throwawayk7h|1 year ago
HarHarVeryFunny|1 year ago
lovestaco|1 year ago
lxe|1 year ago
RecycledEle|1 year ago
Thank you for sharing.