I agree except for (6). A language model assigns probabilities to sequences. The model needs normalised distributions, eg using a softmax, so that’s the right way of thinking about it.
This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.
jaidhyani|2 years ago