This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.
No comments yet.