(no title)
lonk11 | 1 year ago
And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.
It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.
No comments yet.