top | item 41109193

(no title)

lonk11 | 1 year ago

I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.

And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.

It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.

discuss

No comments yet.