(no title)
rlanday | 2 years ago
I think the author is more correct than you are. It is not necessarily the case that we need 3,204 dimensions to represent the information contained in the tokens; in fact, the token embeddings live in a low-dimensional subspace; see footnote 6 here:
https://transformer-circuits.pub/2021/framework/index.html
> We performed PCA analysis of token embeddings and unembeddings. For models with large d_model, the spectrum quickly decayed, with the embeddings/unembeddings being concentrated in a relatively small fraction of the overall dimensions. To get a sense for whether they occupied the same or different subspaces, we concatenated the normalized embedding and unembedding matrices and applied PCA. This joint PCA process showed a combination of both "mixed" dimensions and dimensions used only by one; the existence of dimensions which are used by only one might be seen as a kind of upper bound on the extent to which they use the same subspace.
So some of the embedding dimensions are used to encode the input tokens and some are used to pick the output tokens (some are used for both), and everything else is only used in intermediate computations. This suggests that you might be able to improve on the standard transformer architecture by increasing (or increasing and then decreasing) the dimension, rather than using the same embedding dimensionality at each layer.
No comments yet.