(no title)
wjessup | 3 years ago
What does that mean?
For each token in your input or inference output it requires the model to have some understanding of what the position of the word means.
So there is the word position embedding matrix that contains a vector per position. The matrix has "only" 1024 entries in it for GPT2 or 4096 for GPT3. The size of each entry varies as well, containing a vector from 768 for GPT2 small and up to 12,288 for GPT3.
So the WPE (word position embeddings) for GPT2 is (1024x768) and for GPT3 (4096x12288)
Inference requires info from this vector to be added to the word tokens embedding for each token in the original prompt + each generated token.
kir-gadjello|3 years ago
As often is the case with these large models, you can change it with some finetuning on longer context samples from the same dataset, with what is really a small amount of compute invested compared to the million hours spent on training the thing.
toxik|3 years ago
visarga|3 years ago
afro88|3 years ago
7to2|3 years ago
sebzim4500|3 years ago
https://arxiv.org/abs/2104.09864