(no title)
iknownothow | 1 year ago
The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.
If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!
EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.
iknownothow|1 year ago
But why only choose 13x13 + 1? :(
I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.
And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...
kolinko|1 year ago
iknownothow|1 year ago
If the external-model also undergoes training along with the model then I think that might work.