(no title)
iknownothow | 1 year ago
But why only choose 13x13 + 1? :(
I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.
And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...
No comments yet.