(no title)
psb217
|
4 months ago
The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.
mapleshamrock|4 months ago
psb217|4 months ago
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
[1] - https://arxiv.org/abs/2507.07955