(no title)
rsp1984 | 4 months ago
It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?
intalentive|4 months ago
But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.
Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.
The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.
rsp1984|4 months ago
Correct?