top | item 45644640

(no title)

rsp1984 | 4 months ago

Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?

It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?

discuss

intalentive|4 months ago

Suppose you have an image with 1000 words in it, and suppose for simplicity that every word is 1 token. Then the image is “worth” 1000 tokens.

But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.

Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.

The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.

rsp1984|4 months ago

Excellent, thanks. So basically this is saying: "our pixels-to-token encoding is so efficient (information density in a set of "image tokens" is much higher as compared to a set of text tokens), why even bother representing text as text?"

Correct?