top | item 45806870

(no title)

hamsic | 3 months ago

"Lossless" does not mean that the LLM can accurately reconstruct human-written sentences. Rather, it means that the LLM generates a fully reproducible bitstream based on its own predicted probability distribution.

Reconstructing human-written sentences accurately is impossible because it requires modeling the "true source"—the human brain state (memory, emotion, etc.)—rather than the LLM itself.

Instead, a practical approach is to reconstruct the LLM output itself based on seeds or to store it in a compressible probabilistic structure.

discuss

order

DoctorOetker|3 months ago

Its unclear what you claim lossless compression does or doesn't do, especially since you tie in storing an RNG's seed value at the end of your comment.

"LLMZip: Lossless Text Compression Using Large Language Models"

Implies they use the LLM's next token probability distribution to bring the most likely ones up for the likelihood sorted list of tokens (the higher the next token from the input stream -generated by humans or not- the fewer bits needed to encode its position starting the count from top to bottom, so the better the LLM can predict the true probability of the next token, the better it will be able to compress human-generated text in general)

Do you deny LLM's can be used this way for lossless compression?

Such a system can accurately reconstruct the uncompressed original input text (say generated by a human) from its compressed form.

hamsic|3 months ago

Sure, a model-based coder can losslessly compress any token stream. I just meant that for human-written text, the model’s prediction diverges from how the text was actually produced — so the compression is formally lossless, but not semantically faithful or efficient.