This is particularly interesting as there seems to be, for decades, a general consensus that the problem of text compression is the same as the problem of artificial intelligence, for example https://en.wikipedia.org/wiki/Hutter_Prize
"It is well established that compression is essentially prediction, which effectively links compression and langauge models (Delétang et al., 2023). The source coding theory from Shannon’s information theory (Shannon, 1948) suggests that the number of bits required by an optimal entropy encoder to compress a message ... is equal to the NLL of the message given by a statistical model." (https://ar5iv.labs.arxiv.org/html//2402.00861)
I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.
There's a general consensus that entropy is deeply spooky. It pops up in physics in black holes and the heat death of the universe. The physicist Erwin Schrodinger suggested that life itself consumes negative entropy, and others have proposed other definitions of life that are entropic. Some definitions of intelligence also centre on entropy.
What to make of all that however, has anything but consensus.
I’m not sure this is strictly true. It seems more accurate to say there are deep connections between the two rather than they are theoretically equivalent problems. His work is really cool though no doubt.
In the sense I understand that comparison, or have usually seen it referred to, the compressed representation is the internal latent in a (V)AE. Still, I haven't seen many attempts at compression that would store the latent + a delta to form lossless compression, that an AI system could then maybe use natively at high performance. Or if I have... I have not understood them.
it is true, but i think it's only of philosophical interests. for example, in a sense our physical laws are just human's attempt at compressing our universe.
the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].
Also worth checking out some of the author's other compressors e.g. another one of their neural network solutions using a transformer https://bellard.org/nncp/ holds the top spot in the Large Text Compression Benchmark. It's ~3 orders of magnitude slower though.
If I read this correctly, the largest test reported on this page is the "enwik9" dataset, which compresses to 213 MB with xz and only 135 MB with this method, a 78 MB difference... using a model that is 340 MB (and was probably trained on the test data).
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.
> and was probably trained on the test data
This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:
> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark
despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.
If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?
I believe almost all LLMs are trained using wikpedia these days. So compressing wikipedia well without including the size of the LLM in the compression result is a bit of a cheat. I guess one would argue it is a universal dataset representing understanding the English language and real-world relationships at this point but it is still a bit of a cheat.
There's a reason compression benchmarks often times include the size of the executable when benchmarking compression ratios. Although Matt Mahoney's large text compression benchmark[0] does currently have a transformer model at number 1.
Looks like it’s been updated since then; commenters in that thread are saying the decompressor needs to run on the same hardware as the compressor; now the link says:
> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.”
It adds levity to the article and also introduces the reader to the sorts of things that can go wrong if they try it at home.
The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).
Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.
[+] [-] 0x0|1 year ago|reply
[+] [-] bravura|1 year ago|reply
I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.
[+] [-] retrac|1 year ago|reply
What to make of all that however, has anything but consensus.
[+] [-] WhitneyLand|1 year ago|reply
[+] [-] micimize|1 year ago|reply
[+] [-] nialv7|1 year ago|reply
the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].
[1]: https://github.com/Futrell/ziplm
[+] [-] zamadatix|1 year ago|reply
[+] [-] remram|1 year ago|reply
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
Please let me know if I misunderstand.
[+] [-] zamadatix|1 year ago|reply
"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.
> and was probably trained on the test data
This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:
> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.
[+] [-] binary132|1 year ago|reply
[+] [-] KTibow|1 year ago|reply
[+] [-] justmarc|1 year ago|reply
[+] [-] bhouston|1 year ago|reply
[+] [-] atiedebee|1 year ago|reply
[0] http://www.mattmahoney.net/dc/text.html
[+] [-] vessenes|1 year ago|reply
Demo and code? Available at bellard.org as well.
[+] [-] zamadatix|1 year ago|reply
Has anyone done the work of comparing this to other similar extreme audio compression solutions?
[+] [-] rahimnathwani|1 year ago|reply
[+] [-] jodrellblank|1 year ago|reply
> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.”
[+] [-] 0-_-0|1 year ago|reply
[+] [-] droideqa|1 year ago|reply
[+] [-] Twirrim|1 year ago|reply
[+] [-] hansvm|1 year ago|reply
The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).
Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.
[+] [-] perching_aix|1 year ago|reply
[+] [-] mikevin|1 year ago|reply
[+] [-] Lerc|1 year ago|reply
If it looks like anything at all other than randomness then you can describe whatever it is that it looks like to get more compression.
[+] [-] munch117|1 year ago|reply
It's the nature of compression: Any discernible pattern could have been exploited for further compression.
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] j_juggernaut|1 year ago|reply
https://llmencryptdecrypt-euyfofcjh8bf2utuha2zox.streamlit.a...
[+] [-] meindnoch|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] jll29|1 year ago|reply
[+] [-] cat5e|1 year ago|reply