top | item 42549083

Ts_zip: Text Compression Using Large Language Models

184 points| signa11 | 1 year ago |bellard.org | reply

68 comments

order
[+] 0x0|1 year ago|reply
This is particularly interesting as there seems to be, for decades, a general consensus that the problem of text compression is the same as the problem of artificial intelligence, for example https://en.wikipedia.org/wiki/Hutter_Prize
[+] bravura|1 year ago|reply
"It is well established that compression is essentially prediction, which effectively links compression and langauge models (Delétang et al., 2023). The source coding theory from Shannon’s information theory (Shannon, 1948) suggests that the number of bits required by an optimal entropy encoder to compress a message ... is equal to the NLL of the message given by a statistical model." (https://ar5iv.labs.arxiv.org/html//2402.00861)

I will say again that Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", which evaluates LLMs on their ability to predict future text, is amazing work that the field is currently sleeping on.

[+] retrac|1 year ago|reply
There's a general consensus that entropy is deeply spooky. It pops up in physics in black holes and the heat death of the universe. The physicist Erwin Schrodinger suggested that life itself consumes negative entropy, and others have proposed other definitions of life that are entropic. Some definitions of intelligence also centre on entropy.

What to make of all that however, has anything but consensus.

[+] WhitneyLand|1 year ago|reply
I’m not sure this is strictly true. It seems more accurate to say there are deep connections between the two rather than they are theoretically equivalent problems. His work is really cool though no doubt.
[+] micimize|1 year ago|reply
In the sense I understand that comparison, or have usually seen it referred to, the compressed representation is the internal latent in a (V)AE. Still, I haven't seen many attempts at compression that would store the latent + a delta to form lossless compression, that an AI system could then maybe use natively at high performance. Or if I have... I have not understood them.
[+] nialv7|1 year ago|reply
it is true, but i think it's only of philosophical interests. for example, in a sense our physical laws are just human's attempt at compressing our universe.

the text model used here probably isn't going to be "intelligent" the same way those chat-oriented LLMs are. you can probably still sample text from it, but you can actually do the same with gzip[1].

[1]: https://github.com/Futrell/ziplm

[+] zamadatix|1 year ago|reply
Also worth checking out some of the author's other compressors e.g. another one of their neural network solutions using a transformer https://bellard.org/nncp/ holds the top spot in the Large Text Compression Benchmark. It's ~3 orders of magnitude slower though.
[+] remram|1 year ago|reply
If I read this correctly, the largest test reported on this page is the "enwik9" dataset, which compresses to 213 MB with xz and only 135 MB with this method, a 78 MB difference... using a model that is 340 MB (and was probably trained on the test data).

No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?

Please let me know if I misunderstand.

[+] zamadatix|1 year ago|reply
> using a model that is 340 MB

"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.

> and was probably trained on the test data

This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:

> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?

The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.

[+] binary132|1 year ago|reply
If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?
[+] KTibow|1 year ago|reply
Notably, solutions specialized for enwik9 (specifically fx2-cmix) take up only 110 MB, including the size of the decompressor.
[+] justmarc|1 year ago|reply
This man is an absolute wizard, and a legend who hasn't stopped since the fantastic LZEXE days.
[+] bhouston|1 year ago|reply
I believe almost all LLMs are trained using wikpedia these days. So compressing wikipedia well without including the size of the LLM in the compression result is a bit of a cheat. I guess one would argue it is a universal dataset representing understanding the English language and real-world relationships at this point but it is still a bit of a cheat.
[+] atiedebee|1 year ago|reply
There's a reason compression benchmarks often times include the size of the executable when benchmarking compression ratios. Although Matt Mahoney's large text compression benchmark[0] does currently have a transformer model at number 1.

[0] http://www.mattmahoney.net/dc/text.html

[+] vessenes|1 year ago|reply
Fabrice has recently extended this work into audio encoding, an area which to me seems more useful than shaving a bit more off wik8 compression rates.

Demo and code? Available at bellard.org as well.

[+] zamadatix|1 year ago|reply
Link for the curious https://bellard.org/tsac/

Has anyone done the work of comparing this to other similar extreme audio compression solutions?

[+] rahimnathwani|1 year ago|reply
[+] jodrellblank|1 year ago|reply
Looks like it’s been updated since then; commenters in that thread are saying the decompressor needs to run on the same hardware as the compressor; now the link says:

> “The model is evaluated in a deterministic and reproducible way. Hence the result does not depend on the exact GPU or CPU model nor on the number of configured threads. This key point ensures that a compressed file can be decompressed using a different hardware or software configuration.

[+] 0-_-0|1 year ago|reply
1 MBps is insanely fast for a method like this, it must be in the 100k tokens per second range. Probably with large batches.
[+] droideqa|1 year ago|reply
I have always thought compression to be an analog to intelligence. The smarter you are, the better at summarization you are.
[+] Twirrim|1 year ago|reply
"(and hopefully decompress)" is a horrifying descriptor.
[+] hansvm|1 year ago|reply
It adds levity to the article and also introduces the reader to the sorts of things that can go wrong if they try it at home.

The last paragraph highlights how they fixed one of the main pitfalls I normally see in this sort of thing, where floating-point operations are mangled in myriad ways in the name of efficiency (almost always correct for physics or whatever, but a single bit being incorrect will occasionally mangle this compression scheme).

Mind you, actually doing what they claimed in that last paragraph is usually painful. The easiest approaches re-implement floating-point operations in software using integer instructions, and the complexity increases from there.

[+] mikevin|1 year ago|reply
I'm curious what the compressed text looks like. Anyone have an example?
[+] Lerc|1 year ago|reply
If it is within cooee of state of the art the compressed text should look like a pile of random bits.

If it looks like anything at all other than randomness then you can describe whatever it is that it looks like to get more compression.

[+] munch117|1 year ago|reply
Binary goo, barely distinguishable from random data, if at all. The arithmetic coder will make sure of that.

It's the nature of compression: Any discernible pattern could have been exploited for further compression.

[+] jll29|1 year ago|reply
Speed and compression are are one thing, but I wonder how much energy Ts_zip consumes compared to gzip?
[+] cat5e|1 year ago|reply
Has this been attempted for raw binary? Using an NN to predict the most likely next binary string?