This looks like a nice rundown of how to do this with Python's zstd module.
But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).
Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).
Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.
Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.
> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results
The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.
(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)
The article views compression entirely through Python's limitations.
> gzip and LZW don’t support incremental compression
This may be true in the Python's APIs, but is not true about these algorithms in general.
They absolutely support incremental compression even in APIs of popular lower-level libraries.
Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.
> The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.
I think it makes sense to explore it from practical standpoint, too. It’s in Python stdlib, and works reasonably well, so for some applications it might be good enough.
It’s also fairly easy to implement in other languages with zstd bindings, or even shell scripts:
$ echo 'taco burrito tortilla salsa guacamole cilantro lime' > /tmp/tacos.txt
$ zstd --train $(yes '/tmp/tacos.txt' | head -n 50) -o tacos.dict
[...snip]
$ echo 'racket court serve volley smash lob match game set' > /tmp/padel.txt
$ zstd --train $(yes '/tmp/padel.txt' | head -n 50) -o padel.dict
[...snip]
$ echo 'I ordered three tacos with extra guacamole' | zstd -D tacos.dict | wc -c
57
$ echo 'I ordered three tacos with extra guacamole' | zstd -D padel.dict | wc -c
60
Compression algorithms may have been supporting incremental compression for a while. But as some have pointed out, the point of the post is that it is practical and simple to have this available in Python's standard library. You could indeed do this in Bash, but then people don't do machine learning in Bash.
This has been possible with the zlib module since 1997 [EDIT: zlib is from '97. The zdict param wasn't added until 2012]. You even get similar byte count outputs to the example and on my machine, it's about 10x faster to use zlib.
import zlib
input_text = b"I ordered three tacos with extra guacamole"
tacos = b"taco burrito tortilla salsa guacamole cilantro lime " * 50
taco_comp = zlib.compressobj(zdict=tacos)
print(len(taco_comp.compress(input_text) + taco_comp.flush()))
# prints 41
padel = b"racket court serve volley smash lob match game set " * 50
padel_comp = zlib.compressobj(zdict=padel)
print(len(padel_comp.compress(input_text) + padel_comp.flush()))
# prints 54
True. The post calls out that “you have to recompress the training data for each test document” with zlib (otherwise input_text would taint it), but you can actually call Compress.copy().
zdict was added in Python 3.3, though, so it’s 2012, not 1997. (It might have worked before, just not a part of the official API :-)
In my PhD more than a decade ago, I ended up using png image file sizes to classify different output states from simulations of a system under different conditions. Because of the compressions, homogenous states led to much smaller file size than the heterogenous states. It was super super reliable.
Ooh, totally. Many years ago I was doing some analysis of parking ticket data using gnuplot and had it output a chart png per-street. Not great, but worked well to get to the next step of that project of sorting the directory by file size. The most dynamic streets were the largest files by far.
Another way I've used image compression to identify cops that cover their body cameras while recording -- the filesize to length ratio reflects not much activity going on.
The author sets the solver to saga, doesn’t standardize the features, and uses a very high max_iter.
Logistic Regression takes longer to converge when features are not standardized.
Also, the zstd classifier time complexity scales linearly with the number of classes, logistic regression doesn’t. You have 20 (it’s in the name of the dataset), so why only use 4.
It’s a cool exploration of zstd. But please give the baseline some love. Not everything has to be better than something to be interesting.
You are correct. To be fair I wasn't focused on comparing the runtimes of both methods. I just wanted to give a baseline and show that the batch approach is more accurate.
Python's zlib does support incremental compression with the zdict parameter. gzip has something similar but you have to do some hacky thing to get at it since the regular Python API doesn't expose the entry point. I did manage to use it from Python a while back, but my memory is hazy about how I got to it. The entry point may have been exposed in the code module but undocumented in the Python manual.
Sweet! I love clever information theory things like that.
It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...
EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.
> Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...
I've actually been experimenting with that lately. I did a really naive version that tokenizes the input, feeds the max context window up to the token being encoded into an LLM, and uses that to produce a distribution of likely next tokens, then encodes the actual token with Huffman Coding with the LLM's estimated distribution. I could get better results with arithmetic encoding almost certainly.
It outperforms zstd by a long shot (I haven't dedicated the compute horsepower to figuring out what "a long shot" means quantitatively with reasonably small confidence intervals) on natural language, like wikipedia articles or markdown documents, but (using GPT-2) it's about as good as zstd or worse than zstd on things like files in the Kubernetes source repository.
You already get a significant amount of compression just out of the tokenization in some cases ("The quick red fox jumps over the lazy brown dog." encodes to one token per word plus one token for the '.' for the GPT-2 tokenizer), where as with code a lot of your tokens will just represent a single character so the entropy coding is doing all the work, which means your compression is only as good as the accuracy of your LLM, plus the efficiency of your entropy coding.
I would need to be encoding multiple tokens per "word" with Huffman Coding to hit the entropy bounds, since it has a minimum of one bit per character, so if tokens are mostly just one byte then I can't do better than a 12.5% compression ratio with one token per word. And doing otherwise gets computationally infeasible very fast. Arithmetic coding would do much better especially on code because it can encode a word with fractional bits.
I used Huffman coding for my first attempt because it's easier to implement and most libraries don't support dynamically updating the distribution throughout the process.
I do not agree on the "lossless" adjective. And even if it is lossless, for sure it is not deterministic.
For example I would not want a zip of an encyclopedia that uncompresses to unverified, approximate and sometimes even wrong text. According to this site : https://www.wikiwand.com/en/articles/Size%20of%20Wikipedia a compressed Wikipedia without medias, just text is ~24GB. What's the medium size of an LLM, 10 GB ? 50 GB ? 100 GB ? Even if it's less, it's not an accurate and deterministic way to compress text.
Compression can be generalized as probability modelling (prediction) + entropy coding of the difference between the prediction and the data. The entropy coding has known optimal solutions.
So yes, LLMs are nearly ideal text compressors, except for all the practical inconveniences of their size and speed (they can be reliably deterministic if you sacrifice parallel execution and some optimizations).
My advisor in grad school had me implement a "typo distance" metric on strings once (how many single-key displacements for a typist using home row touch-typing to get from string A to string B), which seemed kind of cool. I never did find out what if anything she wanted to use it for.
Zstd is used in a lot of places now. Lots of servers and browsers support it as it is usually faster and more efficient than other compression standards. And some Linux distributions have packages, or even the kernel that can be compressed with it too, which is preferred in some situations where decompression speed matters more than storage cost.
I'm doing my PhD in compression-based machine learning, wanted to contribute a few clarifying points.
The relationship between probabilistic modeling and lossless compression is very direct. A model that predicts the next symbol with probability p can on average losslessly compress that symbol with the help of an entropy coder (e.g. arithmetic coding) in -log(p) bits. Therefore improved probabilistic models immediately translate into improved lossless compressors.
There are two ways people have used compressors for ML: the first is based on the Minimum Description Length (MDL) principle [0] which says that the best model is the one which provides the shortest description of the data, counting the size of the model itself. This is similar to the technique used in this blog post (argmin code length across class-conditioned compressors) except for counting the model size. Basically you can train a compressor per class and then choose the class compressor which best compresses the test data. It is a maximum likelihood argument because of Shannon's source coding theorem: a code length L corresponds to a probability 2^-L. The second way is the Normalized Compression Distance (NCD) [1], which uses code lengths to calculate information-theoretic distances which can then be plugged into distance-based algorithms like kNN. MDL interprets compressed lengths as likelihoods, while NCD interprets them for distance calculations. The theoretical foundation is Kolmogorov complexity which is the ideal (& uncomputable) lossless compressor, used to define information distance, an "ideal" distance metric based on algorithmic similarity.
Data compression is not in principle restricted to syntactic similarity. As others have mentioned, the new wave of neural compressors (e.g. NNCP [2], CMIX [3], leading the large text compression benchmark [4]) outperform their traditional counterparts (e.g. gzip) because of the ability of neural networks to learn complex semantic patterns. Their improved ability to predict the next token means improved lossless compression. This has also been shown to be the case with pre-trained LLMs [5].
I think it's neat that improving compression improves machine learning, and improving machine learning improves compression!
Is this an AI response? This account was created 4 days ago and all its comments follow the exact same structure. The comments are surprisingly not easy to tell it's AI but it always makes sure to include a "it's X, not Y" conclusion.
ks2048|18 days ago
But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).
Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).
[0] https://kenschutte.com/gzip-knn-paper/
[1] https://kenschutte.com/gzip-knn-paper2/
duskwuff|18 days ago
shoo|18 days ago
> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results
Lemaxoxo|17 days ago
pornel|18 days ago
(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)
The article views compression entirely through Python's limitations.
> gzip and LZW don’t support incremental compression
This may be true in the Python's APIs, but is not true about these algorithms in general.
They absolutely support incremental compression even in APIs of popular lower-level libraries.
Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.
notpushkin|18 days ago
I think it makes sense to explore it from practical standpoint, too. It’s in Python stdlib, and works reasonably well, so for some applications it might be good enough.
It’s also fairly easy to implement in other languages with zstd bindings, or even shell scripts:
Lemaxoxo|17 days ago
Compression algorithms may have been supporting incremental compression for a while. But as some have pointed out, the point of the post is that it is practical and simple to have this available in Python's standard library. You could indeed do this in Bash, but then people don't do machine learning in Bash.
xxs|17 days ago
staplung|18 days ago
notpushkin|18 days ago
zdict was added in Python 3.3, though, so it’s 2012, not 1997. (It might have worked before, just not a part of the official API :-)
physicsguy|18 days ago
chaps|18 days ago
Another way I've used image compression to identify cops that cover their body cameras while recording -- the filesize to length ratio reflects not much activity going on.
earthscienceman|17 days ago
m-hodges|18 days ago
¹ https://matthodges.com/posts/2023-10-01-BIDEN-binary-inferen...
Lemaxoxo|17 days ago
stephantul|18 days ago
The author sets the solver to saga, doesn’t standardize the features, and uses a very high max_iter.
Logistic Regression takes longer to converge when features are not standardized.
Also, the zstd classifier time complexity scales linearly with the number of classes, logistic regression doesn’t. You have 20 (it’s in the name of the dataset), so why only use 4.
It’s a cool exploration of zstd. But please give the baseline some love. Not everything has to be better than something to be interesting.
Lemaxoxo|17 days ago
throwaway81523|18 days ago
Scene_Cast2|18 days ago
It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...
EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.
nl|18 days ago
The current leader on the Hutter Prize (http://prize.hutter1.net/) are all LLM based.
It can (slowly!!) compress a 1GB dump of Wikipedia to 106Mb
By comparison GZip can compress it to 321Mb
See https://mattmahoney.net/dc/text.html for the current leaderboard
dchest|18 days ago
hxtk|18 days ago
It outperforms zstd by a long shot (I haven't dedicated the compute horsepower to figuring out what "a long shot" means quantitatively with reasonably small confidence intervals) on natural language, like wikipedia articles or markdown documents, but (using GPT-2) it's about as good as zstd or worse than zstd on things like files in the Kubernetes source repository.
You already get a significant amount of compression just out of the tokenization in some cases ("The quick red fox jumps over the lazy brown dog." encodes to one token per word plus one token for the '.' for the GPT-2 tokenizer), where as with code a lot of your tokens will just represent a single character so the entropy coding is doing all the work, which means your compression is only as good as the accuracy of your LLM, plus the efficiency of your entropy coding.
I would need to be encoding multiple tokens per "word" with Huffman Coding to hit the entropy bounds, since it has a minimum of one bit per character, so if tokens are mostly just one byte then I can't do better than a 12.5% compression ratio with one token per word. And doing otherwise gets computationally infeasible very fast. Arithmetic coding would do much better especially on code because it can encode a word with fractional bits.
I used Huffman coding for my first attempt because it's easier to implement and most libraries don't support dynamically updating the distribution throughout the process.
az09mugen|18 days ago
For example I would not want a zip of an encyclopedia that uncompresses to unverified, approximate and sometimes even wrong text. According to this site : https://www.wikiwand.com/en/articles/Size%20of%20Wikipedia a compressed Wikipedia without medias, just text is ~24GB. What's the medium size of an LLM, 10 GB ? 50 GB ? 100 GB ? Even if it's less, it's not an accurate and deterministic way to compress text.
Yeah, pretty easy to calculate...
pornel|18 days ago
So yes, LLMs are nearly ideal text compressors, except for all the practical inconveniences of their size and speed (they can be reliably deterministic if you sacrifice parallel execution and some optimizations).
fwip|18 days ago
Edit to soften a claim I didn't mean to make.
woadwarrior01|18 days ago
https://en.wikipedia.org/wiki/Normalized_Google_distance
bandrami|18 days ago
wodenokoto|18 days ago
Orphis|18 days ago
jackhurwitz|17 days ago
The relationship between probabilistic modeling and lossless compression is very direct. A model that predicts the next symbol with probability p can on average losslessly compress that symbol with the help of an entropy coder (e.g. arithmetic coding) in -log(p) bits. Therefore improved probabilistic models immediately translate into improved lossless compressors.
There are two ways people have used compressors for ML: the first is based on the Minimum Description Length (MDL) principle [0] which says that the best model is the one which provides the shortest description of the data, counting the size of the model itself. This is similar to the technique used in this blog post (argmin code length across class-conditioned compressors) except for counting the model size. Basically you can train a compressor per class and then choose the class compressor which best compresses the test data. It is a maximum likelihood argument because of Shannon's source coding theorem: a code length L corresponds to a probability 2^-L. The second way is the Normalized Compression Distance (NCD) [1], which uses code lengths to calculate information-theoretic distances which can then be plugged into distance-based algorithms like kNN. MDL interprets compressed lengths as likelihoods, while NCD interprets them for distance calculations. The theoretical foundation is Kolmogorov complexity which is the ideal (& uncomputable) lossless compressor, used to define information distance, an "ideal" distance metric based on algorithmic similarity.
Data compression is not in principle restricted to syntactic similarity. As others have mentioned, the new wave of neural compressors (e.g. NNCP [2], CMIX [3], leading the large text compression benchmark [4]) outperform their traditional counterparts (e.g. gzip) because of the ability of neural networks to learn complex semantic patterns. Their improved ability to predict the next token means improved lossless compression. This has also been shown to be the case with pre-trained LLMs [5].
I think it's neat that improving compression improves machine learning, and improving machine learning improves compression!
Looking forward to hearing other thoughts.
[0] https://arxiv.org/abs/math/0406077 [1] https://arxiv.org/abs/cs/0111054 [2] https://bellard.org/nncp/nncp_v2.pdf [3] https://www.byronknoll.com/cmix.html [4] https://www.mattmahoney.net/dc/text.html [5] https://arxiv.org/abs/2309.10668
fifticon|18 days ago
OutOfHere|18 days ago
rob_c|18 days ago
willmarquis|18 days ago
[deleted]
phucnet|13 days ago
[deleted]
matheus-rr|18 days ago
[deleted]
Pedro_Ribeiro|18 days ago
cluckindan|18 days ago