top | item 46984459

(no title)

pornel | 18 days ago

The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)

The article views compression entirely through Python's limitations.

> gzip and LZW don’t support incremental compression

This may be true in the Python's APIs, but is not true about these algorithms in general.

They absolutely support incremental compression even in APIs of popular lower-level libraries.

Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.

discuss

notpushkin|17 days ago

> The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

I think it makes sense to explore it from practical standpoint, too. It’s in Python stdlib, and works reasonably well, so for some applications it might be good enough.

It’s also fairly easy to implement in other languages with zstd bindings, or even shell scripts:

  $ echo 'taco burrito tortilla salsa guacamole cilantro lime' > /tmp/tacos.txt
  $ zstd --train $(yes '/tmp/tacos.txt' | head -n 50) -o tacos.dict
  [...snip]

  $ echo 'racket court serve volley smash lob match game set' > /tmp/padel.txt
  $ zstd --train $(yes '/tmp/padel.txt' | head -n 50) -o padel.dict
  [...snip]

  $ echo 'I ordered three tacos with extra guacamole' | zstd -D tacos.dict | wc -c
        57
  $ echo 'I ordered three tacos with extra guacamole' | zstd -D padel.dict | wc -c
        60

notpushkin|17 days ago

Or with the newsgroup20 dataset:

  curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz | tar -xzf -
  cd 20_newsgroups
  for f in *; do zstd --train "$f/*" -o "../$f.dict"; done
  cd ..
  for d in *.dict; do
    cat 20_newsgroups/misc.forsale/74150 | zstd -D "$d" | wc -c | tr -d '\n'; echo " $d";
  done | sort | head -n 3

Output:

     422 misc.forsale.dict
     462 rec.autos.dict
     463 comp.sys.mac.hardware.dict

Pretty neat IMO.

Lemaxoxo|17 days ago

Author here. Thanks for your comment!

Compression algorithms may have been supporting incremental compression for a while. But as some have pointed out, the point of the post is that it is practical and simple to have this available in Python's standard library. You could indeed do this in Bash, but then people don't do machine learning in Bash.

xxs|17 days ago

gzip/deflate has had SYNC_FLUSH - to concat message, and/or try something else. Also it has always had adler hash for dictionaries