top | item 44468164

(no title)

longhaul | 8 months ago

Why can’t browsers/servers just store a standard English dictionary and communicate via indexes?. Anything that isn’t in the dictionary can be sent raw. I’ve always had this thought but don’t see why it isn’t implemented. Might get a bit more involved with other languages but the principle remains the same.

Thinking about it a bit more, we are doing this at the character level- a Unicode table, so why can’t we lookup words or maybe even common sentences ?

discuss

pornel|8 months ago

Compression is limited by the pigeonhole principle. You can't get any compression for free.

There's every possible text in Pi, but on average it's going to cost the same or more to encode the location of the text than the text itself.

To get compression, you can only shift costs around, by making some things take fewer bits to represent, at the cost of making everything else take more bits to disambiguate (e.g. instead of all bytes taking 8 bits, you can make a specific byte take 1 bit, but all other bytes will need 9 bits).

To be able to reference words from an English dictionary, you will have to dedicate some sequences of bits to them in the compressed stream.

If you use your best and shortest sequences, you're wasting them on picking from an inflexible fixed dictionary, instead of representing data in some more sophisticated way that is more frequently useful (which decoders already do by building adaptive dictionaries on the fly and other dynamic techniques).

If you try to avoid hurting normal compression and assign less valuable longer sequences of bits to the dictionary words instead, these sequences will likely end up being longer than the words themselves.

Svetlitski|8 months ago

Compression algorithms like Brotli already do this:

https://www.rfc-editor.org/rfc/rfc7932#page-28

wmf|8 months ago

Brotli has a built-in dictionary.