(no title)
longhaul | 8 months ago
Thinking about it a bit more, we are doing this at the character level- a Unicode table, so why can’t we lookup words or maybe even common sentences ?
longhaul | 8 months ago
Thinking about it a bit more, we are doing this at the character level- a Unicode table, so why can’t we lookup words or maybe even common sentences ?
pornel|8 months ago
There's every possible text in Pi, but on average it's going to cost the same or more to encode the location of the text than the text itself.
To get compression, you can only shift costs around, by making some things take fewer bits to represent, at the cost of making everything else take more bits to disambiguate (e.g. instead of all bytes taking 8 bits, you can make a specific byte take 1 bit, but all other bytes will need 9 bits).
To be able to reference words from an English dictionary, you will have to dedicate some sequences of bits to them in the compressed stream.
If you use your best and shortest sequences, you're wasting them on picking from an inflexible fixed dictionary, instead of representing data in some more sophisticated way that is more frequently useful (which decoders already do by building adaptive dictionaries on the fly and other dynamic techniques).
If you try to avoid hurting normal compression and assign less valuable longer sequences of bits to the dictionary words instead, these sequences will likely end up being longer than the words themselves.
Svetlitski|8 months ago
https://www.rfc-editor.org/rfc/rfc7932#page-28
wmf|8 months ago