(no title)
faiD9Eet | 2 years ago
Let us go into detail on compression: There is a representation. US-ASCII uses 8 bits per latin letter, UTF-32 uses 4 bits per latin letter. It is just a temporal representation to the machine -- usually in memory only, it does have the same amount of information, you can save it more efficiently to disk. You would not want to save either format to disk, it is a waste of space.
Information content (I hope my translation is correct, scan Wikipedia for details) cannot be compressed. But it can be calculated. The more seldom a letter, the more information its occurence carries. As soon as each letter is not equally frequent (compare space and "q") the information density drops. Calculation is quite simple: Count the occurence of each letter, count the caracters used (if there is no "q" in the text, you got to save one letter and its encoding) and apply some math
https://en.wikipedia.org/wiki/Entropy_(information_theory)
For some easy examples, think of morse code and Huffman coding -- not every letter needs to be encoded using the same amount of bits.
> How much data can lowercase save? #
Nothing. Either there is (almost) no information to it in the first place, in that case compression will take care of it for you. There could only be information to it if uppercase letters were equally likely as lowercase letters
> How much data can lowercase save? #
Why do you even stop at letters? You could build a dictionary of words and compress references to it. The compression efficiancy would only depend on the amount of words, regardless of case and regardless of character set. That is why entropy depends on "symbols" instead of "letters"
No comments yet.