top | item 39065928

(no title)

faiD9Eet | 2 years ago

If you ever happen to study computer science, you may come across a subject called "coding theory". It introduces you to compression as well as many other topics such as error correction (Reed-Solomon, used in RAID5, RAID6 and ECC-RAM), line coding (sometimes you need to flip bits to keep the clock on both sides in sync, no clock-signal for a long time may cause clock loss) and a lot of wonderful but weird stuff.

Let us go into detail on compression: There is a representation. US-ASCII uses 8 bits per latin letter, UTF-32 uses 4 bits per latin letter. It is just a temporal representation to the machine -- usually in memory only, it does have the same amount of information, you can save it more efficiently to disk. You would not want to save either format to disk, it is a waste of space.

Information content (I hope my translation is correct, scan Wikipedia for details) cannot be compressed. But it can be calculated. The more seldom a letter, the more information its occurence carries. As soon as each letter is not equally frequent (compare space and "q") the information density drops. Calculation is quite simple: Count the occurence of each letter, count the caracters used (if there is no "q" in the text, you got to save one letter and its encoding) and apply some math

https://en.wikipedia.org/wiki/Entropy_(information_theory)

For some easy examples, think of morse code and Huffman coding -- not every letter needs to be encoded using the same amount of bits.

> How much data can lowercase save? #

Nothing. Either there is (almost) no information to it in the first place, in that case compression will take care of it for you. There could only be information to it if uppercase letters were equally likely as lowercase letters

> How much data can lowercase save? #

Why do you even stop at letters? You could build a dictionary of words and compress references to it. The compression efficiancy would only depend on the amount of words, regardless of case and regardless of character set. That is why entropy depends on "symbols" instead of "letters"

discuss

No comments yet.