What does "ratio (bpb)" mean? I'd guess bytes-per-byte or something, like how many bytes of original you get for each byte of compression, but it doesn't work out: the original size is 1e9 bytes, compressed (rounded) 3.2e8, so that's a ratio of 3.1 (1e9/3.2e8=3.09989). The program size amounts to a rounding error on that figure. The bpb value given is 2.58, nowhere near 3.1.
Edit: the paper defines it as "bits per input byte". What kinda measure is that, it's like "how well did it compress as compared to a factor 8", why 8?!
The bit is the most fundamental unit of information. A base-e unit might be more elegant from a certain mathematical perspective, but the connections to formal logic and the ease of implementation make the base-2 bit the natural choice. At least when talking about things like information, entropy, and compression.
Bytes, on the other hand, are entirely arbitrary. At some point, the industry converged to using groups of 8 bits as the primary semantically meaningful unit smaller than a word. Probably because people at that time thought that having 256 distinct characters would be more or less the right choice. And because groups of power-of-2 bits are convenient on hardware level.
Entropy is usually expressed as bits per symbol (or bits per character), because that's what you get when you sum -P(c) log P(c) over all symbols c. People who are used to that convention often extend it to representing compression ratios. Using bits per byte is rare, because bytes are rarely semantically meaningful.
It's a common way to represent entropy (the information content). One could measure bits per x for any x, of course, but bits per character (nee byte) is quite common and goes back to Shannon.
lucb1e|3 years ago
Edit: the paper defines it as "bits per input byte". What kinda measure is that, it's like "how well did it compress as compared to a factor 8", why 8?!
jltsiren|3 years ago
Bytes, on the other hand, are entirely arbitrary. At some point, the industry converged to using groups of 8 bits as the primary semantically meaningful unit smaller than a word. Probably because people at that time thought that having 256 distinct characters would be more or less the right choice. And because groups of power-of-2 bits are convenient on hardware level.
Entropy is usually expressed as bits per symbol (or bits per character), because that's what you get when you sum -P(c) log P(c) over all symbols c. People who are used to that convention often extend it to representing compression ratios. Using bits per byte is rare, because bytes are rarely semantically meaningful.
isaacimagine|3 years ago
8 / 2.58 = 3.1
dgacmu|3 years ago