top | item 43673933

(no title)

samsartor | 10 months ago

Worth clarifying that you are talking about information content, not entropy. A single text file or png has information, the distribution of all possible text files or all possible pngs has entropy.

I'm not an expert, but let me brainstorm a bit here. Something closer to the specific correlation might be what you want? In vague terms it would measure how much the various bytes of a file are related to each other, by considering how much more likely they are to be given values taken together vs considering each byte individually.

But once again, having extremely high specific correlation might indicate a trival low-complexity example? I'd have to play around with it some more to get a good intuition for how it behaves.

Edit: It seems like this might also be very sensitive to parametrization. The specific correlation in terms of byte values would not be much more useful than the information content, because the marginal distributions aren't very interesting (eg over all files, how likely is byte #57 to be 0xf3 or whatever). It would be a better measure with more meaningful variables, or even something like a markov chain where you consider many short substrings.

Anyway, specific correlation (like information) measures a specific file. The expected specific correlation over all possible files gives the total correlation, which (like entropy) measures the whole distribution. Total correlation is also the KL divergence between the joint distribution and the product of the marginal distributions! Total correlation is also the same thing as mutual information, just generalized to more than two variables. And specific correlation is the generalization of pointwise mutual information.

discuss

order

No comments yet.