top | item 7868889

(no title)

ssdfsdf | 11 years ago

I'm not sure that is true. Take for instance the first 100 prime numbers printed one after another in a string. The string is long and apparently random, yet contains little algorithmic complexity, since the machine which prints out the numbers is fairly simple. A standard compression algorithm will not be able to compress the string very effectively.

Therefore I am not sure that compressing the string is likely to give you a sense of the information contained within it, at least information in the sense which we are interested in.

discuss

order

gwern|11 years ago

Compressing with something like gzip, xz, or zpaq gives you an upper bound on the complexity. This upper bound is pretty loose, but works surprisingly well in a surprising number of domains (not beating special-purpose algorithms usually, of course, but that something like gzip can be used to estimate eg phylogenetic trees at all is surprising).

See for example http://www.illc.uva.nl/Research/Publications/Dissertations/D... 'Statistical Inference Through Data Compression'.

ssdfsdf|11 years ago

I am aware of the interesting duality between compression and learning. I have spent quite some time thinking about it.

However I am still not convinced that this upper bound is going to provide us with the information that we need in this case. What we are looking for is a relative measure of the complexity between each genome. The upper bound will not necessarily give us this relative measure because it may not be able to compress the genetic code of organisms by the same factor. The compressibility of a particular genome, by a specific algorithm will be dependent on the method of encoding of information used by the organism. For instance the organism may repeat codes for redundancy, but it may permute the letters of the copy in a predictable way, for it's own reasons. The compression algorithm used will not pick up on this.

It is useful to use compressiblity by a range of algorithms as the input to a machine learning algorithm, or as part of the model in AIXI, but it is not useful for estimating algorithmic complexity (as far as I am concerned, I am open minded, but not convinced yet).