top | item 4617790

(no title)

meanguy | 13 years ago

Sigh. Claude Shannon is rolling in his grave.

discuss

order

shimon|13 years ago

Why is this an invalid approach? If you have an agreed-upon OS installation, you've got at least several hundred megabytes of binary data from which you can extract sequences for a lookup dictionary. To compress, you locate largest substrings of the input that also appear somewhere in the OS data. For example, you might find that bytes 1000-2000 of the input data are the same as bytes 4528-5528 of /bin/ls. Then you just need a program that can extract this byte range from /bin/ls and fill it in at the appropriate place.

Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data. And it may not be allowed to assume an agreed-upon OS installation, although it was OK to assume that gunzip was available. And finally, doing this sort of largest-overlapping-substring search could be very very slow.

The real key here is that the challenge is about compressing one specific random file. It's clearly impossible to write a compressor that can represent ANY sequence of N bytes in fewer than N bytes including the decompressor. But this challenge is constrained to reconstructing a single input file, not handling arbitrary inputs. If the input files are random, and you keep asking for them, there's a tiny chance you'll get one with enough repetition that even gzip could compress it. If you can key against a large set of reference data, your odds improve significantly.

So, I don't know if this is a practical approach given the restrictions of the contest, the cost of entering, and the expected payoff, but I would say that if it is plausible to use a large body of OS files for lookup, this could be a winnable challenge.

mistercow|13 years ago

>Why is this an invalid approach?

Because of information theory. It's believed that the binary expansion of pi contains all possible finite bit sequences. A program that expands pi is relatively small. Assuming that hypothesis is true, does that mean we could write a compressor that simply breaks input into blocks and then encodes the output as indexes into pi?

And the answer is that we certainly could write that program. And the result would almost always be a larger output than input. Why? Because you'd have to search so far into pi that the index would contain more bits than the input blocks. In short, having a library of arbitrary strings to draw from does not help you to compress data.

More generally, compression is impossible in the general case. Every lossless compression algorithm will, on average, produce an equal or larger output than its input.

romaniv|13 years ago

Of course, it isn't a given that you will be able to find a sufficiently large overlapping substring in the OS binary data.

It seems that it's a given that you will not find that substring. The larger it is, the less likely you will find it and the less likely it will repeat in the uncompressed file.

http://news.ycombinator.com/item?id=4617983

recursive|13 years ago

> you might find that bytes 1000-2000

On average, even if you have the specific file to work with, encoding those numbers, namely 1000 to 2000, will cost at least as much space as the lengths of the sequence you find. So it would be possible to do this, but the sequences would be so short, and offsets so large, that you'd end up losing.

DougBTX|13 years ago

> Why is this an invalid approach?

One way to structure a compressed file is as a lookup dictionary. To decompress the file, you'll need to agree on which file is to be decompressed. If you need to have an agreed OS installation to act as a lookup dictionary before decompression can happen, by all rights you should include the size of the OS in your total of how large the compressed file is.

If you want to use an arbitrary OS, the "decompresser" would need to be able to identify the correct fragments. If you can do that, in less space than the original files, you've already managed to compress the data!