top | item 39870842

(no title)

sn | 1 year ago

For bad-3-corrupt_lzma2.xz, the claim was that "the original files were generated with random local to my machine. To better reproduce these files in the future, a constant seed was used to recreate these files." with no indication of what the seed was.

I got curious and decided to run 'ent' https://www.fourmilab.ch/random/ to see how likely the data in the bad stream was to be random. I used some python to split the data into 3 streams, since it's supposed to be the middle one that's "bad":

I used this regex to split in python, and wrote to "tmp":

    re.split(b'\xfd7zXZ', x)

I manually used dd and truncate to strip out the remaining header and footer according to the specification, which left 48 bytes:

    $ ent tmp2 # bad file payload
    Entropy = 4.157806 bits per byte.
    
    Optimum compression would reduce the size
    of this 48 byte file by 48 percent.
    
    Chi square distribution for 48 samples is 1114.67, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 51.4167 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.258711 (totally uncorrelated = 0.0).
    
    $ ent tmp3 # urandom
    Entropy = 5.376629 bits per byte.
    
    Optimum compression would reduce the size
    of this 48 byte file by 32 percent.
    
    Chi square distribution for 48 samples is 261.33, and randomly
    would exceed this value 37.92 percent of the times.
    
    Arithmetic mean value of data bytes is 127.8125 (127.5 = random).
    Monte Carlo value for Pi is 3.500000000 (error 11.41 percent).
    Serial correlation coefficient is -0.067038 (totally uncorrelated = 0.0).

The data does not look random. From https://www.fourmilab.ch/random/ for the Chi-square Test, "We interpret the percentage as the degree to which the sequence tested is suspected of being non-random. If the percentage is greater than 99% or less than 1%, the sequence is almost certainly not random. If the percentage is between 99% and 95% or between 1% and 5%, the sequence is suspect. Percentages between 90% and 95% and 5% and 10% indicate the sequence is “almost suspect”."

discuss

supriyo-biswas|1 year ago

Now to be fair, such an archive could have been created with a “store” level of compression that doesn’t actually perform any compression.

sn|1 year ago

My reading of the commit message is they're claiming the "data" should look random.