top | item 46720859

(no title)

atiedebee | 1 month ago

I thought the same, so I ran brotli and zstd on some PDFs I had laying around.

  brotli 1.0.7 args: -q 11 -w 24
  zstd v1.5.0  args: --ultra -22 --long=31 
                 | Original | zstd    | brotli
  RandomBook.pdf | 15M      | 4.6M    | 4.5M
  Invoice.pdf    | 19.3K    | 16.3K   | 16.1K

I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.

Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.

discuss

mort96|1 month ago

EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158

I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044

Results by compression type across 55 PDFs:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

mort96|1 month ago

Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.

Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):

    +---------+---------+--------+--------+--------+
    |  none   |  zstd   |   xz   |  gzip  | brotli |
    +---------|---------|--------|--------|--------|
    | 47.81M  | 37.92M  | 37.96M | 38.80M | 37.06M |
    +---------+---------+--------+--------+--------+

These numbers are much more impressive. Still, Brotli has a slight edge.

mrspuratic|1 month ago

> I couldn't quickly find a way to decompress them

    pdftk in.pdf output out.pdf decompress

Thoreandan|1 month ago

Does your source .pdf material have FlateDecode'd chunks or did you fully uncompress it?

atiedebee|1 month ago

I wasn't sure. I just went in with the (probably faulty) assumption that if it compresses to less than 90% of the original size that it had enough "non-randomness" to compare compression performance.

atiedebee|1 month ago

Ran the tests again with some more files, this time decompressing the pdf in advance. I picked some widely available PDFs to make the experiment reproducable.

  file            | raw         | zstd       (%)      | brotli     (%)     |
  gawk.pdf        | 8.068.092   | 1.437.529  (17.8%)  | 1.376.106  (17.1%) |
  shannon.pdf     | 335.009     | 68.739     (20.5%)  | 65.978     (19.6%) |
  attention.pdf   | 24.742.418  | 367.367    (1.4%)   | 362.578    (1.4%)  |
  learnopengl.pdf | 253.041.425 | 37.756.229 (14.9%)  | 35.223.532 (13.9%) |

For learnopengl.pdf I also tested the decompression performance, since it is such a large file, and got the following (less surprising) results using 'perf stat -r 5':

  zstd:   0.4532 +- 0.0216 seconds time elapsed  ( +-  4.77% )
  brotli: 0.7641 +- 0.0242 seconds time elapsed  ( +-  3.17% )

The conclusion seems to be consistent with what brotli's authors have said: brotli achieves slightly better compression, at the cost of a little over half the decompression speed.

order-matters|1 month ago

Whats the assumption we can potentially target as reason for the counter-intuitive result?

that data in pdf files are noisy and zstd should perform better on noisy files?

jeffbee|1 month ago

What's counter-intuitive about this outcome?