top | item 44304987

(no title)

rwaksmunski | 8 months ago

I use this crate to process 100s of TB of Common Crawl data, I appreciate the speedups.

discuss

order

viraptor|8 months ago

What's the reason for using bz2 here? Wouldn't it be faster to do a one off conversion to zstd? It beats bzip2 in every metric at higher compression levels as far as I know.

rwaksmunski|8 months ago

Common Crawl delivers the data as bz2. Indeed I store intermediate data in zstd with ZFS.

declan_roberts|8 months ago

That assumes you're processing the data more than once.

anon-3988|8 months ago

Is this data available as torrents?

malux85|8 months ago

Yeah came here to say a 14% speed up in compression is pretty good!

aidenn0|8 months ago

bzip2 (particularly parallel implementations thereof) are already relatively competitive for compression. The decompression time is where it lags behind because lz77 based algorithms can be incredibly fast at decompression.