top | item 34314319

(no title)

rabernat | 3 years ago

The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)

discuss

dekhn|3 years ago

I would zarr for dense matrices, mostly (I use them with microscope images). I see it also used for frequency/spatial observations in genomic imaging. But I prefer parquet for most direct analysis of sequence, since it's the format best integrated with big data analytics. I care much less about total compression size than I do the ability to decompress the data I need quickly (say, to ETL it to a featurization pipeline).

eternalban|3 years ago

I grep'd for parquet and yours is the only comment that mentions it. parquet -> arrow -> AI