The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)
I would zarr for dense matrices, mostly (I use them with microscope images). I see it also used for frequency/spatial observations in genomic imaging.
But I prefer parquet for most direct analysis of sequence, since it's the format best integrated with big data analytics. I care much less about total compression size than I do the ability to decompress the data I need quickly (say, to ETL it to a featurization pipeline).
dekhn|3 years ago
eternalban|3 years ago