It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data [0]. It's a great example of the really simple transformations you can perform on data that can unlock significant compression improvements. OpenZL can perform that transformation internally (quite easily with SDDL!).[0] https://news.ycombinator.com/item?id=45223827
perching_aix|4 months ago
> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.
bede|4 months ago
Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)
Gethsemane|4 months ago
jltsiren|4 months ago
I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.
jayknight|4 months ago
felixhandte|4 months ago