top | item 33092077

(no title)

iafiaf | 3 years ago

The "bioinformatics formats" might be terrible, but they work. In fact, they are meant to be Excel-readable which keeps my collaborators (and me) happy. Coming from a CS /programming background, it is natural to feel the urge to "fix" the formats (<insert relevant XKCD>), until you realise that there are a libraries that easily handle serialization/parsing.

Besides, "bioinformatic formats" is a meaningless word anyway. FASTQ, VCFs, BCL, AIRR-seq -- all different and it just works.

discuss

order

jiggawatts|3 years ago

"Just works" and "I've been waiting 15 minutes now for this file to un-gzip" aren't compatible in my book, especially on a computer that should be able to process that file in seconds.

Also, I'd love to see someone open a 75 GB FASTQ file in Excel.

Ultimatt|3 years ago

VCF at least typically uses bgzip which is essentially gzipped sections concatenated, but parallel unzipable for random access, cram is also parallelisable in the same way. Maybe you just dont know the formats and tooling so well? Im not sure anyone opens a fastq directly for viewing anymore, but they will want pile ups from a bam. The problem with bio formats isnt that they're text its that they are shit text formats too.

zmmmmm|3 years ago

Nearly all bioinfo tools operate in streaming mode which means line based gzipped formats work great as you can parallelise the processing with reading the file. Nobody ever unzips the whole file before starting to process it.

iafiaf|3 years ago

FASTQ is not for Excel, obviously - although you can still explore it in the shell. Nonetheless operating directly on FASTA/FASTQ files is often a "one-time" preprocessing task. You then serialize the preprocessed data and continue on from there.

FASTA (and its various incantations) are not going anywhere anytime soon.

rrwo|3 years ago

Excel-readable is a bad thing. I wonder how much data is ignored or misinterpreted because Excel misinterpreted the input to be something else.

(I actually know biologists who have run into this problem.)

mnw21cam|3 years ago

https://pubmed.ncbi.nlm.nih.gov/27552985/ estimates that about one fifth of papers with supplementary Excel lists of genes contain mangled gene names. I remember talking about this problem back in 2003. The HGNC has been quietly going around changing the names of some of these genes to try and stop this from being a problem.

kortex|3 years ago

#927 is a fun quip, but as an actual critique of engineering practices, I'd actually much rather see folks attempt new standards and innovate when they feel like they have an idea which could work better, rather than be discouraged by the perennial "great now we have N+1 standards".

Every standard nowadays aside from the very first one, are an N+1. Heck, even IFF and ASN.1, the absolute old timers of file/serialization formats, are improvements on "just mmap to disk" application formats.

cycomanic|3 years ago

> The "bioinformatics formats" might be terrible, but they work. In fact, they are meant to be Excel-readable which keeps my collaborators (and me) happy. Coming from a CS /programming background, it is natural to feel the urge to "fix" the formats (<insert relevant XKCD>), until you realise that there are a libraries that easily handle serialization/parsing.

And this is the crux of the issue, people still think excel processing is acceptable practice in 2022. If you are required to publish your analysis code (if you are not yet, it will come, the writing is on the wall), are you just publishing the excel sheets?