top | item 33099080

(no title)

heuermh | 3 years ago

We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place.

While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.

0 - https://github.com/bigdatagenomics/adam

1 - https://github.com/bigdatagenomics/bdg-formats

discuss

jltsiren|3 years ago

Maybe column-oriented formats like Parquet never became popular in bioinformatics because new file formats usually come from people developing tools for upstream tasks such as read mapping, variant calling, and genome assembly. They are the ones who work with new kinds of data first.

Those upstream tasks tend to be row-oriented. You often iterate over all rows, do something with them, and output new rows in another format. Alternatively, you read the entire input into in-memory data structures, do something, and later serialize the data structures. Using column-oriented formats for such tasks does not feel natural.