(no title)
heuermh | 3 years ago
While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.
jltsiren|3 years ago
Those upstream tasks tend to be row-oriented. You often iterate over all rows, do something with them, and output new rows in another format. Alternatively, you read the entire input into in-memory data structures, do something, and later serialize the data structures. Using column-oriented formats for such tasks does not feel natural.