top | item 44845955

(no title)

mriet | 6 months ago

I can understand this for "small" data, say less than 10 Mb.

In bioinformatics, basically all of the file formats are human-readable/text based. And file sizes range between 1-2Mb and 1 Tb. I regularly encounter 300-600 Gb files.

In this context, human-readable files are ridiculously inefficient, on every axis you can think of (space, parsing, searching, processing, etc.). It's a GD crime against efficiency.

And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.

discuss

order

graemep|6 months ago

I do not think the argument is that ALL data should be in human readable form, but I think there are far more cases of data being in a binary form when it would be better human readable. Your example of a case where it is human readable when it should be binary is rarer for most of us.

In some cases human readable data is for interchange and it should be processed and queried in other forms - e.g. CSV files to move data between databases.

An awful lot of data is small - and these days I think you can say small is quite a bit bigger than 10Mb.

Quite a lot of data that is extracted from a large system would be small at that point, and would benefit from being human readable.

The benefit of data being human readable is not necessarily that you will read it all, but that it is easier to read bits that matter when you are debugging.

attractivechaos|6 months ago

> human-readable files are ridiculously inefficient on every axis you can think of (space, parsing, searching, processing, etc.).

In bioinformatics, most large text files are gzip'd. Decompression is a few times slower than proper file parsing in C/C++/Rust. Some pure python parsers can be "ridiculously inefficient" but that is not the fault of human-readability. Binary files are compressed with existing libraries. Compressed binary files are not noticeably faster to parse than compressed text files. Binary formats can be indeed smaller but space-efficienct formats take years to develop and tend to have more compatibility issues. You can't skip the text format phase.

> And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.

You can't read the whole file by eye, but you can (and should often) eyeball small sections in a huge file. For that, you need a human-readable file format. A problem with this field IMHO is that not many people are literally looking at the data by eye.

kaathewise|6 months ago

One of the problems is that a lot of bioinformatics formats nowadays have to hold so much data that most text editors stop working properly. For example, FASTA splits DNA data into lines of 50-80 characters for readability. But in FASTQ, where the '>' and '+' characters collide with the quality scores, as far as I know, DNA and the quality data are always put into one line each. Trying to find a location in a 10k long line gets very awkward. And I'm sure some people can eyeball Phred scores from ASCII, but I think they are a minority, even among researchers.

Similarly, NEXUS files are also human-readable, but it'd be tough to discern the shape of inlined 200 node Newick trees.

When I was asking people who did actual bioinformatics (well, genomics) what some of their annoyances when working with the bioinf software were, having to do a bunch of busywork on files in-between pipeline steps (compressing/uncompressing, indexing) was one of the complaints mentioned.

I think there's a place in bioinformatics for a unified binary format which can take care of compression, indexing, and metadata. But with that list of requirements it'd have to be binary. Data analysis moved from CSVs and Excel files to Parquet, and I think there's a similar transition waiting to happen here

mcdeltat|6 months ago

Another thing is human readable is typically synonymous with unindexed, which becomes a problem when you have large files and care about performance. In bioinformatics we often distribute sidecar index files with the actual data, which is janky and inefficient. Why not have a decent format to begin with?

Further, when the file is unindexed it's even harder to read it as a human because you can't easily skip to a particular section. I have this trouble often where my code can efficiently access the data once it's loaded, but a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.

attractivechaos|6 months ago

> Another thing is human readable is typically synonymous with unindexed

Indexing is not directly related to binary vs text. Many text formats in bioinformatics are indexed and many binary formats are not when they are not designed with indexing in mind.

> a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.

Yes, indexing is better but without indexing, you can use command line tools to extract the portion you want to look at and then pipe to "more" or "less".