> title
bases with optional newlines
> title
bases with optional newlines
...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.
It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.
this is also the insight that the bwa developer had, to use the burrows-wheeler transform which is part of bzip2 due to it's compression properties being particularly good for genomic sequences.
This was a question that I thought was interesting enough to test ChatGPT with.
Surprisingly, it gave an answer along the lines of the parent comment.
However, it seems it didn't figure this out by itself but it says:
> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.
This is because Zstd's long-distance matcher looks for matching sequences of 64 bytes [0]. Because long matching sequences of the data will likely have the newlines inserted in different offsets in the run, this totally breaks Zstd's ability to find the long-distance match.
Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.
That is fascinating. I wonder if you could layer a Levenshtein State Machine on the strings so you can apply n-edits to the text to get longer matches.
I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine.
Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat.
Interesting. Any idea why this can't be stored in the metadata of the compressed file?
It is stored in the metadata [1], but anything larger than 8 MiB is not guaranteed to be supported. So there has to be an out-of-band agreement between compressor and decompressor.
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
Basic text formats persist, because everyone supports them. Many tools have better file formats for internal purposes, but they are rarely flexible enough and robust enough for wider use. There are occasional proposals for better general purpose formats, but the people proposing them rarely agree which of the competing proposals should be adopted. And even if they manage to agree, they probably don't have the time and the money to make it actually happen.
Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
I've also noticed this. Zstandard doesn't see very common patterns
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.
Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.
other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]
> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"
It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."
When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.
And, despite its simplicity, it has worked for forty years.
> a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.
This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.
When you know you're going to be compressing files of particular structure, it's often very beneficial to tweak compression algorithm parameters. In one case when dealing with CSV data, I was able to find a LZMA2 compression level, dictionary size and compression mode that yielded a massive speedup, uses 1/100th the memory and surprisingly even yields better compression ratios, probably from the smaller dictionary size. That's in comparison to the library's default settings.
Ha. it gets worse.
Search engines or blacklist processors often use gigantic url lists, which are stored as plain ASCII, which is then fed into a perfect hash generator, which accesses those url's unordered. I.e. they need to create a second ordering index to access the urllist. The perfect hashing guys are mathematicians and so they don't care because their definition of a mphf (minimal perfect hash function) is just a random ordering of unique indices, but they don't care to store the ordering also. So we have ASCII and no index.
BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)
Removing the wrapping newline from the FASTA/FASTQ convention also dramatically improves parsing perf when you don't have to do as much lookahead to find record ends.
Unfortunately, when you write a program that doesn't wrap output FASTAs, you have a bunch of people telling you off because SOME programs (cough bioperl cough) have hard limits on line length :)
> I speculated that this poor performance might be caused by the newline bytes (0x0A) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching.
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.
To me the most interesting thing here isn't that you can compress something better by removing randomly-distributed semantically-meaningless information. It's why zstd --long does so much better than gzip when you do and the default does worse than gzip.
Why it does worse than gzip isn't something that I know. Why --long is so efficient is likely a result of evolution of all things :). A lot of things have common ancestors which means shared genetic patterns across species. --long allows zstd to see a 2gb window of data which means it's likely finding all those genetic similarities across species.
Endogenous retroviruses [1] are interesting bits of genetics that helps link together related species. A virus will inject a bit of it's genetics into the host which can effectively permanently scar the host's DNA and all their offspring's DNA.
What's current way to accessibly process my 23andme raw data ? It's been synthesized decade ago and SNPedia and Promethease seems abandoned, so what's alternative if there is, and if there is none how we arrived to this?
I was no longer on the scene when it happened, but I’ve been told it became very difficult to get ongoing funding from the National Science Foundation for bioinformatics software around 10 years ago. You could get an initial grant to develop something, but ongoing support was difficult. So websites and ‘databases’ (curated datasets) that made it easy to run the tools faded away.
That blog post isn't about DNA per se, but it is about sorting data when you know there are only 4 numbers. I guess DNA has 5 - A,T,G,C,N the unknown base - but there's a huge space of DNA-specific compression research that outperforms ZSTD.
This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.
That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.
I've explored alternatives to FASTA and FASTQ but in most cases I found that simply not storing sequence data is the best option of all, but if I have to do it, columnar formats with compression are usually the best alternative when considering all of (my) the constraints.
As someone with an idle interest in data compression, ss it possible to download the original dataset somewhere to play around with? Or rather a like 20gb subset of it.
The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.
jefftk|5 months ago
It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
LeifCarrotson|5 months ago
Or, for an even simpler example:
becomes, on disk, something like which is hard to compress, while is just and then, if you want, you can reflow the text when it's time to render to the screen.bede|5 months ago
AndrewOMartin|5 months ago
mylons|5 months ago
unknown|5 months ago
[deleted]
amelius|5 months ago
Surprisingly, it gave an answer along the lines of the parent comment.
However, it seems it didn't figure this out by itself but it says:
> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.
felixhandte|5 months ago
Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.
[0] https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs...
(I am one of the maintainers of Zstd.)
nerpderp82|5 months ago
I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine.
mfld|5 months ago
lifthrasiir|5 months ago
[1] https://datatracker.ietf.org/doc/html/rfc8878#name-window-de...
nolist_policy|5 months ago
ashvardanian|5 months ago
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
jltsiren|5 months ago
bede|5 months ago
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
Aachen|5 months ago
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
pajko|5 months ago
semiinfinitely|5 months ago
boothby|5 months ago
semiinfinitely|5 months ago
> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"
1. https://madhadron.com/science/farewell_to_bioinformatics.htm...
fwip|5 months ago
When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.
And, despite its simplicity, it has worked for forty years.
totalperspectiv|5 months ago
This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
Fraterkes|5 months ago
StillBored|5 months ago
On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.
unknown|5 months ago
[deleted]
leobuskin|5 months ago
bede|5 months ago
keketi|5 months ago
ciupicri|5 months ago
IshKebab|5 months ago
rurban|5 months ago
unknown|5 months ago
[deleted]
bede|5 months ago
e.g. https://github.com/ArcInstitute/binseq
hhh|5 months ago
amelius|5 months ago
totalperspectiv|5 months ago
Gethsemane|5 months ago
bede|5 months ago
lutusp|5 months ago
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.
pkilgore|5 months ago
What lessons can we take from this?
cogman10|5 months ago
Endogenous retroviruses [1] are interesting bits of genetics that helps link together related species. A virus will inject a bit of it's genetics into the host which can effectively permanently scar the host's DNA and all their offspring's DNA.
[1] https://en.wikipedia.org/wiki/Endogenous_retrovirus
diimdeep|5 months ago
iijj|5 months ago
vintermann|5 months ago
a_bonobo|5 months ago
I thought I'd raise yesterday's HN discussion on 'The unreasonable effectiveness of modern sort algorithms' https://news.ycombinator.com/item?id=45208828
That blog post isn't about DNA per se, but it is about sorting data when you know there are only 4 numbers. I guess DNA has 5 - A,T,G,C,N the unknown base - but there's a huge space of DNA-specific compression research that outperforms ZSTD.
rini17|5 months ago
vintermann|5 months ago
bede|5 months ago
FL33TW00D|5 months ago
jefftk|5 months ago
(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)
dekhn|5 months ago
im3w1l|5 months ago
fwip|5 months ago
meel-hd|5 months ago
nickdothutton|5 months ago
Kim_Bruning|5 months ago
dwattttt|5 months ago
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
vintermann|5 months ago