top | item 28627360

(no title)

You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...

discuss

liamwestray|4 years ago

Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.

benjamin-lee|4 years ago

This is because there's four bases and each can either be included or excluded from a given combination. So there are 4*2 = 16 combinations each of which with their own letter. In all honesty, these are pretty rarely used in practice these days except for N (any base) although they do sometimes show up when representing consensus sequences.