Interested outsider here; I work with a lot of HCLS research customers but don't have a biology-related background. Can you explain the problems with the Nanopore sequencer accuracy in more detail? Basically, I was wondering if I could get one for myself and sequence my own genome, then user the data to learn about life-sciences computing techniques. If I were to buy one of the USB-attachable devices and run it, is the data simply not viable for use in a genomics pipeline, or is it just that the results would be questionable? Also, if accuracy is an issue, what about just running the same sample N times and doing some error correction?
ampdepolymerase|4 years ago
https://genomebiology.biomedcentral.com/articles/10.1186/s13...
I guess there are limits to ensemble methods if the underlying accuracy doesn't increase. I don't work on gene sequencing algorithms but from what I understand of ML ensemble techniques, there are certain assumptions regarding the underlying independence of the errors. The errors for nanopore should be uniform but I am not sure. Any molecular biologist here care to comment?
biophysboy|4 years ago
biophysboy|4 years ago
nyolfen|4 years ago
snystrom|4 years ago
There are two components that drive sequencing error rate. 1) The chemistry behind the sequencing (for nanopore sequencing this is the "feeding DNA through a pore" bit) 2) the method to convert raw signal into DNA sequence (this is called "base calling").
The gold-standard in terms of error profile for sequencing is currently the Illumina short read platform. Illumina machines are really just microscopes (TIRF scopes for optics folks) that sequence DNA by visualizing incorporation of dye-labeled nucleotides into the sequenced molecule(s) (Imagine a really slow PCR [1]). Each base is labeled with a different color, then when a molecule has a match it makes a colored spot on the slide that the machine can read (see here for more info & details of newer chemistry that use fewer colors [2]). This whole process is mediated by DNA polymerase which itself has a very low error rate. Another important point is that DNA sequenced on the illumina platform (called a "library") tends to be from "amplified" template DNA, meaning the DNA will have been processed and potentially be missing chemical modifications on the bases that could be present in the organism. This works to Illumina's advantage, because when trying to answer the question of "what is the DNA sequence?" we want the ground-truth DNA, not the modification state.
In contrast, Nanopore sequencing works by feeding a long strand of DNA through a pore and measuring the change in electrical current through the pore (watch the cool video [3]). For the current set of nanopore flowcells, 8 bases of DNA sit in the pore at a time, meaning the current at each timestep is a product of 8 nucleotides in aggregate. This also means that the pore "sees" each base 8 times, but always in the context of an additional 7. In order to basecall from the raw signal, it's not as easy as saying "blue = A", instead, you have to deconvolve each base from a complex signal. As you might imagine, the folks at Oxford Nanopore & broader research community have turned to machine learning-based base callers to solve this problem, and they work quite well [4]. But they are not perfect. Deconvolving runs of the same base (e.g. "AAAAAAA") is difficult because without well-defined signal changes between bases, the caller has a hard time deciding how many bases it has seen, so a common error mode for nanopore sequencing is to create insertions/deletions at places in the genome with low nucleotide diversity. Another interesting reason is that most Nanopore library preps are often performed on unamplified DNA, and so in addition to normal A/T/G/C nucleotides, the template DNA can also contain bases with chemical modifications. For example, in bacteria, A's are often methylated, and in Humans, C can have all kinds of different modifications (5-methyl-cytosine, 5-hydroxymethyl-cytosine, etc. etc.) and each different modification affects the signal in the nanopore. Therefore, basecallers that weren't trained on modified bases will produce basecalling errors in the presence of base modifications.
For both Illumina and Nanopore basecallers, they assign a quality score to each base that indicates the probability that the basecaller produced an incorrect value. This is called a Q-score, which is defined as "Q = -10(log10(P-value))" (i.e. Q / 10 = the order of magnitude of the error probability) [5]. For example, a Q-score of 10 means an error rate of 1 in 10, but a Q-score of 50 means an error rate of 1 in 100,000. For Illumina sequencing, >95% of the reads have a Q-score > 30 (i.e. 1 in 1000 errors), while Nanopore reads tend to have lower average Q-scores (~Q20, i.e. 1 in 100 errors). For genetics, where 1 base difference can mean the difference between a severe disease allele vs a normal variant, 1 in 100 won't cut it.
The current gen Nanopore flowcell chemistry (R9.4.1) is what most people are talking about when they talk about Nanopore error rates, but they've just released a new pore type & made some basecaller upgrades that improve the accuracy to what they call "Q20+" and some claims of Q>30, and from the data I've seen, it's impressive, I just haven't got my hands on one yet to see for myself [6]. I think the comment saying "wait 5 years" is an overestimate, but if you want to genotype yourself today, I'd just pay someone for Illumina sequencing and process the fastq files yourself if you really want to do it as a learning exercise.
I've unintentionally written an essay, so I'll stop here, but real quick to your other point RE: rerunning the sample N times & using the repeats for error correction. This won't work the way you're thinking because a "sample" is actually a collection of DNA molecules that are sampled randomly by the sequencer. You have no way of knowing that the same read between runs was actually from the same molecule, so you can't error correct this way. Consequently, a totally different sequencing platform from Pacific Biosciences uses this strategy by doing some really cool chemistry, but I'll spare you the second essay (google "PacBio HiFi" or "circular consensus reads" if you're interested).
[1] https://en.wikipedia.org/wiki/Polymerase_chain_reaction
[2] https://www.ecseq.com/support/ngs/do-you-have-two-colors-or-...
[3] https://www.youtube.com/watch?v=RcP85JHLmnI
[4] This paper is a tad out of date, but Ryan Wick always writes extremely clear papers: https://genomebiology.biomedcentral.com/articles/10.1186/s13...
[5] https://www.illumina.com/documents/products/technotes/techno...
[6] https://nanoporetech.com/about-us/news/oxford-nanopore-tech-...
Edit: reformatted links for clarity.
jcims|4 years ago
AstroDogCatcher|4 years ago