Identifying novel sequence variants is a crucial first step toward understanding the genetic basis of many diseases. However, current methods for variant calling, while very good in general, miss the variants in about 10% of the human genome. This 10% of the genome presents unique challenges, such as high GC content, low-complexity sequences and duplications.
In a paper published online this week at Nature Genetics, David Jaffe and colleagues present a two-hit improvement over existing methods. First, they modify existing methods for generating 250bp paired-end reads using a PCR-free protocol. Because PCR amplification isn’t used, the method significantly reduces coverage bias in the final sequence. Second, they present a new algorithm, called DISCOVAR, that is specifically designed to analyze these data and to call variants in the trickiest parts of the genome.
As proof-of-principle, the authors apply their methods to identify all sequence variants in approximately 4MB of sequence from the human cell line GM12878, as compared to the human reference sequence. They found that the Illumina Platinum variant call set, which was based on 100bp reads, actually missed about 25% of the variants, mostly due to low coverage in challenging genomic regions.
The new sequencing and assembly method is comparable in cost to existing methods and paves the way for significant improvements in disease-associated variant discovery. We asked the study’s senior author, David Jaffe, to tell us a little more about the background of this work.
Your background is in mathematics, but you became interested in bioinformatics about 15 years ago. What inspired you to apply your skills to biological problems? What was the major difficulty you encountered in switching fields?
Well, suppose you had been a bricklayer for twenty years. It’s good, but after a while your love affair with the bricks wears off. And then you see this new exciting thing that you can do, that people seem to care a lot about. So you jump.
As for the major difficulty, imagine you’re making a change but you don’t really know anything about the new field—and other people know that! So you have to really listen to what they have to say, and hang in there.
How did the idea for DISCOVAR initially come about?
What our group does is look for ways to combine laboratory and computational improvements so as to achieve a better view of the genome (without breaking the bank). We’re always looking for new approaches, and the Broad is a good place to do that because there is a lot of lab innovation and the culture encourages lab/computational interaction. In the summer of 2012, we first saw 250 base-pair reads generated from PCR-free libraries (using Illumina technology), and we realized that these data had enormous power. So we set about designing an algorithm that might work exceptionally well on this data type. Also, we had two generations of assembly algorithms under our belt (ARACHNE and ALLPATHS-LG) and knew we could do better. The DISCOVAR laboratory protocols are available online at our DISCOVAR blog.
What was the most surprising result of your study?
The most surprising thing was the level of contiguity and completeness that could be achieved in local assemblies. The older methods yield a break every 20kb or so, but the new methods just keep going. One reason is that we no longer had PCR dropouts, loci with little or no coverage. Also much of our computational effort went into error correction that could correct almost any sequence, reducing the incidence of assembly holes attributable to polymerase slippage. Consequently, in most cases, it would be possible to find nearly all the differences with a reference genome.
The DISCOVAR algorithm is designed to work on a specific type of sequencing data—is this sequencing method commonly used, or do you envision it becoming so? Do you think DISCOVAR could become the gold standard for variant calling?
Nearly concurrent with publication, Illumina announced official support for 250 base reads, thus eliminating the need to hack their protocol. We think people will switch because the new data give better results! Variant calling covers a lot of ground. For example, there are very good (economic) reasons why people will still want to sequence exomes. But for cases where the goal is to get a nearly complete inventory of all genomic changes, we think we have the best tool to date.
What were the major hurdles, if any, during the course of this research?
Squeezing the maximum information from the data is just a really hard problem, depending on the fine detail of the data properties (like exactly what sorts of sequences did we get wrong, and why). But to know right from wrong, we had to build a set of reference sequences for our control sample (NA12878), and getting these exactly right was itself a significant undertaking. All of this required an R&D effort with many iterations.
Bonus question: What does the name “DISCOVAR” stand for? Who came up with the name?
Coauthor Iain MacCallum came up with DISCOVAR, which stands for “discover variants.” We actually went through a series of other names first. We thought about Varitas, but somebody else was using it and we thought Harvard might sue us (ha ha). We found a similar name that nobody was using, but dropped it after a colleague told us its street meaning in Brazil…
Click here to read the full paper describing DISCOVAR