Finding the hidden variation in the human genome

A new method from researchers at the Broad Institute improves variant discovery in the human genome.

A new method from researchers at the Broad Institute improves variant discovery in the human genome. {credit}Webridge via Wikipedia{/credit}

Identifying novel sequence variants is a crucial first step toward understanding the genetic basis of many diseases. However, current methods for variant calling, while very good in general, miss the variants in about 10% of the human genome. This 10% of the genome presents unique challenges, such as high GC content, low-complexity sequences and duplications.

In a paper published online this week at Nature Genetics, David Jaffe and colleagues present a two-hit improvement over existing methods. First, they modify existing methods for generating 250bp paired-end reads using a PCR-free protocol. Because PCR amplification isn’t used, the method significantly reduces coverage bias in the final sequence. Second, they present a new algorithm, called DISCOVAR, that is specifically designed to analyze these data and to call variants in the trickiest parts of the genome.

As proof-of-principle, the authors apply their methods to identify all sequence variants in approximately 4MB of  sequence from the human cell line GM12878, as compared to the human reference sequence. They found that the Illumina Platinum variant call set, which was based on 100bp reads, actually missed about 25% of the variants, mostly due to low coverage in challenging genomic regions.

The new sequencing and assembly method is comparable in cost to existing methods and paves the way for significant improvements in disease-associated variant discovery. We asked the study’s senior author, David Jaffe, to tell us a little more about the background of this work.

Your background is in mathematics, but you became interested in bioinformatics about 15 years ago. What inspired you to apply your skills to biological problems? What was the major difficulty you encountered in switching fields?

Well, suppose you had been a bricklayer for twenty years. It’s good, but after a while your love affair with the bricks wears off. And then you see this new exciting thing that you can do, that people seem to care a lot about. So you jump.

As for the major difficulty, imagine you’re making a change but you don’t really know anything about the new field—and other people know that! So you have to really listen to what they have to say, and hang in there.

How did the idea for DISCOVAR initially come about?

What our group does is look for ways to combine laboratory and computational improvements so as to achieve a better view of the genome (without breaking the bank). broad-logoWe’re always looking for new approaches, and the Broad is a good place to do that because there is a lot of lab innovation and the culture encourages lab/computational interaction. In the summer of 2012, we first saw 250 base-pair reads generated from PCR-free libraries (using Illumina technology), and we realized that these data had enormous power. So we set about designing an algorithm that might work exceptionally well on this data type. Also, we had two generations of assembly algorithms under our belt (ARACHNE and ALLPATHS-LG) and knew we could do better. The DISCOVAR laboratory protocols are available online at our DISCOVAR blog.

What was the most surprising result of your study?

The most surprising thing was the level of contiguity and completeness that could be achieved in local assemblies. The older methods yield a break every 20kb or so, but the new methods just keep going. One reason is that we no longer had PCR dropouts, loci with little or no coverage. Also much of our computational effort went into error correction that could correct almost any sequence, reducing the incidence of assembly holes attributable to polymerase slippage.  Consequently, in most cases, it would be possible to find nearly all the differences with a reference genome.

The DISCOVAR algorithm is designed to work on a specific type of sequencing data—is this sequencing method commonly used, or do you envision it becoming so? Do you think DISCOVAR could become the gold standard for variant calling?

Nearly concurrent with publication, Illumina announced official support for 250 base reads, thus eliminating the need to hack their protocol. We think people will switch because the new data give better results! Variant calling covers a lot of ground. For example, there are very good (economic) reasons why people will still want to sequence exomes. But for cases where the goal is to get a nearly complete inventory of all genomic changes, we think we have the best tool to date.

Image from the DISCOVAR demo site. The magenta edge represents a 30 kb heterozygous insertion in the reference sequence.  Each edge represents a DNA sequence. Red vertices “continue on” in full graph.

Image from the DISCOVAR demo site. The magenta edge represents a 30 kb heterozygous insertion in the reference sequence. Each edge represents a DNA sequence. Red vertices “continue on” in full graph. {credit}David Jaffe{/credit}

What were the major hurdles, if any, during the course of this research?

Squeezing the maximum information from the data is just a really hard problem, depending on the fine detail of the data properties (like exactly what sorts of sequences did we get wrong, and why). But to know right from wrong, we had to build a set of reference sequences for our control sample (NA12878), and getting these exactly right was itself a significant undertaking. All of this required an R&D effort with many iterations.

Bonus question: What does the name “DISCOVAR” stand for? Who came up with the name?

Coauthor Iain MacCallum came up with DISCOVAR, which stands for “discover variants.” We actually went through a series of other names first. We thought about Varitas, but somebody else was using it and we thought Harvard might sue us (ha ha). We found a similar name that nobody was using, but dropped it after a colleague told us its street meaning in Brazil…

Click here to read the full paper describing DISCOVAR

Make sure to check out the DISCOVAR blog from the Broad Institute and the online demo tool!

Genome assembly contest prompts soul-searching

Bioinformaticians today published a mammoth evaluation of genome assemblers — computer programs that aim to piece together short DNA sequence reads into complete genomes.

Their work, described in the journal GigaScience, was conducted for the second Assemblathon, a contest designed to compare and evaluate competing genome assemblers. In the current round of the contest, which started in July 2011, 21 teams submitted 43 attempts to assemble three genomes from scratch: that of a bird (budgerigar), a fish (the Lake Malawi cichlid) and a snake (the boa constrictor).

One notable finding from the contest was that different assemblers — and the same assemblers in the hands of different teams — did not give consistent results. That echoes the results of Assemblathon 1, which wrapped up in 2011. But the problem itself may be more significant now than it was then, owing to the democratization of genomics, with many more labs now using many more methods to assemble many more genomes from scratch.

Perhaps because of this, Assemblathon 2 has sparked a bit of soul-searching among bioinformaticians, who have debated its results and their significance since a preprint of the paper was posted on arXiv in January. 

Bioinformatician C. Titus Brown of Michigan State University in East Lansing, who reviewed the paper, published his review and wrote on his blog in February: “the biggest outcome of the Assemblathon 2 paper can be stated quite simply: we’re doing it all wrong, in bioinformatics…as a field, we have pretended that genome assembly is a reliable exercise and that the results can be trusted; the Assemblathon 2 paper shows that that’s wrong.”

Keith Bradnam of the University of California in Davis, the paper’s first author, doesn’t fundamentally disagree with that take: “I agree that the science community should be better at explaining that genomes and genome assemblies are the results of individual experiments that are rarely ever replicated. Trust them at your peril,” he commented on Brown’s post.

This isn’t an ideal situation for the average scientist who just wants to know which is the best tool to use for a specific project. On the blog Haldane’s Sieve, Bradnam compares the process of selecting an assembly method to that of choosing the best pizzeria in Davis.

“[T]he notion of a ‘best’ pizza is highly subjective and the best pizza for one person is almost certainly not going to be the best pizza for someone else,” Bradnam writes.

“Just as it might be hard to find somewhere that sells an inexpensive gluten-free, vegan pizza that’s made with fresh ingredients, has lots of toppings and can be quickly delivered to you at 4:00 am, it may be equally hard to find a genome assembler that ticks all of the boxes that you are interested in.”

See links to all the commentary on today’s paper at the Assemblathon web page.

Follow Erika on Twitter @Erika_Check.