Finding the hidden variation in the human genome

A new method from researchers at the Broad Institute improves variant discovery in the human genome.

A new method from researchers at the Broad Institute improves variant discovery in the human genome. {credit}Webridge via Wikipedia{/credit}

Identifying novel sequence variants is a crucial first step toward understanding the genetic basis of many diseases. However, current methods for variant calling, while very good in general, miss the variants in about 10% of the human genome. This 10% of the genome presents unique challenges, such as high GC content, low-complexity sequences and duplications.

In a paper published online this week at Nature Genetics, David Jaffe and colleagues present a two-hit improvement over existing methods. First, they modify existing methods for generating 250bp paired-end reads using a PCR-free protocol. Because PCR amplification isn’t used, the method significantly reduces coverage bias in the final sequence. Second, they present a new algorithm, called DISCOVAR, that is specifically designed to analyze these data and to call variants in the trickiest parts of the genome.

As proof-of-principle, the authors apply their methods to identify all sequence variants in approximately 4MB of  sequence from the human cell line GM12878, as compared to the human reference sequence. They found that the Illumina Platinum variant call set, which was based on 100bp reads, actually missed about 25% of the variants, mostly due to low coverage in challenging genomic regions.

The new sequencing and assembly method is comparable in cost to existing methods and paves the way for significant improvements in disease-associated variant discovery. We asked the study’s senior author, David Jaffe, to tell us a little more about the background of this work.

Your background is in mathematics, but you became interested in bioinformatics about 15 years ago. What inspired you to apply your skills to biological problems? What was the major difficulty you encountered in switching fields?

Well, suppose you had been a bricklayer for twenty years. It’s good, but after a while your love affair with the bricks wears off. And then you see this new exciting thing that you can do, that people seem to care a lot about. So you jump.

As for the major difficulty, imagine you’re making a change but you don’t really know anything about the new field—and other people know that! So you have to really listen to what they have to say, and hang in there.

How did the idea for DISCOVAR initially come about?

What our group does is look for ways to combine laboratory and computational improvements so as to achieve a better view of the genome (without breaking the bank). broad-logoWe’re always looking for new approaches, and the Broad is a good place to do that because there is a lot of lab innovation and the culture encourages lab/computational interaction. In the summer of 2012, we first saw 250 base-pair reads generated from PCR-free libraries (using Illumina technology), and we realized that these data had enormous power. So we set about designing an algorithm that might work exceptionally well on this data type. Also, we had two generations of assembly algorithms under our belt (ARACHNE and ALLPATHS-LG) and knew we could do better. The DISCOVAR laboratory protocols are available online at our DISCOVAR blog.

What was the most surprising result of your study?

The most surprising thing was the level of contiguity and completeness that could be achieved in local assemblies. The older methods yield a break every 20kb or so, but the new methods just keep going. One reason is that we no longer had PCR dropouts, loci with little or no coverage. Also much of our computational effort went into error correction that could correct almost any sequence, reducing the incidence of assembly holes attributable to polymerase slippage.  Consequently, in most cases, it would be possible to find nearly all the differences with a reference genome.

The DISCOVAR algorithm is designed to work on a specific type of sequencing data—is this sequencing method commonly used, or do you envision it becoming so? Do you think DISCOVAR could become the gold standard for variant calling?

Nearly concurrent with publication, Illumina announced official support for 250 base reads, thus eliminating the need to hack their protocol. We think people will switch because the new data give better results! Variant calling covers a lot of ground. For example, there are very good (economic) reasons why people will still want to sequence exomes. But for cases where the goal is to get a nearly complete inventory of all genomic changes, we think we have the best tool to date.

Image from the DISCOVAR demo site. The magenta edge represents a 30 kb heterozygous insertion in the reference sequence.  Each edge represents a DNA sequence. Red vertices “continue on” in full graph.

Image from the DISCOVAR demo site. The magenta edge represents a 30 kb heterozygous insertion in the reference sequence. Each edge represents a DNA sequence. Red vertices “continue on” in full graph. {credit}David Jaffe{/credit}

What were the major hurdles, if any, during the course of this research?

Squeezing the maximum information from the data is just a really hard problem, depending on the fine detail of the data properties (like exactly what sorts of sequences did we get wrong, and why). But to know right from wrong, we had to build a set of reference sequences for our control sample (NA12878), and getting these exactly right was itself a significant undertaking. All of this required an R&D effort with many iterations.

Bonus question: What does the name “DISCOVAR” stand for? Who came up with the name?

Coauthor Iain MacCallum came up with DISCOVAR, which stands for “discover variants.” We actually went through a series of other names first. We thought about Varitas, but somebody else was using it and we thought Harvard might sue us (ha ha). We found a similar name that nobody was using, but dropped it after a colleague told us its street meaning in Brazil…

Click here to read the full paper describing DISCOVAR

Make sure to check out the DISCOVAR blog from the Broad Institute and the online demo tool!

How we built a better tomato

One species of wild tomato, Solanum lacerdae

One species of wild tomato, Solanum lacerdae{credit}Sandy Knapp{/credit}

Most wild tomato species bear little resemblance to the large, red fruits you’re used to seeing in the supermarket. This is because humans have been molding the tomato to their own taste for thousands of years, by selecting for larger, tastier and (of course) redder fruits.

As a consequence of this selective breeding, we have significantly altered the tomato genome. A new paper published online this week in Nature Genetics analyzed the genomes of 360 tomato accessions, including multiple wild species and cultivated varieties, to understand exactly how and where humans have left their mark on the tomato genome.

This study, the product of a collaboration between many groups around the world, found that human selection on the tomato has led to vast improvement in certain traits at the cost of dramatically reducing genetic variation in large swaths of the genome. An unintended consequence of historical selective breeding in tomato is that there is now little room for improvement on many traits that we care about. By identifying these regions, the study will allow tomato breeders to make more strategic plans for future crop improvement.

We asked one of the study’s senior authors, Sanwen Huang, to tell us a little more about the work and why it is important:

This study was obviously a huge undertaking. How did collaborations come about, and what were the major difficulties in the project?

As an international consortium, we sequenced the tomato genome together (Nature 2012) and this project was regarded as another milestone of tomato research. The difficulty in the current project was deciding what to sequence. Fortunately, our team includes experts who understand tomato germplasm and they studied the natural variation of tomatoes for a long time. As a corollary, we combined tomato lines from many well studied core collections from several countries, such as the US (Roger Chetelat), Israel (Dani Zamir), France (Mathilde Causse), Italy (Andea Mazzucato), and China (Yongchen Du, Zhibiao Ye, and Jingfu Li).

What do you see as the most important aspect of your study’s results?

There are several important results that came out of this work. First, the evolution of tomato fruit size had two stages, from the wild progenitor of the modern cultivated tomato, Solanum pimpinellifolium, to cherry tomato (from ~1g to ~10g), and from cherry tomato to big-fruited tomato (from ~10g to ~100g). We found that there are two independent sets of QTLs or genes that have been selected during the two evolutionary stages. Second, there is a huge genomic signature of the divergence between fresh tomato and processing tomato [tomatoes used for commercial canning], on chromosome 5. This genomic region harbors several genes related to higher soluble solid content and fruit firmness that were selected during breeding for processing tomato. And more interestingly, we noticed that in recent fresh tomato F1 breeding, this region was also exploited for better taste and longer shelf-life.  Third, we identified the causal variants for the pink tomato, which can be used for selective breeding. Pink tomato is a favorite in North China and I prefer it too, as it tastes better than the red ones. Finally, we found there have been costs to historical selection. For example, the near fixation of 25% of the tomato genome due genetic hitchhiking that occurred during domestication and improvement sweeps, as well as the linkage drags associated with wild introgression.

Cover of Nature, May 2012

Were you at all surprised to find such a large number of domestication and improvement sweeps? Did these results differ at all from other prominent vegetables, such as cucumber or potato?

The number and genomic proportion of domestication sweeps in tomato are similar to those in cucumber. However, the linkage disequilibrium blocks are bigger in tomato than in cucumber, possible due to the fact that tomato is a self-crossing species. Based on our data, we predict that the effective population size of tomato at domestication was about 300, similar to that of cucumber (~500), which is significantly smaller than that of maize (~150,000). This means these two vegetables have undergone much more severe bottlenecks during domestication as compared to maize.

How do you envision tomato breeders using the results of your study?

As a result of this work, tomato breeders will have a panoramic view of tomato variation and a better understanding of the raw materials used in their own breeding programs. From a practical standpoint, they will have access to a database of 11 million SNPs, from which they can pick the ones best suited to their molecular breeding programs. For example, they can combine the SNP dataset with their phenotypic data, to elucidate the genetic bases of important traits. Finally, and importantly I think, they will better understand the limitations of conventional breeding and the cost of historical selection, which will give them clues to improve their future programs.

NRCSHI07018_-_Hawaii_(716072)(NRCS_Photo_Gallery)

{credit}Photo courtesy of USDA Natural Resources Conservation Service{/credit}

Congratulations on your recent move to the Agricultural Genome Institute at Shenzhen where you are a co-founding director. Can you tell us a little about this new institute and what its goals are?

Thanks! The leadership of the Chinese Academy of Agricultural Sciences set up the institute (AGIS) to innovate agricultural research using genomics.

AGIS is located at the Dapeng District of Shenzhen, a beautiful bay area. The Shenzhen municipal government is developing the Dapeng Peninsula as the International Bio-valley and high-tech agriculture is one of the highlights. AGIS will recruit ~200 scientists who will decode, analyze, and utilize agricultural genomes. There will be three themes of research: the first theme is to develop basic algorithms and bioinformatic tools tailored for agricultural genomes, many of which are quite different from the human genome that has been the focus for most bioinformatians; the second theme is to empower agricultural breeding with genomics, to increase the efficiency and effectiveness of breeding that is essential to global food security; and the third theme is to provide genomic surveillance of food safety and agricultural environment, which is a huge concern of society and a need for sustainable development.

A vegetable market in Shanghai, China

A vegetable market in Shanghai, China{credit}nadja robot via Flickr.com{/credit}

Bonus question: What is your favorite vegetable?

China is a country of vegetables, as there are over 200 kinds of vegetables that are regularly consumed in the country. I enjoy the diversity. For fruit vegetables I like tomato, cucumber, and chili; for leaf vegetables, I like Chinese cabbage, lettuce, and coriander.

 

You can read more about this exciting study at The Scientist. Read the full paper here

Discovery of a gene for heart and gut rhythms

heartbeatWhat do your heart and gut have in common? More than you might think. A new study by Gregor Andelfinger and colleagues has found that a single gene, SGOL1 (Shugoshin-like 1), is required for the normal rhythms of both the heart and intestine.

The study’s co-authors found 17 patients with dysrhythmias of both the heart and intestine, termed sick sinus syndrome (SSS) and  Chronic intestinal pseudo-obstruction (CIPO), respectively. SSS is a term for a type of cardiac arrhythmia. Though it’s very rare in children or young adults, it is more common in the elderly and generally requires the patient to have a pacemaker implanted. CIPO occurs when the intestines stop their usual rhythmic pulses, and food can no longer pass through the digestive tract on its own. Both conditions are extremely rare as inherited disorders, so finding both disorders in these 17 patients was a truly remarkable discovery.

All affected patients in the study shared the same homozygous variant, which resulted in changing a lysine to a glutamic acid at a conserved residue. The new syndrome was named Chronic Atrial and Intestinal Dysrhythmia (CAID).

We asked one of the study’s lead authors, Gregor Andelfinger at Sainte-Justine University Hospital Research Center in Montreal, to tell us a little more about the work:

How did you become involved in studying CAID?

Map of Canada (New France) in North America 1703

Map of Canada (New France) in North America 1703{credit}Wikipedia{/credit}

We have an excellent collaboration across our provincial biobank for congenital heart disease in Québec and exchange regularly among colleagues. We now have more than 3,000 deeply phenotyped participants in our biobank—both affected and unaffected family members—and when my colleagues told me about an unusual co-occurrence of SSS and CIPO in a couple of cases, we quickly fanned out and a side project suddenly got to center stage in the lab. We were surprised to see how many patients we found in relatively short time for a previously undescribed disease. Obviously, we would be very eager to learn from other groups whether they have encountered similar rare patients, and would love to cooperate! Let’s not forget that this type of research always has a human face, and this is what motivates our group in the first place.

What would you say was the most unexpected aspect of this research? 

Everything in this project was unexpected! On the clinical side, the emergence of a generalized automaticity disorder in humans was totally unanticipated. On the molecular side, one of the biggest surprises certainly was how wrong we all were with our thoughts on what could be the causal gene. Virtually all members in the lab placed their bets on ion channels, a priori the most likely suspects. As you know, we were all proven wrong and had to go back to rethink how this disease arises. We were again surprised how a completely new picture emerged when we finally put all the pieces of the puzzle together—from genetics, populations and cell biology to disease.

How does the finding of SGOL1 mutations in these rare cases help inform the biology of CIPO and SSS more generally?

When doing my literature search, I was very surprised that one of the discoverers of the sinus node [the heart’s pacemaker tissue], Arthur Keith, had already drawn parallels between cardiac and gut pacemaking in an article in 1915 [PDF]. The recent literature suggests a role for TGF-β signaling as a driver for fibrosis in channelopathies and arrhythmias, and obviously this could very well be an important pathway through which a progressive destruction of pacemaking tissues takes place (for example, see papers here and here). Remember that we can clearly show that all patients in our series were normal at birth and developed disease only at later stages. On the other hand, we also have evidence that some ‘developmental anomalies’ are present in CAID patients, since the malformed gut pacemaking system probably was present from birth on, with initially normal function. I think that we are dealing with an overlap of developmental and acquired phenotypes, and that a similar process takes place in isolated SSS and CIPO, even if we could not detect SGOL1 mutations in the isolated forms of disease. Beyond this, I think the monogenic nature of the CAID phenotype tells us that all pacemaker cells need the cohesin complex. I would not be surprised if we found at least two non-canonical roles for SGOL1 in the future, one driving the developmental, and the other one driving the acquired part of disease, and that these disease pathways are at least partially shared in isolated SSS and CIPO. ‘Shugoshin’ means ‘guardian spirit’ in Japanese, so this is a very apt name for functions of this gene beyond its known function of protecting sister chromatids

What do cardiac and intestinal pacemakers have in common, and what could make them particularly vulnerable to mutations in a cohesin complex member?

First, they are both relatively small organs. An adult sinus node is approximately 15 x 5 x 1.5 mm long, probably not more than 50,000 cells. Second, both organs are non-uniform and comprise different cellular subtypes, and third, they have to be in a very particular place to efficiently perform their function. Fourth, and very importantly, cells in both organs are capable of automaticity. What could the cohesin complex have to do with these commonalities of different pacemakers in the human body? For the known functions of cohesin, in particular cell division, I speculate that a defect could directly influence how many cells will be available to form a certain organ. However, apart from the smaller myenteric plexuses we found in CAID patients, we do not have direct experimental evidence for this. Of course, this could also affect subpopulations within these organs, the second organ property I alluded to above. Ageing and loss of cells over time may also come into play in this intricate balance.

I am at a loss to come up with a valid hypothesis how a dysfunction of the cohesin complex would lead to the misplaced myenteric plexuses we found in CAID patients. As far as the fourth commonality between cardiac and intestinal pacemakers is concerned, we know that automaticity is mainly generated due to spontaneous depolarizations. The channels responsible for this phenomenon are mainly the HCN-channels and SCN5A, but calcium transients also participate in this. Given that cohesin plays an important role in transcriptional regulation, it is conceivable that some target genes are not correctly expressed when SGOL1 is mutated, either in time, space or quantity. Several recent studies on cohesinopathies point out that higher-order chromatin architecture organization has to be tightly regulated for normal gene expression, and I speculate that a dysfunction of SGOL1 could lead to problems with ion channel expression and thus be one of the key factors why we see this exquisite target organ specificity.

Can you say a little about the FORGE Canada consortium and how your research relates to its mission?

Care4RareThe FORGE Canada (Finding of Rare Disease Genes) was launched on April 1, 2011 and brought together clinicians from all 21 Clinical Genetics Centres representing every province, as well as clinicians from 17 countries. From nation-wide requests for proposals, 264 disorders were selected for study from the 371 submitted; disease-causing variants (including in 67 genes not previously associated with human disease; 41 of these have been genetically or functionally validated, and 26 are currently under study) were identified for 146 disorders over a 2-year period. The outcome of this project was recently published in an article in AJHG. This project has a successor, Care4Rare, which is a pan-Canadian collaborative team building upon the infrastructure and discoveries of the FORGE Canada (Finding of Rare Disease Genes) project.  The goal of CARE for RARE is to improve clinical care for patients and families affected by rare diseases.  I think the great success of these projects also stems from their openness to collaborators like our group – this is the way it should be, and since my lab is working on several rare disease traits, we have benefited greatly from their help.

 

You can read the full paper here on the Nature Genetics website.