About Orli Bahcall

Senior Editor for Genetics & Genomics, Nature. Twitter: @obahcall

Joint calling of the ExAC publications

ExAC publications in Nature

We report this week in Nature and Nature Genetics the first publications from the Exome Aggregation Consortium (ExAC), a project that has generated the largest catalogue to date of variation in the protein-coding regions of the genome (known collectively as the exome), aggregating sequence data from over 60,000 individuals from across 21 research studies. Most importantly, they have provided a publicly accessible database (https://exac.broadinstitute.org), which has already become a critical resource for research and clinical studies. While an estimated over 1 million individuals have been exome or whole genome sequenced, only a small fraction of this data has been made publicly available, as there are many challenges to sharing and providing open access to these datasets. We applaud the authors for recognizing this need and meeting these challenges.

This work comes 15 years after we published the Human Genome Project, and follows in a series of community resources to catalogue variation in human genomes within and across populations. We continue to support these efforts, recognizing the necessity of developing these resources to further studies to understand the information encoded in our genome, genetic variation and genetic basis of disease.

Mapping ExAC publications

Very rare genetic variation: a first look

The scale of this sequencing dataset in ExAC has provided some of our first glimpses into very rare genetic variation across populations, with several important early insights. Firstly, the authors identify more than 7.4 million high-confidence genetic variants, on average one every 8 bases, the majority of them entirely novel (not present in any existing database) and extremely rare (more than half of the variants are seen only once across all 60,706 samples). Second, they are able to document recurrent rare mutations emerging independently, providing an estimate of the frequency of recurrence, never observed systematically before due to the need for such large sample sizes. Third, they are able to examine the level of selective constraint against protein-truncating variation, identifying 3,230 genes that appear highly loss-of-function-intolerant. Reassuringly, this includes most known human haploinsufficient disease genes, however 72% do not yet have an established human disease phenotype. While some of these genes may be associated with weaker phenotypes or embryonic lethality, this points to how much more we have yet to understand about the phenotypic consequences of loss of function in human genes.

Copy number variation in ExAC

In a companion paper in Nature Genetics, Douglas Rudefer, Shaun Purcell and colleagues examine rare copy number variation (CNV) with the ExAC dataset, specifically the rates and properties of genic CNVs with <0.5% frequency. They use their previous method XHMM to characterize CNV calls from this exome sequencing dataset. They find that ~70% of individuals carry at least one rare genic CNV, with an average of 0.81 deleted and 1.75 duplicated genes. The authors also estimate relative intolerance to CNVs for each gene. This CNV dataset is incorporated into ExAC and will be useful for continuing population and disease association studies, together with other measures of genic intolerance, and the authors provide an example of this in analysis of a schizophrenia case-control study.

Clinical genetics: classifying pathogenic variation

The current work also brings an important message for clinical genetics in the need for reexamining the literature on classifying pathogenic variation for rare disorders. The average ExAC participant harbors ~54 variants that have previously been classified as causal for a disease, and considering the ascertainment of the study it is likely that most of this may be due to misclassified variants.

Using ExAC as a reference panel for classifying disease relevant variation, Lek et al. review the evidence for pathogenicity of 192 previously reported pathogenic variants for rare Mendelian disorders. Only 9 of these variants had sufficient support for disease association, with a high proportion of these variants present at an implausibly high frequency in the ExAC dataset. This suggests that many of these were false positive associations and incorrectly classified as pathogenic, the implications of which are not merely academic, as these findings are often used in clinical diagnoses and treatment.

In two additional companion publications, the authors take this a step further and demonstrate what is needed to move towards resolution of the nature of these prior associations, by bringing together large case series combined with ExAC. Walsh et al. (Genetics in Medicine, 10.1038/GIM.2016.90 published online August 17, 2016) systematically reexamine evidence for genes implicated in cardiomyopathy, one of the most common and severe rare disorders, and find many well known purported cardiomyopathy genes do not show support for pathogenicity, including some that are included in routine clinical genetic testing. Similarly, Minikel et al. collect 16,025 prion disease cases, the largest case series ever available for prion disease, for which ~10-15% of cases are estimated to be caused by mutations in the PRNP gene. They find a number of variants in PRNP thought to be pathogenic and with high penetrance appear to be likely benign (Minikel et al. Science Translational Medicine 10.1126/scitranslmed.aad5169). This led to a corrected patient diagnosis soon after this report, as Robert Green explained in his Perspective accompanying this publication (Lebo et al. Science Translational Medicine 10.1126/scitranslmed.aad9460).

These findings highlight the necessity to carefully evaluate the literature for rare genetic disorders. This also reinforces the value of large reference panels such as ExAC for filtering variants seen in patient exomes, a practice most of the genomics community has adopted in establishing standards for assessing sequence variants in human disease (MacArthur et al. Nature 508, 469–476 (2014), 10.1038/nature13127). The ExAC project continues to expand in size, hoping to increase to more than 120,000 exome sequences over this next year, as well as 20,000 whole genome sequences, bringing additional sample size, diversity and exploration of non-coding regions that will aid these efforts.

ClinVar and contributing to variant interpretation databases

This project, which relied on the willingness of many large research consortia to provide their raw data, demonstrates the extreme value of promoting the sharing, aggregation and harmonization of genomic data. This is true also for patient genetic variants, as there is a need for databases that provide greater confidence in variant interpretation. NCBI’s ClinVar database, which accepts contributions of clinically annotated genetic variation from clinical labs, clinicians and researchers, has become a key resource for clinical variant interpretation.

Improvements to the landscape of clinical genetics will require continued investment in such variant databases, continued expansion of human genetic reference panels, as well as efforts to link these to phenotype data. Recontacting to obtain phenotype data will be trialed on a subset of the ExAC dataset where consents allow, while new initiatives such as the UK 100,000 Genomes Project and the US Precision Medicine Initiative will also provide linked genome and phenotype information. Finally, enabling the ethical sharing of linked genetic and clinical data without violating participant privacy will require fundamental innovation in regulation and ethics policy, work that has been started by bodies such as the NIH and the Global Alliance for Genomics and Health, but around which considerable uncertainty remains.

Focus on TCGA Pan-Cancer Analysis

Nature Genetics is pleased to present today the first installment of our Focus on TCGA Pan-Cancer Analysis.

The Cancer Genome Atlas (TCGAhas analyzed over 8,000 cancer cases across 27 tumor types to date, and aim to have over 100,000 specimens analyzed by the of 2015. They have commendably made both data and exploration tools publicly available at https://www.cancergenome.nih.gov. They have previously published 8 papers reporting in-depth genomic characterization of individual tumor types.

The TCGA Pan-Cancer initiative, launched in October 2012 at meeting in Santa Cruz, California, seeks to combine analysis across tumor types in order to identify both similarities and differences in genomic alterations.  The work presented in this collection of Pan-Cancer publications includes analysis of the first 12 TCGA tumor types. This includes over 3,000 cancer patients profiled with 6 different platforms to assess genomic, transcriptional, epigenetic and proteomic alterations, combined with clinical data.  The authors demonstrate that while a majority of the tumor samples show unique genomic alterations, that by combining analysis they are able to both increase statistical power for the detection  of molecular drivers and to identify common pathways that are altered across tumor types.

The Pan-Cancer initiative provides a model for large-scale collaborative analysis as well as data sharing, bringing together over 250 collaborators from ~30 institutions working together on over 60 projects analyzing the same dataset.  These efforts required a strong collaborative framework, a commitment to rapid distribution of data, and means to facilitate shared analysis. Josh Stuart and colleagues provide an overview of this project in an accompanying Commentary.

This work also relied on the development of new bioinformatics tools and platforms, providing a foundation that should prove useful in future large-scale analysis projects. A Commentary by Larsson Omberg and colleagues highlights these approaches and the use of the Synapse software platform to share and evolve data, analysis and results among the Pan-Cancer Working Group. The Synapse platform was developed by Sage Bionetworks to facilitate open and data-driven collaborative research efforts, and is also being well used in DREAM challenges.  The use of this platform supported the discovery efforts reported in this collection of Pan-Cancer papers, which also provide a public resource of highly curated and standardized data sets across a series of data freezes along with automated analysis systems.

In the first of two Analysis papers published today in Nature Genetics, Chris Sander and colleagues provide a hierarchical classification of 3,299 tumors from 12 cancer types from the Pan-Cancer dataset, using a newly developed algorithmic approach. Their analysis separates tumors into those with primarily somatic mutations and those with primarily copy number alterations. They also identify oncogenic signatures that characterize ~30 tumor subclasses, which may suggest therapeutic targets of relevance across tumor types.

In a second Analysis published in Nature Genetics, Rameen Beroukhim and colleagues characterized somatic copy number alterations (SCNAs) in 11 cancer types and 4,934 primary cancer specimens from the Pan-Cancer dataset.  They observed whole-genome doubling in 37% of cancers, associated with higher rates of all SCNA.

We are pleased to support the TCGA Pan-Cancer efforts as a model for large-scale collaborative genomics projects combined with open data sharing, and demonstrating the ready benefits this can bring to our understanding of the molecular drivers of cancer.  The TCGA Pan-Cancer project continues to develop, and so will this Focus, so please get primed with this selection of publications and stay tuned.  In the meantime, here is a selection of social media and press stories: https://storify.com/obahcall/nature-genetics-pan-cancer-focus.

Preview to the 7th Genomics of Common Diseases

The 7th annual The Genomics of Common Diseases conference is taking place this weekend, from September 7-10, in Keble College, Oxford University. At this conference, we seek to represent a top selection of the latest research characterizing the genetic basis of a range of common diseases.

We held the first Genomics of Common Diseases conference in 2007, with a program that highlighted rapid advancements in identifying common variants associated with a range of common diseases, made possible by new methods enabling genome-wide association studies (GWAS). Over the past seven years, our understanding of the genetic architecture of disease has been progressively redefined by GWAS characterizing common variation, the fine mapping of associated regions, the emergence and growth of new sequencing technologies and the assessment of rare variant association. We have represented the progress in the field facilitated by rapid improvements in and reduced costs of genotyping and sequencing technologies. We have also seen rapid growth in the scale of genetic datasets, with the need to analyze progressively larger sample sizes. Our sixth annual conference focused both on presenting the latest applied technologies and on how to meet challenges posed by the analysis and interpretation of these large-scale genetic datasets. Continue reading

Welcome to the Nature Genetics iCOGS collection

We are pleased publish today a Focus issue on cancer risk, including findings from the COGS (Collaborative Oncological Gene-environment Study) consortium, published as 13 coordinated papers in this Nature Genetics iCOGS collection.

At Nature Genetics, we give voice to leading efforts to understand the genetic basis of disease.  Over the past six years, we have seen mass surveys of genetic variants across the human genome, called genome-wide association studies, yield key insights into hundreds of common diseases.

Today, we’re proud to see how COGS, extending this approach to oncology, has doubled the number of genetic regions implicated in breast, ovarian and prostate cancers.  As such, these 13 papers represent a milestone in our understanding of these common cancers, and exemplify what’s needed in such discovery efforts.

To provide a brief summary before you dive in, several overall findings from this collection bear highlighting:

1)      This collection implicates 74 new genomic regions in these 3 cancers, doubling the number of reported associations.

2)      This work finds substantial contribution from common genetic variation to cancer risk heritability, explaining up to roughly 1/3 of the familial relative risk for these cancers.

3)      The studies find that variation in some parts of our genomes is associated with multiple hormone-sensitive cancers, suggesting shared mechanisms.

4)      This collection also establishes a general framework for refining initial association findings through fine-mapping and functional annotation.

5)      Finally, these efforts have pointed to ways that such findings may find use in personal healthcare and genetic risk prediction.

These studies also exemplify what’s needed in such genetic discovery efforts, including these three important factors:

1)      This work requires very large samples, with these new papers summarizing data from more than 200,000 people, making this the largest cancer genotyping discovery effort.

2)      These studies entail collaboration, here bringing together researchers from more than 160 institutions worldwide, in four cancer-specific consortia, to execute over 40 studies.

3)      This collection of studies pool efforts across distinct but related diseases – here, 3 hormone-related cancers. This allows more efficient use of hard won data, for faster and more precise insights into the shared versus unique genetic underpinnings of particular diseases.

In supporting this collaborative effort, we were pleased to work directly with the COGS authors to coordinate these 13 publications across 5 leading journals, working with editors at The American Journal of Human GeneticsHuman Molecular Genetics,  Nature Communications, and PLoS Genetics

And, to best convey the findings, Nature Genetics is debuting a new online publishing format that builds from the single paper as unit of publication, to a broad, unified view into the entire iCOGS collection.

We hope that this iCOGS explorer microsite will help our readers quickly understand and access materials across the set of coordinated papers.  This iCOGS explorer includes a series of essays, called Primers, which interactively guide readers through the studies, summarizing the main findings and themes of the collection, offering direct glimpses into each paper, and perspectives on how they relate to other work in the field.

This new online publishing format builds on the Threads used in the ENCODE explorer.  These Primers now interlace “threads” (direct quotes from papers) with editorial analysis in order to provide more context and additional analysis.

We also hope that this provides an example to help the public understand current efforts to characterize the genetic basis of common diseases, as we review here how and why such studies are performed, and what insights are possible.

Without further delay, we welcome you to jump right into the Nature Genetics iCOGS collection.  Not sure where to start or overwhelmed with this stack of materials?  I recommend picking your favorite from these questions below, linked to answers in our Primers.

 

We hope that you find this collection and the iCOGS explorer useful, and that this proves of benefit in accessing the information in these current studies, as well as promoting continuing collaborative research.

 

Orli G. Bahcall
Senior Editor
Nature Genetics  
Twitter: @obahcall 

Marking the launch of this blog and our journal

We launched Free Association in November 2005, as one of the first two Nature Publishing Group journal blogs. Our blog was launched as a pioneering effort by our then Senior Editor Alan Packer (who has since moved to a position as Associate Director for Research at the Simons Foundation ), as a new way for the editors of Nature Genetics to engage our community.  We did so with excitement about interacting and discussing papers and community issues on a more informal level than is possible in our print publication.  At the same time, I recall that we (the Nature Genetics editors at the time) shared some concerns about what we would be able to discuss, given the confidential nature of the peer review process. We also wondered if our community of authors, reviewers, and readers alike would manage to find time and interest in posting on our site.  While these concerns have in some part remained, we have a new perspective as we move into the 7th year of this blog and with the launch of this new site. 

Over these years, we have maintained Free Association as an editor driven blog, used to highlight and discuss our own content, press and feedback from the community, and to announce special events.  We will continue to post on these topics, and are also welcoming guest posts on topics relevant to our own content and our genetics/genomics community.

We have also experimented with using this blog as a means to discuss and develop community standards and research guidelines relevant to our community, but we have now shifted to using the data standards section of Nature Precedings  for this purpose. 

This year also marks the 20th anniversary of the launch of Nature Genetics.  I have been fortunate enough to be an editor here since 2004 (yes, I do remember our pages pre-GWAS), and have to admit that every year I find myself saying that this is one of the most exciting times to be in this field.  We have much to celebrate in advances within the genetics and genomics communities.   All I will say for now is that you should stay closely tuned for how we will mark this anniversary.  Comments and suggestions are of course welcome. 

Orli Bahcall

Senior Editor, Nature Genetics