We report this week in Nature and Nature Genetics the first publications from the Exome Aggregation Consortium (ExAC), a project that has generated the largest catalogue to date of variation in the protein-coding regions of the genome (known collectively as the exome), aggregating sequence data from over 60,000 individuals from across 21 research studies. Most importantly, they have provided a publicly accessible database (http://exac.broadinstitute.org), which has already become a critical resource for research and clinical studies. While an estimated over 1 million individuals have been exome or whole genome sequenced, only a small fraction of this data has been made publicly available, as there are many challenges to sharing and providing open access to these datasets. We applaud the authors for recognizing this need and meeting these challenges.
This work comes 15 years after we published the Human Genome Project, and follows in a series of community resources to catalogue variation in human genomes within and across populations. We continue to support these efforts, recognizing the necessity of developing these resources to further studies to understand the information encoded in our genome, genetic variation and genetic basis of disease.
Mapping ExAC publications
- Primary report from ExAC: Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature http://dx.doi.org/10.1038/nature19057 (2016).
- Nature Editorial: Rare Rewards (18 August 2016).
- Nature Podcast on ExAC and clinical genetics with Daniel MacArthur and Robert Green.
- News & Views by Jay Shendure: A deep dive into genetic variation. Nature, 277–278 (18 August 2016). doi:10.1038/536277a. Jay’s perspective on developments from first exome sequences studies for Mendelian disorders 7 years ago, to over 60,000 exomes today and growing.
- Companion paper: Rudefer et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nature Genetics http://dx.doi.org/10.1038/ng.3638 (2016).
- Companion paper: Walsh, R. et al.Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet. Med. http://dx.doi.org/ 10.1038/GIM.2016.90 (2016).
Companion paper: Minikel at al. Quantifying prion disease penetrance using large population control cohorts. Science Translational Medicine 10.1126/scitranslmed.aad5169 (2016).
Very rare genetic variation: a first look
The scale of this sequencing dataset in ExAC has provided some of our first glimpses into very rare genetic variation across populations, with several important early insights. Firstly, the authors identify more than 7.4 million high-confidence genetic variants, on average one every 8 bases, the majority of them entirely novel (not present in any existing database) and extremely rare (more than half of the variants are seen only once across all 60,706 samples). Second, they are able to document recurrent rare mutations emerging independently, providing an estimate of the frequency of recurrence, never observed systematically before due to the need for such large sample sizes. Third, they are able to examine the level of selective constraint against protein-truncating variation, identifying 3,230 genes that appear highly loss-of-function-intolerant. Reassuringly, this includes most known human haploinsufficient disease genes, however 72% do not yet have an established human disease phenotype. While some of these genes may be associated with weaker phenotypes or embryonic lethality, this points to how much more we have yet to understand about the phenotypic consequences of loss of function in human genes.
Copy number variation in ExAC
In a companion paper in Nature Genetics, Douglas Rudefer, Shaun Purcell and colleagues examine rare copy number variation (CNV) with the ExAC dataset, specifically the rates and properties of genic CNVs with <0.5% frequency. They use their previous method XHMM to characterize CNV calls from this exome sequencing dataset. They find that ~70% of individuals carry at least one rare genic CNV, with an average of 0.81 deleted and 1.75 duplicated genes. The authors also estimate relative intolerance to CNVs for each gene. This CNV dataset is incorporated into ExAC and will be useful for continuing population and disease association studies, together with other measures of genic intolerance, and the authors provide an example of this in analysis of a schizophrenia case-control study.
Clinical genetics: classifying pathogenic variation
The current work also brings an important message for clinical genetics in the need for reexamining the literature on classifying pathogenic variation for rare disorders. The average ExAC participant harbors ~54 variants that have previously been classified as causal for a disease, and considering the ascertainment of the study it is likely that most of this may be due to misclassified variants.
Using ExAC as a reference panel for classifying disease relevant variation, Lek et al. review the evidence for pathogenicity of 192 previously reported pathogenic variants for rare Mendelian disorders. Only 9 of these variants had sufficient support for disease association, with a high proportion of these variants present at an implausibly high frequency in the ExAC dataset. This suggests that many of these were false positive associations and incorrectly classified as pathogenic, the implications of which are not merely academic, as these findings are often used in clinical diagnoses and treatment.
In two additional companion publications, the authors take this a step further and demonstrate what is needed to move towards resolution of the nature of these prior associations, by bringing together large case series combined with ExAC. Walsh et al. (Genetics in Medicine, 10.1038/GIM.2016.90 published online August 17, 2016) systematically reexamine evidence for genes implicated in cardiomyopathy, one of the most common and severe rare disorders, and find many well known purported cardiomyopathy genes do not show support for pathogenicity, including some that are included in routine clinical genetic testing. Similarly, Minikel et al. collect 16,025 prion disease cases, the largest case series ever available for prion disease, for which ~10-15% of cases are estimated to be caused by mutations in the PRNP gene. They find a number of variants in PRNP thought to be pathogenic and with high penetrance appear to be likely benign (Minikel et al. Science Translational Medicine 10.1126/scitranslmed.aad5169). This led to a corrected patient diagnosis soon after this report, as Robert Green explained in his Perspective accompanying this publication (Lebo et al. Science Translational Medicine 10.1126/scitranslmed.aad9460).
These findings highlight the necessity to carefully evaluate the literature for rare genetic disorders. This also reinforces the value of large reference panels such as ExAC for filtering variants seen in patient exomes, a practice most of the genomics community has adopted in establishing standards for assessing sequence variants in human disease (MacArthur et al. Nature 508, 469–476 (2014), 10.1038/nature13127). The ExAC project continues to expand in size, hoping to increase to more than 120,000 exome sequences over this next year, as well as 20,000 whole genome sequences, bringing additional sample size, diversity and exploration of non-coding regions that will aid these efforts.
ClinVar and contributing to variant interpretation databases
This project, which relied on the willingness of many large research consortia to provide their raw data, demonstrates the extreme value of promoting the sharing, aggregation and harmonization of genomic data. This is true also for patient genetic variants, as there is a need for databases that provide greater confidence in variant interpretation. NCBI’s ClinVar database, which accepts contributions of clinically annotated genetic variation from clinical labs, clinicians and researchers, has become a key resource for clinical variant interpretation.
Improvements to the landscape of clinical genetics will require continued investment in such variant databases, continued expansion of human genetic reference panels, as well as efforts to link these to phenotype data. Recontacting to obtain phenotype data will be trialed on a subset of the ExAC dataset where consents allow, while new initiatives such as the UK 100,000 Genomes Project and the US Precision Medicine Initiative will also provide linked genome and phenotype information. Finally, enabling the ethical sharing of linked genetic and clinical data without violating participant privacy will require fundamental innovation in regulation and ethics policy, work that has been started by bodies such as the NIH and the Global Alliance for Genomics and Health, but around which considerable uncertainty remains.