Sidestepping spurious associations

Robert_Delaunay,_1913,_Premier_Disque,_134_cm,_52.7_inches,_Private_collection

Layers of structure {credit}Robert Delaunay, 1912-1913, Premier Disque{/credit}

Genome-wide association tests have been hugely successful at finding genes and even specific mutations that contribute to traits ranging from human height to schizophrenia. At its most basic, the idea is that a group of individuals with a shared phenotype should also share some genetic variants in common that are causally related to the trait in question. Unfortunately, there are other reasons that individuals who share a trait, such as cardiovascular disease or epilepsy, might share genetic variants in common. For example, a gene might seem to be associated with epilepsy within a given population, but it may be that a subgroup of the affected individuals shares a common ancestry that they aren’t aware of, and the associated gene may simply reflect that fact.

Researchers have of course been aware of this problem for a long time and genome-wide association studies (GWAS) are now designed to account for hidden population structure. However, these methods are not perfect. Finding ways to improve GWAS methods is an active area of research.

A study published online in Nature Genetics this week reports a new sophisticated method for performing GWAS while automatically accounting for hidden population structure. The study by Minsun Song, Wei Hao and John Storey demonstrates the power of their method, the genotype-conditional association test, first on simulated data and ten shows how it can be applied to large genotype datasets for both quantitative and binary traits. We asked the senior author, John Storey, to tell us a little bit more about the study.

Questions with John Storey: 

Many statistical methods exist for accounting for population stratification in genetic association tests. What makes your genotype-conditional association test different?

The genotype-conditional association test (GCAT) is different operationally because it fits a statistical model where the variation in genotypes is explained in terms of the trait variation and adjustments for population structure.  This means that the genotype and trait variables are swapped when performing the statistical regression, and a different type of regression (logistic regression) is used.

GCAT is also different because we have provided a theoretical proof that the test controls for general forms of population structure.  To our knowledge, before this paper, there has been no theoretically proven way to account for general forms of structure in population-based studies without relying on approximations.  An important, distinguishing feature of our method is that the key assumption one needs to verify on real data is about the model used to capture population structure observed in the genotypes.  This can typically be verified in practice, and there has been a lot of great work on this topic over the years, so there are plenty of existing resources available to properly model structure itself.  GCAT does not require extensive assumptions about the trait model to be verified, including non-genetic effects, which is often impossible to do in practice.  Finally, GCAT can computationally scale to very large sample sizes, on the order of a million individuals.  Methods that require estimating a kinship matrix cannot currently scale to very large sample sizes.

How did the initial idea for this method come about?

Figure 1 in the paper essentially captures the initial idea.  We wanted to develop a method that (1) allows for very general trait models, including genetic and non-genetic effects that are highly confounded with structure, and (2) involves estimating parameters and models that require few assumptions and can be verified in practice.  As the project developed, we really grew to appreciate the linear-mixed effects model approach and we viewed our research as a useful way to look at the problem from another perspective.

Rationale for the proposed test of association.

Rationale for the proposed test of association.{credit}Figure 1, Song et al (2014) Nature Genetics published online 30 March 2015; doi:10.1038/ng.3244{/credit}

Who do you think will most benefit from this new method and why?

Since our method allows for more general assumptions about the trait model, a researcher who is uncomfortable with the assumptions that current methods make about the trait model will benefit from the new method.  A researcher who has a large sample size will also have an easier and shorter time performing the method (which has software available on GitHub at https://github.com/StoreyLab/gcat).  The theory that supports our method also applies to other distributions on traits, such as the Poisson, Negative Binomial, or Exponential distributions, so our method is capable of considering more exotic traits such as RNA-seq profiles.  Finally, I think the theoretical work in the paper will be helpful to anyone wanting to be exposed to a different understanding of the problem.

What types of association tests would this method not be appropriate for?

If a study involves closely related individuals, then the method is not appropriate for it.  However, the user should easily discover this when verifying whether the model of population structure fails to properly explain the genotype data.  We would have to do further work to see if GCAT can be extended to the case of related individuals.

What problem(s) still needs to be solved in genome-wide association testing?

I will just comment on statistical methodology problems.  It is still early days on figuring out the best way to analyze multiple traits simultaneously or how to best analyze very large sample-size studies (typically as meta-analyses).  We are interested in GWAS that involve many simultaneously measured molecular traits that may involve lots of challenges such as population structure among the individuals and batch effects in the molecular profiles (e.g., the GTEx study).  GCAT was a step in this direction for us.  I also think that kinship matrix estimation needs some additional work (especially for large sample sizes) and I personally am not yet satisfied with how we deal with polygenic models in GWAS.  Finally, I think that coming up with ways to utilize more functional genomics and pathway information in a GWAS is a great direction.

You recently became the director of the Center for Statistics and Machine Learning at Princeton. How has this changed your interaction with other faculty involved in the center? Have there been any unexpected or surprising results of joining up these two disciplines at Princeton?

There has been broad and extremely enthusiastic support at Princeton University for building the Center for Statistics and Machine Learning.  It seems that every major discipline has significant research activity that is data-driven, even in the humanities (e.g., our Center for Digital Humanities).  It has been a pleasure to learn about the wide range of “big data” research happening on campus and to be able to think about how we can build the Center to enhance all of this activity.  There has been a core of faculty members at Princeton for years who primarily work in statistics and/or machine learning, so we are all thrilled to have an established intellectual home now..

 

Pinpointing genes underlying developmental delay

A paper published online this week at Nature Genetics uses an innovative method to find new genes that contribute to neurocognitive disorders, such as autism.

The paper reports 10 new candidate genes for developmental delay or autism. The results also led to the discovery of two new subtypes of developmental delay, caused by loss of the genes SETBP1 and ZMYND11, respectively. You can find the paper reporting this study here.

Gene-duplication

One example of a CNV. In this case, the region is duplicated. {credit}Wikipedia{/credit}

The authors of the study narrowed in on the 10 candidate genes by first building a map of all the regions in the genome with different copy numbers between the developmentally delayed and normal children. These differences, known as copy number variants (CNVs), can each include many different genes. By then integrating this map with single base-pair changes (SNVs) between the two groups, the researchers were able narrow in on the genes most likely to contribute to cognitive disorders.

I asked one of the senior authors of the paper, Evan Eichler, to tell us a little more about the background of the study and why it is important:

Q: The study includes authors from many institutions–how did you all come together to work on this project?

A: The multi-center collaboration is one that developed over the last ten years when we began our work on CNVs and genomic hotspots flanked by segmental duplications. Some connections go further back, for example, I have known Lisa Shaffer from the days when I was a graduate student and she was in charge of the molecular cytogenetics laboratory at Baylor College of Medicine.

Q: Why did you decide to focus on CNVs rather than other types of variants? Was this the plan from the start?

A: The paper actually goes after both CNVs and SNVs. We used the very large number of cases and controls to identify regions that reached nominal significance for burden (i.e excess of deletions and duplications in patients when compared to controls). We then selected genes for resequencing (using MIPs [molecular inversion probes]) and show excess of loss-of-function mutations and similarity in clinical phenotypes between the SNV and CNV patients. It was the plan from the start.

Q: What would you say is the major new breakthrough in this study?

A: A systematic approach to go from large CNVs to pinpointing the underlying gene responsible for specific forms of developmental delay and ID. The paper bridges between those two types of variants and shows the power of combining these different datasets to make discoveries.

Q: How do you envision clinicians using the results? Are there any caveats that they need to consider?

A: Hopefully, the CNV morbidity map will provide clinicians and families some guidance in terms of interpreting previous variants of unknown significance. The discovery of specific genes and intersection of exomes and CNVs should also help with interpretation of clinical exomes that are now being generated. I anticipate that more than 1/2 of the genes listed in Table 2, for example, are relevant to pediatric DD as well as other diseases. The caveat is that more data and clinical assessment are required. Despite 30,000 cases and 20,000 controls many regions are still underpowered to move them to a category of benign or pathogenic. Large clinical labs should exchange their CNV data more freely.

Q: Do you think the approach used in this study (coupling exomes and CNVs) will be useful for other neuropsychiatric (or other) disorders?

A: Yes. Many complex neuropsychiatric disorders may in fact manifest as mild DD or other learning disorders early in childhood. Case-in-point is ZMYND11. We show that it is most likely the gene responsible for the 10p15.3 microdeletion syndrome but also find that 3/4 males with truncating mutations also have neuropsychiatric diagnoses as adults. A sporadic truncating mutation of ZMYND11 was also identified in a recent trio exome sequencing study of schizophrenia family. It still surprises me that the neuropsychiatric and pediatric developmental delay fields don’t compare notes more often.

Preliminary look at GWAS articles including dbGaP accessions

{credit}NCBI {/credit}

In this month’s Editorial (doi:10.1038/ng.3088) we mention 66 articles in this journal published between 2008 and 2013 that cite dbGaP accession codes and we took a preliminary look at citation of 13 pairs of GWAS articles with and without a dbGaP accession published on the same trait on the same day in the same journal (in the case of more than two simultaneous articles, non-overlapping pairs were assigned by sequential DOI number). Here are the references and some of the citation information for readers who want to investigate this area further.

Simultaneously published articles with citation data:

Screen Shot 2014-08-25 at 4.33.49 PM

citationdataAll Nature Genetics articles with dbGaP accession:

DOI Scopus citations up to 8/1/14
10.1038/ng.249 212
10.1038/ng.364 184
10.1038/ng.362 174
10.1038/ng.416 128
10.1038/ng.311 468
10.1038/ng.269 396
10.1038/ng.291 583
10.1038/ng.290 305
10.1038/ng.386 111
10.1038/ng.384 464
10.1038/ng.381 534
10.1038/ng.377 169
10.1038/ng.456 86
10.1038/ng.466 137
10.1038/ng.474 270
10.1038/ng.432 141
10.1038/ng.716 87
10.1038/ng.714 131
10.1038/ng.520 628
10.1038/ng.523 86
10.1038/ng.521 211
10.1038/ng.517 134
10.1038/ng.501 191
10.1038/ng.493 75
10.1038/ng.602 46
10.1038/ng.604 68
10.1038/ng.537 148
10.1038/ng.568 198
10.1038/ng.567 80
10.1038/ng.571 281
10.1038/ng.573 223
10.1038/ng.686 761
10.1038/ng.666 91
10.1038/ng.642 197
10.1038/ng.1017 52
10.1038/ng.1013 56
10.1038/ng01113 13
10.1038/ng.859 85
10.1038/ng.803 374
10.1038/ng.801 387
10.1038/ng.970 69
10.1038/ng.922 75
10.1038/ng.934 31
10.1038/ng.941 77
10.1038/ng.223 43
10.1038/ng.2250 103
10.1038/ng.2466 18
10.1038/ng.1108 124
10.1038/ng.1051 45
10.1038/ng.2354 95
10.1038/ng.2344 35
10.1038/ng.2213 60
10.1038/ng.2274 71
10.1038/ng.2285 22
10.1038/ng.2272 40
10.1038/ng.2368 23
10.1038/ng.2360 30
10.1038/ng.2385 63
10.1038/ng.2564 42
10.1038/ng.2505 23
10.1038/ng.2529 51
10.1038/ng.2554 38
10.1038/ng.2792 17
10.1038/ng.2794 6
10.1038/ng.2764 42
10.1038/ng.2702 36

Love for Nature Genetics

In a previous blog post, I asked “what makes a Nature Genetics paper?” I have been slow to follow up on the post with my own answers to that question, but in the meantime I would like to share this email that brightened my day (edited for clarity): 

“I work on the field of genome-wide association studies (GWAS) in complex diseases and I note that of recent months, the criteria and standard of Nature Genetics (NG) for accepting GWAS papers is getting higher and higher.

 Looking back at this editorial from 3 years back:  https://www.nature.com/ng/journal/v43/n7/full/ng.881.html

Vol43_7

Cover art from Vol 43, Issue 7 {credit}John Arabolos{/credit}

It appears that NG remains truly interested in strong, novel biological insights arising from the genetic work, and this is really wonderful. Consistent with this, NG published a beautiful and conclusive GWAS on visceral leishmaniasis in 2013, even though very small studies in the past have hinted at the same gene, but without any power to be at all definitive. I could go on and on, as there are many such great examples.

Despite a ton of [rejections without review], my collective experience with the journal has been very good due to the consistency of the editorial decisions handed down, and the very helpful tone of the editors. It may seem subtle and not all that obvious, but I note that ‘secondary, strongly genome-wide significant, ethnic specific signals’ within a broadly known locus is usually not of sufficient novelty for NG, and it is really consistent throughout. The journal is to be saluted for the consistent, increase in standards throughout the years.”

We of course try to be as consistent as possible in our editorial decisions and to constantly raise the bar…though this doesn’t always make everyone happy, for sure. We’d love to hear from more of you (whether positive or negative feedback…though please keep it civil!). You can email me or directly leave a note in the comments if you prefer.