Genome-wide association tests have been hugely successful at finding genes and even specific mutations that contribute to traits ranging from human height to schizophrenia. At its most basic, the idea is that a group of individuals with a shared phenotype should also share some genetic variants in common that are causally related to the trait in question. Unfortunately, there are other reasons that individuals who share a trait, such as cardiovascular disease or epilepsy, might share genetic variants in common. For example, a gene might seem to be associated with epilepsy within a given population, but it may be that a subgroup of the affected individuals shares a common ancestry that they aren’t aware of, and the associated gene may simply reflect that fact.
Researchers have of course been aware of this problem for a long time and genome-wide association studies (GWAS) are now designed to account for hidden population structure. However, these methods are not perfect. Finding ways to improve GWAS methods is an active area of research.
A study published online in Nature Genetics this week reports a new sophisticated method for performing GWAS while automatically accounting for hidden population structure. The study by Minsun Song, Wei Hao and John Storey demonstrates the power of their method, the genotype-conditional association test, first on simulated data and ten shows how it can be applied to large genotype datasets for both quantitative and binary traits. We asked the senior author, John Storey, to tell us a little bit more about the study.
Questions with John Storey:
Many statistical methods exist for accounting for population stratification in genetic association tests. What makes your genotype-conditional association test different?
The genotype-conditional association test (GCAT) is different operationally because it fits a statistical model where the variation in genotypes is explained in terms of the trait variation and adjustments for population structure. This means that the genotype and trait variables are swapped when performing the statistical regression, and a different type of regression (logistic regression) is used.
GCAT is also different because we have provided a theoretical proof that the test controls for general forms of population structure. To our knowledge, before this paper, there has been no theoretically proven way to account for general forms of structure in population-based studies without relying on approximations. An important, distinguishing feature of our method is that the key assumption one needs to verify on real data is about the model used to capture population structure observed in the genotypes. This can typically be verified in practice, and there has been a lot of great work on this topic over the years, so there are plenty of existing resources available to properly model structure itself. GCAT does not require extensive assumptions about the trait model to be verified, including non-genetic effects, which is often impossible to do in practice. Finally, GCAT can computationally scale to very large sample sizes, on the order of a million individuals. Methods that require estimating a kinship matrix cannot currently scale to very large sample sizes.
How did the initial idea for this method come about?
Figure 1 in the paper essentially captures the initial idea. We wanted to develop a method that (1) allows for very general trait models, including genetic and non-genetic effects that are highly confounded with structure, and (2) involves estimating parameters and models that require few assumptions and can be verified in practice. As the project developed, we really grew to appreciate the linear-mixed effects model approach and we viewed our research as a useful way to look at the problem from another perspective.
Who do you think will most benefit from this new method and why?
Since our method allows for more general assumptions about the trait model, a researcher who is uncomfortable with the assumptions that current methods make about the trait model will benefit from the new method. A researcher who has a large sample size will also have an easier and shorter time performing the method (which has software available on GitHub at https://github.com/StoreyLab/gcat). The theory that supports our method also applies to other distributions on traits, such as the Poisson, Negative Binomial, or Exponential distributions, so our method is capable of considering more exotic traits such as RNA-seq profiles. Finally, I think the theoretical work in the paper will be helpful to anyone wanting to be exposed to a different understanding of the problem.
What types of association tests would this method not be appropriate for?
If a study involves closely related individuals, then the method is not appropriate for it. However, the user should easily discover this when verifying whether the model of population structure fails to properly explain the genotype data. We would have to do further work to see if GCAT can be extended to the case of related individuals.
What problem(s) still needs to be solved in genome-wide association testing?
I will just comment on statistical methodology problems. It is still early days on figuring out the best way to analyze multiple traits simultaneously or how to best analyze very large sample-size studies (typically as meta-analyses). We are interested in GWAS that involve many simultaneously measured molecular traits that may involve lots of challenges such as population structure among the individuals and batch effects in the molecular profiles (e.g., the GTEx study). GCAT was a step in this direction for us. I also think that kinship matrix estimation needs some additional work (especially for large sample sizes) and I personally am not yet satisfied with how we deal with polygenic models in GWAS. Finally, I think that coming up with ways to utilize more functional genomics and pathway information in a GWAS is a great direction.
You recently became the director of the Center for Statistics and Machine Learning at Princeton. How has this changed your interaction with other faculty involved in the center? Have there been any unexpected or surprising results of joining up these two disciplines at Princeton?
There has been broad and extremely enthusiastic support at Princeton University for building the Center for Statistics and Machine Learning. It seems that every major discipline has significant research activity that is data-driven, even in the humanities (e.g., our Center for Digital Humanities). It has been a pleasure to learn about the wide range of “big data” research happening on campus and to be able to think about how we can build the Center to enhance all of this activity. There has been a core of faculty members at Princeton for years who primarily work in statistics and/or machine learning, so we are all thrilled to have an established intellectual home now..