Farm to Genomes: African Rice

Meyer at al., Nature Genetics, 2016

Meyer at al., Nature Genetics, 2016

Rice is one of the most important crops on the planet, responsible for feeding billions of people. Given this global significance, studying rice in different geographies can be useful and aid in harnessing genetic diversity underlying particular traits and adaptations favorable to different environments. African rice (Oryza glaberrima Steud.) is mainly grown in sub-Saharan Africa and known for its stress tolerance. In a new article this week in Nature Genetics, Michael Purugganan and colleagues report the whole genome re-sequencing of 93 African rice landraces from various regions of Western coastal and sub-Saharan Africa. They create a genome-wide SNP map and through comparative genomic analysis study the domestication and population history of African rice. They use their map to perform GWAS for salt tolerance and find 11 significantly associated regions, highlighting the value of this unique genetic resource.

Meyer et al., Nature Genetics, 2016

Meyer et al., Nature Genetics, 2016

By studying various regions with distinct environments, the authors were able to get clues about adaptation and geographic spread of the populations. They focused on coastal Senegal and inland Togo, which have higher and lower levels of soil salinity, respectively, and interviewed farmers in the region to understand the agricultural practices they employ in each region. The knowledge of the farmers helped to inform the genetic analysis and contributed to the model of African rice domestication and dispersal.

You can watch some of the interviews with the farmers here:

African rice farmers- interviews

Additionally, we spoke with authors Michael Purugganan and Rachel Meyer to get some background on this research.

Why do you think that rice is understudied in Africa compared to other places?

MP: I think it’s because it is not widely grown, unlike its Asian counterpart which has pretty much taken over the world.  But there definitely is more interest in African rice as breeders are trying to figure out how to increase food production in Africa, as well as to try to see what genes in African rice can be used to improve Asian rice.

RM: There is a lot of great research on improving Asian rice for African farmers that is being done by brilliant AfricaRice scientists, and they are working hard on the social science side too. But there are so many challenges that Africa disproportionately faces – particularly climate variation – that demands ramping up rice research. There is insufficient support for programs that integrate crop experiments and trials into the different farmlands. A better connection between scientists and small-scale farmers would really help farmers adopt new varieties too- because there is sometimes resistance to trying new ones.

How did you choose which samples to include in your analysis?

RM: Recognizing that a lot of NGO work encouraging farmers to grow Asian rice ramped up in the 80’s and 90’s, we took advantage of the germplasm largely donated in the 70’s to the West Africa Rice Development Association, which were duplicated and available through IRRI (International Rice Research Institute). We chose accessions with the most metadata available, preferring ones with georeferenced location and a cultivar name. It wasn’t until later that we realized water tables far inland were high in salinity, so we just tried to make sure we had a fair number of samples within 250km of the coast, or along rivers connecting to the ocean.

Were you surprised by any of your findings?

MP: There definitely were a few surprises in the data, but the big revelation for me was the long time for the population bottleneck that led to domestication.  We found from the genomic data that it may have taken more than 10,000 years of steady population decline before full-blown domesticated African rice shows up in the archaeological record.  This suggests the possibility that humans were already cultivating or managing its ancestor for thousands of years, and I think if this pattern holds for other domesticated crop species it will change our thinking on how domestication has taken place.

RM: I was surprised we got nice GWAS results with so few samples, and even more surprised that we saw several of those exhibiting signatures of geographic selection. We were lucky to find a broad distribution of traits in the landraces we chose to sequence, for we had made the DNA libraries ahead of the phenotyping experiments.

What was it like to meet and talk with the farmers?

RM: It was one of the highlights of my life to meet the farmers! I’m grateful to have gotten a glimpse of their heritage, their pride, and their struggles. We were all so impressed with the generosity of women, in particular, to help each other. We were also shocked by how many farms are run by the elderly; their children don’t see farming as profitable and many have left. For the three of us in the field, it made us think hard about how we can give back to the communities that gave us their time. I hope that crop science, publicity (like this blog) and policy changes can raise the profile of the small-scale farmer.

In each interview, the farmers also had a chance to interview us, and that part was especially interesting. Several asked really good questions about African and Asian rice domestication. You could see the cultural value of the basic science.

You chose to focus on salinity tolerance as a trait particularly relevant to farming in Africa.  In what ways do you see your results being used for crop improvement?

RM: One of the authors, from AfricaRice, Dr. Kofi Bimpong, had actually been working on salt tolerance separately as well, and has two graduate student collecting African rice landraces in Casamance. If from this paper we can consider that domestication possibly occurred in the Inner Niger Delta region and also in the West, then these collecting efforts are all the more important because they are from a center of origin, promising more genetic variation than people would have ever estimated. If you look through the available germplasm there is so little that has been collected or studied from Casamance. It’s tricky collecting there, for there is social unrest, and landmines. Hats off to the young graduate students, Mamadou Sock and Bathe Diop, doing that fieldwork; I’m sure there is a lot of discovery to be made with those collections, and more promising salt tolerant landraces to integrate into breeding programs.

In addition, our results suggesting many of the salt tolerance genes are shared in both rice species make them more valuable to explore in other crops.  Shared adaptive mechanisms are especially fascinating to evolutionary biologists and are powerful assets of the breeder’s toolbox.

Sidestepping spurious associations

Robert_Delaunay,_1913,_Premier_Disque,_134_cm,_52.7_inches,_Private_collection

Layers of structure {credit}Robert Delaunay, 1912-1913, Premier Disque{/credit}

Genome-wide association tests have been hugely successful at finding genes and even specific mutations that contribute to traits ranging from human height to schizophrenia. At its most basic, the idea is that a group of individuals with a shared phenotype should also share some genetic variants in common that are causally related to the trait in question. Unfortunately, there are other reasons that individuals who share a trait, such as cardiovascular disease or epilepsy, might share genetic variants in common. For example, a gene might seem to be associated with epilepsy within a given population, but it may be that a subgroup of the affected individuals shares a common ancestry that they aren’t aware of, and the associated gene may simply reflect that fact.

Researchers have of course been aware of this problem for a long time and genome-wide association studies (GWAS) are now designed to account for hidden population structure. However, these methods are not perfect. Finding ways to improve GWAS methods is an active area of research.

A study published online in Nature Genetics this week reports a new sophisticated method for performing GWAS while automatically accounting for hidden population structure. The study by Minsun Song, Wei Hao and John Storey demonstrates the power of their method, the genotype-conditional association test, first on simulated data and ten shows how it can be applied to large genotype datasets for both quantitative and binary traits. We asked the senior author, John Storey, to tell us a little bit more about the study.

Questions with John Storey: 

Many statistical methods exist for accounting for population stratification in genetic association tests. What makes your genotype-conditional association test different?

The genotype-conditional association test (GCAT) is different operationally because it fits a statistical model where the variation in genotypes is explained in terms of the trait variation and adjustments for population structure.  This means that the genotype and trait variables are swapped when performing the statistical regression, and a different type of regression (logistic regression) is used.

GCAT is also different because we have provided a theoretical proof that the test controls for general forms of population structure.  To our knowledge, before this paper, there has been no theoretically proven way to account for general forms of structure in population-based studies without relying on approximations.  An important, distinguishing feature of our method is that the key assumption one needs to verify on real data is about the model used to capture population structure observed in the genotypes.  This can typically be verified in practice, and there has been a lot of great work on this topic over the years, so there are plenty of existing resources available to properly model structure itself.  GCAT does not require extensive assumptions about the trait model to be verified, including non-genetic effects, which is often impossible to do in practice.  Finally, GCAT can computationally scale to very large sample sizes, on the order of a million individuals.  Methods that require estimating a kinship matrix cannot currently scale to very large sample sizes.

How did the initial idea for this method come about?

Figure 1 in the paper essentially captures the initial idea.  We wanted to develop a method that (1) allows for very general trait models, including genetic and non-genetic effects that are highly confounded with structure, and (2) involves estimating parameters and models that require few assumptions and can be verified in practice.  As the project developed, we really grew to appreciate the linear-mixed effects model approach and we viewed our research as a useful way to look at the problem from another perspective.

Rationale for the proposed test of association.

Rationale for the proposed test of association.{credit}Figure 1, Song et al (2014) Nature Genetics published online 30 March 2015; doi:10.1038/ng.3244{/credit}

Who do you think will most benefit from this new method and why?

Since our method allows for more general assumptions about the trait model, a researcher who is uncomfortable with the assumptions that current methods make about the trait model will benefit from the new method.  A researcher who has a large sample size will also have an easier and shorter time performing the method (which has software available on GitHub at https://github.com/StoreyLab/gcat).  The theory that supports our method also applies to other distributions on traits, such as the Poisson, Negative Binomial, or Exponential distributions, so our method is capable of considering more exotic traits such as RNA-seq profiles.  Finally, I think the theoretical work in the paper will be helpful to anyone wanting to be exposed to a different understanding of the problem.

What types of association tests would this method not be appropriate for?

If a study involves closely related individuals, then the method is not appropriate for it.  However, the user should easily discover this when verifying whether the model of population structure fails to properly explain the genotype data.  We would have to do further work to see if GCAT can be extended to the case of related individuals.

What problem(s) still needs to be solved in genome-wide association testing?

I will just comment on statistical methodology problems.  It is still early days on figuring out the best way to analyze multiple traits simultaneously or how to best analyze very large sample-size studies (typically as meta-analyses).  We are interested in GWAS that involve many simultaneously measured molecular traits that may involve lots of challenges such as population structure among the individuals and batch effects in the molecular profiles (e.g., the GTEx study).  GCAT was a step in this direction for us.  I also think that kinship matrix estimation needs some additional work (especially for large sample sizes) and I personally am not yet satisfied with how we deal with polygenic models in GWAS.  Finally, I think that coming up with ways to utilize more functional genomics and pathway information in a GWAS is a great direction.

You recently became the director of the Center for Statistics and Machine Learning at Princeton. How has this changed your interaction with other faculty involved in the center? Have there been any unexpected or surprising results of joining up these two disciplines at Princeton?

There has been broad and extremely enthusiastic support at Princeton University for building the Center for Statistics and Machine Learning.  It seems that every major discipline has significant research activity that is data-driven, even in the humanities (e.g., our Center for Digital Humanities).  It has been a pleasure to learn about the wide range of “big data” research happening on campus and to be able to think about how we can build the Center to enhance all of this activity.  There has been a core of faculty members at Princeton for years who primarily work in statistics and/or machine learning, so we are all thrilled to have an established intellectual home now..

 

On the history of pigs

USDA_ARS_Meishan_pig-Cropped

{credit}Agricultural Research Service via Wikipedia{/credit}

Understanding the genomic changes that occurred during the domestication of animals and plants by humans is important on many levels. Such insights can provide information about human history and our interactions with other species, as is the case with genetic studies of dog and cat domestication. These studies can also help us to improve crop plants (such as tomato) and livestock (such as cattle) for human consumption or other use. Finally, genetic studies on domestication can help to identify disease-causing mutations that have been selected for as a by product of selection for beneficial traits (for example, in cats and dogs).

Though humans have a huge influence on important traits in domesticated species, those species are still responding to natural selection during the domestication process, which in turn may affect traits important for agricultural purposes. Identifying genomic regions influenced by positive natural selection in domesticated animals  can lead to important insights into the biology of specific breeds.

In this respect, the pig is an excellent model to study. Humans domesticated pigs approximately 10,000 years ago in the Near East and China, but a relatively open method of keeping pigs allowed for continued interbreeding with wild boars for some time. In a study published this week in Nature GeneticsLusheng Huang, Jun Ren and colleagues from Jiangxi Agricultural University sequenced the genomes of 69 diverse domestic and wild pigs in China to better understand their evolutionary history.

Pig sampling in China

Pig sampling in China{credit}Lusheng Huang{/credit}

The study included pigs from 11 diverse breeds (and 3 populations of wild boar) within China in order to compare the adaptations in breeds from cold vs. hot areas. They identified over 700 genomic regions that showed evidence of selective sweeps. Many of the genes in these regions were involved in processes important for regulation of temperature during cold or heat stress, such as hair development, energy metabolism and blood circulation.

However, one of the most striking results was the identification of a large (~14Mb) sweep region on the X-chromosome. More than 94% of the single nucleotide polymorphisms (SNPs) in the 69 pig sample that had extreme allele frequency differences between North and South populations were located within the X-linked sweep region. All Northern Chinese samples showed a strong signature of selection in this region. Upon further analysis, the authors were able to determine that the most likely scenario, given their data, was that this region was introgressed from a now-extinct species of Sus. This region of the X-chromosome undergoes very little recombination. This fact, combined with the strong signal of positive selection in the region, meant the introgressed sequence remained mostly preserved for more than 8 million years.

We asked one of the study’s senior authors, Lusheng Huang, to tell us a little more about the work:

How did you collect the DNA samples from the pigs for your study? Were any of the samples difficult to get?

We collected DNA samples from 4,100 three-generation consangeneously unrelated pigs representing all 68 indigenous breeds that are distributed in 24 provinces of China. It took us four and half years to complete sample collections, Some native pigs lived in the high attitude regions (Yunnan, Guizhou, Sichuan and Tibet) were very hard to get. Afterwards, we constructed a DNA bank for Whole China indigenous pigs. As a pilot study, we first genotyped 520 unrelated pigs (no common ancestor within 3 generations) from 32 Chinese breeds for 60K SNPs in the Illumina porcine beadchip. Then, we selected 69 representative pigs from the 520 pigs according to their genetic relationships in the neighbor-joining tree constructed with the 60K SNP data. The 69 pigs selected for whole-genome sequencing are highly rep­resentative of populations at the geographical extremes of China.

pig sampling

{credit}Lusheng Huang{/credit}

Most of the sampled pigs were originally raised in government-sponsored conservation farms. We selected animals to cover a majority of consanguinity of each breed according to their pedigree information. However, samples of several breeds were collected from isolated villages or farms at rural areas. For example, it was a big challenge for us to collect samples of Tibetan pigs from different geographic populations in the vast region of the Tibet Plateau. To find purebred Tibetan pigs that were not influenced by human-mediated hybrid with exotic breeds, we had to travel to remote pastoral areas at high altitudes and make an in-depth field investigation with the kind help of local residents. To cover the consanguinity of each Tibetan population as broad as possible, we preferably collected samples from Tibetan boars that are usually aggressive like wild boars and were really difficult to get (see above picture).

What do the positively selected regions tell us about the history of pig domestication?

These regions clearly illustrate that pigs have experienced natural selection for local fitness before (ancient event) or after (recent event) domestication. The selection footprints in the pig genomes can be visualized by whole-genome sequencing, characterized by reduced heterozygosity, excess of low-frequency variants, extended and differentiated haplotypes. The selected sweep regions harbor functional genes that play a role in adaptation to local environments. DCF17 and VPS13A are two such examples highlighted in this study.

What do you think was the most unexpected result in this study? Did you believe it at first?

The extremely divergent haplotype in the X-linked sweep region between Southern and Northern Chinese pigs, an indication of a possible ancient interspecies introgression event, was the most unexpected result in this study. It is a big surprise. Frankly speaking, we did not believe it at first.

Adapted from Fig. 4a in Huashui Ai et al. 2014

The pattern of haplotype sharing in diverse populations. The haplotypes were reconstructed for each individual using all of the variants on the X chromosome. Alleles that are identical to or different from the ones in the Wuzhishan reference genome are indicated by red and blue, respectively. Adapted from Fig. 4a in Huashui Ai et al. 2014{credit}Nature Genetics{/credit}

Why is the finding of a large introgression region on the X chromosome important?

Although evidence of adaptive evolution driven by introgression from archaic species has been recently identified in some species including humans, the X-linked introgression region shows that adaptive introgression is not limited to closely related species, but in some cases, introgression with very divergent species can provide the basis for the evolution of radically new traits in a species. This radical example of so-called ‘reticulate evolution’ in mammals shakes the foundation of most modern evolutionary biology and provides a new view of adaptive evolution that emphasizes saltationist (sudden) processes driven by introgression. Moreover, as discussed in the paper, our ability to detect this, potentially quite old, introgression event is facilitated by the fact that the introgression fragment falls in a recombination-decreasing region. This has allowed the introgressed haplotype to be maintained for a prolonged period. Our results may suggest that introgression generally plays a much more dominant role in adaptive evolution than previously thought, but has been difficult to detect because introgression fragments in other systems degenerate quickly due to recombination.

Do you think similar ancient introgressions have occurred in other domesticated species? If so, how would you test this?

We cannot rule out the possibility. If one wants to test this hypothesis, we would suggest to use a research strategy similar to that used in this study. First, we would need to get the genome sequences of multiple species divergent from a domesticated species. Then, we can perform a genome-wide scan for possible introgression regions from another divergent species in the domestic species. Several statistics of ABBA, F4, haplotype sharing and phylogenetic analysis can be explored to identify such ancient introgressions.

Erhualian

{credit}Lusheng Huang{/credit}

Bonus question: What is your favorite breed of domestic pig?

Erhualian, the most prolific pig breed in the world.