Share your thoughts on the editorial in our June issue.
Category Archives: Genetics & Genomics
Method of the Year 2016
As is our tradition every year we have chosen a method, or in this case a set of methods, that have experienced rapid growth in the last years. This year’s choice of epitranscriptome analysis does not comprise a single technique but is based on advances in detecting, enriching and profiling base modifications on all RNA species.
Some of these modifications are abundant and have known functions, others are rare and their role is still obscure. We believe recent methodological advances, as detailed in a Review by Chengqi Yi and colleagues, lay the groundwork for a comprehensive profiling of some of these marks that will shed light on their role in the cell.
Our selection of methods to watch highlights areas we think will experience growth in the coming year and be influential in biological research: from global metabolomics, to RNA-targeting CRISPR, to elucidating single cell function and faster brain imaging. We do not claim to provide a comprehensive list and our choices may be biased by our fields of interest. We do hope you enjoy reading this feature and if you disagree with us, or if you think we have overlooked an important area, please let us know.
Understanding and documenting variation in human genomes
To understand disease one needs to understand the genetic variations that underlie it. Many tools exist that predict the deleteriousness of variants in the human genome; PolyPhen2, SIFT or CADD (combined annotation dependent depletion), to name only a few examples. On page 109 of our March issue Yuval Itan et al. present the mutation significance cutoff (MSC) to replace a global threshold for calling variants deleterious, often used for CADD scores, with a gene-level threshold. For MSC, as for any other variant prediction tool, it was important to validate the quality of the predictions with variants known to be deleterious. Established mutation databases are often used as ground truth to test the quality of prediction tools. MSC, for example, was validated against variants found in two large databases, HGMD and ClinVar.
The February editorial discusses the strength and limitations of large human variation databases and emphasizes the importance of sharing variant data in publicly accessible databases. We encourage our readers to share their experience with these databases and to recommend their favorite ones.
Analyzing high throughput sequencing data
Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.
Mapping and assembling genomic reads
One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.
Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend, short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner, promises exceptional speed and accuracy. The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.
If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.
Finding structural variants
In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al. published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase. In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue. This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.
Handling RNA-seq data
In 2008 Mortazavi et al. and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.
To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.
Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment. And they looked at the performance of software to reconstruct transcripts.
David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript. And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.
Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.
A star is born: the updated Human Reference Genome
The release of the 38th build of the human reference genome gets a well-deserved rock-star greeting by the scientific community.

The new GRCh38 is already a rock-star{credit}Wikimedia Commons/Flickr:Starman/K.Spencer{/credit}
Fans know it is worth the effort to camp out for tickets to a concert by a beloved rock, pop or country star. GRCh38, the newest build of the human reference genome, is that kind of star. Delayed by a few snags and also held up by the US government shut-down, the sequence has just traveled to GenBank for use by the scientific community.
Not only has Genome Reference Consortium build 38 (GRCh38) eliminated some pesky previous gaps, it will be the first human reference assembly to have sequence information for centromeres. Up until now, centromeres, which are specialized structural components of chromosomes, have been represented in the reference by gaps of 3 million base pairs. The news about centromere sequence will be of interest to cell biologists and genomics researchers alike.
“This will be a major boon to evolutionary studies of human populations and to the many groups doing mechanistic work on human centromeres and kinetochores,” says Stanford University researcher Aaron Straight, whose work focuses on cell division and chromosome segregation. “Finally, now we can stop saying ‘mind the gap’.”
The reference genome finishers are the members of the Genome Reference Consortium (GRC) at the European Bioinformatics Institute, the US National Center for Biotechnology Information, The Wellcome Trust Sanger Institute and The Genome Institute at Washington University.
Scientists may not have physically camped like concert-goers in front of the buildings where genome finishers scurry to get the sequence out the door. But the throngs have been virtually present. The GRC, which works on human, mouse and zebrafish reference genomes, is “having to field a lot of questions from folks who want to know the minute they can have the assembly,” says Deanna Church, a genomicist formerly at the US National Center for Biotechnology Information and who has, since this interview, moved to Personalis, a genetic testing and analysis company.
The din has faded from the 2001 celebration marking the end of the Human Genome Project. But the sequence was not complete nor is it complete now. As colleagues at Nature Methods have pointed out here and here, the sequence originally had around 150,000 gaps.
The most recent reference genome, Genome Reference Consortium build 37 (GRCh37), has 357 gaps. And is missing sequence around the centromeres. No longer.
Come here, centromere
The structure and repetitive nature of centromeric regions has made them largely inaccessible to methods used to create the reference assembly, says Church. The concept and the methods to produce the centromere sequences for this reference build were developed by a research team at University of California at Santa Cruz (UCSC). They constructed sequences using the Sanger technique and the data helped the team behind GRCh38 to fill in these important gaps.

The centromere community will be happy to no longer say this.{credit}Wikimedia Commons/Clicsouris{/credit}
In a paper, the UCSC team, led by Karen Miga and Jim Kent, a member of GRC’s scientific advisory board, noted that centromeric regions are replete with near-identical tandem repeats—satellite DNA. Difficult assembly of these regions have led them frequently to be excluded from genomic studies. In the new reference genome, the scientists used reads generated during the Venter genome assembly and created models for the centromeres, says Church.
“These models don’t exactly represent the centromere sequences in the Venter assembly, but they are a good approximation of the ‘average’ centromere in this genome,” she says. And these sequence models are not exact representations of any one centromere, either. But including these sequences in the reference assembly “will likely improve genome analysis using current methods, and allow for some further study of population variation in centromere sequences,” says Church.
Stephen Quake responds to Lior Pachter
Stephen Quake responds to a blog post by Lior Pachter that analyzes data from his recent analysis of single-cell RNA sequencing methods published in Nature Methods.
In October, we published an Analysis by Quake and colleagues that evaluated a number of single-cell RNA-seq approaches on the basis of their sensitivity, accuracy and reproducibility. In a subsequent blog post, Pachter challenged their data reporting. At issue is whether the failure rate among 96 samples sequenced using the Fluidigm C1 microfluidic instrument should have been presented differently.
We encourage animated discussion of published research and hope that this can serve as a useful forum. In this guest post, Quake responds to Pachter’s blog entry. The views expressed below are solely his and do not necessarily represent those of Nature Methods.
In a recent blog post, Lior Pachter appears to question my scientific integrity and suggest that I unfairly manipulated data in a recent publication on single cell RNAseq.
Pachter has not contacted me directly with his questions nor did he give any warning before publishing his blog post. While I am happy that he is carefully scrutinizing publications and independently re-analyzing primary data, his rather sensationalistic approach to reporting his results in the absence of discussion or peer-review risks doing a disservice to science and adds more heat than light.
Pachter tries to have it both ways – based on our published data he accuses me of 1) wasting effort by sequencing lower quality samples and 2) selectively publishing data from only the better samples. It is hard to see how these accusations can simultaneously both be true. As described in the methods section of our paper, the C1 capture rate is not perfectly efficient and therefore we manually inspected all the chambers. We found 93 chambers had single cells, 1 chamber had two cells, and 2 chambers had no cells. Of the 93 chambers with single cells, 91 of the cells appeared to be alive as measured by a live/dead stain and 2 did not. Our single cell RNAseq experiments included all 91 of the “live” single cells and 1 of the “dead” single cells; the data from the latter was indistinguishable from the former and thus it was included in all further analyses. There was absolutely no selection or manipulation of the data. All of the raw data as well as our R scripts were made available for Pachter and others to download and analyze upon publication of our paper.
The sequencing library prep and workflow that we use is geared around 96 parallel samples and we decided it would be valuable to process control samples in exactly the same batch as the single cell samples. We therefore included four control samples with the single cells: amplification products from a chamber on the chip that did not have a cell (C09, which was unfortunately not given a distinguishing filename during the file upload), a single cell tube amplification, a no template control (NTC, C70) tube experiment that did not have a single cell, and a bulk control sample. Pachter correctly points out that C70 is dominated by the ERCC spike in controls and has essentially no human transcripts as expected; similarly, the other negative control C09 performs very poorly next to the actual single cell data. It is not clear to me why Pachter thinks I should be embarrassed for performing negative control experiments; indeed biochemical amplifiers are known to be so sensitive that there are many stories of contamination that occurs through aerosol dispersal from nearby benches, etc. In our own analyses C09 and the other controls were excluded from the single cell data.
Pachter also noticed that ~ 3 of the single cell RNAseq experiments have significantly lower quality than the other 89, as measured by fraction of spike in sequenced or by log-correlation coefficient. If taken at face value, this corresponds to a failure rate of 3/92, or 3%. The experiments therefore had a 97% success rate by this metric and it is hard to see where his complaint lies. We conservatively included ALL of the single cell data in our analyses and thus if one follows Pachter’s prescription to only analyze the experiments that he deems “successful”, then the results will be even better than we reported.
Finally, Pachter makes a misleading argument concerning the statistical methods used to generate figure 4a. This figure is concerned with the questions of whether an ensemble of single-cell RNAseq experiments produces similar gene expression values as a bulk experiment. The reason for sub-sampling to equal depth is worry of introducing artifacts by comparing two RNAseq experiments of dramatically differing sequencing depth (see e.g. Cai, Guoshuai, et al. “Accuracy of RNA-Seq and its dependence on sequencing depth.” BMC bioinformatics 13.Suppl 13 (2012) and Tarazona, Sonia, et al. “Differential expression in RNAseq: a matter of depth.”Genome research 21.12 (2011): 2213-2223.). This figure has little to do with estimating the quality of the individual RNAseq experiments.
Importance of data sharing
No more (trade) secrets
Withholding information on the clinical significance of genetic variants from the scientific community impedes the progress of research and medicine.
Imagine you are a physician or researcher and seek to get more confirmation on the clinical impact of particular genetic variants. If your search of public databases comes up empty this does not necessarily mean that nothing is known about the mutations in question. Rather, the information may be locked away as a trade secret in a genetic testing company’s proprietary database.
Physicians and their patients are not able to independently verify the medical significance of a testing company’s finding, instead the results have to be taken on blind faith. Researchers are limited in their knowledge of the vast mutational landscape in genes associated with diseases such as cancer which in turn may limit their understanding of the molecular underpinning of the disease.
Robert Nussbaum, at the University of California, San Francisco, recently pointed out that in other fields of medicine such an approach would be unthinkable. In a Technology Review he said, “Imagine if radiological images or histopathology slides of cancers were examined by a single monopoly holder without the medical community being able to assess and learn from what these images and tissue specimens teach us.” He launched the Sharing Clinical Reports Project, an initiative to collect de-identified information on genetic testing data on the BRCA1 and 2 genes (as discussed in our August editorial).
With more genetic testing companies likely to enter the market, after the US Supreme Court invalidated some gene patents, the problems caused by proprietary data may increase. Clinicians may now have more options to obtain a genetic test, but, if they go with the less established testing company, they are then left with a suboptimal interpretation with possibly grave implications for the patient.
A resolution from the American Medical Association passed in June 2013 supports public access to genetic data. The resolution calls for companies, laboratories, researchers and providers to publicly share data on genetic variants in a manner consistent with privacy and HIPAA protections.
Whether such calls will be heeded is another question. In a New York Times OdEd piece aptly named “Our genes, their secrets” the author wonders if the recent Supreme Court decision will prompt genetic testing companies to rely more on this strategy of treating information on the clinical impact of mutations as trade secrets and thereby try to deter competition and ensure revenue.
How can this be prevented? Cook-Deegan et al. – in a recent article in the European Journal of Human genetics – call for joint action by national health systems, insurers, regulators, researchers, providers and patients to ensure broad access to information about the clinical significance of variants. Some of their suggestions, besides the promotion of voluntary sharing, include sharing as a condition of payment or regulatory approval of the testing laboratories.
The battle about who may offer certain genetic tests is certainly heating up. Ambry Genetics and Gene by gene, two of the companies now offering BRCA1 and 2 testing, have been sued by Myriad Genetics for patent infringement. A few days later, on July 12, US senator Patrick Leahy, a democrat from Vermont, wrote to Francis Collins, the director of the NIH, urging him to force Myriad to license the patent on reasonable terms to other parties to ensure affordable life-saving diagnostic tests. As the federal agency that provided the funding for the research behind Myriad’s patent the NIH has the authority to do so, based on a provision in the Bayh-Dole Act that enabled universities to own inventions based on federal funding. Whether it will exercise this authority is unclear. Collin’s reply is still outstanding.
Ambry Genetics disputes that it infringes any of Myriad’s patents and a company spokesperson told Nature Methods that Ambry plans to share their testing data.
If enough companies follow suit, the desirable equilibrium of compensating a company fairly for the cost of its test and at the same time letting the public benefit from the results of these tests should be within reach.
What you always wanted to know about histones
Nature Methods and Nature Biotechnology will host a live discussion on why histone modifications matter in health and disease.
Some call it a code, some call it a language. The fact is that core histone proteins that make up the nucloeosme can be modified by a range of post translational modifications (current tally is 16) and that these PTMs, individually or collectively, send a message to the transcription machinery, either attracting or repelling it.
If you have wondered about the nature of the histone code, if you have questions about the importance of its writers, readers and erasers, or wonder how these are changed in some diseases and what can be done about it, an upcoming webcast will give a chance to raise these questions.
On February 26 we will discuss the importance of histone modifications from two aspects. First: What is the biology behind it? Which enzymes write the code and how important is crosstalk between different modifications? Second: How can one efficiently target these enzymes to fight disease?
Our speakers, Ali Shilatifard and James Bradner, will present their views and then they will engage in a live discussion fueled by questions from the audience.Sign up for the webcast, and post your questions here before February 26, or during the webcast on the event website. Either way, we will try our best to get them answered.
Note: The live webcast has now concluded. Anyone who wants to see it may still register at the link above and view a recording of the webcast at their leisure.
A universal human reference genome?
It is hard to overstate the importance of the human reference genome. In our editorial we want to acknowledge the value of this resource and at the same time ask what it would take to improve it further towards a reference that better reflects human diversity.
First and foremost we think that the Genome reference consortium (GRC) needs assistance from the community to fill gaps in the current reference and provide alternative assemblies for highly divergent regions. Just to give an example, currently there are only three diverse regions incorporated in the genome, but there likely are hundreds more. One of them, one that the GRC is actively working on, is the repeat rich region in chromosome 1q21.1 that is implicated in several cognitive and developmental disorders. Unfortunately the GRC does not have the resources to focus on more than a handful.
Of course there are also things the GRC can do to improve the reference.
Some improvements have already been decided upon, for example a switch to common haplotypes. This decision brings its own challenges – determining what is a common versus a rare haplotype is not always easy. And some important information may be lost in the conversion, for example in cases where the common allele contains a stop codon whereas the rare allele reads through and produces the protein. The latter is likely to be medically relevant.
Another desirable feature would be an easier navigation system to make the reference more user friendly.
What are your experiences with the human reference genome? What are the areas you think need improvement? Do you think a reference that reflects human diversity is attainable or even desirable?
Testing times for metagenomics
An Article in the June issue of Nature Methods uses simulated data sets to evaluate programs used for metagenomics data analysis. The author of a News & Views argues that although the results indicate existing programs do work, new algorithms are needed as well as model metagenomics systems for use as test beds.
