Microbial sequencing at Nature Methods

Over the years, Nature Methods has published many methods to generate and analyze complex sequence data for microbial studies. We cover highlights from our papers below.

Carl Woese set the stage for a molecular taxonomy of microbial life in 1977 by demonstrating that the 16S ribosomal subunit could form the basis of prokaryotic classification. Amplifying markers such as 16S from microbial mixtures really took off with the advent of high-throughput sequencing, which provided a way to rapidly profile communities sampled directly from the environment. Shotgun sequencing approaches are used more and more for taxonomic profiling as well, enabling gene and genomic sequences to be reconstructed for the functional characterization of communities.

Amplicon-based community profiling
The 454 pyrosequencing platform originally dominated efforts to study the 16S locus because of its long sequence reads. In 2008, Rob Knight and colleagues described the use of error-correcting barcodes for pyrosequencing hundreds of samples together.  Then in 2013, Jeffrey Dangl and colleagues took barcoding to a new level by tagging every template molecule during library prep on the Illumina platform in order to remove much of the PCR bias and error introduced during amplification.

On the computational side, Christopher Quince and colleagues presented PyroNoise in 2009 for ‘denoising’ or removing errors from pyrosequencing flowgrams. Jens Reeder and Rob Knight followed a year later with Denoiser, a fast heuristic alternative. Gene Tyson and colleagues moved away from flowgrams with their Acacia software, which corrects sequence files directly and can also work on Ion Torrent data due to its similar error profile containing homopolymeric repeats.

Once cleaned up, marker sequences need to be grouped into ‘operational taxonomic units’ (OTUs) that roughly correspond to genera, species or strains. Among many algorithms that do this, Robert Edgar introduced UPARSE (we realized that there is some ambiguity but it is pronounced YOU-parse) in 2013 for accurate OTU clustering in the face of erroneous or chimeric sequence reads.

To stitch the computational analysis steps together, ‘quantitative insights into microbial ecology’, or QIIME (pronounced chime) from Rob Knight and colleagues offers a user-friendly modular pipeline for amplicon sequence analysis.

Metagenomic community profiling
In shotgun metagenomics approaches, all fragments of genomic DNA in a sample are sequenced and classified. Isidore Rigoutsos and colleagues introduced PhyloPythia in 2007 to assign fragments to higher taxonomic groups or ‘bins’ based on matching the frequency of tetranucleotide sequences with signatures from known taxa. Its faster, open-source successor PhyloPythiaS from Alice McHardy and colleagues came out in 2012.

Arthur Brady and Steven Salzberg also used sequence composition, or combined it with sequence alignment with Phymm and PhymmBL in 2009; their PhymmBL expanded includes additional functionality and parallelization and came out in 2011.

In 2012, Curtis Huttenhower and colleagues described MetaPhlAn, which limits analysis to clade-specific marker genes to speed up the classification of sequence reads. Peer Bork and colleagues also extracted a limited marker set from metagenomic data in their metagenomic OTUs (mOTU) approach in 2013, but used 40 universally conserved prokaryotic genes. Both methods work best in systems like the human gut that have a large number of sequenced reference genomes.

Genomes from mixtures
Earlier this year, Christopher Quince, Anders Andersson and colleagues published an unsupervised binning method called CONCOCT to help reconstruct genomes from mixtures. It uses sequence composition and differential coverage across samples to assign pre-assembled contiguous sequences (contigs) to species or strain bins.

Single-cell sequencing is another way to obtain microbial genomes. Paul Blainey and Stephen Quake discuss challenges and opportunities for single-cell sequencing in a Commentary in our Method of the Year issue in 2014. When cultures are available, long-read single-molecule sequencing technology can provide very high quality genome sequences; the HGAP software from Jonas Korlach and colleagues makes this possible using a single Pacific Biosciences sequencing library.

With genomic sequences in hand, there remains the question of how to fit them within an appropriate taxonomy. Peer Bork and colleagues tackled the problem in 2013 with their species identification (SpecI) tool, that bases classification on the same 40 markers as mOTU.

Functional analysis and ecology
An array of tools have been designed to wrestle ecological and biological insights from metagenomic sequence data, such as the GENE PRediction IMprovement Pipeline (GenePRIMP) for annotating prokaryotic genomes by Amrita Pati and colleagues in 2010 and the metagenomeSeq method to test for the differential microbe abundance across environments or conditions by Mihai Pop and colleagues in 2013 (also see a comment by Bork and colleagues and the authors’ reply).

In 2010, Rob Knight and colleagues compared 51 methods for their ability to identify biologically relevant distribution patterns using real and simulated 16S pyrosequencing data from samples that were clustered or assayed along environmental gradients. In 2012, Jack Gilbert and colleagues developed microbial assemblage prediction (MAP), an artificial neural network approach to model microbial community structure across the Western English Channel that combines time course metagenomic data from a single site with bioclimatic data gathered over the entire channel.

Quality control and bias
Generating accurate and robust microbial sequence data requires rigorous benchmarking and controls, and experimental methods are constantly improving. Nikos Kyrpides and colleagues studied the use of simulated data to evaluate metagenomic analysis methods in 2007. In 2010, Philip Hugenholtz and colleagues evaluated two methods to deplete rRNA from metatranscriptomes.

J Gregory Caporaso and colleagues further demonstrated the effect of Illumina read quality on taxonomic assignment and diversity assessment in 2013, and Scott Kelley and colleagues developed SourceTracker software to identify contaminants in microbial sequencing studies.

We look forward to many more contributions in the field of microbial sequencing.

 

References:
Alice Carolyn McHardy et al.
Accurate phylogenetic classification of variable-length DNA fragments
Nature Methods 4, 63-72 (2007) doi:10.1038/nmeth976

Konstantinos Mavromatis et al.
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 4 (6), pp. 495-500 (2007) doi:10.1038/nmeth1043

Micah Hamady, Jeffrey J Walker, J Kirk Harris, Nicholas J Gold & Rob Knight
Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex
Nature Methods 5, 235-237 (2008) doi:10.1038/nmeth.1184

Christopher Quince et al.
Accurate determination of microbial diversity from 454 pyrosequencing data
Nature Methods 6, 639-641 (2009) doi:10.1038/nmeth.1361

Arthur Brady & Steven L Salzberg
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models
Nature Methods 6, 673-676 (2009) doi:10.1038/nmeth.1358

J Gregory Caporaso et al.
QIIME allows analysis of high-throughput community sequencing data
Nature Methods 7, 335-336 (2010) doi:10.1038/nmeth.f.303

Jens Reeder & Rob Knight
Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions
Nature Methods 7, 668-669 (2010) doi:10.1038/nmeth0910-668b

He et al.
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
Nature Methods 7, 807-812 (2010) doi:10.1038/nmeth.1507

Amrita Pati et al.
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes
Nature Methods 7, 455-457 (2010) doi:10.1038/nmeth.1457

Justin Kuczynski,  Zongzhi Liu,  Catherine Lozupone,  Daniel McDonald,  Noah Fierer &  Rob Knight
Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
Nature Methods 7, 813-819 (2010) doi:10.1038/nmeth.1499
Patil et al.
Taxonomic metagenome sequence assignment with structured output models
Nature Methods 8, 191-192 (2011) doi:10.1038/nmeth0311-191

Arthur Brady & Steven L Salzberg
PhymmBL expanded: confidence scores, custom databases, parallelization and more
Nature Methods 8, 367-367 (2011) doi:10.1038/nmeth0511-367

Dan Knights et al.
Bayesian community-wide culture-independent microbial source tracking
Nature Methods 8, 761-763 (2011) doi:10.1038/nmeth.1650

Lauren Bragg, Glenn Stone, Michael Imelfort, Philip Hugenholtz &  Gene W Tyson
Fast, accurate error-correction of amplicon pyrosequences using Acacia
Nature Methods 9, 425-426 (2012) doi:10.1038/nmeth.1990

Nicola Segata et al.
Metagenomic microbial community profiling using unique clade-specific marker genes
Nature Methods 9, 811-814 (2012) doi:10.1038/nmeth.2066

Peter E Larsen,  Dawn Field &  Jack A Gilbert
Predicting bacterial community assemblages using an artificial neural network approach
Nature Methods 9, 621-625 (2012) doi:10.1038/nmeth.1975

Robert C Edgar
UPARSE: highly accurate OTU sequences from microbial amplicon reads
Nature Methods 10, 996-998 (2013) doi:10.1038/nmeth.2604

Derek S Lundberg,  Scott Yourstone,  Piotr Mieczkowski,  Corbin D Jones &  Jeffery L Dangl
Practical innovations for high-throughput amplicon sequencing
Nature Methods 10, 999-1002 (2013) doi:10.1038/nmeth.2634

Shinichi Sunagawa et al.
Metagenomic species profiling using universal phylogenetic marker genes
Nature Methods 10, 1196-1199 (2013) doi:10.1038/nmeth.2693

Daniel R Mende,  Shinichi Sunagawa,  Georg Zeller &  Peer Bork
Accurate and universal delineation of prokaryotic species
Nature Methods 10, 881-884 (2013) doi:10.1038/nmeth.2575

Chen-Shan Chin et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
Nature Methods 10, 563-569 (2013) doi:10.1038/nmeth.2474

Nicholas A Bokulich et al.
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing
Nature Methods 10, 57-59 (2013) doi:10.1038/nmeth.2276

Joseph N Paulson,  O Colin Stine,  Héctor Corrada Bravo &  Mihai Pop
Differential abundance analysis for microbial marker-gene surveys
Nature Methods 10, 1200-1202 (2013) doi:10.1038/nmeth.2658

Paul C Blainey &  Stephen R Quake
Dissecting genomic diversity, one cell at a time
Nature Methods 11, 19-21 (2014) doi:10.1038/nmeth.2783

Johannes Alneberg et al.
Binning metagenomic contigs by coverage and composition
Nature Methods (2014) doi:10.1038/nmeth.3103

Let’s give statistics the attention it deserves

This month we launch a new column ‘Points of Significance’ devoted to statistics, a topic of profound importance for biological research, but one that often doesn’t receive the attention it deserves.

For the past three years Nature Methods has been publishing the Points of View column, one page a month dedicated to practical advice for researchers on how to create accessible and accurate visualizations of their data. The response to the column articles has been fantastic and most recently we organized them by topic here on our blog.

Unfortunately, a truth about data visualization is that no matter how good the visualization, if the experiment wasn’t appropriately designed and the data wasn’t analyzed correctly, the resulting visual depiction of the data will be inherently flawed. Nature Methods and the other Nature journals recently made changes to improve data and methods reporting as part of a reproducibility initiative. We feel this is an important first step in improving experimental reproducibility and repeatability, but unfortunately by the time work is submitted for publication it can be difficult to correct shortcomings in experiemntal design and analysis.

A population distribution and a distribution of sample means.

A population distribution and a distribution of sample means.

In our September issue readers will find a new column, Points of Significance, that we hope will be as useful as the column that preceded it, perhaps more so. Martin Krzywinski, who has been writing the visualization column, is now joined by Naomi Altman, Professor of Statistics at The Pennsylvania State University. Among other things, Naomi will be responsible for ensuring that the information and advice we provide about statistics in every Points of Significance article is accurate.

The column has been expanded from one to two pages and will often have an Excel spreadsheet associated with it. This expansion will help us better communicate information that is less well served by display items. However, as illustrated by the figures in the first article of the column and the accompanying spreadsheet, visual displays will continue to play a vital role due to their strength in providing easily interpretable examples that can often be more readily grasped than mathematical or narrative descriptions.

We will strive to present the material so that each article in the column builds on prior ones. In this spirit the first article discusses populations and sampling, a foundation for nearly all topics to follow. The accompanying spreadsheet allows readers to play around with sampling and see for themselves how often values obtained from samples deviate substantially from the real population. It can be disconcerting to see just how often ‘bad luck’ can give a ‘wrong’ result in one set of measurements while in another set of measurements the ‘right’ result is obtained but statistical measures would suggest that the former is more likely to be ‘correct’ than the latter. This excellently highlights how statistics is unable to tell you if you are right. But this doesn’t suggest statistics has limited value. Instead, readers of scientific articles reporting statistical results need a healthy grasp of the limitations of statistical analysis and users of statistics can always learn ways to improve the power of their analysis.

The “aura of exactitude” that often surrounds statistics is one of the main notions that the Points of Significance column will attempt to dispel, while providing useful pointers on using and evaluating statistical measures. We expect that readers will find the upcoming October Points of Significance article on error bars and confidence intervals with its practical tips on interpreting these graphical elements to be particularly useful almost every time they read a manuscript containing these popular visual representations of uncertainty.

We hope readers enjoy Points of Significance. It is appropriate that the column is debuting during the International Year of Statistics. To allow readership by a wider audience each article will be free to access for a period of one month after it is published.

Update: All Points of Significance articles are now free access and have been collected together on a dedicated page in the nature.com “Statistics for biologists” resource.

For more on statistics, and particularly statistics training, don’t miss this September’s Editorial.

. . . . . . . .

Update: Below is a continuously updated list of the Points of Significance articles.

Importance of being uncertain – September 2013
How samples are used to estimate population statistics and what this means in terms of uncertainty.
Error Bars – October 2013
The use of error bars to represent uncertainty and advice on how to interpret them.
Significance, P values and t-tests – November 2013
Introduction to the concept of statistical significance and the one-sample t-test.
Power and sample size – December 2013
Using statistical power to optimize study design and sample numbers.
Visualizing samples with box plots – February 2014
Introduction to box plots and their use to illustrate the spread and differences of samples.
Comparing samples—part I – March 2014
How to use the two-sample t-test to compare either uncorrelated or correlated samples.
Comparing samples—part II – April 2014
Adjustment and reinterpretation of P values when large numbers of tests are performed.
Nonparametric tests – May 2014
Use of nonparametric tests to robustly compare skewed or ranked data.
Designing comparative experiments – June 2014
The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.
Analysis of variance and blocking – July 2014
Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.
Replication – September 2014
Technical replication reveals technical variation while biological replication is required for biological inference.
Nested designs – October 2014
Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.
Two-factor designs – December 2014
It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.
Sources of variation – January 2015
To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication to collect replicable and meaningful results.
Split plot design – March 2015
When some experimental factors are harder to vary than others, a split plot design can be efficient for exploring the main (average) effects and interactions of the factors.
Bayes’ theorem – April 2015
Use Bayes’ theorem to combine prior knowledge with observations of a system and make predictions about it.
Bayesian statistics – May 2015
Unlike classical frequentist statistics, Bayesian statistics allows direct inference of the probability that a model is correct and it provides the ability to update this probability as new data is collected.
Sampling distributions and the bootstrap – June 2015
Use the bootstrap method to simulate new samples and assess the precision and bias of sample estimates.
Bayesian networks – September 2015
Model interactions between causes and effects in large networks of causal influences using Bayesian networks, which combine network analysis with Bayesian statistics.
Association, correlation and causation – October 2015
Pairwise dependencies can be characterized using correlation but be aware that correlation only implies association, not causation. Conversely, causation implies association, not correlation.
Simple linear regression – November 2015
Linear regression is a flexible way to predict the values of one variable using the values of the other to find a ‘best line’ through the data points.

Head-to-head comparisons of methods and tools

Choosing the best tool or method for a particular experiment can be a daunting task. Finding the right choice can mean much time and many resources and an improper one can lead to poor or inaccurate results.

Direct head-to-head comparisons of methods or tools under standardized experimental conditions can yield extremely valuable information for method users and also for tool developers. To ensure publication of these types of papers, Nature Methods provides the ‘Analysis’ format.

In our February issue Editorial we discuss the value of these types of publications and we highlight two recent examples of Analysis papers that we hope will become well-thumbed copies in many desks throughout the world.

Zhuang and colleagues performed a systematic empirical comparison of different fluorescent dyes used for super-resolution imaging and Deisseroth and colleagues compare a wealth of optogenetic tools for the modulation of neuronal activity.

Nature Methods will continue to look for these kinds of comparative projects and we are eager to hear your thoughts about  particular areas that might benefit from this type of work and to receive proposals and submissions of this kind.

Where’s your ground truth?

When using or developing experimental and observational methods it is crucial to assess the method performance in an effort to ensure that the information it provides reflects reality. For experimental biologists this often means conducting carefully chosen control experiments with alternative methods or different experimental settings. More rigorous assessment, particularly for high-throughput or large-scale methods, often requires the use of ‘ground truth’ or ‘gold standard’ data sets. But talk to different people and you will get different answers regarding what ‘ground truth’ or ‘gold standard’ data is. This often includes a nice historical explanation of where the term ‘ground truth’ comes from.

For developers of signal processing and image analysis algorithms though, the situation is clearer; the ground truth is the signal or image you start with. But add a living system into the mix and things get far more complicated. The Editorial in the November issue of Nature Methods discusses the challenges facing developers and users of algorithms for automated analysis of biological data, with a focus on image data. In short, traditional ground truth data is often insufficient. The addition of integrated-editing and change-logging capabilities to these software tools can increase the quality of the analysis, aid further algorithm development and increase the likelihood of biologists adopting the software in the first place.

Efficiency through analysis

The May Editorial in Nature Methods discusses how the overall efficiency of research can be improved by comparative analysis of research method and tool performance.

Although such analysis studies aren’t considered as ‘sexy’ as basic exploratory research, the benefits for and gratitude from the community can be profound. Large well-funded laboratories are more likely to have the resources to perform such analyses and should not discount the advantages to performing such studies and publishing the results.

Nature Methods has published several such analysis studies in the past. A (probably incomplete) selection is listed below. We will strive to publish even more in the future. Our ‘Analysis’ article type is actually dedicated to these kinds of studies. We encourage communities and labs to both contribute such analyses and suggest methodological areas that would benefit from them. The selection below may provide some inspiration.

2005
Multiple-laboratory comparison of microarray platforms
doi:10.1038/nmeth756
Independence and reproducibility across microarry platforms
doi:10.1038/nmeth757
Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations
doi:10.1038/nmeth785

2006
A guide to choosing fluorescent proteins
doi:10.1038/nmeth819

2007
Reproducible isolation of distinct, overlapping segments of the phosphoproteome
doi:10.1038/nmeth1005
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
doi:10.1038/nmeth1043

2008
Cyclic nucleotide analogs as probes of signaling pathways
doi:10.1038/nmeth0408-277

2009
Cost-effective strategies for completing the interactome
doi:10.1038/nmeth.1283
A HUPO test sample study reveals common problems in mass spectrometry-based proteomics
doi:10.1038/nmeth.1333

2010
Comprehensive comparative analysis of strand-specific RNA sequencing methods
doi:10.1038/nmeth.1491
Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
doi:10.1038/nmeth.1499
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
doi:10.1038/nmeth.1507

2011
Chemically defined conditions for human iPSC derivation and culture
doi:10.1038/nmeth.1593
Two-photon absorption properties of fluorescent proteins
doi:10.1038/nmeth.1596