Microbial sequencing at Nature Methods

Over the years, Nature Methods has published many methods to generate and analyze complex sequence data for microbial studies. We cover highlights from our papers below.

Carl Woese set the stage for a molecular taxonomy of microbial life in 1977 by demonstrating that the 16S ribosomal subunit could form the basis of prokaryotic classification. Amplifying markers such as 16S from microbial mixtures really took off with the advent of high-throughput sequencing, which provided a way to rapidly profile communities sampled directly from the environment. Shotgun sequencing approaches are used more and more for taxonomic profiling as well, enabling gene and genomic sequences to be reconstructed for the functional characterization of communities.

Amplicon-based community profiling
The 454 pyrosequencing platform originally dominated efforts to study the 16S locus because of its long sequence reads. In 2008, Rob Knight and colleagues described the use of error-correcting barcodes for pyrosequencing hundreds of samples together.  Then in 2013, Jeffrey Dangl and colleagues took barcoding to a new level by tagging every template molecule during library prep on the Illumina platform in order to remove much of the PCR bias and error introduced during amplification.

On the computational side, Christopher Quince and colleagues presented PyroNoise in 2009 for ‘denoising’ or removing errors from pyrosequencing flowgrams. Jens Reeder and Rob Knight followed a year later with Denoiser, a fast heuristic alternative. Gene Tyson and colleagues moved away from flowgrams with their Acacia software, which corrects sequence files directly and can also work on Ion Torrent data due to its similar error profile containing homopolymeric repeats.

Once cleaned up, marker sequences need to be grouped into ‘operational taxonomic units’ (OTUs) that roughly correspond to genera, species or strains. Among many algorithms that do this, Robert Edgar introduced UPARSE (we realized that there is some ambiguity but it is pronounced YOU-parse) in 2013 for accurate OTU clustering in the face of erroneous or chimeric sequence reads.

To stitch the computational analysis steps together, ‘quantitative insights into microbial ecology’, or QIIME (pronounced chime) from Rob Knight and colleagues offers a user-friendly modular pipeline for amplicon sequence analysis.

Metagenomic community profiling
In shotgun metagenomics approaches, all fragments of genomic DNA in a sample are sequenced and classified. Isidore Rigoutsos and colleagues introduced PhyloPythia in 2007 to assign fragments to higher taxonomic groups or ‘bins’ based on matching the frequency of tetranucleotide sequences with signatures from known taxa. Its faster, open-source successor PhyloPythiaS from Alice McHardy and colleagues came out in 2012.

Arthur Brady and Steven Salzberg also used sequence composition, or combined it with sequence alignment with Phymm and PhymmBL in 2009; their PhymmBL expanded includes additional functionality and parallelization and came out in 2011.

In 2012, Curtis Huttenhower and colleagues described MetaPhlAn, which limits analysis to clade-specific marker genes to speed up the classification of sequence reads. Peer Bork and colleagues also extracted a limited marker set from metagenomic data in their metagenomic OTUs (mOTU) approach in 2013, but used 40 universally conserved prokaryotic genes. Both methods work best in systems like the human gut that have a large number of sequenced reference genomes.

Genomes from mixtures
Earlier this year, Christopher Quince, Anders Andersson and colleagues published an unsupervised binning method called CONCOCT to help reconstruct genomes from mixtures. It uses sequence composition and differential coverage across samples to assign pre-assembled contiguous sequences (contigs) to species or strain bins.

Single-cell sequencing is another way to obtain microbial genomes. Paul Blainey and Stephen Quake discuss challenges and opportunities for single-cell sequencing in a Commentary in our Method of the Year issue in 2014. When cultures are available, long-read single-molecule sequencing technology can provide very high quality genome sequences; the HGAP software from Jonas Korlach and colleagues makes this possible using a single Pacific Biosciences sequencing library.

With genomic sequences in hand, there remains the question of how to fit them within an appropriate taxonomy. Peer Bork and colleagues tackled the problem in 2013 with their species identification (SpecI) tool, that bases classification on the same 40 markers as mOTU.

Functional analysis and ecology
An array of tools have been designed to wrestle ecological and biological insights from metagenomic sequence data, such as the GENE PRediction IMprovement Pipeline (GenePRIMP) for annotating prokaryotic genomes by Amrita Pati and colleagues in 2010 and the metagenomeSeq method to test for the differential microbe abundance across environments or conditions by Mihai Pop and colleagues in 2013 (also see a comment by Bork and colleagues and the authors’ reply).

In 2010, Rob Knight and colleagues compared 51 methods for their ability to identify biologically relevant distribution patterns using real and simulated 16S pyrosequencing data from samples that were clustered or assayed along environmental gradients. In 2012, Jack Gilbert and colleagues developed microbial assemblage prediction (MAP), an artificial neural network approach to model microbial community structure across the Western English Channel that combines time course metagenomic data from a single site with bioclimatic data gathered over the entire channel.

Quality control and bias
Generating accurate and robust microbial sequence data requires rigorous benchmarking and controls, and experimental methods are constantly improving. Nikos Kyrpides and colleagues studied the use of simulated data to evaluate metagenomic analysis methods in 2007. In 2010, Philip Hugenholtz and colleagues evaluated two methods to deplete rRNA from metatranscriptomes.

J Gregory Caporaso and colleagues further demonstrated the effect of Illumina read quality on taxonomic assignment and diversity assessment in 2013, and Scott Kelley and colleagues developed SourceTracker software to identify contaminants in microbial sequencing studies.

We look forward to many more contributions in the field of microbial sequencing.

 

References:
Alice Carolyn McHardy et al.
Accurate phylogenetic classification of variable-length DNA fragments
Nature Methods 4, 63-72 (2007) doi:10.1038/nmeth976

Konstantinos Mavromatis et al.
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 4 (6), pp. 495-500 (2007) doi:10.1038/nmeth1043

Micah Hamady, Jeffrey J Walker, J Kirk Harris, Nicholas J Gold & Rob Knight
Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex
Nature Methods 5, 235-237 (2008) doi:10.1038/nmeth.1184

Christopher Quince et al.
Accurate determination of microbial diversity from 454 pyrosequencing data
Nature Methods 6, 639-641 (2009) doi:10.1038/nmeth.1361

Arthur Brady & Steven L Salzberg
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models
Nature Methods 6, 673-676 (2009) doi:10.1038/nmeth.1358

J Gregory Caporaso et al.
QIIME allows analysis of high-throughput community sequencing data
Nature Methods 7, 335-336 (2010) doi:10.1038/nmeth.f.303

Jens Reeder & Rob Knight
Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions
Nature Methods 7, 668-669 (2010) doi:10.1038/nmeth0910-668b

He et al.
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
Nature Methods 7, 807-812 (2010) doi:10.1038/nmeth.1507

Amrita Pati et al.
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes
Nature Methods 7, 455-457 (2010) doi:10.1038/nmeth.1457

Justin Kuczynski,  Zongzhi Liu,  Catherine Lozupone,  Daniel McDonald,  Noah Fierer &  Rob Knight
Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
Nature Methods 7, 813-819 (2010) doi:10.1038/nmeth.1499
Patil et al.
Taxonomic metagenome sequence assignment with structured output models
Nature Methods 8, 191-192 (2011) doi:10.1038/nmeth0311-191

Arthur Brady & Steven L Salzberg
PhymmBL expanded: confidence scores, custom databases, parallelization and more
Nature Methods 8, 367-367 (2011) doi:10.1038/nmeth0511-367

Dan Knights et al.
Bayesian community-wide culture-independent microbial source tracking
Nature Methods 8, 761-763 (2011) doi:10.1038/nmeth.1650

Lauren Bragg, Glenn Stone, Michael Imelfort, Philip Hugenholtz &  Gene W Tyson
Fast, accurate error-correction of amplicon pyrosequences using Acacia
Nature Methods 9, 425-426 (2012) doi:10.1038/nmeth.1990

Nicola Segata et al.
Metagenomic microbial community profiling using unique clade-specific marker genes
Nature Methods 9, 811-814 (2012) doi:10.1038/nmeth.2066

Peter E Larsen,  Dawn Field &  Jack A Gilbert
Predicting bacterial community assemblages using an artificial neural network approach
Nature Methods 9, 621-625 (2012) doi:10.1038/nmeth.1975

Robert C Edgar
UPARSE: highly accurate OTU sequences from microbial amplicon reads
Nature Methods 10, 996-998 (2013) doi:10.1038/nmeth.2604

Derek S Lundberg,  Scott Yourstone,  Piotr Mieczkowski,  Corbin D Jones &  Jeffery L Dangl
Practical innovations for high-throughput amplicon sequencing
Nature Methods 10, 999-1002 (2013) doi:10.1038/nmeth.2634

Shinichi Sunagawa et al.
Metagenomic species profiling using universal phylogenetic marker genes
Nature Methods 10, 1196-1199 (2013) doi:10.1038/nmeth.2693

Daniel R Mende,  Shinichi Sunagawa,  Georg Zeller &  Peer Bork
Accurate and universal delineation of prokaryotic species
Nature Methods 10, 881-884 (2013) doi:10.1038/nmeth.2575

Chen-Shan Chin et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
Nature Methods 10, 563-569 (2013) doi:10.1038/nmeth.2474

Nicholas A Bokulich et al.
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing
Nature Methods 10, 57-59 (2013) doi:10.1038/nmeth.2276

Joseph N Paulson,  O Colin Stine,  Héctor Corrada Bravo &  Mihai Pop
Differential abundance analysis for microbial marker-gene surveys
Nature Methods 10, 1200-1202 (2013) doi:10.1038/nmeth.2658

Paul C Blainey &  Stephen R Quake
Dissecting genomic diversity, one cell at a time
Nature Methods 11, 19-21 (2014) doi:10.1038/nmeth.2783

Johannes Alneberg et al.
Binning metagenomic contigs by coverage and composition
Nature Methods (2014) doi:10.1038/nmeth.3103

Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

Guidelines for algorithms and software in Nature Methods

A large proportion of original research published in Nature Methods relies to varying degress on custom algorithms and software developed by the authors. Here we provide guidance on our relevant material sharing and reporting policies.

Nature Methods first outlined our material sharing and reporting standards for algorithms and software in a March 2007 Editorial. Now, after seven years of experience applying those policies we updated and expanded on them in our March 2014 Editorial. On this page we provide more detailed guidelines for authors submitting manuscripts containing unpublished algorithms and software they created. We are posting this information here because we’d like these guidelines to evolve and we want input from our communities on how they think this should happen. Please comment below and let us know your thoughts. We will update this document as our policies change.

Manuscripts published in Nature Methods include methods and tools in which algorithms and software represent an increasingly important methodological component. However, the degree to which they are central to the reported methodology can vary considerably. The algorithm or tool may be the entire motivation for publishing the work or it may be ancillary to it. Additionally, the methodology may be a novel algorithm of value in and of itself but a coded implementation is still necessary for the authors to show that it works as expected. Finally, the software tool may implement existing algorithms in a user-friendly form to deliver high value functionality of substantial general interest. Because of this wide variety it is inappropriate to enforce one-size-fits-all standards for algorithms and software reported in Nature Methods. The guidelines below represent our current editorial position on software reporting and release.

Client-side Software
This is software that is installed and used on a personal computer and not intended to be accessed remotely as a web service. It can be entirely stand-alone on a commonly available operating system (Windows, Mac OS X, or *nix) or can require the user to have a popular software platform installed (MATLAB or LabVIEW). In all cases, but particularly when using MATLAB or LabVIEW, all platform versions and software dependencies must be detailed in the supplied documentation.

At Submission

  • If the custom algorithm/software is central to the method and has not been reported previously in a published research paper it must be supplied by the authors in a usable form including one or more of the following.
    1. Source code
    2. Complete pseudocode
    3. Full mathematical description of the algorithm
    4. Compiled standalone software

    We strongly urge that full source code be provided. A compiled executable alone is not sufficient but may be required if the tool is intended to be of wide general use. Final acceptable forms of release of the algorithm, software and code will be determined by the editor after consultation with referees. This decision will be influenced by the editorial motivation for publishing the work (i.e. high novelty, satisfies wide general need, etc).

  • If the software is ancillary to the methodology being reported or is a routine implementation of obvious processes, such as microscope control software or analyses that are otherwise adequately described, the software need not be supplied to reviewers at submission but final release requirements may change in the course of the review process.
  • Supplied source code or software must be accompanied by documentation sufficient for a typical user to compile, install and use the software. Depending on the nature of the software tool, how central it is to the manuscript and our editorial motivation for considering the work, the minimum documentation may be a simple readme file or a full manual in PDF format.
  • If appropriate, sample data known to work on the software should be provided along with the expected output. Referees are encouraged to try and use the tool to analyze their own data.
  • The software and associated files may be supplied for reviewers as either:
    1. A single Supplementary Software zip file up to 200 MB in size
    2. Four DVDs to be mailed to the reviewers.
  • Any restrictions on the availability of software or code used to implement novel algorithms must be specified at the time of submission. Editors will decide whether any restrictions are acceptable in consultation with the reviewers. If some restrictions are deemed acceptable, they must be clearly explained in the methods section of the manuscript. Authors must supply all information needed for the reviewers to properly evaluate the software or code. If the motivation of the submitted manuscript is to provide a useful tool, rather than report a new algorithmic development, there should be no substantial restrictions on software or code availability.
  • We encourage authors to provide a license with the software or code.
  • A narrative description of key algorithmic components should be provided in the main text. Extensive equations, pseudocode or snippets of source code should be confined to the Online Methods or a Supplementary Note.

At Acceptance

  • If the software is central to the methodology and non-obvious, the source code should be provided in a Supplementary Software zip file as described above so that readers can easily access the exact code used to obtain the results in the paper. There are some possible exceptions:
    1. If the author’s institution requires a user to accept a license agreement or if the author has other reasonable grounds for not providing the source code as Supplementary Software, it may be acceptable for the author to host source code on an institutional server and require that users fill out an online form and agree to a license before downloading the software. In this instance the software must have version numbering and a link to the version used in the work must be provided in the manuscript.
    2. In some situations it may be permissible for authors to supply only compiled software as Supplementary Software but the source code to academic users upon email request. Details of availability must be clearly stated in the manuscript.
    3. It is not acceptable to make software and code available by email request only.

  • If the software or code isn’t the main tool/method being reported in the manuscript the authors may provide a note in the readme file of the Supplementary Software cautioning users that the code is unsupported and not intended for general use. In this case it is permissible that the software or code be made available only by email request but the authors must state this availability in the manuscript.
  • Regardless of how the software is made available, the code supplied with the manuscript must be identical to that used to obtain the data in the paper. An exception can be made for changes that don’t alter the processing of input data. The authors may however provide a link to access new versions of the software.
  • We strongly encourage authors to include a license with all published software and code.
  • We encourage authors to provide macros for recording the software version and parameter settings during analyses or to integrate this functionality into the software itself.

Web Tools/Resources
These represent a special class of software that many times can’t be expected to follow the same guidelines outlined above. This is particularly true if the web tool or resource is being supplied as a service and has few, if any, novel computational aspects to it. The only end-user requirement for web tools is that they be freely accessible with any modern web browser.

Nature.com provides a proxy server for reviewers to access web tools and resources anonymously.

At Submission

  • The authors must supply a working link and any necessary log in information.
  • Any unpublished algorithms central to the operation of the tool should be supplied in forms a), c) or d) detailed above.

At Acceptance

  • The authors should supply written confirmation that they will keep the website and tool operating and freely accessible for the foreseeable future.

Bioimage Informatics

It is no secret that imaging, and microscopy in particular, represents a substantial fraction of the manuscripts published in Nature Methods. Our very first focus issue, in fact, was on fluorescence imaging. When that focus was published in 2005 the term ‘bioimage informatics’ didn’t even exist. Even today, the term isn’t widely used and, unlike many other bioinformaticians, those who work on the development of algorithms and software tools for analysis of biological image data have few dedicated venues for discussing or publishing their work.

But computational techniques are becoming increasingly important in biological imaging and the people developing these tools increasingly see themselves as a distinct community. When we approached the community about publishing a focus issue on bioimage informatics there was an enthusiastic response and the results can be seen in our July issue and focus that went live today.

We hope that biologists using microscopy in their research find the information in the focus useful and that it stimulates them to try some of the tools now available and in development. Many of these tools have functionality designed to encourage community participation and aid in both the creation of new analysis methods and the communication of methods and protocols to other users.

Although these tools and the community developing them have come a long way since Wayne Rasband first released NIH Image, bioimage informatics is still in its relative infancy. As discussed in the focus editorial, algorithm development and usage will become even more important for biological microscopy and will change the way biologists perform and report their research.