Database under maintenance

Posted on 30 Aug 2016 by Tal Nawy

This month’s editorial examines what it will take to sustain biomedical data sharing in the face of explosive data growth and financial challenges for databases.

Read the editorial here.

Microbial sequencing at Nature Methods

Posted on 29 Sep 2014 by Tal Nawy

Over the years, Nature Methods has published many methods to generate and analyze complex sequence data for microbial studies. We cover highlights from our papers below.

Carl Woese set the stage for a molecular taxonomy of microbial life in 1977 by demonstrating that the 16S ribosomal subunit could form the basis of prokaryotic classification. Amplifying markers such as 16S from microbial mixtures really took off with the advent of high-throughput sequencing, which provided a way to rapidly profile communities sampled directly from the environment. Shotgun sequencing approaches are used more and more for taxonomic profiling as well, enabling gene and genomic sequences to be reconstructed for the functional characterization of communities.

Amplicon-based community profiling
The 454 pyrosequencing platform originally dominated efforts to study the 16S locus because of its long sequence reads. In 2008, Rob Knight and colleagues described the use of error-correcting barcodes for pyrosequencing hundreds of samples together. Then in 2013, Jeffrey Dangl and colleagues took barcoding to a new level by tagging every template molecule during library prep on the Illumina platform in order to remove much of the PCR bias and error introduced during amplification.

On the computational side, Christopher Quince and colleagues presented PyroNoise in 2009 for ‘denoising’ or removing errors from pyrosequencing flowgrams. Jens Reeder and Rob Knight followed a year later with Denoiser, a fast heuristic alternative. Gene Tyson and colleagues moved away from flowgrams with their Acacia software, which corrects sequence files directly and can also work on Ion Torrent data due to its similar error profile containing homopolymeric repeats.

Once cleaned up, marker sequences need to be grouped into ‘operational taxonomic units’ (OTUs) that roughly correspond to genera, species or strains. Among many algorithms that do this, Robert Edgar introduced UPARSE (we realized that there is some ambiguity but it is pronounced YOU-parse) in 2013 for accurate OTU clustering in the face of erroneous or chimeric sequence reads.

To stitch the computational analysis steps together, ‘quantitative insights into microbial ecology’, or QIIME (pronounced chime) from Rob Knight and colleagues offers a user-friendly modular pipeline for amplicon sequence analysis.

Metagenomic community profiling
In shotgun metagenomics approaches, all fragments of genomic DNA in a sample are sequenced and classified. Isidore Rigoutsos and colleagues introduced PhyloPythia in 2007 to assign fragments to higher taxonomic groups or ‘bins’ based on matching the frequency of tetranucleotide sequences with signatures from known taxa. Its faster, open-source successor PhyloPythiaS from Alice McHardy and colleagues came out in 2012.

Arthur Brady and Steven Salzberg also used sequence composition, or combined it with sequence alignment with Phymm and PhymmBL in 2009; their PhymmBL expanded includes additional functionality and parallelization and came out in 2011.

In 2012, Curtis Huttenhower and colleagues described MetaPhlAn, which limits analysis to clade-specific marker genes to speed up the classification of sequence reads. Peer Bork and colleagues also extracted a limited marker set from metagenomic data in their metagenomic OTUs (mOTU) approach in 2013, but used 40 universally conserved prokaryotic genes. Both methods work best in systems like the human gut that have a large number of sequenced reference genomes.

Genomes from mixtures
Earlier this year, Christopher Quince, Anders Andersson and colleagues published an unsupervised binning method called CONCOCT to help reconstruct genomes from mixtures. It uses sequence composition and differential coverage across samples to assign pre-assembled contiguous sequences (contigs) to species or strain bins.

Single-cell sequencing is another way to obtain microbial genomes. Paul Blainey and Stephen Quake discuss challenges and opportunities for single-cell sequencing in a Commentary in our Method of the Year issue in 2014. When cultures are available, long-read single-molecule sequencing technology can provide very high quality genome sequences; the HGAP software from Jonas Korlach and colleagues makes this possible using a single Pacific Biosciences sequencing library.

With genomic sequences in hand, there remains the question of how to fit them within an appropriate taxonomy. Peer Bork and colleagues tackled the problem in 2013 with their species identification (SpecI) tool, that bases classification on the same 40 markers as mOTU.

Functional analysis and ecology
An array of tools have been designed to wrestle ecological and biological insights from metagenomic sequence data, such as the GENE PRediction IMprovement Pipeline (GenePRIMP) for annotating prokaryotic genomes by Amrita Pati and colleagues in 2010 and the metagenomeSeq method to test for the differential microbe abundance across environments or conditions by Mihai Pop and colleagues in 2013 (also see a comment by Bork and colleagues and the authors’ reply).

In 2010, Rob Knight and colleagues compared 51 methods for their ability to identify biologically relevant distribution patterns using real and simulated 16S pyrosequencing data from samples that were clustered or assayed along environmental gradients. In 2012, Jack Gilbert and colleagues developed microbial assemblage prediction (MAP), an artificial neural network approach to model microbial community structure across the Western English Channel that combines time course metagenomic data from a single site with bioclimatic data gathered over the entire channel.

Quality control and bias
Generating accurate and robust microbial sequence data requires rigorous benchmarking and controls, and experimental methods are constantly improving. Nikos Kyrpides and colleagues studied the use of simulated data to evaluate metagenomic analysis methods in 2007. In 2010, Philip Hugenholtz and colleagues evaluated two methods to deplete rRNA from metatranscriptomes.

J Gregory Caporaso and colleagues further demonstrated the effect of Illumina read quality on taxonomic assignment and diversity assessment in 2013, and Scott Kelley and colleagues developed SourceTracker software to identify contaminants in microbial sequencing studies.

We look forward to many more contributions in the field of microbial sequencing.

References:
Alice Carolyn McHardy et al.
Accurate phylogenetic classification of variable-length DNA fragments
Nature Methods 4, 63-72 (2007) doi:10.1038/nmeth976

Konstantinos Mavromatis et al.
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 4 (6), pp. 495-500 (2007) doi:10.1038/nmeth1043

Micah Hamady, Jeffrey J Walker, J Kirk Harris, Nicholas J Gold & Rob Knight
Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex
Nature Methods 5, 235-237 (2008) doi:10.1038/nmeth.1184

Christopher Quince et al.
Accurate determination of microbial diversity from 454 pyrosequencing data
Nature Methods 6, 639-641 (2009) doi:10.1038/nmeth.1361

Arthur Brady & Steven L Salzberg
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models
Nature Methods 6, 673-676 (2009) doi:10.1038/nmeth.1358

J Gregory Caporaso et al.
QIIME allows analysis of high-throughput community sequencing data
Nature Methods 7, 335-336 (2010) doi:10.1038/nmeth.f.303

Jens Reeder & Rob Knight
Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions
Nature Methods 7, 668-669 (2010) doi:10.1038/nmeth0910-668b

He et al.
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
Nature Methods 7, 807-812 (2010) doi:10.1038/nmeth.1507

Amrita Pati et al.
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes
Nature Methods 7, 455-457 (2010) doi:10.1038/nmeth.1457

Justin Kuczynski, Zongzhi Liu, Catherine Lozupone, Daniel McDonald, Noah Fierer & Rob Knight
Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
Nature Methods 7, 813-819 (2010) doi:10.1038/nmeth.1499
Patil et al.
Taxonomic metagenome sequence assignment with structured output models
Nature Methods 8, 191-192 (2011) doi:10.1038/nmeth0311-191

Arthur Brady & Steven L Salzberg
PhymmBL expanded: confidence scores, custom databases, parallelization and more
Nature Methods 8, 367-367 (2011) doi:10.1038/nmeth0511-367

Dan Knights et al.
Bayesian community-wide culture-independent microbial source tracking
Nature Methods 8, 761-763 (2011) doi:10.1038/nmeth.1650

Lauren Bragg, Glenn Stone, Michael Imelfort, Philip Hugenholtz & Gene W Tyson
Fast, accurate error-correction of amplicon pyrosequences using Acacia
Nature Methods 9, 425-426 (2012) doi:10.1038/nmeth.1990

Nicola Segata et al.
Metagenomic microbial community profiling using unique clade-specific marker genes
Nature Methods 9, 811-814 (2012) doi:10.1038/nmeth.2066

Peter E Larsen, Dawn Field & Jack A Gilbert
Predicting bacterial community assemblages using an artificial neural network approach
Nature Methods 9, 621-625 (2012) doi:10.1038/nmeth.1975

Robert C Edgar
UPARSE: highly accurate OTU sequences from microbial amplicon reads
Nature Methods 10, 996-998 (2013) doi:10.1038/nmeth.2604

Derek S Lundberg, Scott Yourstone, Piotr Mieczkowski, Corbin D Jones & Jeffery L Dangl
Practical innovations for high-throughput amplicon sequencing
Nature Methods 10, 999-1002 (2013) doi:10.1038/nmeth.2634

Shinichi Sunagawa et al.
Metagenomic species profiling using universal phylogenetic marker genes
Nature Methods 10, 1196-1199 (2013) doi:10.1038/nmeth.2693

Daniel R Mende, Shinichi Sunagawa, Georg Zeller & Peer Bork
Accurate and universal delineation of prokaryotic species
Nature Methods 10, 881-884 (2013) doi:10.1038/nmeth.2575

Chen-Shan Chin et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
Nature Methods 10, 563-569 (2013) doi:10.1038/nmeth.2474

Nicholas A Bokulich et al.
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing
Nature Methods 10, 57-59 (2013) doi:10.1038/nmeth.2276

Joseph N Paulson, O Colin Stine, Héctor Corrada Bravo & Mihai Pop
Differential abundance analysis for microbial marker-gene surveys
Nature Methods 10, 1200-1202 (2013) doi:10.1038/nmeth.2658

Paul C Blainey & Stephen R Quake
Dissecting genomic diversity, one cell at a time
Nature Methods 11, 19-21 (2014) doi:10.1038/nmeth.2783

Johannes Alneberg et al.
Binning metagenomic contigs by coverage and composition
Nature Methods (2014) doi:10.1038/nmeth.3103

Strengthening communities through competition

Posted on 27 Jun 2014 by Tal Nawy

Community bioinformatics challenges help drive methods development.

Science moves ahead faster as a social enterprise, perhaps especially so in the dynamic area of bioinformatics. Bioinformatics competitions are important opportunities for developers (and users) to come together to define the essential questions in the field and decide on the best metrics to evaluate them. They also perform a critical function in making valuable benchmark datasets available to anyone, including small labs and young students.

At the end of a challenge, ideally, is a better appreciation of the most promising approaches to a problem, as well as a recognition of difficulties and opportunities for future development. And there are new contacts for collaboration. As a reality check, it is less common for researchers with directly competing methods to collaborate; their work depends on a competitive funding model. But complementary approaches provide fertile ground for exploring new ways to attack a problem, and some contests are directly encouraging collaborative coding.

In our July Editorial, we continue our support of these initiatives, urging participation and an embrace of formats that maximize engagement among participants. Already, measures like on-line forums, webinars, and conferences involve participants in the planning and interpretation stages, which are critical for getting the most out of each event.

A variety of formats beyond the traditional bake-off are evolving in the collaborative spirit, encouraging more sharing of ideas and code. For example, hackathons take on more focused coding challenges in a single dedicated meet-up session, while open-source competitions make code available during the contest to allow researchers to learn from each other. These formats are not meant as an evaluation of existing methods, but promote new solutions. As Gustavo Stolovitzky of the DREAM challenges points out, publishing code during the event has the potential for ‘herding’ behavior (copy-the-leader), which can stifle creativity and produce a coding monoculture. A number of DREAM challenges now use a two-stage approach in which top performers from a traditional competition phase are invited back to develop a new and better solution together.

Journals and funders also play a role in supporting these efforts. Nature methods has published a number of papers resulting from community competitions (CAFA, DREAM, FlowCAP, Particle tracking and RGASP) and the Nature journals have been committed to providing these papers under a Creative Commons attribution-noncommercial-share alike unported license since January 2013.

There are difficulties associated with running large-scale events. Choice of data set and metrics can bias evaluations towards certain solutions, and the involvement of many developers can water down the conclusions resulting from the challenge. Moreover, usability is often not considered since it is hard to quantify. Ultimately, these issues can be helped by boosting participation in decision-making during planning stages, tailoring conclusions to each scenario that is tested, and having judging panels test the best-performing methods to ensure usability.

We are heartened to see the continued success of community-led competitions and the birth of contests in new areas. In a guest post, we invited organizers of the CAMI competition to announce their upcoming event on metagenome data interpretation.

Below, we provide a non-comprehensive list of some recent and ongoing challenges:

Bake-offs
Assemblathon – genome assembly
CAFA (Critical Assessment of Function Annotation) – protein functional prediction
CAGI (Critical Assessment of Genome Interpretation) – functional variant prediction
CAMI (Critical Assessment of Metagenome Interpretation) – see the announcement
CAPRI (Critical Assessment of PRediction of Interactions) – structure-based protein-protein interaction prediction
CASP (Critical Assessment of protein Structure Prediction) – protein structure prediction since 1994!
DREAM (Dialogue for Reverse Engineering Assessment and Methods) – systems biology challenges with hybrid formats and challenge-assisted review
FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods)
Grand Challenges in biomedical image analysis
Particle tracking challenge
RGASP (RNA-seq Genome Annotation Assessment Project)

Crowdsourcing competitions, hackathons and fast challenges
BioHackathons – open-source programming meetups
Innocentive – commercial platform offering cash prizes (e.g. the $1 million US Defense Threat Reduction Agency (DTRA) challenge to identify organisms from a stream of DNA sequences)
DNA60IFX – short challenges based on DNA or RNA sequence data
DREAM – a number of recent and current challenges include a collaborative phase of tool development
Neurosynth hackathons – open-source programming meetups in computational neurobiology
Sequence Squeeze – open-source competition for sequence file compression (cash prize)
[topcoder] – variety of computational challenges, with some cash prizes

The Critical Assessment of Metagenome Interpretation (CAMI) competition

Posted on 27 Jun 2014 by Tal Nawy

Alice McHardy, Alex Sczyrba and Thomas Rattei announce a new initiative for assessing metagenomics methods in this guest post.

Alice McHardy{credit}Folker Meyer{/credit}

Alex Sczyrba{credit}A. Sczyrba{/credit}

Thomas Rattei{credit}Anja Venier{/credit}

In just over a decade, metagenomics has developed into a powerful and productive method in microbiology and microbial ecology. The ability to retrieve and organize bits and pieces of genomic DNA from any natural context has opened a window into the vast universe of uncultivated microbes. Tremendous progress has been made in computational approaches to interpret this sequence data but none can completely recover the complex information encoded in metagenomes.

A number of challenges stand in the way. Simplifying assumptions are needed and lead to strong limitations and potential inaccuracies in practice. Critically, methodological improvements are difficult to gauge due to the lack of a general standard for comparison. Developers face a substantial burden to individually evaluate existing approaches, which consumes time and computational resources, and may introduce unintended biases.

The Critical Assessment of Metagenome Interpretation (CAMI) is a new community-led initiative designed to help tackle these problems by aiming for an independent, comprehensive and bias-free evaluation of methods. We are making extensive high-quality unpublished metagenomic data sets available for developers to test their short read assembly, binning and taxonomic classification methods. The results of CAMI will provide exhaustive quantitative metrics on tool performance to serve as a guide to users under different scenarios, and to help developers identify promising directions for future work.

As a community effort, we encourage feedback by both method developers and users of metagenome analysis tools. The CAMI initiative was one of the four discussion threads of the Metagenome Meeting at the Newton Institute in Cambridge this year. Another open discussion with both developers and users of computational metagenome methods will also take place at a roundtable at the ISME conference in Seoul in August.

We urge developers to participate by registering for the competition on our website and joining our Google+ group to provide feedback on the current design phase. The competition is tentatively scheduled to open at the end of 2014. Key data sets are being generated, and CAMI is currently seeking additional data contributors to provide genomes of deep-branching lineages for data set generation. The results will be presented and discussed in a workshop a few months after the competition. We aim for a joint publication of the generated insights together with all CAMI contest participants and data contributors.

We encourage everyone to get involved and spread the word!

Stephen Quake responds to Lior Pachter

Posted on 15 Nov 2013 by Tal Nawy

Stephen Quake responds to a blog post by Lior Pachter that analyzes data from his recent analysis of single-cell RNA sequencing methods published in Nature Methods.

In October, we published an Analysis by Quake and colleagues that evaluated a number of single-cell RNA-seq approaches on the basis of their sensitivity, accuracy and reproducibility. In a subsequent blog post, Pachter challenged their data reporting. At issue is whether the failure rate among 96 samples sequenced using the Fluidigm C1 microfluidic instrument should have been presented differently.

We encourage animated discussion of published research and hope that this can serve as a useful forum. In this guest post, Quake responds to Pachter’s blog entry. The views expressed below are solely his and do not necessarily represent those of Nature Methods.

Stephen Quake Methagora blog post In a recent blog post, Lior Pachter appears to question my scientific integrity and suggest that I unfairly manipulated data in a recent publication on single cell RNAseq.

Pachter has not contacted me directly with his questions nor did he give any warning before publishing his blog post. While I am happy that he is carefully scrutinizing publications and independently re-analyzing primary data, his rather sensationalistic approach to reporting his results in the absence of discussion or peer-review risks doing a disservice to science and adds more heat than light.

Pachter tries to have it both ways – based on our published data he accuses me of 1) wasting effort by sequencing lower quality samples and 2) selectively publishing data from only the better samples. It is hard to see how these accusations can simultaneously both be true. As described in the methods section of our paper, the C1 capture rate is not perfectly efficient and therefore we manually inspected all the chambers. We found 93 chambers had single cells, 1 chamber had two cells, and 2 chambers had no cells. Of the 93 chambers with single cells, 91 of the cells appeared to be alive as measured by a live/dead stain and 2 did not. Our single cell RNAseq experiments included all 91 of the “live” single cells and 1 of the “dead” single cells; the data from the latter was indistinguishable from the former and thus it was included in all further analyses. There was absolutely no selection or manipulation of the data. All of the raw data as well as our R scripts were made available for Pachter and others to download and analyze upon publication of our paper.

The sequencing library prep and workflow that we use is geared around 96 parallel samples and we decided it would be valuable to process control samples in exactly the same batch as the single cell samples. We therefore included four control samples with the single cells: amplification products from a chamber on the chip that did not have a cell (C09, which was unfortunately not given a distinguishing filename during the file upload), a single cell tube amplification, a no template control (NTC, C70) tube experiment that did not have a single cell, and a bulk control sample. Pachter correctly points out that C70 is dominated by the ERCC spike in controls and has essentially no human transcripts as expected; similarly, the other negative control C09 performs very poorly next to the actual single cell data. It is not clear to me why Pachter thinks I should be embarrassed for performing negative control experiments; indeed biochemical amplifiers are known to be so sensitive that there are many stories of contamination that occurs through aerosol dispersal from nearby benches, etc. In our own analyses C09 and the other controls were excluded from the single cell data.

Pachter also noticed that ~ 3 of the single cell RNAseq experiments have significantly lower quality than the other 89, as measured by fraction of spike in sequenced or by log-correlation coefficient. If taken at face value, this corresponds to a failure rate of 3/92, or 3%. The experiments therefore had a 97% success rate by this metric and it is hard to see where his complaint lies. We conservatively included ALL of the single cell data in our analyses and thus if one follows Pachter’s prescription to only analyze the experiments that he deems “successful”, then the results will be even better than we reported.

Finally, Pachter makes a misleading argument concerning the statistical methods used to generate figure 4a. This figure is concerned with the questions of whether an ensemble of single-cell RNAseq experiments produces similar gene expression values as a bulk experiment. The reason for sub-sampling to equal depth is worry of introducing artifacts by comparing two RNAseq experiments of dramatically differing sequencing depth (see e.g. Cai, Guoshuai, et al. “Accuracy of RNA-Seq and its dependence on sequencing depth.” BMC bioinformatics 13.Suppl 13 (2012) and Tarazona, Sonia, et al. “Differential expression in RNAseq: a matter of depth.”Genome research 21.12 (2011): 2213-2223.). This figure has little to do with estimating the quality of the individual RNAseq experiments.

Promoting shared hardware design

Posted on 27 Jun 2013 by Tal Nawy

Now is the time to move open-source hardware development into basic research labs.

Having convinced airport security to haul a suspicious looking briefcase packed with hardware on board, Pete Pitrone, an imaging specialist in the group of Pavel Tomancak, headed for South Africa. His aim was to introduce young students to the parts, many manufactured at his own institute, in the hope they would assemble them into a sophisticated working microscope. It was a symbolic step to demonstrate the potential for building new tools in laboratories and beyond.

The OpenSPIM microscope-in-a-briefcase.
Credit: Vineeth Surendranath

Manufacturing has gained an appealing image of late. In his State of the Union Address in February, US president Barack Obama announced the creation of three manufacturing hubs modeled after an institute in Youngstown, Ohio. His comments referenced the ability to innovate quickly with additive manufacturing, which relies on digital design and 3D printing: relatively recent improvements which have changed the way that physical objects and devices are made, and helped to open up the design process.

Taking advantage of these developments at the grassroots level is an enthusiastic crop of do-it-yourself ‘builders’ or ‘hackers’ who are promoting a culture of shared design and open innovation, and have spawned a movement towards open-source hardware. Analogous to open-source software, open-source hardware licenses prevent the patenting of hardware designs or physical objects, and require comprehensive and freely accessible design and instructions to allow anyone to build the same device. One working definition and a helpful list of considerations is provided by the Open Source Hardware Association.

The advent of cheap 3D printers such as the RepRap and MakerBot have made it easy, cheap and relatively fast to turn digital designs into objects. 3D printing involves the layered deposition of a heated polymer through a precisely positioned moving extruder. Open-source electronics that can be used to control hardware with software from the likes of Arduino and Raspberry Pi are also making it easier to manufacture sophisticated devices.

In our July editorial, we argue that basic research shares the values of openness and reproducibility embodied by open-source hardware. Beyond making research tools easier and cheaper to build and replicate, developing devices in an open-source environment can actually speed innovation by encouraging community feedback early in development. This can make the work that goes into extensive documentation and robustness testing worthwhile for the individual research group.

Open-source differs from traditional design in its focus:

open-source tools are specifically designed for others to build and modify them
open-source tools must include extensive documentation, including parts lists, any related software code (also published as open-source) and design files
the focus on reproducibility encourages simple, streamlined design
modularity and integration are ultimate goals

Many in the design field have said that this focus actually improves designs and promotes the broadest uptake.

Applied technologies like photovoltaics and hardware infrastructure for cloud computing are investing in open-source approaches, but there are currently very few examples of open-source hardware in basic research. OpenPCR publishes the designs for a thermal cycler to conduct PCR, which can be purchased as an inexpensive kit and assembled by hand. Some labs simply use 3D printing to generate teaching models and basic equipment like test tube racks (e.g. the DeRisi lab).

Pitrone teaching South African students how to assemble an OpenSPIM scope.
Credit: P. Tomancak

The July issue of Nature Methods includes two leading examples of academic efforts in this direction, OpenSpim and OpenSpinMicroscopy, that include detailed designs for light-sheet microscopes that can take 3D movies of living things. Parts for the OpenSPIM scope were hiding in the briefcase en route to an EMBO course organized by Musa Mhlanga along with Freddy Frischknecht and Jost Enninga in Pretoria. A highlight according to Pavel Tomancak was to watch talented high school students assemble and successfully operate the scope in under two hours. The availability of design details and the focus on making it possible to build are a model of accessibility that has ramifications in teaching and outreach, encouraging many others to play around with the hardware.

Hardware innovation is a critical part of the technological advances that drive science. To carry out experiments, many research laboratories need tools that are simply not available, cost too much or will take too long to develop commercially. An open-source approach can lower the barriers to adopting, disseminating and ultimately improving tools for research.

Methagora

a blog from Nature Methods

Author Archives: Tal Nawy