Some resources and tools related to noncoding RNAs

In ‘Meet some code-breakers of noncoding RNAs,’ the technology feature in the February issue of Nature Methods, we speak with a few scientists about the path ahead in methods for characterize the noncoding RNAs.

With their input, we compiled a list of some of resources and tools in this field.

We can gladly include additional resources. Please comment on this page. You can also tweet us: @naturemethods or @metricausa

Some resources and tools related to noncoding RNAs:

 

Resource Description Publication
DASHR Database of small human noncoding RNAs

Leung, Y.Y et al DASHR:database of small human noncoding RNAs. Nucleic Acids Res. 44:D216-22. (2016)

FANTOM CAT Functional Annotation of the mammalian genome (FANTOM) is an international consortium.

This resource is an atlas of human long noncoding RNAs with accurate 5’ ends

 

 

Chung-Chau, H. et al Annotation of noncoding transcripts for example to find functional lncRNAs that show an effect on global expression after knockout/knockdown Nature 543,  199–204  (2017).

Okazaki, Y. et al.Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.
420(6915):563-73 (2002).

Gencode Resource about human and mouse noncoding RNAs, drawing on data generated by the Encyclopedia of DNA Elements (ENCODE) consortium.Information about the noncoding RNA species and their annotations are here Harrow J, et al. GENCODE: The reference human genome annotation for The ENCODE ProjectGenome Research doi: 10.1101/gr.135350.111. (2012)
LNCipedia Database of annotations of  functional long noncoding RNAs manually curated from the scientific literature Clark MB, et al. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res 39: D146-151 (2011).
 lncRNAdb  Database of annotations of  functional long noncoding RNAs manually curated from the scientific literature Amaral, P.P et al lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res 39: D146-151.(2011).
lncRNAWiki A Wiki to encourage community-based curation of human long noncoding RNAs. Ma, L et al. LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs Nucleic Acids Research43, D1, p. Pages D187–D192, (2015).

 

lncRNAtor A portal for long noncoding RNA with information such as expression profiles and coding potential. Data sources include TCGA, GEO, ENCODE and modENCODE. Park, C. et al. lncRNAtor: a comprehensive resource for functional investigation of long non-coding RNAs. Bioinformatics. 30(17):2480-5. (2014).
MINTbase Database of tRNA fragments from 11,000 people and 32 cancer types Pliatsika, V.et al. Nucleic Acids Res. 46, D1, D152–D159 (2018).
miRBase Database of published miRNA sequences and annotations Griffiths-Jones S. et al. Nucleic Acids Res. 36, D154-158 (2008).
miRDip A resource with human data; for finding microRNAs that target a gene; or genes targeted by a microRNA Tokar, T. et al mirDIP 4.1- integrative database of human microRNA target predictions, Nucleic Acids Res. 46(D1):D360-D370. (2018).
miRGeneDB A database of validated and anotated human microRNA genes Fromm, B. et al et al. MirGene DB2.0: the curated microRNA GeneDatabase, manuscript in bioarXiv. doi: https://doi.org/10.1101/258749
Noncode A noncoding RNA database with information from 17 species especially long noncoding RNAs. The information is mined from the scientific literature and data resources such as lncRNAdb, and lncipedia.

It includes links to literature about tools such as ncFANs for functional annotation of lncRNAs,

Liu C, et al. NONCODE: an integrated knowledge database of non-coding RNAs. Nucleic Acids Research, 2005, 33 (Database issue):D112-D115. (2005)
Regulome resources and data  Resources and data from the Center for Personal Dynamic Regulomes, including the ATAQ-Seq protocol and transcriptional landscape data from 13 cell types from healthy people and 3 cell types from people afflicted by leukemia. Corces MR, et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature Genetics  48(10):1193-203 (2016).
RNA central Resource hosted at the European Bioinformatics Institute that draws on a number of other database resources, such as

LncBase

This resource includes, for example, a database of experimentally supported miRNA:gene interactions and analysis tools and pipelines such as for miRNA pathway analysis

snOPY

snoRNA orthological gene database with information abut snoRNAs, snoRNA gene loci and target RNAs.

TarBase

Manually curated experimentally validated miRNA-gene interactions

 

 Tools 
miRDeep
miRDeep2
Tools for miRNA identification from RNA-seq data An, J et al miRDeep*: an integrated application tool for miRNA identification from RNA sequencing data.Nucleic Acids Res.41(2):727-37 (2013).

Friedländer MR et al. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res. 40(1):37-52. (2012)

 MiRNA prediction tool   miRNA prediction Miranda, KC et al. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes Cell126, 1203-1217, (2006).
 OASIS  Small non-coding RNA detection and expression analysis tool Capece, V. et al. Oasis: online analysis of small RNA deep sequencing data. Bioinformatics 31, 2205–2207 (2015).
Datasets
Analysis of 13 cell types; expression of primate and tissue-specific microRNAs Human miRNAs, their targets, and visualization of the loci on the human genome browser Londin, E, et al. Analysis of 13 cell types reveals evidence for the expression of numerous novel primate- and tissue-specific microRNAs Proc. Natl. Acad. Sci.U.S.A. 112(10):E1106-15. (2015).

Sources: H. Chang, Stanford University School of Medicine; Rory Johnson, University of Bern, E. Marshall, BC Cancer Agency; M. Turner, Babraham Institute; U. Ohler, Max Delbrück Center for Molecular Medicine; I. Rigoutsos, Philadelphia University + Thomas Jefferson University; Nature Research.

 

 

XFEL projects, tools, data portals

Earlier this year, the EuXFEL’s first laser beam reached the ‘hutch’.

Earlier this year, the EuXFEL’s first laser beam
reached the ‘hutch’.{credit}Jessica Mancuso{/credit}

As of September 1, the European X-ray free-electron laser (EuXFEL) is ready for the research community’s experiments; the user page is here.

In the September issue of Nature Methodswe present some of the experimental ideas researchers are exploring in that facility.

Other XFELs are operational or in the works: FERMI facilityLinac Coherent Light Source at Stanford (LCLS)Pohang Accelerator Laboratory (PAL) X-ray Free-Electron Laser, SPring-8 Angstrom Compact Free ElectronLaser (SACLA)Swiss Free-Electron Laser (SwissFEL).

One day, there might even be a XFEL that fits on a table-top (see below). The day is already here when scientists need to analyze mountains of XFEL-data. The EuXFEL will likely make those mountains grow in height. There are tools for that and likely more tools to come (see far below for a list of some tools).

Tabletop XFEL

To complement the large XFEL facilities, a number of research groups are developing benchtop XFELs9,10. Such projects involve miniaturization of all aspects of the technology, including the accelerator. Some groups explore ways to pass a laser through plasma to produce bright, high-energy, short-pulsed beams. Separately, some researchers use a terahertz generator, which can provide sufficiently high pulse energies, says Franz Kärtner, a physicist at the University of Hamburg who also holds an appointment at MIT. His team, along with Petra Fromme of Arizona State University, is developing such a tabletop XFEL instrument.

The scientists would like to use the instrument for coherent diffractive imaging and spectroscopy experiments on photosystem II, a protein complex involved in photosynthesis. Their compact XFEL approach, which will use a terahertz generator, lasers and nonlinear optics, is calculated to achieve photon energies between 10 and 12 keV, hard X-rays that can be harnessed for imaging at atomic resolution, says Kärtner.

Although this compact XFEL will generate fewer photons per shot than a large-scale FEL—106 to 109 photons per shot as opposed to 1012 photons or more—the machine will be able to produce very short pulses, on the order of 0.5 femtoseconds. That is 10–100 times shorter than current FEL pulses. And if that comes to be, says Kärtner, the peak power of the instrument may be almost on par with that of an XFEL.

A terahertz accelerator module for a table-top XFEL in the making

A terahertz accelerator module for a table-top XFEL in the making
{credit}DESY/Heiner Müller-Elsner{/credit}

In the instrument’s terahertz-driven accelerator there will be acceleration gradients between 500 MV/m to 1GV/m. It’s this high frequency that helps to compress electron bunches over short distances and that will let the developers to use compact electron guns. In this fashion, they will be able to shoot a coherent electron beam directly from a gun and emit an X-ray beam much like a FEL, says Kärtner.

Tabletop XFELs fill an important experimental gap between Röntgen’s X-ray tube and the large-scale FELs. “There is nothing in between,” says Kärtner. That’s akin to a situation in optical science in which researchers need to choose between a light bulb and a large-scale optical laser such as the one at the National Ignition Facility at Lawrence Livermore National Laboratory. If the developers of the compact XFEL succeed at packing enough photons into each shot, the instrument will have potential applications in many fields, he says, including enhanced characterization of materials or higher-resolution medical and structural biology imaging.

  1.  Kneip, et al. Nat. Phys. 6, 980–983 (2010).
  2.  Kärtner, X. et al. Nucl. Instrum. Methods Phys. Res. A 829, 24–29 (2016)

Data mountains

XFEL-based experiments produce mountains of data. At EuXFEL, there are two two-dimensional pixel detectors, which will each deliver 10-40 gigabytes of data every second of an experiment.

Experimental data will be housed in the facility’s online systems and then moved to offline disk-based systems also at the facility where researchers can access and analyze it, says Filipe Maia, a biophysicist at the Uppsala University.

The data torrent makes for “a daunting problem,” says Maia, “and currently there’s clearly a lack of user friendly tools.” This issue is a general trend and not unique to XFEL-based research, but researchers are getting better at handling datasets, which happens also as they familiarize themselves with the increasingly available tools. After publication, he hopes the XFEL-data will be transmitted to an online repository to share it with the community. One such resource is the Coherent X-ray Imaging Data Bank (CXIDB), which he built.


Here are some tools for analyzing and managing XFEL data                             

 Resource Description
CASS-CFEL-ASG Suite of tools for real-time monitoring of XFEL experiments, data analysis and visualization, raw data correction, crystal hit finding.
cctbx.xfel Suite of tools for processing measurements made during SFX experiments at an XFEL. Built on Computational Crystallographic Toolbox.
CrystFEL Software suite for processing SFX data.
 Cheetah Data analysis and high-throughput data reduction tools for SFX data.
 Condor Simulation of Flash X-ray imaging to help solve structures without needing crystallization
 Dragonfly  Software/algorithm for single-particle imaging with XFELs
 Hummingbird  Real-time monitoring of XFEL experiments
 Hawk  Package for analyzing and phasing diffraction patterns from single particle-based experiments
 IOTA Spot-finding software for XFEL-based diffraction images. Part of the cctbx.xfel suite.
 OnDA  Real-time monitoring and data analysis of XFEL experiments
 psana  A data analysis framework at LCLS
 SACLA analysis  framework Real-time data processing pipeline at SACLA for serial femtosecond crystallography; it uses modified Cheetah and CrystFEL.
WavePropaGator Software framework for simulating XFEL experiments.
 XATOM  Software calculating and simulating X-ray atom interaction.
Part of the software package Xraypac.
 XMDYN  Simulation tool for modeling dynamics of matter that is exposed to high-intensity X-rays. Part of the software package Xraypac.
 Resources and Portals 
 Coherent X-ray Imaging Data Bank (CXIDB) A database for coherent X-ray imaging experiments.
 LCLS data  analysis  Data analysis resources at LCLS.
 Protein Data  Bank (PDB)  Data repository for protein structures.
 SIMEX  A project that aims to develop an experimental simulation platform for use at XFELs

Sources: Henry Chapman, DESY; Janos Hajdu, Filipe Maia, Uppsala University; Sébastien Boutet, LCLS

LCLS: Stanford Linac Coherent Light Source; Linac Coherent Light Source, SLAC National Accelerator Laboratory (formerly named Stanford Linear Accelerator Center)
SACLA: Spring-8 Angstrom Compact Free Electron Laser
SFX: Serial Femtosecond Crystallography
XFEL: X-ray free-electron laser

Computable sugars: some computational resources in glycoscience

Glycoscience is sweet science

Glycoscience is sweet science{credit}PhotoDisc/ Getty Images{/credit}

As glycoscience advances, labs will increasingly want to ask questions about glycosylation sites on a protein or the structure of a sugar, says Raja Mazumder, a bioinformatician at George Washington University. They might ask for example: are there glycosyltransferases that are expressed in liver but not in the heart, or, which ones are overexpressed by a factor of three in more than two cancers. Such questions require infrastructure building, he says, because right now there is no mechanism to allow such queries. But he and others are building such capabilities. Mazumder along with William York at the University of Georgia are starting to build a glycoscience informatics portal.

Mazumder wants to leverage existing ontologies in the developer community in order to build systems that can be queried on a large-scale. For example, Mazumder is working with Cathy Wu at Georgetown University, who is developing the Protein Ontology. Such ontologies are collected, for example, by the non-profit OBO Foundry. To allow flexible querying, the computational resources will draw on different ontologies; ones that relate to glycans, genes, proteins, tissues, diseases and more.

Ontologies are part the team’s effort to build application program interfaces (APIs) that expose the data in a given database to incoming queries. Given how complex sugars are, the informatics framework has to be well-organized for both human and machine-based querying, says Mazumder.

When using the resource, a researcher will receive results that also document the search process itself such as the version of the queried database. “You need to be able to tell where you got that information from,” says Mazumder. Tracking data provenance matters especially in an age when databases continuously integrate information emerging in the literature.

For the Food and Drug Administration, Mazumder is developing computational standards for high-throughput sequencing, which he wants to also apply to glycoscience. His ‘biocompute object’ captures the given computational workflow a lab might have used to generate results: the software used, the databases queried and their version, and identifiers of data inputs and outputs. These biocompute objects are intended to help regulatory scientists interpret submitted work. It can also help scientists generally see if, for example, the version of software they used worked as it should, says Mazumder.

Too often labs use computational tools without benchmarking them, says Mazumder. “It would be unthinkable for a wet-lab scientist to not have a positive and negative control,” he says.  In informatics, developers benchmark their software but users often do not have these habits. “They don’t even know: if I don’t find anything, is it because my software did not run well or not?”

As labs move to big data analysis in genomics and also, eventually, in glycoscience, this aspect is ever more important, says Mazumder. In his view, biocompute objects will help glycobiology researchers communicate with one another about their results, such as where on a protein they found a sugar with a given structure. More generally, it will help glycoscientists to have a better way to connect the available sugar resources as they pursue their questions of interest.


Here are some resources that glycoscientists can tap into:                             

 Category Resource Description
General resources and funding information
Transforming Glycoscience: A Roadmap for the Future Report by the National Research Council of the National Academies of Science
NIH Common Fund program in glycoscience  Funding opportunities from the NIH Common Fund program in glycoscience
A roadmap for Glycoscience In Europe by BBSRC, EGSF, European Science Foundation   Glycoscience roadmap for Europe
GlycoNet Resources related to glycoscience research in Canada, based at the University of Alberta where the Alberta Glycomics Centre is located
National Center for Functional Glycomics A Glycomics-related Biomedical Technology Resource Center based at Beth Israel Deaconess Medical Center, Harvard Medical School with resources on, for example, microarrays and microarray services, protocols, training and databases
Databases and  portals 
CAZy Carbohydrate-Active Enzymes, a database of enzyme families that degrade, modify or create glycosidic bonds
Consortium for Functional Glycomics Resources and glycoscience data. Part of the National Center for Functional Glycomics.
ExPASy Software tools and databases to simulate, predict and visualize glycans, glycoproteins and glycan-binding proteins
Glycan Library  A list of lipid-linked sequence-defined glycan probes
Glyco3D A portal for structural glycoscience
GlycoBase 3.2 A database of N– and O-linked glycan structures with HPLC, UPLC, exoglycosidase sequencing and mass spectrometry data
GlycoPattern Portal for glycan array experimental results from the Consortium for Functional Glycomics
Glycosciences.de Collection of databases and tools in glycoscience
GlyToucan Repository for glycan structures based in Japan
MatrixDB A database of experimental data of interactions by proteoglycans, polysaccharides and extracellular matrix proteins
Repository of Glyco-enzyme expression constructs University of Georgia Complex Carbohydrate Research Center repository for glyco-enzyme constructs
SugarBind A database of carbohydrate sequences to which bacteria, toxins and viruses adhere
UniCarbKB A resource curated by scientists in in five countries. It includes GlycoSuiteDB, a database of glycan structures; EUROCarbDB, an experimental and structural database and UniCarb-DB, a mass spec database of glycan structures
Software tools
CASPER Web-based tool to calculate NMR chemical shifts of oligo- and polysaccharides
Glycan Builder An online tool at ExPASy for predicting possible oligosaccharide structures on proteins
GlycoMiner/GlycoPattern Software tools to automatically identify mass spec spectra of N-glycopeptides
GlyMAP An online resource for mapping glyco-active enzymes
NetOGlyc Software tool for predicting O--glycosylation sites on proteins
SweetUnityMol Molecular visualization software

Sources: NIH, R. Mazumder, George Washington University; New England Biolabs, Thermo Fisher Scientific, Nature Research

An archive for raw EM data

Earlier this week we published a Correspondence describing EMPIAR, a public archive for raw 2D electron microscopy (EM) image data.

While the established Electron Microscopy Data Bank (EMDB) hosts the 3D EM map data required by most journals for publication, the EM community has long been calling for an archive to host the raw 2D image data underlying the 3D maps, as highlighted in our Method of the Year 2015 feature. EMPIAR, a pilot project from the Protein Data Bank in Europe (PDBe), now fills this need.

At Nature Methods we support this archive as a welcome development in the rapidly growing 3D EM field that will enhance transparency, reproducibility, and facilitate the development and refinement of data analysis tools. Though we do not require that our authors deposit their 2D EM image data in EMPIAR, we do encourage it. We urge researchers to make use of the archive and provide feedback to the developers in order to ensure that it is meeting the needs of the field.

Any interested readers without a subscription or site license may read the full text of the Correspondence here.

Microbial sequencing at Nature Methods

Over the years, Nature Methods has published many methods to generate and analyze complex sequence data for microbial studies. We cover highlights from our papers below.

Carl Woese set the stage for a molecular taxonomy of microbial life in 1977 by demonstrating that the 16S ribosomal subunit could form the basis of prokaryotic classification. Amplifying markers such as 16S from microbial mixtures really took off with the advent of high-throughput sequencing, which provided a way to rapidly profile communities sampled directly from the environment. Shotgun sequencing approaches are used more and more for taxonomic profiling as well, enabling gene and genomic sequences to be reconstructed for the functional characterization of communities.

Amplicon-based community profiling
The 454 pyrosequencing platform originally dominated efforts to study the 16S locus because of its long sequence reads. In 2008, Rob Knight and colleagues described the use of error-correcting barcodes for pyrosequencing hundreds of samples together.  Then in 2013, Jeffrey Dangl and colleagues took barcoding to a new level by tagging every template molecule during library prep on the Illumina platform in order to remove much of the PCR bias and error introduced during amplification.

On the computational side, Christopher Quince and colleagues presented PyroNoise in 2009 for ‘denoising’ or removing errors from pyrosequencing flowgrams. Jens Reeder and Rob Knight followed a year later with Denoiser, a fast heuristic alternative. Gene Tyson and colleagues moved away from flowgrams with their Acacia software, which corrects sequence files directly and can also work on Ion Torrent data due to its similar error profile containing homopolymeric repeats.

Once cleaned up, marker sequences need to be grouped into ‘operational taxonomic units’ (OTUs) that roughly correspond to genera, species or strains. Among many algorithms that do this, Robert Edgar introduced UPARSE (we realized that there is some ambiguity but it is pronounced YOU-parse) in 2013 for accurate OTU clustering in the face of erroneous or chimeric sequence reads.

To stitch the computational analysis steps together, ‘quantitative insights into microbial ecology’, or QIIME (pronounced chime) from Rob Knight and colleagues offers a user-friendly modular pipeline for amplicon sequence analysis.

Metagenomic community profiling
In shotgun metagenomics approaches, all fragments of genomic DNA in a sample are sequenced and classified. Isidore Rigoutsos and colleagues introduced PhyloPythia in 2007 to assign fragments to higher taxonomic groups or ‘bins’ based on matching the frequency of tetranucleotide sequences with signatures from known taxa. Its faster, open-source successor PhyloPythiaS from Alice McHardy and colleagues came out in 2012.

Arthur Brady and Steven Salzberg also used sequence composition, or combined it with sequence alignment with Phymm and PhymmBL in 2009; their PhymmBL expanded includes additional functionality and parallelization and came out in 2011.

In 2012, Curtis Huttenhower and colleagues described MetaPhlAn, which limits analysis to clade-specific marker genes to speed up the classification of sequence reads. Peer Bork and colleagues also extracted a limited marker set from metagenomic data in their metagenomic OTUs (mOTU) approach in 2013, but used 40 universally conserved prokaryotic genes. Both methods work best in systems like the human gut that have a large number of sequenced reference genomes.

Genomes from mixtures
Earlier this year, Christopher Quince, Anders Andersson and colleagues published an unsupervised binning method called CONCOCT to help reconstruct genomes from mixtures. It uses sequence composition and differential coverage across samples to assign pre-assembled contiguous sequences (contigs) to species or strain bins.

Single-cell sequencing is another way to obtain microbial genomes. Paul Blainey and Stephen Quake discuss challenges and opportunities for single-cell sequencing in a Commentary in our Method of the Year issue in 2014. When cultures are available, long-read single-molecule sequencing technology can provide very high quality genome sequences; the HGAP software from Jonas Korlach and colleagues makes this possible using a single Pacific Biosciences sequencing library.

With genomic sequences in hand, there remains the question of how to fit them within an appropriate taxonomy. Peer Bork and colleagues tackled the problem in 2013 with their species identification (SpecI) tool, that bases classification on the same 40 markers as mOTU.

Functional analysis and ecology
An array of tools have been designed to wrestle ecological and biological insights from metagenomic sequence data, such as the GENE PRediction IMprovement Pipeline (GenePRIMP) for annotating prokaryotic genomes by Amrita Pati and colleagues in 2010 and the metagenomeSeq method to test for the differential microbe abundance across environments or conditions by Mihai Pop and colleagues in 2013 (also see a comment by Bork and colleagues and the authors’ reply).

In 2010, Rob Knight and colleagues compared 51 methods for their ability to identify biologically relevant distribution patterns using real and simulated 16S pyrosequencing data from samples that were clustered or assayed along environmental gradients. In 2012, Jack Gilbert and colleagues developed microbial assemblage prediction (MAP), an artificial neural network approach to model microbial community structure across the Western English Channel that combines time course metagenomic data from a single site with bioclimatic data gathered over the entire channel.

Quality control and bias
Generating accurate and robust microbial sequence data requires rigorous benchmarking and controls, and experimental methods are constantly improving. Nikos Kyrpides and colleagues studied the use of simulated data to evaluate metagenomic analysis methods in 2007. In 2010, Philip Hugenholtz and colleagues evaluated two methods to deplete rRNA from metatranscriptomes.

J Gregory Caporaso and colleagues further demonstrated the effect of Illumina read quality on taxonomic assignment and diversity assessment in 2013, and Scott Kelley and colleagues developed SourceTracker software to identify contaminants in microbial sequencing studies.

We look forward to many more contributions in the field of microbial sequencing.

 

References:
Alice Carolyn McHardy et al.
Accurate phylogenetic classification of variable-length DNA fragments
Nature Methods 4, 63-72 (2007) doi:10.1038/nmeth976

Konstantinos Mavromatis et al.
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 4 (6), pp. 495-500 (2007) doi:10.1038/nmeth1043

Micah Hamady, Jeffrey J Walker, J Kirk Harris, Nicholas J Gold & Rob Knight
Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex
Nature Methods 5, 235-237 (2008) doi:10.1038/nmeth.1184

Christopher Quince et al.
Accurate determination of microbial diversity from 454 pyrosequencing data
Nature Methods 6, 639-641 (2009) doi:10.1038/nmeth.1361

Arthur Brady & Steven L Salzberg
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models
Nature Methods 6, 673-676 (2009) doi:10.1038/nmeth.1358

J Gregory Caporaso et al.
QIIME allows analysis of high-throughput community sequencing data
Nature Methods 7, 335-336 (2010) doi:10.1038/nmeth.f.303

Jens Reeder & Rob Knight
Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions
Nature Methods 7, 668-669 (2010) doi:10.1038/nmeth0910-668b

He et al.
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
Nature Methods 7, 807-812 (2010) doi:10.1038/nmeth.1507

Amrita Pati et al.
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes
Nature Methods 7, 455-457 (2010) doi:10.1038/nmeth.1457

Justin Kuczynski,  Zongzhi Liu,  Catherine Lozupone,  Daniel McDonald,  Noah Fierer &  Rob Knight
Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
Nature Methods 7, 813-819 (2010) doi:10.1038/nmeth.1499
Patil et al.
Taxonomic metagenome sequence assignment with structured output models
Nature Methods 8, 191-192 (2011) doi:10.1038/nmeth0311-191

Arthur Brady & Steven L Salzberg
PhymmBL expanded: confidence scores, custom databases, parallelization and more
Nature Methods 8, 367-367 (2011) doi:10.1038/nmeth0511-367

Dan Knights et al.
Bayesian community-wide culture-independent microbial source tracking
Nature Methods 8, 761-763 (2011) doi:10.1038/nmeth.1650

Lauren Bragg, Glenn Stone, Michael Imelfort, Philip Hugenholtz &  Gene W Tyson
Fast, accurate error-correction of amplicon pyrosequences using Acacia
Nature Methods 9, 425-426 (2012) doi:10.1038/nmeth.1990

Nicola Segata et al.
Metagenomic microbial community profiling using unique clade-specific marker genes
Nature Methods 9, 811-814 (2012) doi:10.1038/nmeth.2066

Peter E Larsen,  Dawn Field &  Jack A Gilbert
Predicting bacterial community assemblages using an artificial neural network approach
Nature Methods 9, 621-625 (2012) doi:10.1038/nmeth.1975

Robert C Edgar
UPARSE: highly accurate OTU sequences from microbial amplicon reads
Nature Methods 10, 996-998 (2013) doi:10.1038/nmeth.2604

Derek S Lundberg,  Scott Yourstone,  Piotr Mieczkowski,  Corbin D Jones &  Jeffery L Dangl
Practical innovations for high-throughput amplicon sequencing
Nature Methods 10, 999-1002 (2013) doi:10.1038/nmeth.2634

Shinichi Sunagawa et al.
Metagenomic species profiling using universal phylogenetic marker genes
Nature Methods 10, 1196-1199 (2013) doi:10.1038/nmeth.2693

Daniel R Mende,  Shinichi Sunagawa,  Georg Zeller &  Peer Bork
Accurate and universal delineation of prokaryotic species
Nature Methods 10, 881-884 (2013) doi:10.1038/nmeth.2575

Chen-Shan Chin et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
Nature Methods 10, 563-569 (2013) doi:10.1038/nmeth.2474

Nicholas A Bokulich et al.
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing
Nature Methods 10, 57-59 (2013) doi:10.1038/nmeth.2276

Joseph N Paulson,  O Colin Stine,  Héctor Corrada Bravo &  Mihai Pop
Differential abundance analysis for microbial marker-gene surveys
Nature Methods 10, 1200-1202 (2013) doi:10.1038/nmeth.2658

Paul C Blainey &  Stephen R Quake
Dissecting genomic diversity, one cell at a time
Nature Methods 11, 19-21 (2014) doi:10.1038/nmeth.2783

Johannes Alneberg et al.
Binning metagenomic contigs by coverage and composition
Nature Methods (2014) doi:10.1038/nmeth.3103

Light sheet imaging in Nature Methods

It was only a few months before Nature Methods was launched in October 2004 that Jan Huisken and Ernst Stelzer had published a paper in Science in which they used light sheet microscopy – what they called selective plane illumination microscopy or SPIM – to image fluorescence within transgenic embryos. Simplistically put, this century-old technique achieves optical sectioning by illuminating a sample through its width with a thin sheet of light. In the last decade, Nature Methods has published a steady stream of papers reporting developments in light-sheet imaging. Here are the highlights.

Our very first light-sheet paper was also from the Stelzer group, reporting the use of deconvolution to improve resolution of the technique (Verveer et al, 2007). This was rapidly followed by a paper from Hans-Ulrich Dodt, in which samples such as entire insects or brain tissue were rendered transparent with clearing agents to produce spectacular light-sheet ‘ultramicrographs’ (Dodt et al, 2007). The push to higher resolution continued, with a paper from Albert Diaspro reporting 3D super-resolution imaging within thick samples using light-sheets (Cella Zanacchi et al, 2011). The Stelzer group, meanwhile, improved performance of the technique in larger samples that scatter more light, by combining it with structured illumination (Keller et al, 2010).

Thai Truong, Willy Supatto and Scott Fraser added two photon excitation to light-sheet imaging, thereby doubling the depth and increasing by an order of magnitude the speed at which they could image samples such as developing embryos with each approach alone (Truong et al, 2011); Supatto recently extended this to imaging in multiple colours  (Mahou et al, 2014). And then in 2012, the groups of Phillip Keller and Lars Hufnagel independently reported microscopes that could take take multiple views of a biological sample simultaneously, allowing rapid imaging of entire developing fly embryos at sub-cellular resolution (Tomer et al, 2012; Krzic et al, 2012).

Though light-sheet imaging is perhaps at its most powerful in the imaging of thick samples like embryos or tissue sections, it has been used for substantial performance improvements in cellular imaging as well. In 2011, Eric Betzig’s group used scanned Bessel beams to create thinner light sheets and thus much improved axial resolution, achieving isotropic 3D resolution and rapid imaging within living cells (Planchon et al, 2011). Note also that, as Tom Vettenburg, Kishan Dholakia and colleagues showed,  generating the light sheet using an Airy beam, rather than Gaussian or Bessel beam, yields an even larger field of view without sacrificing contrast and resolution (Vettenburg et al, 2014). Variations on the light-sheet theme have also been developed by the labs of Makio Tokunaga and Sunney Xie for single-molecule imaging within cells (Tokunaga et al, 2008Gebhardt et al, 2013).

In recent years, the excitement around this technology has been palpable, with several papers reporting impressive applications of light-sheet microscopy: it has been used to functionally image the entire fish brain (Ahrens et al, 2013) and the brain of ‘fictively behaving’ fish (Vladimirov et al, 2014), as well as to image the beating fish heart (Mickoleit et al, 2014).

Perhaps not surprisingly, the emphasis in methods development has also been shifting a little. On the one hand, platforms are being developed to make this valuable technique available more widely, for instance via the OpenSPIM or OpenSpinMicroscopy platforms (Pitrone et al, 2013; Gualda et al, 2013). At the same time, analytical tools are necessarily being developed to handle the vast reams of data that a light-sheet experiment generates. The group of Pavel Tomancak reported Bayesian-based deconvolution methods to analyse the large data sets that result from multiview imaging (Preibisch et al, 2014). Phillip Keller and colleagues described computational methods to segment and track nuclei in data sets from light sheet or other imaging, for fast lineaging of developing embryos (Amat et al, 2014). Misha Ahrens and colleagues reported Thunder, a suite of analytical tools built on a platform for distributed computing, enabling the mapping of brain activity in ‘fictively behaving’ zebrafish (Freeman et al, 2014).

It’s fair to say that this venerable method has been thoroughly revived over the past decade. Light-sheet imaging is poised to yield tremendous biological insight. We hope to keep you updated on future developments in Methagora.

Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

Strengthening communities through competition

Community bioinformatics challenges help drive methods development.

Science moves ahead faster as a social enterprise, perhaps especially so in the dynamic area of bioinformatics. Bioinformatics competitions are important opportunities for developers (and users) to come together to define the essential questions in the field and decide on the best metrics to evaluate them. They also perform a critical function in making valuable benchmark datasets available to anyone, including small labs and young students.

At the end of a challenge, ideally, is a better appreciation of the most promising approaches to a problem, as well as a recognition of difficulties and opportunities for future development. And  there are new contacts for collaboration. As a reality check, it is less common for researchers with directly competing methods to collaborate; their work depends on a competitive funding model. But complementary approaches provide fertile ground for exploring new ways to attack a problem, and some contests are directly encouraging collaborative coding.

In our July Editorial, we continue our support of these initiatives, urging participation and an embrace of formats that maximize engagement among participants. Already, measures like on-line forums, webinars, and conferences involve participants in the planning and interpretation stages, which are critical for getting the most out of each event.

A variety of formats beyond the traditional bake-off are evolving in the collaborative spirit, encouraging more sharing of ideas and code. For example, hackathons take on more focused coding challenges in a single dedicated meet-up session, while open-source competitions make code available during the contest to allow researchers to learn from each other. These formats are not meant as an evaluation of existing methods, but promote new solutions. As Gustavo Stolovitzky of the DREAM challenges points out, publishing code during the event has the potential for ‘herding’ behavior (copy-the-leader), which can stifle creativity and produce a coding monoculture. A number of DREAM challenges now use a two-stage approach in which top performers from a traditional competition phase are invited back to develop a new and better solution together.

Journals and funders also play a role in supporting these efforts. Nature methods has published a number of papers resulting from community competitions (CAFA, DREAM, FlowCAP, Particle tracking and RGASP) and the Nature journals have been committed to providing these papers under a Creative Commons attribution-noncommercial-share alike unported license since January 2013.

There are difficulties associated with running large-scale events. Choice of data set and metrics can bias evaluations towards certain solutions, and the involvement of many developers can water down the conclusions resulting from the challenge. Moreover, usability is often not considered since it is hard to quantify. Ultimately, these issues can be helped by boosting participation in decision-making during planning stages, tailoring conclusions to each scenario that is tested, and having judging panels test the best-performing methods to ensure usability.

We are heartened to see the continued success of community-led competitions and the birth of contests in new areas. In a guest post, we invited organizers of the CAMI competition to announce their upcoming event on metagenome data interpretation.

Below, we provide a non-comprehensive list of some recent and ongoing challenges:

Bake-offs
Assemblathon – genome assembly
CAFA (Critical Assessment of Function Annotation) – protein functional prediction
CAGI (Critical Assessment of Genome Interpretation) – functional variant prediction
CAMI (Critical Assessment of Metagenome Interpretation) – see the announcement
CAPRI (Critical Assessment of PRediction of Interactions) – structure-based protein-protein interaction prediction
CASP (Critical Assessment of protein Structure Prediction) – protein structure prediction since 1994!
DREAM (Dialogue for Reverse Engineering Assessment and Methods) – systems biology challenges with hybrid formats and challenge-assisted review
FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods)
Grand Challenges in biomedical image analysis
Particle tracking challenge
RGASP (RNA-seq Genome Annotation Assessment Project)

Crowdsourcing competitions, hackathons and fast challenges
BioHackathons – open-source programming meetups
Innocentive – commercial platform offering cash prizes (e.g. the $1 million US Defense Threat Reduction Agency (DTRA) challenge to identify organisms from a stream of DNA sequences)
DNA60IFX – short challenges based on DNA or RNA sequence data
DREAM – a number of recent and current challenges include a collaborative phase of tool development
Neurosynth hackathons – open-source programming meetups in computational neurobiology
Sequence Squeeze – open-source competition for sequence file compression (cash prize)
[topcoder] – variety of computational challenges, with some cash prizes

The Critical Assessment of Metagenome Interpretation (CAMI) competition

Alice McHardy, Alex Sczyrba and Thomas Rattei announce a new initiative for assessing metagenomics methods in this guest post.

Alice McHardy

Alice McHardy{credit}Folker Meyer{/credit}

Alex Sczyrba

Alex Sczyrba{credit}A. Sczyrba{/credit}

Thomas Rattei

Thomas Rattei{credit}Anja Venier{/credit}

In just over a decade, metagenomics has developed into a powerful and productive method in microbiology and microbial ecology. The ability to retrieve and organize bits and pieces of genomic DNA from any natural context has opened a window into the vast universe of uncultivated microbes. Tremendous progress has been made in computational approaches to interpret this sequence data but none can completely recover the complex information encoded in metagenomes.

A number of challenges stand in the way. Simplifying assumptions are needed and lead to strong limitations and potential inaccuracies in practice. Critically, methodological improvements are difficult to gauge due to the lack of a general standard for comparison. Developers face a substantial burden to individually evaluate existing approaches, which consumes time and computational resources, and may introduce unintended biases.

cami_1The Critical Assessment of Metagenome Interpretation (CAMI) is a new community-led initiative designed to help tackle these problems by aiming for an independent, comprehensive and bias-free evaluation of methods. We are making extensive high-quality unpublished metagenomic data sets available for developers to test their short read assembly, binning and taxonomic classification methods. The results of CAMI will provide exhaustive quantitative metrics on tool performance to serve as a guide to users under different scenarios, and to help developers identify promising directions for future work.

As a community effort, we encourage feedback by both method developers and users of metagenome analysis tools. The CAMI initiative was one of the four discussion threads of the Metagenome Meeting at the Newton Institute in Cambridge this year. Another open discussion with both developers and users of computational metagenome methods will also take place at a roundtable at the ISME conference in Seoul in August.

We urge developers to participate by registering for the competition on our website and joining our Google+ group to provide feedback on the current design phase. The competition is tentatively scheduled to open at the end of 2014. Key data sets are being generated, and CAMI is currently seeking additional data contributors to provide genomes of deep-branching lineages for data set generation. The results will be presented and discussed in a workshop a few months after the competition. We aim for a joint publication of the generated insights together with all CAMI contest participants and data contributors.

We encourage everyone to get involved and spread the word!