Some resources and tools related to noncoding RNAs

In ‘Meet some code-breakers of noncoding RNAs,’ the technology feature in the February issue of Nature Methods, we speak with a few scientists about the path ahead in methods for characterize the noncoding RNAs.

With their input, we compiled a list of some of resources and tools in this field.

We can gladly include additional resources. Please comment on this page. You can also tweet us: @naturemethods or @metricausa

Some resources and tools related to noncoding RNAs:

 

Resource Description Publication
DASHR Database of small human noncoding RNAs

Leung, Y.Y et al DASHR:database of small human noncoding RNAs. Nucleic Acids Res. 44:D216-22. (2016)

FANTOM CAT Functional Annotation of the mammalian genome (FANTOM) is an international consortium.

This resource is an atlas of human long noncoding RNAs with accurate 5’ ends

 

 

Chung-Chau, H. et al Annotation of noncoding transcripts for example to find functional lncRNAs that show an effect on global expression after knockout/knockdown Nature 543,  199–204  (2017).

Okazaki, Y. et al.Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.
420(6915):563-73 (2002).

Gencode Resource about human and mouse noncoding RNAs, drawing on data generated by the Encyclopedia of DNA Elements (ENCODE) consortium.Information about the noncoding RNA species and their annotations are here Harrow J, et al. GENCODE: The reference human genome annotation for The ENCODE ProjectGenome Research doi: 10.1101/gr.135350.111. (2012)
LNCipedia Database of annotations of  functional long noncoding RNAs manually curated from the scientific literature Clark MB, et al. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res 39: D146-151 (2011).
 lncRNAdb  Database of annotations of  functional long noncoding RNAs manually curated from the scientific literature Amaral, P.P et al lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res 39: D146-151.(2011).
lncRNAWiki A Wiki to encourage community-based curation of human long noncoding RNAs. Ma, L et al. LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs Nucleic Acids Research43, D1, p. Pages D187–D192, (2015).

 

lncRNAtor A portal for long noncoding RNA with information such as expression profiles and coding potential. Data sources include TCGA, GEO, ENCODE and modENCODE. Park, C. et al. lncRNAtor: a comprehensive resource for functional investigation of long non-coding RNAs. Bioinformatics. 30(17):2480-5. (2014).
MINTbase Database of tRNA fragments from 11,000 people and 32 cancer types Pliatsika, V.et al. Nucleic Acids Res. 46, D1, D152–D159 (2018).
miRBase Database of published miRNA sequences and annotations Griffiths-Jones S. et al. Nucleic Acids Res. 36, D154-158 (2008).
miRDip A resource with human data; for finding microRNAs that target a gene; or genes targeted by a microRNA Tokar, T. et al mirDIP 4.1- integrative database of human microRNA target predictions, Nucleic Acids Res. 46(D1):D360-D370. (2018).
miRGeneDB A database of validated and anotated human microRNA genes Fromm, B. et al et al. MirGene DB2.0: the curated microRNA GeneDatabase, manuscript in bioarXiv. doi: https://doi.org/10.1101/258749
Noncode A noncoding RNA database with information from 17 species especially long noncoding RNAs. The information is mined from the scientific literature and data resources such as lncRNAdb, and lncipedia.

It includes links to literature about tools such as ncFANs for functional annotation of lncRNAs,

Liu C, et al. NONCODE: an integrated knowledge database of non-coding RNAs. Nucleic Acids Research, 2005, 33 (Database issue):D112-D115. (2005)
Regulome resources and data  Resources and data from the Center for Personal Dynamic Regulomes, including the ATAQ-Seq protocol and transcriptional landscape data from 13 cell types from healthy people and 3 cell types from people afflicted by leukemia. Corces MR, et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature Genetics  48(10):1193-203 (2016).
RNA central Resource hosted at the European Bioinformatics Institute that draws on a number of other database resources, such as

LncBase

This resource includes, for example, a database of experimentally supported miRNA:gene interactions and analysis tools and pipelines such as for miRNA pathway analysis

snOPY

snoRNA orthological gene database with information abut snoRNAs, snoRNA gene loci and target RNAs.

TarBase

Manually curated experimentally validated miRNA-gene interactions

 

 Tools 
miRDeep
miRDeep2
Tools for miRNA identification from RNA-seq data An, J et al miRDeep*: an integrated application tool for miRNA identification from RNA sequencing data.Nucleic Acids Res.41(2):727-37 (2013).

Friedländer MR et al. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res. 40(1):37-52. (2012)

 MiRNA prediction tool   miRNA prediction Miranda, KC et al. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes Cell126, 1203-1217, (2006).
 OASIS  Small non-coding RNA detection and expression analysis tool Capece, V. et al. Oasis: online analysis of small RNA deep sequencing data. Bioinformatics 31, 2205–2207 (2015).
Datasets
Analysis of 13 cell types; expression of primate and tissue-specific microRNAs Human miRNAs, their targets, and visualization of the loci on the human genome browser Londin, E, et al. Analysis of 13 cell types reveals evidence for the expression of numerous novel primate- and tissue-specific microRNAs Proc. Natl. Acad. Sci.U.S.A. 112(10):E1106-15. (2015).

Sources: H. Chang, Stanford University School of Medicine; Rory Johnson, University of Bern, E. Marshall, BC Cancer Agency; M. Turner, Babraham Institute; U. Ohler, Max Delbrück Center for Molecular Medicine; I. Rigoutsos, Philadelphia University + Thomas Jefferson University; Nature Research.

 

 

Computable sugars: some computational resources in glycoscience

Glycoscience is sweet science

Glycoscience is sweet science{credit}PhotoDisc/ Getty Images{/credit}

As glycoscience advances, labs will increasingly want to ask questions about glycosylation sites on a protein or the structure of a sugar, says Raja Mazumder, a bioinformatician at George Washington University. They might ask for example: are there glycosyltransferases that are expressed in liver but not in the heart, or, which ones are overexpressed by a factor of three in more than two cancers. Such questions require infrastructure building, he says, because right now there is no mechanism to allow such queries. But he and others are building such capabilities. Mazumder along with William York at the University of Georgia are starting to build a glycoscience informatics portal.

Mazumder wants to leverage existing ontologies in the developer community in order to build systems that can be queried on a large-scale. For example, Mazumder is working with Cathy Wu at Georgetown University, who is developing the Protein Ontology. Such ontologies are collected, for example, by the non-profit OBO Foundry. To allow flexible querying, the computational resources will draw on different ontologies; ones that relate to glycans, genes, proteins, tissues, diseases and more.

Ontologies are part the team’s effort to build application program interfaces (APIs) that expose the data in a given database to incoming queries. Given how complex sugars are, the informatics framework has to be well-organized for both human and machine-based querying, says Mazumder.

When using the resource, a researcher will receive results that also document the search process itself such as the version of the queried database. “You need to be able to tell where you got that information from,” says Mazumder. Tracking data provenance matters especially in an age when databases continuously integrate information emerging in the literature.

For the Food and Drug Administration, Mazumder is developing computational standards for high-throughput sequencing, which he wants to also apply to glycoscience. His ‘biocompute object’ captures the given computational workflow a lab might have used to generate results: the software used, the databases queried and their version, and identifiers of data inputs and outputs. These biocompute objects are intended to help regulatory scientists interpret submitted work. It can also help scientists generally see if, for example, the version of software they used worked as it should, says Mazumder.

Too often labs use computational tools without benchmarking them, says Mazumder. “It would be unthinkable for a wet-lab scientist to not have a positive and negative control,” he says.  In informatics, developers benchmark their software but users often do not have these habits. “They don’t even know: if I don’t find anything, is it because my software did not run well or not?”

As labs move to big data analysis in genomics and also, eventually, in glycoscience, this aspect is ever more important, says Mazumder. In his view, biocompute objects will help glycobiology researchers communicate with one another about their results, such as where on a protein they found a sugar with a given structure. More generally, it will help glycoscientists to have a better way to connect the available sugar resources as they pursue their questions of interest.


Here are some resources that glycoscientists can tap into:                             

 Category Resource Description
General resources and funding information
Transforming Glycoscience: A Roadmap for the Future Report by the National Research Council of the National Academies of Science
NIH Common Fund program in glycoscience  Funding opportunities from the NIH Common Fund program in glycoscience
A roadmap for Glycoscience In Europe by BBSRC, EGSF, European Science Foundation   Glycoscience roadmap for Europe
GlycoNet Resources related to glycoscience research in Canada, based at the University of Alberta where the Alberta Glycomics Centre is located
National Center for Functional Glycomics A Glycomics-related Biomedical Technology Resource Center based at Beth Israel Deaconess Medical Center, Harvard Medical School with resources on, for example, microarrays and microarray services, protocols, training and databases
Databases and  portals 
CAZy Carbohydrate-Active Enzymes, a database of enzyme families that degrade, modify or create glycosidic bonds
Consortium for Functional Glycomics Resources and glycoscience data. Part of the National Center for Functional Glycomics.
ExPASy Software tools and databases to simulate, predict and visualize glycans, glycoproteins and glycan-binding proteins
Glycan Library  A list of lipid-linked sequence-defined glycan probes
Glyco3D A portal for structural glycoscience
GlycoBase 3.2 A database of N– and O-linked glycan structures with HPLC, UPLC, exoglycosidase sequencing and mass spectrometry data
GlycoPattern Portal for glycan array experimental results from the Consortium for Functional Glycomics
Glycosciences.de Collection of databases and tools in glycoscience
GlyToucan Repository for glycan structures based in Japan
MatrixDB A database of experimental data of interactions by proteoglycans, polysaccharides and extracellular matrix proteins
Repository of Glyco-enzyme expression constructs University of Georgia Complex Carbohydrate Research Center repository for glyco-enzyme constructs
SugarBind A database of carbohydrate sequences to which bacteria, toxins and viruses adhere
UniCarbKB A resource curated by scientists in in five countries. It includes GlycoSuiteDB, a database of glycan structures; EUROCarbDB, an experimental and structural database and UniCarb-DB, a mass spec database of glycan structures
Software tools
CASPER Web-based tool to calculate NMR chemical shifts of oligo- and polysaccharides
Glycan Builder An online tool at ExPASy for predicting possible oligosaccharide structures on proteins
GlycoMiner/GlycoPattern Software tools to automatically identify mass spec spectra of N-glycopeptides
GlyMAP An online resource for mapping glyco-active enzymes
NetOGlyc Software tool for predicting O--glycosylation sites on proteins
SweetUnityMol Molecular visualization software

Sources: NIH, R. Mazumder, George Washington University; New England Biolabs, Thermo Fisher Scientific, Nature Research

Microbial sequencing at Nature Methods

Over the years, Nature Methods has published many methods to generate and analyze complex sequence data for microbial studies. We cover highlights from our papers below.

Carl Woese set the stage for a molecular taxonomy of microbial life in 1977 by demonstrating that the 16S ribosomal subunit could form the basis of prokaryotic classification. Amplifying markers such as 16S from microbial mixtures really took off with the advent of high-throughput sequencing, which provided a way to rapidly profile communities sampled directly from the environment. Shotgun sequencing approaches are used more and more for taxonomic profiling as well, enabling gene and genomic sequences to be reconstructed for the functional characterization of communities.

Amplicon-based community profiling
The 454 pyrosequencing platform originally dominated efforts to study the 16S locus because of its long sequence reads. In 2008, Rob Knight and colleagues described the use of error-correcting barcodes for pyrosequencing hundreds of samples together.  Then in 2013, Jeffrey Dangl and colleagues took barcoding to a new level by tagging every template molecule during library prep on the Illumina platform in order to remove much of the PCR bias and error introduced during amplification.

On the computational side, Christopher Quince and colleagues presented PyroNoise in 2009 for ‘denoising’ or removing errors from pyrosequencing flowgrams. Jens Reeder and Rob Knight followed a year later with Denoiser, a fast heuristic alternative. Gene Tyson and colleagues moved away from flowgrams with their Acacia software, which corrects sequence files directly and can also work on Ion Torrent data due to its similar error profile containing homopolymeric repeats.

Once cleaned up, marker sequences need to be grouped into ‘operational taxonomic units’ (OTUs) that roughly correspond to genera, species or strains. Among many algorithms that do this, Robert Edgar introduced UPARSE (we realized that there is some ambiguity but it is pronounced YOU-parse) in 2013 for accurate OTU clustering in the face of erroneous or chimeric sequence reads.

To stitch the computational analysis steps together, ‘quantitative insights into microbial ecology’, or QIIME (pronounced chime) from Rob Knight and colleagues offers a user-friendly modular pipeline for amplicon sequence analysis.

Metagenomic community profiling
In shotgun metagenomics approaches, all fragments of genomic DNA in a sample are sequenced and classified. Isidore Rigoutsos and colleagues introduced PhyloPythia in 2007 to assign fragments to higher taxonomic groups or ‘bins’ based on matching the frequency of tetranucleotide sequences with signatures from known taxa. Its faster, open-source successor PhyloPythiaS from Alice McHardy and colleagues came out in 2012.

Arthur Brady and Steven Salzberg also used sequence composition, or combined it with sequence alignment with Phymm and PhymmBL in 2009; their PhymmBL expanded includes additional functionality and parallelization and came out in 2011.

In 2012, Curtis Huttenhower and colleagues described MetaPhlAn, which limits analysis to clade-specific marker genes to speed up the classification of sequence reads. Peer Bork and colleagues also extracted a limited marker set from metagenomic data in their metagenomic OTUs (mOTU) approach in 2013, but used 40 universally conserved prokaryotic genes. Both methods work best in systems like the human gut that have a large number of sequenced reference genomes.

Genomes from mixtures
Earlier this year, Christopher Quince, Anders Andersson and colleagues published an unsupervised binning method called CONCOCT to help reconstruct genomes from mixtures. It uses sequence composition and differential coverage across samples to assign pre-assembled contiguous sequences (contigs) to species or strain bins.

Single-cell sequencing is another way to obtain microbial genomes. Paul Blainey and Stephen Quake discuss challenges and opportunities for single-cell sequencing in a Commentary in our Method of the Year issue in 2014. When cultures are available, long-read single-molecule sequencing technology can provide very high quality genome sequences; the HGAP software from Jonas Korlach and colleagues makes this possible using a single Pacific Biosciences sequencing library.

With genomic sequences in hand, there remains the question of how to fit them within an appropriate taxonomy. Peer Bork and colleagues tackled the problem in 2013 with their species identification (SpecI) tool, that bases classification on the same 40 markers as mOTU.

Functional analysis and ecology
An array of tools have been designed to wrestle ecological and biological insights from metagenomic sequence data, such as the GENE PRediction IMprovement Pipeline (GenePRIMP) for annotating prokaryotic genomes by Amrita Pati and colleagues in 2010 and the metagenomeSeq method to test for the differential microbe abundance across environments or conditions by Mihai Pop and colleagues in 2013 (also see a comment by Bork and colleagues and the authors’ reply).

In 2010, Rob Knight and colleagues compared 51 methods for their ability to identify biologically relevant distribution patterns using real and simulated 16S pyrosequencing data from samples that were clustered or assayed along environmental gradients. In 2012, Jack Gilbert and colleagues developed microbial assemblage prediction (MAP), an artificial neural network approach to model microbial community structure across the Western English Channel that combines time course metagenomic data from a single site with bioclimatic data gathered over the entire channel.

Quality control and bias
Generating accurate and robust microbial sequence data requires rigorous benchmarking and controls, and experimental methods are constantly improving. Nikos Kyrpides and colleagues studied the use of simulated data to evaluate metagenomic analysis methods in 2007. In 2010, Philip Hugenholtz and colleagues evaluated two methods to deplete rRNA from metatranscriptomes.

J Gregory Caporaso and colleagues further demonstrated the effect of Illumina read quality on taxonomic assignment and diversity assessment in 2013, and Scott Kelley and colleagues developed SourceTracker software to identify contaminants in microbial sequencing studies.

We look forward to many more contributions in the field of microbial sequencing.

 

References:
Alice Carolyn McHardy et al.
Accurate phylogenetic classification of variable-length DNA fragments
Nature Methods 4, 63-72 (2007) doi:10.1038/nmeth976

Konstantinos Mavromatis et al.
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 4 (6), pp. 495-500 (2007) doi:10.1038/nmeth1043

Micah Hamady, Jeffrey J Walker, J Kirk Harris, Nicholas J Gold & Rob Knight
Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex
Nature Methods 5, 235-237 (2008) doi:10.1038/nmeth.1184

Christopher Quince et al.
Accurate determination of microbial diversity from 454 pyrosequencing data
Nature Methods 6, 639-641 (2009) doi:10.1038/nmeth.1361

Arthur Brady & Steven L Salzberg
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models
Nature Methods 6, 673-676 (2009) doi:10.1038/nmeth.1358

J Gregory Caporaso et al.
QIIME allows analysis of high-throughput community sequencing data
Nature Methods 7, 335-336 (2010) doi:10.1038/nmeth.f.303

Jens Reeder & Rob Knight
Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions
Nature Methods 7, 668-669 (2010) doi:10.1038/nmeth0910-668b

He et al.
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
Nature Methods 7, 807-812 (2010) doi:10.1038/nmeth.1507

Amrita Pati et al.
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes
Nature Methods 7, 455-457 (2010) doi:10.1038/nmeth.1457

Justin Kuczynski,  Zongzhi Liu,  Catherine Lozupone,  Daniel McDonald,  Noah Fierer &  Rob Knight
Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
Nature Methods 7, 813-819 (2010) doi:10.1038/nmeth.1499
Patil et al.
Taxonomic metagenome sequence assignment with structured output models
Nature Methods 8, 191-192 (2011) doi:10.1038/nmeth0311-191

Arthur Brady & Steven L Salzberg
PhymmBL expanded: confidence scores, custom databases, parallelization and more
Nature Methods 8, 367-367 (2011) doi:10.1038/nmeth0511-367

Dan Knights et al.
Bayesian community-wide culture-independent microbial source tracking
Nature Methods 8, 761-763 (2011) doi:10.1038/nmeth.1650

Lauren Bragg, Glenn Stone, Michael Imelfort, Philip Hugenholtz &  Gene W Tyson
Fast, accurate error-correction of amplicon pyrosequences using Acacia
Nature Methods 9, 425-426 (2012) doi:10.1038/nmeth.1990

Nicola Segata et al.
Metagenomic microbial community profiling using unique clade-specific marker genes
Nature Methods 9, 811-814 (2012) doi:10.1038/nmeth.2066

Peter E Larsen,  Dawn Field &  Jack A Gilbert
Predicting bacterial community assemblages using an artificial neural network approach
Nature Methods 9, 621-625 (2012) doi:10.1038/nmeth.1975

Robert C Edgar
UPARSE: highly accurate OTU sequences from microbial amplicon reads
Nature Methods 10, 996-998 (2013) doi:10.1038/nmeth.2604

Derek S Lundberg,  Scott Yourstone,  Piotr Mieczkowski,  Corbin D Jones &  Jeffery L Dangl
Practical innovations for high-throughput amplicon sequencing
Nature Methods 10, 999-1002 (2013) doi:10.1038/nmeth.2634

Shinichi Sunagawa et al.
Metagenomic species profiling using universal phylogenetic marker genes
Nature Methods 10, 1196-1199 (2013) doi:10.1038/nmeth.2693

Daniel R Mende,  Shinichi Sunagawa,  Georg Zeller &  Peer Bork
Accurate and universal delineation of prokaryotic species
Nature Methods 10, 881-884 (2013) doi:10.1038/nmeth.2575

Chen-Shan Chin et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
Nature Methods 10, 563-569 (2013) doi:10.1038/nmeth.2474

Nicholas A Bokulich et al.
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing
Nature Methods 10, 57-59 (2013) doi:10.1038/nmeth.2276

Joseph N Paulson,  O Colin Stine,  Héctor Corrada Bravo &  Mihai Pop
Differential abundance analysis for microbial marker-gene surveys
Nature Methods 10, 1200-1202 (2013) doi:10.1038/nmeth.2658

Paul C Blainey &  Stephen R Quake
Dissecting genomic diversity, one cell at a time
Nature Methods 11, 19-21 (2014) doi:10.1038/nmeth.2783

Johannes Alneberg et al.
Binning metagenomic contigs by coverage and composition
Nature Methods (2014) doi:10.1038/nmeth.3103

Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

Strengthening communities through competition

Community bioinformatics challenges help drive methods development.

Science moves ahead faster as a social enterprise, perhaps especially so in the dynamic area of bioinformatics. Bioinformatics competitions are important opportunities for developers (and users) to come together to define the essential questions in the field and decide on the best metrics to evaluate them. They also perform a critical function in making valuable benchmark datasets available to anyone, including small labs and young students.

At the end of a challenge, ideally, is a better appreciation of the most promising approaches to a problem, as well as a recognition of difficulties and opportunities for future development. And  there are new contacts for collaboration. As a reality check, it is less common for researchers with directly competing methods to collaborate; their work depends on a competitive funding model. But complementary approaches provide fertile ground for exploring new ways to attack a problem, and some contests are directly encouraging collaborative coding.

In our July Editorial, we continue our support of these initiatives, urging participation and an embrace of formats that maximize engagement among participants. Already, measures like on-line forums, webinars, and conferences involve participants in the planning and interpretation stages, which are critical for getting the most out of each event.

A variety of formats beyond the traditional bake-off are evolving in the collaborative spirit, encouraging more sharing of ideas and code. For example, hackathons take on more focused coding challenges in a single dedicated meet-up session, while open-source competitions make code available during the contest to allow researchers to learn from each other. These formats are not meant as an evaluation of existing methods, but promote new solutions. As Gustavo Stolovitzky of the DREAM challenges points out, publishing code during the event has the potential for ‘herding’ behavior (copy-the-leader), which can stifle creativity and produce a coding monoculture. A number of DREAM challenges now use a two-stage approach in which top performers from a traditional competition phase are invited back to develop a new and better solution together.

Journals and funders also play a role in supporting these efforts. Nature methods has published a number of papers resulting from community competitions (CAFA, DREAM, FlowCAP, Particle tracking and RGASP) and the Nature journals have been committed to providing these papers under a Creative Commons attribution-noncommercial-share alike unported license since January 2013.

There are difficulties associated with running large-scale events. Choice of data set and metrics can bias evaluations towards certain solutions, and the involvement of many developers can water down the conclusions resulting from the challenge. Moreover, usability is often not considered since it is hard to quantify. Ultimately, these issues can be helped by boosting participation in decision-making during planning stages, tailoring conclusions to each scenario that is tested, and having judging panels test the best-performing methods to ensure usability.

We are heartened to see the continued success of community-led competitions and the birth of contests in new areas. In a guest post, we invited organizers of the CAMI competition to announce their upcoming event on metagenome data interpretation.

Below, we provide a non-comprehensive list of some recent and ongoing challenges:

Bake-offs
Assemblathon – genome assembly
CAFA (Critical Assessment of Function Annotation) – protein functional prediction
CAGI (Critical Assessment of Genome Interpretation) – functional variant prediction
CAMI (Critical Assessment of Metagenome Interpretation) – see the announcement
CAPRI (Critical Assessment of PRediction of Interactions) – structure-based protein-protein interaction prediction
CASP (Critical Assessment of protein Structure Prediction) – protein structure prediction since 1994!
DREAM (Dialogue for Reverse Engineering Assessment and Methods) – systems biology challenges with hybrid formats and challenge-assisted review
FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods)
Grand Challenges in biomedical image analysis
Particle tracking challenge
RGASP (RNA-seq Genome Annotation Assessment Project)

Crowdsourcing competitions, hackathons and fast challenges
BioHackathons – open-source programming meetups
Innocentive – commercial platform offering cash prizes (e.g. the $1 million US Defense Threat Reduction Agency (DTRA) challenge to identify organisms from a stream of DNA sequences)
DNA60IFX – short challenges based on DNA or RNA sequence data
DREAM – a number of recent and current challenges include a collaborative phase of tool development
Neurosynth hackathons – open-source programming meetups in computational neurobiology
Sequence Squeeze – open-source competition for sequence file compression (cash prize)
[topcoder] – variety of computational challenges, with some cash prizes

The Critical Assessment of Metagenome Interpretation (CAMI) competition

Alice McHardy, Alex Sczyrba and Thomas Rattei announce a new initiative for assessing metagenomics methods in this guest post.

Alice McHardy

Alice McHardy{credit}Folker Meyer{/credit}

Alex Sczyrba

Alex Sczyrba{credit}A. Sczyrba{/credit}

Thomas Rattei

Thomas Rattei{credit}Anja Venier{/credit}

In just over a decade, metagenomics has developed into a powerful and productive method in microbiology and microbial ecology. The ability to retrieve and organize bits and pieces of genomic DNA from any natural context has opened a window into the vast universe of uncultivated microbes. Tremendous progress has been made in computational approaches to interpret this sequence data but none can completely recover the complex information encoded in metagenomes.

A number of challenges stand in the way. Simplifying assumptions are needed and lead to strong limitations and potential inaccuracies in practice. Critically, methodological improvements are difficult to gauge due to the lack of a general standard for comparison. Developers face a substantial burden to individually evaluate existing approaches, which consumes time and computational resources, and may introduce unintended biases.

cami_1The Critical Assessment of Metagenome Interpretation (CAMI) is a new community-led initiative designed to help tackle these problems by aiming for an independent, comprehensive and bias-free evaluation of methods. We are making extensive high-quality unpublished metagenomic data sets available for developers to test their short read assembly, binning and taxonomic classification methods. The results of CAMI will provide exhaustive quantitative metrics on tool performance to serve as a guide to users under different scenarios, and to help developers identify promising directions for future work.

As a community effort, we encourage feedback by both method developers and users of metagenome analysis tools. The CAMI initiative was one of the four discussion threads of the Metagenome Meeting at the Newton Institute in Cambridge this year. Another open discussion with both developers and users of computational metagenome methods will also take place at a roundtable at the ISME conference in Seoul in August.

We urge developers to participate by registering for the competition on our website and joining our Google+ group to provide feedback on the current design phase. The competition is tentatively scheduled to open at the end of 2014. Key data sets are being generated, and CAMI is currently seeking additional data contributors to provide genomes of deep-branching lineages for data set generation. The results will be presented and discussed in a workshop a few months after the competition. We aim for a joint publication of the generated insights together with all CAMI contest participants and data contributors.

We encourage everyone to get involved and spread the word!

Here there be software

Software plays an important role in scientific research, and published studies increasingly rely on custom software code developed by authors. This calls for better transparency in research articles and improved access to the software and code itself.

This month in Nature Methods and on methagora we revisit issues regarding software reporting and availability first raised exactly seven years ago in our March 2007 Editorial “Social software“. Our March 2014 Editorial updates and expands on these editorial policies and a blog post provides details of our guidelines for custom algorithms and software reported in Nature Methods research papers. We encourage researchers to read these, particularly those considering submitting a research manuscript using or reporting custom software to us. We also hope that publicizing our editorial policies might aid other journals in thinking about how to handle algorithms and software associated with research they publish.

Of course, these efforts are only one small part of what needs to be done to improve access to and use of scientific research software. As can be seen by our somewhat complex guidelines, it is difficult to establish simple rules that are sensible and fair for all cases and all communities. Community participation will be essential for refining and improving how software is handled.

Nature Methods currently relies on the use of Supplementary Software zip files for authors to supply the software and code underlying research articles. This isn’t pretty but it fulfills our basic needs. For example, 50% of the research articles in our March issue contain Supplementary Software files. But better methods are needed to archive and document code and assign provenance.

An important initiative in this regard is the “Code as a research object” project that is a collaboration between Mozilla Science Lab, Github and figshare that seeks to “better integrate code and scientific software into the scholarly workflow.” The aim is to create citable endpoints for the exact code used in particular studies. [Full disclosure: figshare is a product of Digital Science which, like Nature Methods, is part of Macmillan Publishers.]

The project is still in its early stages and follows on the similar but broader Research Object community project. Similarly, GigaScience and F1000Research are experimenting with archiving code and pipelines with DOIs.

We applaud these efforts and encourage the broader research community to participate in them. The current discussion about what is needed for code reuse (announced on the ScienceLab blog) and going on in a thread at Github would greatly benefit from more input by researchers who don’t consider themselves code jockeys.

There are many sophisticated and powerful things that could be done in an ideal world to facilitate code exposure and reuse, but the situation at the great majority of journals is so underdeveloped and the needs so acute that even small flexible steps forward will have a positive impact. Most important is for facilities to be put in place that allow and encourage the entire community to move forward, not just a small portion of it.

Guidelines for algorithms and software in Nature Methods

A large proportion of original research published in Nature Methods relies to varying degress on custom algorithms and software developed by the authors. Here we provide guidance on our relevant material sharing and reporting policies.

Nature Methods first outlined our material sharing and reporting standards for algorithms and software in a March 2007 Editorial. Now, after seven years of experience applying those policies we updated and expanded on them in our March 2014 Editorial. On this page we provide more detailed guidelines for authors submitting manuscripts containing unpublished algorithms and software they created. We are posting this information here because we’d like these guidelines to evolve and we want input from our communities on how they think this should happen. Please comment below and let us know your thoughts. We will update this document as our policies change.

Manuscripts published in Nature Methods include methods and tools in which algorithms and software represent an increasingly important methodological component. However, the degree to which they are central to the reported methodology can vary considerably. The algorithm or tool may be the entire motivation for publishing the work or it may be ancillary to it. Additionally, the methodology may be a novel algorithm of value in and of itself but a coded implementation is still necessary for the authors to show that it works as expected. Finally, the software tool may implement existing algorithms in a user-friendly form to deliver high value functionality of substantial general interest. Because of this wide variety it is inappropriate to enforce one-size-fits-all standards for algorithms and software reported in Nature Methods. The guidelines below represent our current editorial position on software reporting and release.

Client-side Software
This is software that is installed and used on a personal computer and not intended to be accessed remotely as a web service. It can be entirely stand-alone on a commonly available operating system (Windows, Mac OS X, or *nix) or can require the user to have a popular software platform installed (MATLAB or LabVIEW). In all cases, but particularly when using MATLAB or LabVIEW, all platform versions and software dependencies must be detailed in the supplied documentation.

At Submission

  • If the custom algorithm/software is central to the method and has not been reported previously in a published research paper it must be supplied by the authors in a usable form including one or more of the following.
    1. Source code
    2. Complete pseudocode
    3. Full mathematical description of the algorithm
    4. Compiled standalone software

    We strongly urge that full source code be provided. A compiled executable alone is not sufficient but may be required if the tool is intended to be of wide general use. Final acceptable forms of release of the algorithm, software and code will be determined by the editor after consultation with referees. This decision will be influenced by the editorial motivation for publishing the work (i.e. high novelty, satisfies wide general need, etc).

  • If the software is ancillary to the methodology being reported or is a routine implementation of obvious processes, such as microscope control software or analyses that are otherwise adequately described, the software need not be supplied to reviewers at submission but final release requirements may change in the course of the review process.
  • Supplied source code or software must be accompanied by documentation sufficient for a typical user to compile, install and use the software. Depending on the nature of the software tool, how central it is to the manuscript and our editorial motivation for considering the work, the minimum documentation may be a simple readme file or a full manual in PDF format.
  • If appropriate, sample data known to work on the software should be provided along with the expected output. Referees are encouraged to try and use the tool to analyze their own data.
  • The software and associated files may be supplied for reviewers as either:
    1. A single Supplementary Software zip file up to 200 MB in size
    2. Four DVDs to be mailed to the reviewers.
  • Any restrictions on the availability of software or code used to implement novel algorithms must be specified at the time of submission. Editors will decide whether any restrictions are acceptable in consultation with the reviewers. If some restrictions are deemed acceptable, they must be clearly explained in the methods section of the manuscript. Authors must supply all information needed for the reviewers to properly evaluate the software or code. If the motivation of the submitted manuscript is to provide a useful tool, rather than report a new algorithmic development, there should be no substantial restrictions on software or code availability.
  • We encourage authors to provide a license with the software or code.
  • A narrative description of key algorithmic components should be provided in the main text. Extensive equations, pseudocode or snippets of source code should be confined to the Online Methods or a Supplementary Note.

At Acceptance

  • If the software is central to the methodology and non-obvious, the source code should be provided in a Supplementary Software zip file as described above so that readers can easily access the exact code used to obtain the results in the paper. There are some possible exceptions:
    1. If the author’s institution requires a user to accept a license agreement or if the author has other reasonable grounds for not providing the source code as Supplementary Software, it may be acceptable for the author to host source code on an institutional server and require that users fill out an online form and agree to a license before downloading the software. In this instance the software must have version numbering and a link to the version used in the work must be provided in the manuscript.
    2. In some situations it may be permissible for authors to supply only compiled software as Supplementary Software but the source code to academic users upon email request. Details of availability must be clearly stated in the manuscript.
    3. It is not acceptable to make software and code available by email request only.

  • If the software or code isn’t the main tool/method being reported in the manuscript the authors may provide a note in the readme file of the Supplementary Software cautioning users that the code is unsupported and not intended for general use. In this case it is permissible that the software or code be made available only by email request but the authors must state this availability in the manuscript.
  • Regardless of how the software is made available, the code supplied with the manuscript must be identical to that used to obtain the data in the paper. An exception can be made for changes that don’t alter the processing of input data. The authors may however provide a link to access new versions of the software.
  • We strongly encourage authors to include a license with all published software and code.
  • We encourage authors to provide macros for recording the software version and parameter settings during analyses or to integrate this functionality into the software itself.

Web Tools/Resources
These represent a special class of software that many times can’t be expected to follow the same guidelines outlined above. This is particularly true if the web tool or resource is being supplied as a service and has few, if any, novel computational aspects to it. The only end-user requirement for web tools is that they be freely accessible with any modern web browser.

Nature.com provides a proxy server for reviewers to access web tools and resources anonymously.

At Submission

  • The authors must supply a working link and any necessary log in information.
  • Any unpublished algorithms central to the operation of the tool should be supplied in forms a), c) or d) detailed above.

At Acceptance

  • The authors should supply written confirmation that they will keep the website and tool operating and freely accessible for the foreseeable future.

People, publishing, and policy: Q&A with Janet Thornton, director of the European Bioinformatics Institute.

Janet Thornton has been named Dame Commander of the Order of the British Empire. She feels it is an important recognition of bioinformatics.

Janet Thornton has been named Dame Commander of the Order of the British Empire. She feels it is an important recognition of bioinformatics.{credit}EMBL-EBI{/credit}

The scientist profiled in the February issue of Nature Methods (the Author File) is Janet Thornton, the director of the European Bioinformatics Institute.

Here, she shares some additional insight about publishing, science policy, and mentoring. What follows is an edited excerpt of her conversation with Nature Methods. Read more here.

VM: In an era of not-so-plentiful funds, ELIXIR (interviewer looks up acronym…)—the European life-sciences Infrastructure for biological Information—and other initiatives takes you deep into policy-making. Which tends to not resemble a picnic on a sunny Nottinghamshire day. What motivates you?

JT: ELIXIR was launched Dec 18 and now has its own director. It does feel a bit that it’s my child. But it’s a child that has grown up and is really on its way to becoming independent and moving forward to being an independent adult. It’s still got a long way to go. It’s a bit like a teenager, actually. (laughs)

I honestly believe that these initiatives are the best way forward because, despite the setbacks, everyone broadly agrees. So it is a case of getting through the politics and making the science happen. As we know, science has no borders—and all scientists agree with this—so in the end, common sense will win and we can go forward.

VM: You have published around 400 papers. What does a paper mean to you?

JT: Probably for me the most important part of the process of science is publishing a paper. Because it’s the time when you really sort out what matters, why you did it, what you discovered and then you try and make it understandable for other people. And I have to say I get really upset when my papers are rejected.

VM: What types of papers do you enjoy reading?

JT: I love reading good solid papers, which are logical and explain how the results are obtained and why they are important. I used to spend hours in the library, like a detective tracking down information and knowledge.

VM: Rumor has it, you still present posters.

I don’t often present posters but there was one particular occasion when the University of Cambridge organized an event and they asked all the senior staff throughout the university to present posters. That was the last sort of official poster presentation. Of course, my students and post-docs have posters all the time. And I do man those posters as appropriate. It’s fun. You talk about your work.

VM: What is the best way for a scientist to select members most suited to his or her lab?

JT: Five things I look for: a) Bright/clever, b) Committed and interested in a project or area of research, c) Relevant expertise – though this is not the most important thing, d) What does the lab think? e) Would I like to have a meeting at 9am on a Monday morning with this person?

VM: Computational resources in the life sciences are not always appreciated. What do you recommend to scientists keen on being and staying tool-builders and resource-providers?

JT: Find a good place to go to follow your dream; find someone you want to work with and prepare yourself for the future. Not all scientists can be principal investigators (PIs), nor indeed want to be, so the key is to find your own niche.

VM: You studied physics at the University of Nottingham, then shifted to biophysics for your PhD at the National Institute for Medical Research. What do you advise when students of any stripe wonder: ‘Shall I choose physics? Computer science? Biology?’

JT: I am afraid I am biased—go with biology—it is amazing, beautiful, complex, but still an open book with lots to discover. And even if this were not enough, it has so many really important applications —many of the so-called grand challenges that will literally affect the future of this planet and everyone on it.

Nature journals provide a CC license for community experiments

Nature Methods has long been an advocate of the value of community experiments (or competitions/challenges) to assess and compare the performance of algorithms and software tools. In 2008 we discussed the value of these competitions and advocated that they also be used to assess the performance of less widely used algorithms such as those used for single particle tracking. Such an experiment for assessing single particle tracking was run in 2012, although the results are still awaiting publication.

Publication of such work has often been confined to more specialized journals but in 2012 Nature Methods started publishing manuscripts emanating from these competitions with a manuscript assessing the performance of gene regulatory network inference methods based on results of one of the DREAM5 challenges.

In recognition of the profound value such challenges provide to the wider scientific community the Nature journals will now be publishing manuscripts describing the results of these challenges under a Creative Commons attribution-noncommercial-share alike unported license. This is the same license we use for publishing first genome papers, standards papers and white papers. The first example of this is an Analysis article published in Nature Methods yesterday describing the results of the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment.

Publication of such community experiments will necessarily be highly selective and likely increasingly so as such challenges become more prevalent, as illustrated by the explosion in the number of Grand Challenges in Medical Image Analysis. But these community experiments provide invaluable information on the performance of methods that are otherwise difficult to objectively compare. We hope that the potential for publication in a Nature journal and the open access provided by a creative commons license helps encourage broader participation in these efforts and visibility of the results.

Update: February 12
We just published another manuscript describing a community experiment. This Analysis article presents the results of the first FlowCAP challenge that assessed the performance of flow cytometry automated analysis methods.