Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

Here there be software

Software plays an important role in scientific research, and published studies increasingly rely on custom software code developed by authors. This calls for better transparency in research articles and improved access to the software and code itself.

This month in Nature Methods and on methagora we revisit issues regarding software reporting and availability first raised exactly seven years ago in our March 2007 Editorial “Social software“. Our March 2014 Editorial updates and expands on these editorial policies and a blog post provides details of our guidelines for custom algorithms and software reported in Nature Methods research papers. We encourage researchers to read these, particularly those considering submitting a research manuscript using or reporting custom software to us. We also hope that publicizing our editorial policies might aid other journals in thinking about how to handle algorithms and software associated with research they publish.

Of course, these efforts are only one small part of what needs to be done to improve access to and use of scientific research software. As can be seen by our somewhat complex guidelines, it is difficult to establish simple rules that are sensible and fair for all cases and all communities. Community participation will be essential for refining and improving how software is handled.

Nature Methods currently relies on the use of Supplementary Software zip files for authors to supply the software and code underlying research articles. This isn’t pretty but it fulfills our basic needs. For example, 50% of the research articles in our March issue contain Supplementary Software files. But better methods are needed to archive and document code and assign provenance.

An important initiative in this regard is the “Code as a research object” project that is a collaboration between Mozilla Science Lab, Github and figshare that seeks to “better integrate code and scientific software into the scholarly workflow.” The aim is to create citable endpoints for the exact code used in particular studies. [Full disclosure: figshare is a product of Digital Science which, like Nature Methods, is part of Macmillan Publishers.]

The project is still in its early stages and follows on the similar but broader Research Object community project. Similarly, GigaScience and F1000Research are experimenting with archiving code and pipelines with DOIs.

We applaud these efforts and encourage the broader research community to participate in them. The current discussion about what is needed for code reuse (announced on the ScienceLab blog) and going on in a thread at Github would greatly benefit from more input by researchers who don’t consider themselves code jockeys.

There are many sophisticated and powerful things that could be done in an ideal world to facilitate code exposure and reuse, but the situation at the great majority of journals is so underdeveloped and the needs so acute that even small flexible steps forward will have a positive impact. Most important is for facilities to be put in place that allow and encourage the entire community to move forward, not just a small portion of it.

Guidelines for algorithms and software in Nature Methods

A large proportion of original research published in Nature Methods relies to varying degress on custom algorithms and software developed by the authors. Here we provide guidance on our relevant material sharing and reporting policies.

Nature Methods first outlined our material sharing and reporting standards for algorithms and software in a March 2007 Editorial. Now, after seven years of experience applying those policies we updated and expanded on them in our March 2014 Editorial. On this page we provide more detailed guidelines for authors submitting manuscripts containing unpublished algorithms and software they created. We are posting this information here because we’d like these guidelines to evolve and we want input from our communities on how they think this should happen. Please comment below and let us know your thoughts. We will update this document as our policies change.

Manuscripts published in Nature Methods include methods and tools in which algorithms and software represent an increasingly important methodological component. However, the degree to which they are central to the reported methodology can vary considerably. The algorithm or tool may be the entire motivation for publishing the work or it may be ancillary to it. Additionally, the methodology may be a novel algorithm of value in and of itself but a coded implementation is still necessary for the authors to show that it works as expected. Finally, the software tool may implement existing algorithms in a user-friendly form to deliver high value functionality of substantial general interest. Because of this wide variety it is inappropriate to enforce one-size-fits-all standards for algorithms and software reported in Nature Methods. The guidelines below represent our current editorial position on software reporting and release.

Client-side Software
This is software that is installed and used on a personal computer and not intended to be accessed remotely as a web service. It can be entirely stand-alone on a commonly available operating system (Windows, Mac OS X, or *nix) or can require the user to have a popular software platform installed (MATLAB or LabVIEW). In all cases, but particularly when using MATLAB or LabVIEW, all platform versions and software dependencies must be detailed in the supplied documentation.

At Submission

  • If the custom algorithm/software is central to the method and has not been reported previously in a published research paper it must be supplied by the authors in a usable form including one or more of the following.
    1. Source code
    2. Complete pseudocode
    3. Full mathematical description of the algorithm
    4. Compiled standalone software

    We strongly urge that full source code be provided. A compiled executable alone is not sufficient but may be required if the tool is intended to be of wide general use. Final acceptable forms of release of the algorithm, software and code will be determined by the editor after consultation with referees. This decision will be influenced by the editorial motivation for publishing the work (i.e. high novelty, satisfies wide general need, etc).

  • If the software is ancillary to the methodology being reported or is a routine implementation of obvious processes, such as microscope control software or analyses that are otherwise adequately described, the software need not be supplied to reviewers at submission but final release requirements may change in the course of the review process.
  • Supplied source code or software must be accompanied by documentation sufficient for a typical user to compile, install and use the software. Depending on the nature of the software tool, how central it is to the manuscript and our editorial motivation for considering the work, the minimum documentation may be a simple readme file or a full manual in PDF format.
  • If appropriate, sample data known to work on the software should be provided along with the expected output. Referees are encouraged to try and use the tool to analyze their own data.
  • The software and associated files may be supplied for reviewers as either:
    1. A single Supplementary Software zip file up to 200 MB in size
    2. Four DVDs to be mailed to the reviewers.
  • Any restrictions on the availability of software or code used to implement novel algorithms must be specified at the time of submission. Editors will decide whether any restrictions are acceptable in consultation with the reviewers. If some restrictions are deemed acceptable, they must be clearly explained in the methods section of the manuscript. Authors must supply all information needed for the reviewers to properly evaluate the software or code. If the motivation of the submitted manuscript is to provide a useful tool, rather than report a new algorithmic development, there should be no substantial restrictions on software or code availability.
  • We encourage authors to provide a license with the software or code.
  • A narrative description of key algorithmic components should be provided in the main text. Extensive equations, pseudocode or snippets of source code should be confined to the Online Methods or a Supplementary Note.

At Acceptance

  • If the software is central to the methodology and non-obvious, the source code should be provided in a Supplementary Software zip file as described above so that readers can easily access the exact code used to obtain the results in the paper. There are some possible exceptions:
    1. If the author’s institution requires a user to accept a license agreement or if the author has other reasonable grounds for not providing the source code as Supplementary Software, it may be acceptable for the author to host source code on an institutional server and require that users fill out an online form and agree to a license before downloading the software. In this instance the software must have version numbering and a link to the version used in the work must be provided in the manuscript.
    2. In some situations it may be permissible for authors to supply only compiled software as Supplementary Software but the source code to academic users upon email request. Details of availability must be clearly stated in the manuscript.
    3. It is not acceptable to make software and code available by email request only.

  • If the software or code isn’t the main tool/method being reported in the manuscript the authors may provide a note in the readme file of the Supplementary Software cautioning users that the code is unsupported and not intended for general use. In this case it is permissible that the software or code be made available only by email request but the authors must state this availability in the manuscript.
  • Regardless of how the software is made available, the code supplied with the manuscript must be identical to that used to obtain the data in the paper. An exception can be made for changes that don’t alter the processing of input data. The authors may however provide a link to access new versions of the software.
  • We strongly encourage authors to include a license with all published software and code.
  • We encourage authors to provide macros for recording the software version and parameter settings during analyses or to integrate this functionality into the software itself.

Web Tools/Resources
These represent a special class of software that many times can’t be expected to follow the same guidelines outlined above. This is particularly true if the web tool or resource is being supplied as a service and has few, if any, novel computational aspects to it. The only end-user requirement for web tools is that they be freely accessible with any modern web browser.

Nature.com provides a proxy server for reviewers to access web tools and resources anonymously.

At Submission

  • The authors must supply a working link and any necessary log in information.
  • Any unpublished algorithms central to the operation of the tool should be supplied in forms a), c) or d) detailed above.

At Acceptance

  • The authors should supply written confirmation that they will keep the website and tool operating and freely accessible for the foreseeable future.

Bioimage Informatics

It is no secret that imaging, and microscopy in particular, represents a substantial fraction of the manuscripts published in Nature Methods. Our very first focus issue, in fact, was on fluorescence imaging. When that focus was published in 2005 the term ‘bioimage informatics’ didn’t even exist. Even today, the term isn’t widely used and, unlike many other bioinformaticians, those who work on the development of algorithms and software tools for analysis of biological image data have few dedicated venues for discussing or publishing their work.

But computational techniques are becoming increasingly important in biological imaging and the people developing these tools increasingly see themselves as a distinct community. When we approached the community about publishing a focus issue on bioimage informatics there was an enthusiastic response and the results can be seen in our July issue and focus that went live today.

We hope that biologists using microscopy in their research find the information in the focus useful and that it stimulates them to try some of the tools now available and in development. Many of these tools have functionality designed to encourage community participation and aid in both the creation of new analysis methods and the communication of methods and protocols to other users.

Although these tools and the community developing them have come a long way since Wayne Rasband first released NIH Image, bioimage informatics is still in its relative infancy. As discussed in the focus editorial, algorithm development and usage will become even more important for biological microscopy and will change the way biologists perform and report their research.

Where’s your ground truth?

When using or developing experimental and observational methods it is crucial to assess the method performance in an effort to ensure that the information it provides reflects reality. For experimental biologists this often means conducting carefully chosen control experiments with alternative methods or different experimental settings. More rigorous assessment, particularly for high-throughput or large-scale methods, often requires the use of ‘ground truth’ or ‘gold standard’ data sets. But talk to different people and you will get different answers regarding what ‘ground truth’ or ‘gold standard’ data is. This often includes a nice historical explanation of where the term ‘ground truth’ comes from.

For developers of signal processing and image analysis algorithms though, the situation is clearer; the ground truth is the signal or image you start with. But add a living system into the mix and things get far more complicated. The Editorial in the November issue of Nature Methods discusses the challenges facing developers and users of algorithms for automated analysis of biological data, with a focus on image data. In short, traditional ground truth data is often insufficient. The addition of integrated-editing and change-logging capabilities to these software tools can increase the quality of the analysis, aid further algorithm development and increase the likelihood of biologists adopting the software in the first place.

iPhones in the lab

Do you use your iPhone (or other smartphone or mobile computing device) in the lab? This month’s editorial notes how large numbers of scientists seem to have an iPhone or other mobile device capable of running quite sophisticated applications, or apps. Increasing numbers of these apps are targeted at biologists and some are even intended for use at the lab bench; and lists of recommended apps are popping up on blogs and other sites. Check out the links below for a sample.

22 iPhone Apps for Science Geeks – July 11, 2008

More iPhone apps for scientists – October 13, 2008

5 Bio-Related Apps for your iPhone/iPod Touch – November 4, 2008

iPhone and research – July 24, 2009

iPhone apps every biologist needs – October 9, 2009

10 Best iPhone Apps for Science Majors – December 23, 2009

Some recently released apps that don’t appear in the lists above are:

Bio-Rad PCR – Practical guidance for performing PCR and qPCR

NEB Tools – Double digest finder and restrictions enzyme finder tools

ChemMobi – Search for chemical information by name or ID. View selected properties, MSDS information and structure.

LabCal & LabCalPro – Various laboratory calculation functions molarity, moles, stock dilutions, pH & g-force

GeneticCode & GeneticCodePro – Reference tool the nucleic acid codon table and amino acid properties

But how likely is it that bench researchers will actually use an expensive personal mobile computing device like an iPhone in the lab environment? Since we are no longer in the lab ourselves, we wonder what the current generation of grad students and post-docs are doing. Is your iPhone useful in the lab? What about a similar portable device? What apps do you use?

Although there has been a lot of noise surrounding the new Android-based phone from Google we were unable to find an apps intended for use in the lab on that platform, with the exception of seemingly hundreds of scientific calculator apps. We would love to hear from any readers who are familiar with scientific apps available for platforms other than the iPhone.

Yesterday Apple announced the long anticipated iPad mobile computing device. This tablet computer can run iPhone apps in addition to providing a far larger screen and the capability to run more powerful applications than the iPhone. It seems unlikely that such a device would become as ubiquitous among scientists as the iPhone since it doesn’t double as a phone. However, it has definite advantages as a dedicated laboratory tool and is more suitable for reading journal articles than the iPhone.

Speaking of reading journal articles, the nature.com app should be available for download from the Apple app store on February 1. We’ll keep you posted.

Social software

Don’t be mistaken, Nature Methods’ material sharing policy includes the requirement to make custom-developed software available upon publication. But there are several ways of making software available. We examine the various degrees of disclosure and the choice of formats and try to clarify our position. Let us know if we are heading in the right direction!