TechBlog: Software quality tests yield best practices

Screen Shot2

{credit}Alexandros Stamatakis/GitHub{/credit}

Life science research increasingly runs on software. A good fraction, perhaps even most of it, is made by academics, for academics: Rough around the edges, perhaps, but effective — not to mention free. But, is it of high quality?

Alexandros Stamatakis decided to find out.

Stamatakis is a computer scientist and bioinformatician at HITS, the Heidelberg Institute for Theoretical Studies in Germany, and a professor of computer science at the Karslruhe Institute of Technology. His team has been developing and refining software tools for evolutionary biology for more than 15 years, he says, including one called RAxML (from which the code snippet shown above was pulled). Yet for all that time, he says, his code still wasn’t perfect.

“The more I developed it the more bugs I had to fix and the more I started worrying about software quality,” he says.

Not software ‘accuracy’, mind you — when it comes to phylogenetics, it’s difficult to know whether software is providing the correct answer. “You don’t know the ground-truth,” Stamatakis says. Rather, he was curious whether popular tools meet computer-science standards for quality.

To find out, Stamatakis and his team downloaded the code for 16 popular phylogenetic tools (plus, as a control, one from the field of astronomy), which collectively have been cited more than 90,000 times. They then ran those codes — 15 of which were written in C/C++ and the last in Java — through a series of tests.

For instance, they looked at how well software can scale from a desktop computer to a large cluster, something that increasingly is necessary as life science datasets balloon in size. They measured the amount of duplicated code in the software to get a rough indication of maintainability. And they counted the number of so-called ‘assertions’ — logical statements in the code that assert, for instance, that a value falls within a certain range, and that cause the software to terminate should they fail — to obtain a measure of code ‘correctness’.

“There have been empirical studies by computer scientists working in the field of software engineering, where they showed that there is a correlation between incorrect code, or code defects, and the number of assertions used — or let’s better say, an anti-correlation,” Stamatakis says.

So, how did the toolset do? Not too well.

As documented in an article published 29 January in Molecular Biology and Evolution, none of the 16 programs in the round-up, including Stamatakis’ own RAxML, aced all the tests. (With 57,233 lines of code, RAxML exhibited both compiler warnings and memory leaks.) But, he stresses, that is neither to denigrate the programmers who wrote those tools — who, after all, were simply trying (and generally succeeding) to solve a particular problem — nor to suggest they do not work properly.

Rather, he says, potential users must exercise caution in using these tools. “They shouldn’t blithely trust software. And they shouldn’t view it as black boxes,” but instead (as he puts it in his article) as “potential Pandora’s boxes”.

Users should strive also to understand what their code is doing, Stamatakis advises. And if unexpected results arise, repeat them using a separate tool that performs the same task, to ensure they aren’t chasing digital phantoms.

Stamatakis concludes his article with a series of ‘best practices’ for software developers. These include running tests for memory allocation errors and leaks, using assertions, checking for code compilation warnings using multiple compilers, and minimizing code complexity and duplication — practices that are common in professional software development but less so in the life sciences.

The tools Stamatakis’ team used to run its tests are freely available, so readers can try them themselves to see how trustworthy their chosen software is.

Journal editors, he says, should consider requiring such tests of any peer-reviewed work, either performed by the authors themselves prior to submission, or by the peer-reviewers. In fact, during our conversation, Stamatakis suggested he might make the toolbox available as a Python script or Docker container, to make it easier for others to adopt. If and when he does, we’ll let you know. In the meantime, caveat emptor!

 

Jeffrey Perkel is Technology Editor, Nature

 

Suggested posts

‘Manubot’ powers a crowdsourced ‘deep-learning’ review

eLife replaces commenting system with Hypothesis annotations

Interactive figures, a mea culpa

TechBlog: Timothée Poisot: Data science for the rest of us

tim

{credit}Timothée Poisot{/credit}

Timothée Poisot recently travelled to London for MozFest 2017, “The world’s leading festival for the open Internet movement.” There, the quantitative and computational ecologist at the University of Montréal in Canada ran a session entitled “Scientific computing for the terabyte-less.” Here, he tells Naturejobs why life science research needn’t necessarily follow the Big Data model.

Continue reading

TechBlog: Bioconda promises to ease bioinformatics software installation woes

bioconda

{credit}Johannes Köster/GitHub{/credit}

Bioinformatics is notoriously complicated, what with its arcane command-line interface, complex workflows, and massive datasets. For the uninitiated, simply installing the software can present a problem.

A new paper on the bioRxiv preprint archive describes one possible solution, a bioinformatics-focused package collection called Bioconda.

Continue reading

TechBlog: Interactive figures address data reproducibility

Juicebox.js

Juicebox.js{credit}Screenshot/Jeffrey Perkel{/credit}

Data reproducibility and transparency mean different things to different people, but one aspect involves allowing scientists to view and manipulate the data or code underlying published figures, both to double-check others’ work and to repeat those analyses using custom data. Over the past year, for instance, the open-access journal F1000Research has implemented integrations with Code Ocean and Plotly for viewing and manipulating programming code and figures, respectively. Now, a new publication showcases interactive figures for 3D genome analysis, too.

Continue reading

TechBlog: The nanopore toolbox

550285a-i2

{credit}Nik Spencer/Nature{/credit}

For this week’s Technology Feature, Michael Eisenstein wrote about the technology, applications, and challenges of nanopore DNA sequencing. In brief, the technology involves threading intact pieces of DNA through a tiny aperture in a membrane or other barrier, through which a current flows. As each base passes, it disrupts that current in a characteristic way, allowing specialized software to determine the sequence.

The technology has multiple benefits: it’s relatively inexpensive and compact, and produces exceptionally long reads. But the resulting error rate is also higher than some other technologies. What that means is, informatics tools designed to handle short-read data can often stumble when confronted with nanopore sequences. But a growing collection of dedicated long-read tools is rapidly filling in the gap. I asked a few nanopore veterans to help me compile a list.

Continue reading

TechBlog: Jupyter powers bioinformatics, again

GenePattern Notebook screenshot

Bioinformatics isn’t easy for newbies. It’s typically done on the Linux command line, where users direct the computer using text-based instructions rather than clicking a mouse.

But there are alternatives. One popular choice is Galaxy; another is GenePattern. Both allow researchers to execute complex bioinformatics tools via open-source, point-and-click, web-based interfaces, freeing them from the burdens of the command line, programming, and software installation. As such, they make bioinformatics workflows relatively user-friendly. And that trend is continuing.

Continue reading

TechBlog: HiPiler simplifies chromatin structure analysis

Screen Shot 2017-09-07 at 12.29.30 PM

For my recent Toolbox on 3D genome visualization tools, Nils Gehlenborg at Harvard Medical School clued me into two interesting pieces of software. One, HiGlass, was included in my article; a related tool, HiPiler, was not. But that doesn’t mean it’s not worth talking about.

Continue reading

TechBlog: Mike Goodstadt: A circuitous route to bioinformatics

Mike Goodstadt (2)

{credit}CNAG-CRG{/credit}

Most coders come to bioinformatics by one of two routes. They’re either biologists skilled in programming, or programmers with an interest in biology. Mike Goodstadt, the programmer behind the genome-visualization tool TADkit, took a different approach.

In the early-to-mid 1990s, Goodstadt was a student at the University of Bath in the UK. His course of study: Architecture. Continue reading

TechBlog: Jupyter Joins the Galaxy

journal.pcbi.1005425.g001

{credit}PLoS Comput Biol 13(5):e1005425{/credit}

There’s a galaxy of tools in the Galaxy bioinformatics environment — 4,807 at last count. With them, researchers can do just about anything, computationally speaking. One thing they couldn’t do was work with their data programmatically. Now, thanks to a recent software update, that gap has been filled.

In a paper published on 25 May in PLoS Computational Biology, the Galaxy team describes a plug-in that provides access to both Jupyter (neé IPython notebook) and RStudio.

Continue reading

TechBlog: The sound of DNA

DNA sonify pic UPDATE

{credit}Mark Temple{/credit}

With an alphabet comprising just four letters, DNA sequence isn’t much to look at. So, when sequence analysis tools want to highlight key elements, they typically do so using colour, font, or by overlaying other types of information. In the not-too-distant future, there may be another option: Audio.

In a paper published this past April in BMC Bioinformatics, molecular biologist and part-time drummer Mark Temple of Western Sydney University, Australia, describes “an auditory display tool” for DNA: sequence in, audio out.

Available online at dnasonification.org, the tool does precisely what it sounds like: Given a sequence of DNA, it will convert the As, Cs, Gs, and Ts into notes played by a virtual piano, guitar, and organ. An ancillary browser extension, called Jazz-Plugin, is required to play the resulting MIDI files, though Temple has made a number of example MP3 files available on his web site and on YouTube.

After uploading a sequence, the user can select precisely how the musical transcription is accomplished. The simplest mode maps each base to a single note, providing a four-tone auditory landscape. Another maps dinucleotides to notes, increasing the complexity to 16 total sounds.

Most informative, says Temple, is the trinucleotide mode. Here, the software maps each nucleotide triplet to one of 20 notes, and outputs the audio in each of three reading frames at once, just as the genetic code maps 64 codons to 20 amino acids. The result is a series of three-note arpeggios – CGF-ADD-CFF-DFG-AFC-GCD-FCD-FCD, for instance. Optional parameters allow the user to flag start and stop codons, or to cause audio in each reading frame to turn on and off as start and stop codons arise. Continue reading