Bioinformatics is notoriously complicated, what with its arcane command-line interface, complex workflows, and massive datasets. For the uninitiated, simply installing the software can present a problem.
A new paper on the bioRxiv preprint archive describes one possible solution, a bioinformatics-focused package collection called Bioconda.
The problem Bioconda attempts to solve is this. Consider the popular software tool SAMTOOLS, which is used to create and manipulate sequence alignment data files. SAMTOOLS is distributed as a zipped archive, inside of which are several hundred C language source code files and sample data. To get the software running, users have to unpack the archive, compile the source files, and install them in their correct locations. “You basically need to be a systems administrator person to install this stuff,” says bioinformatician C. Titus Brown of the University of California, Davis.
It’s not difficult, exactly — at its simplest, the process requires just a simple make command — but it can go awry. And it often does, especially if one software tool depends upon another being present (a so-called dependency), or if the installation process requires the user to have high-level ‘administrative’ privileges.
Johannes Köster, a computer scientist at the University of Duisburg-Essen in Germany, encountered this problem head-on several years ago. While running a software training workshop, he needed the participants to install a suite of tools so they could follow along and participate in the exercises. But asking everyone to install that software the old-fashioned way was asking for trouble. So, he looked for an alternative.
What Köster needed, he realized, was a package manager — a tool that could easily find, install, and update different software packages, as well as to resolve inevitable dependency issues, regardless of a user’s system permissions. Several such solutions exist, including Debian Med (a life-sciences-focused package collection for Debian Linux) and Homebrew (a MacOS-specific tool). Another option, and Köster’s choice, was Conda, a “platform-agnostic” system originally developed to support a science-focused distribution of the Python programming language, called Anaconda.
With Conda, developers describe their tool and how it should be installed. Conda then creates a distributable form of the software that anyone can use, whether on Windows, MacOS, or Linux. Simply type conda install followed by the name of the desired package, and Conda does the rest. “A Conda package is kind of a minimal entity of a software,” Köster explains: “it just contains the software itself in a relocatable way, so that you can deploy it to any system.”
Conda allows users to create, maintain, and switch between multiple ‘environments’, each of which can contain different versions of the same tools. In one environment, for instance, researchers can use the newest version of a particular program; in another, they can run an older variant, perhaps to avoid a newly introduced idiosyncrasy. Users can even create workflows that require software tools that would otherwise conflict. And system administrators can create dedicated ‘software stacks’ that they can distribute to users with a single command.
Conda is a general-purpose tool. To “unlock [its] benefits” for the life science community, Köster launched Bioconda as a bioinformatics-dedicated channel within the Conda ecosystem in 2015.
Bioconda packages are created and tested largely automatically. Developers submit recipes to the Bioconda page on GitHub, at which point they are tested for syntax errors, built online using continuous integration, folded into containers, and tested again to ensure no dependencies are missing. Following manual review by a human curator, the completed packages are uploaded to the Bioconda channel, making them available for installation. The containers in which the tools were tested — compatible with Docker, Singularity, and rkt — are uploaded to the Biocontainers repository, for those who want to inject an additional layer of reproducibility into their work.
Starting from 10 packages — the tools Köster needed for his workshop — Bioconda has grown to include some 3,000 packages that have been downloaded over 6 million times. For that, Köster credits an “awesome and dedicated community” of over 200 contributors. “Without their work, it would have been impossible to build up or maintain such a huge resource,” he says.
The bioRxiv preprint describing Bioconda was published on 21 October.
Jeffrey Perkel is Technology Editor, Nature
10 Nov 2017: This post has been updated to indicate that neither Bioconda nor Debian Med are themselves package managers; they are distributions of packages for the Conda and Debian package managers, respectively.
Suggested posts
Recent comments on this blog
African astronomy and how one student broke into the field
From Doctorate to Data Science: A very short guide
Work/life balance: New definitions