There’s a galaxy of tools in the Galaxy bioinformatics environment — 4,807 at last count. With them, researchers can do just about anything, computationally speaking. One thing they couldn’t do was work with their data programmatically. Now, thanks to a recent software update, that gap has been filled.
Jupyter, according to its website, “is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text” in “over 40” languages, including Python and R; RStudio is an “integrated development environment” specifically for R. Researchers can fire up these “Galaxy interactive environments” from within Galaxy itself, access and manipulate their datasets, and then push their modified files back into Galaxy for further processing.
The idea, says James Taylor, a computational biologist at Johns Hopkins University in Baltimore, Maryland, whose lab co-leads Galaxy development, was “to add a little more flexibility.”
When the Galaxy team first developed Galaxy, Taylor explains, few biologists had the computational know-how or comfort to actually perform bioinformatics analyses. That’s because bioinformatics is mostly done at the Linux command line, whose text-based interface and arcane syntax can be intimidating for the uninitiated. And it often involves programming, a skill that relatively few biologists possess.
Galaxy, which now has some 110,000 registered users, was designed to circumvent those issues. The system encapsulates bioinformatics tools inside a relatively friendly web-browser interface. Users interact with those tools via pointing-and-clicking in their browser window, building analytical workflows (or “pipelines”) graphically. As they work, Galaxy records everything they do, thus facilitating experimental documentation, reproducibility, and collaboration.
But, says Taylor, because each experiment is unique, researchers often need to write custom code to polish off their analyses, and Galaxy offered no way to handle that eventuality. Users would have to do what they could in the Galaxy environment, then save their data to disk to finish their work at the command line — a process that is both cumbersome and difficult to document and repeat.
Now, researchers can perform those “ad hoc” analyses without ever leaving Galaxy. And, because their work, like all Galaxy steps, is stored in the system’s history, they can easily repeat, share, and modify those analyses as necessary.
The interactive environments are implemented as Docker containers. As Andrew Silver laid out in a Nature Toolbox article in May, “Containers are essentially lightweight, configurable virtual machines — simulated versions of an operating system and its hardware, which allow software developers to share their computational environments. Researchers use them to distribute complicated scientific software systems, thereby allowing others to execute the software under the same conditions that its original developers used.” When a user launches an interactive environment from within Galaxy, they are assigned a unique container, meaning they are free to customize it as necessary without impacting other users. There are even package managers to simplify the process.
Led by Björn Grüning in Freiberg, Germany, and Eric Rasche at Texas A&M University in College Station, Texas, the Galaxy team document three example scenarios in their published report (see, for instance, here). In one, Jupyter is used to assess the read coverage of viral sequence reads in a series of human samples that had previously been mapped to a reference genome in Galaxy. In another, RNA-seq datasets were processed within Galaxy to create a list of expressed genes and their abundances, then transferred to Jupyter for data normalization and other manipulations. The final example uses Galaxy and Jupyter to address the effect of maternal age on mitochondrial DNA variants — a study that involved 312 datasets representing 78 human subjects and 118 GB of sequence data.
According to the authors, the marriage of Jupyter and Galaxy offers several potential benefits, including lowering the entry barrier to data exploration, facilitating data reuse and experimentation, and fostering collaboration between biologists and bioinformaticians.
By “allowing users to alternate between the comfort of Galaxy’s interface and the versatility of Jupyter,” they conclude, users get “the opportunity to experiment with simple programming tasks to gain skills and confidence to explore further. As such, our system (and its subsequent evolution) is the first step in making more and more researchers within the life sciences familiar with scientific computing principles.”
Jupyter is currently live on the main public Galaxy server, usegalaxy.org; RStudio integration is “coming soon,” Taylor says, though it’s already an option for custom Galaxy implementations.
Jeffrey Perkel is Technology Editor, Nature.