Science meets Netflix with data streaming

XCMS_logo_online

{credit}Gary Siuzdak{/credit}

In today’s web-connected world, we’ve come to expect instant gratification. When you select a video on Netflix, you don’t wait for the file to finish downloading. Thanks to ever-increasing bandwidth, video can stream to your computer, playing as it arrives. Thus was the concept of binge-watching born, and many a fan of “Stranger Things” went to bed exhausted, but mostly satisfied. (#justiceforbarb). As it turns out, data streaming is being used in the life sciences, too. Nature technology editor Jeffrey Perkel finds out more.

Gary Siuzdak, Senior Director for the Scripps Center for Metabolomics at The Scripps Research Institute in La Jolla, California, has been developing tools to help the metabolomics community analyze its mass spectretry data for at least a decade. In 2012, his lab released the free, cloud-based system called XCMS Online. Users simply upload their raw spectra, and the software will find peaks, identify differences between samples, and (using the lab’s freely available METLIN database), work out which metabolites those spectral features correspond to.According to Siuzdak, the system has nearly 14,000 users. And until recently, they had to finish collecting their data before they could upload it to the cloud. Now, they can stream it, Netflix-style.

In a paper published in mid-December 2016 in Analytical Chemistry, Siuzdak and his team describe XCMS Stream, software that can be installed on a user’s mass spectrometer, which allows the system to compress and upload files as they’re generated.

That multitasking can save considerable time in studies involving large numbers of samples, Siuzdak says. In the published report, he and his team ran 10 samples through the mass spec, each of which produced a 3-GB data file. They then measured how long it took to get those data into XCMS Online. “The real-time data streaming during acquisition allowed for quick turnaround of results only 4 h after completion of data acquisition,” they wrote. “In contrast, 18 hours were needed with manual uploading to obtain results (4.5 times longer).” To be clear, that latter period includes a 10-hour window during which the data just sit overnight, but even so, hours of otherwise wasted time are saved.

“We’ve calculated, depending on the number of runs, it can require hours if not days to actually upload the data into our system, depending on internet speed,” Siuzdak told Nature. But using real-time data streaming, “typically what you’re ultimately waiting for is just the very last analysis that you performed.” (That’s assuming the data upload time is less than the time required to run the sample, of course.)

For a project comprising 10,000 samples, the authors estimate it could take 73 days following data acquisition to upload the data the old-fashioned way, compared to 2.5 minutes via online streaming. Using batch streaming, an intermediate approach in which data are uploaded after collection but compressed and uploaded all together – upload would take 16 days.

Curious how unique XCMS Online was in embracing streaming, I reached out to commercial firms that also offer cloud-based informatics. Both Illumina and Thermo Fisher Scientific also support streaming, company representatives told me. According to Ilya Chorny, Associate Director of Product Marketing, Illumina sequencers push their data to the company’s BaseSpace Sequence Hub following every sequencing cycle. As a result, he says, data are available for analysis shortly after after a sequencing run completes.

Thermo Fisher Scientific has begun offering streaming as well with its Thermo Fisher Connect platform, at least for its newest real-time PCR thermocyclers, the QuantStudio 3 and QuantStudio 5. As reported in my recent Nature Toolbox article on the Internet of Things, another dozen Thermo instruments are to be hooked up and streaming data by the end of 2017, the company reports, including ultra-low-temperature freezers and smart electronic pipettes.

Data upload time has been one of the ongoing limitations of cloud-based computing, Siuzdak says, slowing the pace of research. “I believe we’ve addressed that pretty effectively finally.”

Whether that leads to a sharp uptick in binge-bioinformatics, remains to be seen.

Jeffrey Perkel is Technology Editor, Nature