Soapbox Science

Scientific publishing 2.0: moving the compute to the data rather than moving the data to the computers

Adrian Giordani has a Masters in Science Communication from Imperial College London, where he was also the Editor-in-Chief of I, Science magazine. He was a science journalist and Interim Editor-in-Chief at CERN, Geneva, Switzerland. The publication he worked for, International Science Grid This Week, covers news about science and computing in Europe, the US and Asia Pacific regions. Adrian writes about technology such as supercomputing, grid computing, cloud computing, volunteer computing, networks, big data, software and the science it enables. You can follow him on Twitter.

Today, data-intensive science turns raw data into information and then knowledge. This represents the vision of the late and influential computer scientist, Jim Gray, who divided the evolution of science into four paradigms. One thousand years ago, science was experimental in nature, a few hundred years ago it became theoretical, a few decades ago it moved to a computational discipline, and today it’s data driven. Researchers are reliant on e-science tools to enable collaboration, federation, analysis, and exploration to address the data deluge, currently equal to about 1.2 zettabytes each year. If 11 ounces of coffee equalled one gigabyte, a zettabyte would be the same volume as the Great Wall of China.

So much data is produced that the journal Neuroscience stopped accepting supplementary files along with research manuscripts to enable them to better handle the peer review process. In an attempt to address the challenges presented by so much data, some are combing software, databases and infrastructures to transform the way scientific publishing is done, which has been little changed for centuries.

Bring on research publishing 2.0

The life sciences are looking at data analysis and publishing approaches that move the compute to the data rather than moving the data to the computers. The GigaScience journal is creating an open-access data platform which combines software operations, databases and cloud computing into an all-in-one package to make all stages of scientific research computable, enabling peer review not just for results, but methods and data too – a first for genomics and biomedical data.

The journal, a collaboration with BGI, in Shenzhen, China, and the journal BioMed Central, aims to turn research papers into executable data objects, turning all stages of what goes into making a research paper, such as data and analysis into a reducible and citable format. With this, for example, a researcher could, in theory, create a separate methods publication from a research paper, with its own Digital Object Identifier (DOI) that can be indexed and cited. This is important because other researchers can independently validate or disprove the scientific methods underpinning published scientific results and perhaps create new hypotheses and methods.

A diagram showing how the new GigaScience data platform will transform digital research papers into executable data objects for peer-review analysis. This image is taken from a talk by Scott Edmunds called 'Data publication in the data deluge' at COASP 2012 in Budapest on September 20.

Image courtesy Scott Edmunds.

Scott Edmunds, an editor of the GigaScience journal says,

“You don’t need methods sections in a paper any more as methods can be computed, making it easier for reviewers to check data.”

To enable this new journal platform, GigaScience is using an open-source workflow system called Galaxy, which is run by a team at Penn State University, Pennsylvania, US. This was the software foundation used by the ENCODE project, which aims to describe how the human genome actually works and what each part does. In September 2012, 30 papers were published simultaneously in journals including Nature, with the raw data made available to anyone so they could follow the same analysis steps of the original research.

If we build it, will they come?

Last year, the GigaScience journal launched their first citable DOI dataset on the E. coli genome. The information was released on Twitter with a creative commons license and in a pre-publication citable format. Upon release, researchers around the world started producing their own assemblies and annotations, sharing data on Twitter, with some dubbing it the first ‘Tweenome’.

Currently, the journal stores about 20 terabytes of public data on its servers and another 5-10 terabytes are being prepared for release. They have already released a number of peer-reviewed papers and are hard at work finalizing their data platform, so researchers will be able to analyse and recreate published experiments on the Galaxy platform, in the next few months. Edmunds adds,

“People have been used to publishing research papers in a certain way for three centuries. Some people will get our platform and use it immediately. But, I think it will take a while for the rest of the community to get used to it.”

Flipping publishing on its head

Better knowledge discovery: In the above model, the Phortos Group presents how nanopublications can help generate and verify new scientific knowledge more efficiently. The key is as follows: 1. ‘In cerebro’ human reasoning drives specific hypotheses to be researched. 2. ‘In silico’ knowledge discovery uses applications to help find associations too large for any human brain to comprehend alone. 3. Associations and hypotheses are validated by accessing ‘in origine’ source data.

Image courtesy Barend Mons.

Data can be represented as a nanopublication, the smallest unit of publication, discussed in a 2011 Nature article. Researchers have decided to invert the publishing model, making all research data computer-readable which would make data mining facts and linking information for knowledge discovery easier. Amongst the dozens of global projects converting major datasets is that of nanopub.org, which helps researchers create nanopublications of their data. The Phortos Group, a collaboration of companies and researchers providing services to big data owners, aims to develop applications to find associations too difficult for any one human brain to analyze alone.

Barend Mons, scientific director of the Netherlands Bioinformatics Centre, says his group’s research has already inferred new discoveries from what they refer to as the ‘explictome’ or all explicit assertions they can find (an assertion is anything that can be uniquely identified and attributed to an author). An explicit assertion means, for example that the gene “DMD dystrophin [Homo sapiens]” is listed in the Entrez Gene Database with the identification number 1756, and can be linked to this URL: http://www.ncbi.nlm.nih.gov/gene/1756.

They estimate that the explictome of the biomedical life sciences currently consists of 100 trillion nanopublications, which by grouping similar assertions under a summarized or cardinal assertion, can be reduced to fewer than two million concepts – a manageable amount. “We can infer novel protein-protein interactions for example. Our work may go beyond just dealing with big data… We’re moving into studies on the very nature of human knowledge discovery and this will give some very exciting insights,” says Mons.

A distributed community of communities

Aha! Will the combination of GigaScience, nanopublications and ScienceSoft enable more eureka effects such as the mythical moment experienced by Greek polymath Archimedes in his bath over 2,200 years ago?

Image courtesy Wikimedia Commons.

A bigger question is: Could a software approach that serves one research community be transferable to another community?

At the 8th IEEE International Conference on eScience 2012, Mike Conlon, principal investigator of the VIVO project, gave a keynote speech about new scientific results that were made possible by linking two completely different research areas.

In a similar vein, Alberto Di Meglio, project director of the European Middleware Initiative says,

“Today, the way data sets and software are referred to in research papers do(es) not yet allow other people to reproduce the results in a consistent and easily accessible way. Creating links across knowledge is essential. Science today is distributed and the advantages of cross-disciplinary research are becoming apparent.”

The SciencePAD (formerly known as ScienceSoft) project is trying to address this question of connecting communities as it builds a catalogue of software products for all research communities that use scientific software. The SciencePAD project is looking to include various science disciplines in this discussion and organised a workshop on Wednesday 30th January 2013 at CERN on the topic of unique identifiers for software digital objects, such as people, software, data and publications.

While projects at the bleeding edge of scientific publishing continue to progress, they should find ways, perhaps through more workshops, conferences or social media, to include other research communities outside of genomics and biomedicine so as many as possible can benefit from the fruits of their labour. And after all a problem shared is a problem halved.

Comments

Comments are closed.