As data continues to be produced at staggering rates, scientists need to become more aware of the benefits of data sharing, says Eleni Liapi.
Guest contributor Eleni Liapi
The scientific community is currently experiencing an explosion in data generation. At CERN (the European Council for Nuclear Research), the rate of data production is 1 petabyte (=1015 bytes) per day inside the Large Hadron Collider (LHC), which is comparable to 210,000 DVDs. At the European Bioinformatics Institute, 20 petabytes of biological data had been stored between 2004- 2012. In the US alone, the volume of data produced by the healthcare industry in 2011 was estimated at 150 exabytes (=1018 bytes). Undoubtedly, this volume of information brings with it several problems, including data storage and sharing.
Access to data is a topic that initiates numerous discussions and opinions between scientists and other communities for a plethora of reasons, including concerns about inappropriate use, institutional or industrial restrictive policies where the gigabytes of obtained genomic data are to be utilised for pharmaceutical research, for example. To date, there have already been attempts to estimate the extent of the problem. In one survey, 67% of the participants expressed the view that inaccessible data hinder scientific progress.
Given the impact that big data has on science, I wanted to summarise a few rationales supporting data sharing, based on frequently occurring incidents amongst the scientific community.
Turning negative to positive
Across all scientific disciplines, a hypothesis is proposed, tested, and then reported on. Usually only the so-called positive results, proving the hypothesis at test, are published. The rest of the data, when not serving the purpose of negative controls, are saved in computer files, forgotten about (or intentionally ignored). As a result, groups working on similar problems tend to repeatedly test the same hypothesis. This can lead to contrasting results, further confusing what might actually be happening. Granting access to negative results could be a positive solution as it helps redirect scientific questions or examine alternative hypotheses, which might open up the way for new discoveries.
Like negative results, the data from preliminary experiments (which evaluate parameters of a method under development), although insufficient to (dis)prove a hypothesis, are also omitted from publications. However, they could provide the basis for other groups to continue their research on a certain topic. By utilizing shared data sets from these experiments, one research group can adjust their experimental set-ups or sample sets, and thus maximize the utility of previously obtained data in newer research projects for the same scientific query under investigation. In this way, not only the “minor” data deriving from smaller-scale experiments become useful, but also scientists can look back retrospectively to the overall approaches and material used. This approach can create a timeline of findings per scientific topic, which can also serve as a record of the scientific progress for the future generations to learn from.
Experimental reproducibility, a crucial aspect of science, would benefit from data sharing. Sharing outcomes from similar experiments would facilitate a more accurate interpretation of findings, not only for the research groups involved but also for the community at large, since more data can minimise false positive or negative errors. Comparisons made on this ground can further benefit the scientific community by joining the scientists on their efforts to find answers to common important unresolved scientific enigmas in a more objective way. Moreover, data sharing followed by citations enhances the attribution of credits for the work of the researchers in charge and improves the discoverability of data sets inside their publications.
Organised and stored data
In the context of reproducibility, the storage, organisation and ease of search for scientific data are critical. For this reason, the construction of metadata (structured, descriptive and explanatory information for data) from obtained scientific results has been emphasized, as metadata allow for independent data usage and repeated analysis among researchers. In the view of shared data use among scientists, interdisciplinary research is facilitated by the existence of metadata. The efficiency of metadata sharing lies in their availability, readability and compatibility among information systems, which can further improve for example the combination biological and ecological data in large-scale studies. Considering their usefulness, it is no wonder that metadata were reported as the key for enhancing data sharing in terms of “interoperability” (the ability of different informational systems to exchange and share data) and archiving of digitized scientific information for the future.
Data sharing should be considered as an indispensable part of scientific practices in order to advance science and to build and strengthen the trust between the general public and the scientific community, as progresses in healthcare, daily transportations, technology and space exploratory missions affect the human kind in the long-term.
Eleni Liapi is a runner up in the 2015 Scientific Data writing competition. She is also working in the field of forensic molecular pathology in Cologne (Germany), focusing on the gene expression in human post-mortem injured skin. As past Erasmus intern, she greatly supports international student mobility. In her free time, she enjoys volunteering for music festivals, nature hikes, reading and writing stories.