Data sharing will reduce the experiments needed in the lab and will increase the speed of knowledge generation by decreasing the time spent on the generation of equivalent datasets.
Guest contributor Ana Sofia Figueiredo
I’m a postdoctoral scientist in systems biology at the University of Magdeburg, Germany. There, I build mathematical models to understand the mechanisms behind certain biological processes, such as the process of energy production by cells under extreme conditions. These mathematical models are representations of reality and some of them can be useful, although all of them are wrong. When well parameterized with data, these models give a quantitative representation and better understanding of such biological processes. Using a systems biology approach, I can do experiments in silico that are very difficult or technically impossible to do in vitro or in vivo. However, a model is only as good as the data it incorporates.
When I have access to publicly available experimental datasets, I can plug the data into my models and, from the synergy of combining mathematical models with experimental data, learn more about the biological system I have at hands.
Sharing data, models and experimental protocols can push forward the generation of knowledge in science.
Data – information – knowledge
Often, the generation of data is hypothesis-driven, but the reverse is also possible. The cycle of hypothesis testing and data generation relies upon the triad data-information-knowledge. If data are a collection of items or facts, information will be the structured description of these data, and knowledge is what I learn about the biological system I study from observing and making questions to those data sets.
Today, many scientific fields depend on low-cost, high-throughput data generation. In the life sciences, for example, Next Generation Sequencing (NGS) provides raw data on genome sequencing and resequencing, as well as transcriptome, interactome and epigenome characterization. One of these machines can produce several terabytes of raw data in one week. Bioinformatics and systems biology provide methods and tools to automate the analysis of these data. The result thereof allows for the construction of interaction maps between the constitutive elements of the network, which can be translated to mathematical models able to predict biological phenomena. Other examples where large datasets are collected to make event predictions are the weather forecast system, and time and space orbit simulation methods in astronomy also conceal mathematical modelling with “big data” coming from diverse satellites in space.
Such methods and tools, together with available computer power, can distil information from the raw data and provide us with knowledge about specific scientific questions such as “Which are the molecular mechanisms used by the immune system to fight disease?” or broader questions such as “How is the weather at the weekend?” or, even, “when will the next meteor hit Earth?”.
Standardize – integrate – visualize
Data generation requires experimental planning, and this needs to include a way to store and access data. Having standardized data allows for data exchange between scientists and platforms. In 2003, Hucka and colleagues presented the Systems Biology Markup Language (SBML), a free XML-based format to represent biochemical reactions. SBML bridges the need for a standardized way of exchanging data and information between scientists and software tools.
Once data has been standardized, there needs to be a way to integrate it into large databases, most of which are open and freely available. If data are open and freely available, then more scientists can have access to it. Model management platforms such as SEEK will enable the “management, sharing and exploration of data and models in systems biology”.
Finally, the Systems Biology Graphical Notation (SBGN) is a standard notation that enables the storage, visualization and communication of biological data and information about specific biological processes.
Reduce – Reuse – Recycle
In the last paragraphs, I pointed out some examples how data can generate knowledge and some methods and tools to efficiently share data amongst scientists. However, many researches are reluctant to share their results because there might be a risk of losing control of their data.
But the 3R principle (Reduce-Reuse-Recycle) used in waste management can be applied to data sharing, and demonstrate its importance: data sharing will allow the reduction of data production by reusing and recycling the already available data.
This further relates to one other 3R principle, Replacement-Reduction-Refinement. Many data-generating experiments in the life sciences involve animal testing. But, mathematical models with predictive power allow for partial replacement of animal testing by computer simulations and sharing data enables the reducing the number of animals used in tests by reducing the number of needed experiments.
Sharing data will not only save money and, most importantly, the environment by reducing the amount of experiments needed, but it will also increase the speed of knowledge generation by decreasing the time spent on the generation of equivalent datasets.
Having well planned experiments that are freely available in standard notations, will foster the development of science by delivering results more efficiently; enabling the testing of different hypothesis by different research groups using the same data set; and reducing fraud and improving the integrity of published work.
Ana Sofia Figueiredo is a winner of the 2015 Scientific Data writing competition and a postdoc in systems biology in Magdeburg, Germany. At work, she uses mathematics to answer questions of biology. At home, she is a compulsive writer and an impulsive chef, who eventually writes in her blog short stories entangling food, family and lots of fantasy. But what she really, really enjoys is playing around and doing nonsense with her three children.