SciData writing competition winner Sarah Lemprière explains how making the world’s deluge of data open will help science
As a global population we are generating more data than ever before. The International Data Corporation (IDC) estimates that by 2020 over 80 million gigabytes of data will be produced every minute. Each second, the world will generate enough data for a 50-year-long Netflix binge. Scientific investigation is a big part of that: every day huge amounts of data are generated on everything from the behaviour of supernovae to the 3D structure of proteins in the brain. When the world’s largest radio telescope comes online in 2020, it alone will produce 180,000 gigabytes of data a minute.
Previously, most of this scientific data would never be made public — the need to produce a compelling story for a journal article means that many datasets showing ‘negative’ results will never be published.
The data that are published will likely be presented in a summarised form and may be behind a journal paywall. In 2012 the International Data Corporation (IDC) estimated that just 0.5% of existing data is used for analysis. There are multiple reasons for this: a large proportion is not currently organised in a way which enables it to be analysed or it may not have been identified as having research or business value. Much potentially valuable data is simply inaccessible to those with the skills, or the desire, to analyse it in depth.
However, the last few years have seen an increase in researchers making their data available online, to the general public, in a relatively unprocessed form. CERN began releasing data from the Large Hadron Collider in 2014, and there are now over 300 terabytes available for analysis on their open data portal. The UK government published an open data strategy in 2012 and continues to release data on a diverse range of topics for analysis by anyone with an internet connection.
If this kind of open data sharing becomes common practice the potential advantages to scientific discovery are tremendous. By making data public you immediately engage a whole new diverse workforce. You can make use of the knowledge and skills of researchers who don’t have the ability to pay journal subscription fees, and interested members of the public, many of whom will approach that dataset from a different perspective to those who collected it. There is a clear appetite for this within the scientific community: in a report published by Elsevier 73% of academics surveyed said that having access to published datasets would benefit their research. It informs experimental design, boosts the value of newly-acquired data, and in some cases enables researchers to gain valuable insights without collecting any data themselves.
The challenge will be to encourage the general public to rise to the same level of interest as scientists. In 2012 the Research Councils UK identified a lack of public understanding of open data and how to engage with it. To me this suggests that dumping data into an online portal will not be enough to encourage public participation and that analysis tools and explainers tailored to a lay audience will need to be provided and publicised. CERN are already doing this well, by providing two open data portals, one for the public and one for researchers, alongside links to learning resources.
Once made public, data can then be continually mined for further insights. For example, the UK National Health Service (NHS) collects data on all their patients for use by doctors. Some of this data has been anonymised and published, providing a hugely valuable dataset for clinical researchers. Already this NHS data has provided insights into the impact of specific medical interventions on long term patient outcomes, without the need to carry out new studies.
The scientific equivalent of using every part of the buffalo, the spread of this kind of open data policy to other fields would enable us to begin to use existing information to its full potential.
If standardised methods of acquiring, depositing and curating datasets can be developed, then scientific data would become a potent weapon in scientific discovery. With standardised datasets, live summaries could be produced to provide an up-to-date view of an entire field or question.
For example, if researchers wanted to examine the efficacy of a particular drug, a standardised and curated database could provide a graphical overview of all the studies that used that drug and the observed effects on patient outcomes. Ideally this summary could link to raw datasets to allow in-depth analysis.
This would be a dramatic improvement on the current method, which involves trawling through search results to find all the relevant journal articles and extracting information from text or figures. The transparency that results from releasing raw datasets also allows conclusions published in the literature to be checked and debated by multiple independent researchers, ensuring that findings are robust. Scientific discovery is always about building upon current knowledge. Isaac Newton’s famous quote still applies: “If I have seen further than others, it is by standing upon the shoulders of giants”. The more high quality data that is freely available, the further we will be able to see.
Sarah Lemprière is a neuroscience PhD student at the University of Edinburgh, where she’s investigating the role of synaptic proteins in the action of ketamine. She loves to write and is interested in how scientists share their findings with each other, and with the public.
Suggested posts
From Doctorate to Data Science: A very short guide
Defending science by opening up: Lessons from Understanding Animal Research
Recent comments on this blog
African astronomy and how one student broke into the field
From Doctorate to Data Science: A very short guide
Work/life balance: New definitions