Data-intensive research requires a new breed of scientist: interdisciplinary analysts who enjoy swimming in data, says Atma Ivancevic.
There has always been an emphasis on the generation of novel data in science. Being a scientist involves progressing from observation to hypothesis to experiment to output. In the past, a combination of scarce data to look at and low throughput machinery to make more has led to limited experimental outcomes.
As a result of the rise of computational power, scientists are facing a paradigm shift from data production to data interpretation. Advances in technology have revolutionised the field of science. Now, our experiments are high output, data-intensive projects that come with their own problems. We find ourselves with the ability to capture and store vast amounts of data – and not enough people to meaningfully interpret it.
The problem is that many scientists are not expected to have the skills needed for data-intensive research.
Consider the following scenario: a biologist in the early 2000s wants to explore the evolution of a particular gene in animals. Using a template DNA sequence, she runs lab experiments to extract the gene from each species and align the sequences. Species differences are assessed, a conclusion is drawn and the results are published, with appropriate analysis on the potential function of the gene and how it has evolved over time. The requirement for publication here is the production of data.
A few years go by, and suddenly the cost of sequencing drops dramatically. Companies and consortia are publishing genomes at such an alarming rate that the number of publicly available species is in the thousands. The biologist wants to see how the previous hypotheses hold up against a larger subset of species. Where to start? She will need high-performance computing machines to store and process the genomes. Manual inspection of the data is too laborious: she needs an automated workflow. She could hire a programmer, but they might lack the background knowledge to find biologically significant phenomena.
Even publishing is a challenge – printing the raw results would make Nature a lot thicker than anyone would want to read. The biologist finds herself at a loss because she cannot perform the computation needed for large datasets, and does not know how to convert the data into publishable format.
What went wrong? The original approach was reproducible and true to the scientific method – for small datasets. But it can’t be adapted to the immensely complex datasets we see today. The rise of data affects all scientific disciplines: we have next-generation sequencing machines in biology, the Large Hadron Collider in physics, and satellite data collection devices in climate sciences. Scientific practice now requires complete familiarity with a wide set of computational tools and algorithms.
Suppose our biologist decides to embrace the data revolution. There are plenty of open source resources online: she starts by enrolling in a course for programming. She posts questions on message boards and forums. At conferences, she learns how to visualise her results. Eventually she starts offering her own advice to others. Because she recognised her limitations and worked to address them, the biologist is ready to begin her journey as a data scientist.
The fundamental characteristics of being a scientist have not changed. We still need systematic, logical researchers with robust methodologies (and substantial funding). But computational experience is now a prerequisite. The future belongs to all-rounders: computer-literate scientists who can quickly transform data into results, and effectively present their results to a broad scientific audience.
Atma Ivancevic is an amateur writer and soon-to-be PhD graduate in bioinformatics at the University of Adelaide, Australia. In her spare time, she enjoys binging on Netflix and spending lazy days at the beach.
This piece was selected as one of the winning entries for the Publishing Better Science through Better Data writing competition. Publishing Better Science through Better Data is a free, full day conference focussing on how early career researchers can best utilise and manage research data. The conference will run on October 26th at Wellcome Collection Building, London.