As every area of research becomes data-intensive, emphasis is shifting from data generation to data analysis, bringing new challenges to researchers, says Réka Nagy.
On my first day as a new PhD student, freshly awarded molecular genetics degree in hand, I was sat down at a laptop with an unfamiliar operating system and was encouraged to explore some data using arr. What sounded like pirate speak turned out to be R, a statistical programming language. Yep – for my PhD I swapped pipettes for programming, dilutions for data and spectrophotometers for statistics. Others experienced the opposite, entering the world of biology from a computer science background.
Regardless of our training, we are all scientists. Our job (and, ideally, our passion) is to add to global knowledge by answering scientific questions using repeatable, standardized approaches.
Observe. Measure. Analyse. Discover. Communicate. Improve. Repeat.
These words represent the work flow of the scientist, and deluge of data that now floods all areas of research has not changed this mantra. But it has changed the way we approach each step, as well as the relative importance of each.
In 1977, 24 years after the discovery of DNA, the 5386 nucleotides of ΦX174, a DNA (rather than RNA)-based virus were sequenced. This technical breakthrough could only be accomplished through the work of nine scientists who pioneered a new sequencing methodology. In 1977, entire PhD projects were dedicated to sequencing, and analysis was limited by the availability of data.
Today, genome sequencing is turning into a standardized data production service that is complementary to research, instead of being the focus of it. CERN is generating 30 petabytes of data per year. The humanities are no different, generating data by digitizing books, cataloguing recordings of spoken word or producing a database of high-resolution images of the art of the world.
With so much data becoming available so quickly, emphasis has shifted to data analysis, which is becoming an indispensable transferable skill. Access to many, varied datasets cause scientific questions to increase in number and complexity, making it difficult to investigate thoroughly. It allows us to discover more faster, but we have less time to dwell on the implications of each new finding – the results of the next experiment are in before the results of the current one have crystallized in our mind.
We need to focus our enquiries by solving problems raised by current societal issues and developing solutions – alongside other scientists – with real-world applications. We have to maintain our ability to write for each other in journals, but new discoveries cannot benefit society if they stay there. It is also our duty to communicate our research, translating large amounts of complex data into concepts that can be understood by policymakers and the public. The increasing importance of science communication is evidenced by the availability of policy internships and science communication workshops and competitions aimed at early-career researchers.
It is a steep learning curve, but three years in I can see the real value in one person becoming an interdisciplinary entity of their own. We can see an experiment through from start to finish which means there are no crucial black boxes we don’t know about. This speeds up troubleshooting and helps with interpreting results. We’re able to write grants without the burden of these being labelled as explicitly interdisciplinary.
By communicating our findings to other scientists and society, we transform data into knowledge, gain a deeper understanding of its implications, and pick up useful tips on how to improve both ourselves and our science. Data-intensive research has transformed how we work with data, and requires us to develop and learn to use new methodologies. Scientists stepping up to this challenge should embrace the wealth of data as a tool to boost the rate of scientific progress.
Réka Nagy is a PhD student at the MRC Institute of Genetics and Molecular Medicine at the University of Edinburgh, unravelling how genetics shapes our health using large family-based datasets. When she’s not busy writing scripts and analysing data, she can be found communicating science or using a computer to play video games and design anything from posters to dream homes. You can find her on LinkedIn and Twitter.
This piece was selected as one of the winning entries for the Publishing Better Science through Better Data writing competition. Publishing Better Science through Better Data is a free, full day conference focussing on how early career esearchers can best utilise and manage research data. The conference will run on October 26th at Wellcome Collection Building, London.