Want to embrace open data but don’t know where to start? The tools are out there, says Matthew Edmonds.
The Publishing Better Science through Better Data conference, or #scidata16 for short, took place at the Wellcome Collection in London at the end of October. This one-day event organised by the journal Scientific Data, Springer Nature and the Wellcome Trust explored the challenges facing early-career researchers as we enter the era of open data.
As a data novice, I arrived without really knowing what to expect. The types of experiments I perform generate only small datasets needing a simple statistical test, easily summarised in a graph in the manuscript. The original data can be safely left to gather dust in a shared drive.
Or so I thought. As the day progressed, it became clear that most data issues apply no matter what your experiment is. I hadn’t previously considered how other researchers can access, read, understand, use and confirm the reproducibility of my data. Now, we’re entering an era when most journals require publication of raw data and analysis techniques alongside a main article. Fortunately, a post-lunch session of lightning talks illustrated some of the emerging solutions to these problems.
Some results may be years old before they’re published, and others never see the light of day because they’re considered “negative” or irrelevant. An experimental manipulation that produces no effect may not make it into a journal article, for example. To challenge these views, Dr Rachel Harding is sharing her results in near-real-time through the Lab Scribbles blog. She makes mini-reports of her work on Huntington’s disease and her data go into the Zenodo repository (which makes any dataset citable). Taking this approach has benefits for both her and the wider community: they don’t have to wait months to see new data, and can offer suggestions for improvements or collaborations before publication. Dr Harding’s lab book has been viewed over 20,000 times from 95 countries. How many people have read yours?
Of course, making sense of someone else’s lab book can be challenging. Jo Barratt of Open Knowledge International introduced their concept of Frictionless Data: packaging data to a few simple standards that greatly improve the ease of sharing, reading and usage. Among their tools is Goodtables, which enables quick validation of tabular data (you can publically test it here). Just upload a table and a schema (defining the variables and any restrictions, such as integers only) and it will flag any errors before you get into the nitty-gritty of analysis.
And analysis is particularly difficult in some cases. Take image processing, which is important across many scientific disciplines. The research goals of each field can throw up idiosyncratic problems. Research in nanomaterials, for example, can use electron tomography – a technique that allows 3D characterisation but requires rendering of the data into an image. Visualisation is essential to understanding the data but is highly dependent on the preferences of individual researchers, which are very difficult to describe in writing. To address this problem, Dr Robert Hovden of the University of Michigan developed tomviz, which integrates the raw data and manipulation steps into one place. This allows others in the field to see the pipeline from data to model. Dr Hovden has made it open source and independent of operating system, and he says he sees no reason why it couldn’t be used for any similar dataset.
In contrast, neuroimaging datasets require huge number-crunching power to provide outputs relevant to whole networks of neurons, even up to the level of the whole brain. Individual researchers often don’t have the required computing power at their disposal at their institutions, or must wait their turn to use them. This bottleneck inspired the Montréal Neurological Institute to create resources open to anyone. They cover the whole process from data repository (LORIS) to high-performance computer processing (CBRAIN), and importantly maintain compatibility with multinational collaborations such as the European Human Brain Project.
So what were the lessons for my dormant hard drive and I? I’d never seriously considered sharing my data going into #scidata16, but this position is fast becoming inexcusable. Yes, it might be difficult for me to get all of my information out in the big wide world, but creative people are making tools to facilitate that process. Now I will be seeking and using those tools for my own data – for the benefit of everyone.
Matthew Edmonds is a postdoc at the University of Birmingham, UK, researching how cells which have defective mechanisms to repair damaged DNA can lead to cancer. He finds the pace of change in technology available to researchers astonishing, and tries his best to keep up. You can keep up with him at @benchmatt.
You can access all the slides and videos from Publishing Better Science through Better Data 2016, as well as the great visual summary of the day, on the event website.