Scientific Data | Scientific Data

Data Matters: interview with Russell Poldrack

Russ PoldrackRussell Poldrack is Professor of Psychology and Neurobiology and Director of the Imaging Research Center at the University of Texas in Austin.

What are the current data preservation practices within your field?

Data preservation practices are really non-existent. If people do anything it’s usually saving something to DVDs or tapes, and then sticking it somewhere to rot. I’ve spoken to my colleagues, trying to find some of the early landmark datasets of fMRI papers to put into OpenfMRI (openfMRI.org), but most of them either say, we can’t find the data anymore, or it’s on a tape but we don’t have the drive that can read it anymore. I have data from 10 years ago on various tape formats that I couldn’t get to if I wanted to, though it seems that the technology has stabilised a bit. The other worry is that you put it on a DVD or a hard drive, but those things decay; people often have the assumption that once you put the data onto physical media it will be there as long as you want it, and that is definitely not the case. I think the best strategy is to replicate data geographically across as many different systems as possible so that there’s no single point of failure.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

I’m currently writing a paper about data sharing and big data, and one of the things we’re putting out there is how preservation is one of the fundamental and important aspects of data sharing.

What are the problems with getting researchers to deposit data in repositories?

Fundamentally people just haven’t bought into the need to do it. Clearly there’s a cost in time and effort and most of the field have not been convinced that the effort that goes into sharing their data is sufficiently paid off by the benefits. Preservation is one of the benefits, but another benefit that is growing is that people have realised that if someone shares their data then their results are probably more trustworthy, and I think that in the younger generation of researchers this is coming out even stronger.

“I have data from 10 years ago on various tape formats that I couldn’t get to if I wanted to.”

Part of the problem is because organizing the data to make them useful for sharing can be hard. You have to make a fundamental distinction between different types of imaging data; whereas resting state data and anatomical date are relatively easy, task fMRI data are particularly challenging. The 1000 Functional Connectomes Project and International Neuroimaging Data-sharing Initiative have been incredibly successful at the sharing of resting-state fMRI data, but the metadata needed to share these data is relatively minimal. With task fMRI data there is so much more metadata that you need and that’s what we focus on at openfMRI. The amount of curation you have to put into a dataset to make it widely useful is huge.

It’s not currently an issue of the availability of the sharing service; the repository I run gets maybe one request a month from someone outside of my circle of collaborators to share their data. If everybody wanted to share their task fMRI data there’s no way that the field could actually keep up with that.  We can curate at most a few datasets a month and there’s probably at least 20-50 datasets being generated every month. There’s no way we could keep up with it without a lot of support and the problem is that building a data sharing repository is a pretty boring thing for granting agencies. Right now you only have to share your data if your NIH grant is over $500,000 a year in direct costs. There’s no way that NIH would say that data from all grants has to be shared and that they’re going to support the cost of that sharing, because that cost would be monumental.

“In the end it’s the researchers who are really responsible for making sure that their data are preserved.”

Who is responsible for data preservation, and how does a product like Scientific Data fit into the ecosystem?

In the end it’s the researchers who are really responsible for making sure that their data are preserved, but it’s really hard to do that on one’s own, and certainly the people who are funding the acquisition of the data and processing the data have a stake in making sure that those data live as long as possible. Libraries can clearly play a role given their historical role as repositories of knowledge, but I think that in the end it’s the researchers who bear the fundamental responsibility for making sure that their data are preserved in a way that will be useful.

Scientific Data offers an accessible and permanent way to describe a dataset that wouldn’t necessarily be done even in the paper that describes the results of a study. There are often details that can’t fit into a paper, and so Scientific Data will provide more visibility for shared data. It also provides people with a way to get some credit for sharing. One of the concerns for sharing data is that someone will take the data and not give any credit for it. There’s been a debate about what the right model is for credit. The Alzheimer’s Disease Neuroimaging Initiative were really successful in collecting a large amount of data on healthy people and as well as people with mild cognitive impairment and Alzheimer’s disease, and their model has been that if you want to take the data and publish it, that’s fine, but you have to include the consortium as a co-author on your paper. There’s been a debate in the literature about whether that’s the right model because it basically creates these shadow authors that had nothing to do with what was done in the paper, they just generated the data. Data papers offer a nice alternative to that model, letting people get credit for having shared their data through citations of that data paper. It’s going to be important in helping people figure out why sharing is good.

Interview by David Stuart, a freelance writer based in London, UK

See this Data Descriptor with data stored at OpenfMRI: Hanke, M. et al. A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie. Sci. Data 1:140003 (2014).

Comments

There are currently no comments.