Scientific Data | Scientific Data

Data Matters: interview with Robert Cook

Robert CookRobert Cook is a scientist at the Distributed Active Archive Center (DAAC), Oak Ridge National Laboratory, USA

What are the current preservation practices in your field?

There is a range of practice. Some agencies are very good about requiring that data from their funded projects be preserved. And some investigators are very good about preserving their data, while others are not quite so good. Some researchers see the benefit of sharing their data, and so they prepare the data well, making sure it’s available for others to use. Others perhaps don’t have the background in data management to prepare data to share, or perhaps they’re just using it themselves and don’t really need to prepare it for others to understand and use. We see this range in our archive here at Oak Ridge National Lab, a programmatic archive for the NASA terrestrial ecology programme, there are some that prepare data really well and for others it’s very difficult.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

Today, many journals require data to be published before the article, based on those data, can be published. This practice is helping a great deal to get data preserved. At the ORNL DAAC, we’re being asked to archive data products so that authors can publish their paper. Another requirement that is common here in the States is that when investigators write proposals, the funding agency is requiring a plan for data management. So when they’re writing a proposal, investigators are thinking about what they’re going to do with the data during their project, and thinking about where they’re going to preserve their data at the end of the project. So that’s helping a lot as well.

People increasingly recognise that they can contribute their data to larger studies and answer questions that really can’t be answered by just their one data product. The example I’m thinking about is the global network of flux towers. They’ve been around for 20 years, and when they were first established people would just study their site without sharing their data. They’d learn a tremendous amount about the forest or the grassland or biome that they were studying. After the first 3 or 5 years went by the investigators began to see that if they combined their data with that from other similar biomes then they would learn much more about those biomes.

“We’re also seeing that researchers are still using those 20-year old data to address today’s scientific questions, and that’s really important.”

How difficult is the long-term preservation of this data?

At our data centre we follow an old US National Research Council axiom that we archive data for a user 20 years into the future, and that’s obviously a very difficult thing to do. It is difficult to imagine what technology researchers will be using in 5 years, let alone 20 years. We try to describe the units and all the aspects of the data so that someone can pick up the data in 3 years, 5 years, or 20 years understand and use the data, without having to go back to the investigator to ask questions about the data. We need to capture the information about a data set in a way that’s curated and well maintained over the long term. We’re now coming up to the 20 year anniversary of some ecological data products at our archive, and we’re now able to look at those data products and see that we did archive them in file formats and with documentation that are readable now. Interestingly enough, we’re also seeing that researchers are still using those 20-year old data to address today’s scientific questions, and that’s really important.

We need to have good archival practices with descriptive metadata that allows people to find the data, even though they may not know exactly what it is they’re looking for, and may not know anything about some of the studies that were conducted. Once they have discovered this data they need to be able to access it and use it, and the documentation and the information about the data in the files — the metadata — needs to be good enough to allow that.

In the future, people will use these preserved data in ways that no one imagined when the data were first collected.  We can never really recreate the environmental conditions of the past, so it is important that we preserve our observations. The scientific community needs to preserve diverse data that can be used to address hypotheses.

Do researchers have the necessary data management and preservation skills?

I think data management training during most student’s graduate and undergraduate education is lacking, and there needs to be more of an effort to train students to manage their data properly. It’s becoming really important to be able to compile and document data so that people know what the data represents. An important goal is to be able to hand the data over to someone else, and say, ‘here’s my data go use it’, without spending any time explaining the data.

A data management course should be required in Earth or ecological science education, like statistics courses are in many university departments. Data management is an important topic for which everyone needs formal training to do their research. With data management training, researchers can manage their data, share the data among their colleagues, and also get it to an archive in good condition.

“A data management course should be required in Earth or ecological science education.”

What is the role for a product like Scientific Data?

There are a couple of roles. One is to get a description about an important data set out to the community in a formal publication. We expect the audience of Scientific Data to be extensive, so that many will know about these data papers. Another role is about getting appropriate credit for the work that goes into generating a data product. Scientists want to get credit for their work and a publication in a journal like Scientific Data provides that credit. People will be able to add the citation and Digital Object Identifier to their resumes as a data publication, not quite the same as say a paper in Nature, but still it provides nice credit. People can track how many times other authors use those data in publications — a data citation index – and that’s another good metric for scholars.

The issue of data citations is important. When data are used and properly cited in articles, readers will be able to find, access, and use the data behind the article. Readers will be able to use the very same data to reproduce the results in article, which is the basis for the scientific method, or perhaps use those data in new ways.

Interview by David Stuart, a freelance writer based in London, UK

Comments

There are currently no comments.