Scientific Data | Scientific Data

Data Matters: interview with Mark Thorley

Mark Thorley is the Head of Science Information and Data Management Co-ordinator at the National Environment Research Council, UK.

What are the current preservation practices in your field?

Preservation practices in environmental science are varied; some are quite good, some are less good. We’re probably better off than some other areas as it has long been recognised that in certain areas of environmental research your research depends on building up long term time spheres of data to look for change, so people have had to undertake good digital preservation or data preservation for a long time now. From my perspective, we know we’re reasonably good at managing long term large scale datasets, the community resource datasets, but probably more at risk are smaller scale individual experimental datasets, where they’re managed more by the researcher or the institution rather than by national or international facilities.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

There are two issues, one is the preservation of the data, and the other is preservation of the knowledge to be able to use those data, which tends to be lost as people move on and retire. Historically data haven’t always been effectively documented, so the data may be secure digitally, but we may lose the key knowledge about the data to enable us to be able to reuse it effectively.

Research data is a very active topic of discussion nowadays, the work done by the Royal Society in the Science as an open enterprise report, under Geoffrey Bolton, has been quite seminal in getting the debate and issues to be taken seriously at a high level. I’ve been working with data research/data management since 1990, and for the first ten or more years it was an issue that was below the radar and a lot of people at a senior level just couldn’t understand what the problem was, but as we’re moving to a digital world how we manage digital items and assets is becoming a bigger and more critical issue.

“We may lose the key knowledge about the data to enable us to be able to reuse it effectively.”

What are the challenges in the preservation of scientific data?

We have to be selective in what we manage, because the biggest cost in digital preservation is the accession process. Once the data is documented and described and it’s been moved onto secure storage, the cost of actually keeping it on spinning disks is relatively trivial. So there has to be a real consideration of how we can work with pragmatic ways of identifying that which we should put effort into keeping, and that which we can just let go.

At our data centre in NERC we don’t take everything we’re offered, we assess whether it is of long-term value to the community. We focus on managing stuff on the understanding it’s going to be reused by other people, is sufficiently well documented, and of sufficient value to be worth us putting effort into it for the long term. We recognise you can’t go out tomorrow and measure what yesterday’s weather was, and where things are spatially or temporally unique you have to manage the data you generate. However in some experimental data, it might be more efficient to go away and re-measure the data. If you look at the efficiency in things like gene sequencing it might actually be quicker to go away and re-sequence the genome rather than to manage a dataset from 5 years ago. In marine geophysics an individual cruise can cost £100k’s, so you can come up with an economic argument as to why you need to manage the dataset, but to be honest, you could probably run the same geophysical cruise in 5 years time and get a much higher resolution better quality dataset, but at a cost.

The key thing is if the data support a research publication then they have to be managed, because, we need to be able to replicate the research if challenged, and a part of that process of replication is having access to underlying data. It isn’t necessarily the job of the national facilities to do that, it’s more the role of the researcher working in conjunction with their institution.

How will technological changes affect preservation?

Technology will advance, ways of capturing metadata at source will improve, so more data will become more tagged with more metadata automatically early on in the process as time progresses. Consider digital photography, for example. In my home computer I have gigabytes of digital photographs taken over the past 10-15 years. We start off with great enthusiasm documenting when we took our photographs, but after a while you forget to. Nowadays digital cameras come with GPS enabled, so it automatically geotags a picture when you take it whether you like it or not. You can imagine another 5 years, there’ll be other metadata resources automatically encoded in a digital image just as we learn better how we need to understand the surrounding information that’s needed to be able to understand an image. There’ll also be more complex search algorithms and more tools to help us understand what is being held in a digital environment.

What is the role of products like Scientific Data?

“The engagement of the community has undergone a step change in recent years and Scientific Data is part of that.”

The engagement of the community has undergone a step change in recent years and Scientific Data is part of that, but it’s also there to actually help encourage people to think about data more. By publishing a data paper about your dataset and then having the dataset available to support the data paper is helping provide rewards for people who work in the data science side. To some extent it’s been the less glamorous end of research, people weren’t really interested in making a career in data because you didn’t get the same kudos for doing data as you do for doing primary research, and things are now starting to change in that area. Things like Scientific Data, and Geoscience Data Journal from Wiley, are really valuable developments in this area as people put effort into documenting their datasets and then ensuring they’re secure if they’re published. Though not all datasets will be judged relevant to the scope of a journal like Scientific Data.

Interview by David Stuart, a freelance writer based in London, UK

Comments

There are currently no comments.