What are the data practices in your field?
My job at Microsoft Research is to connect external scientific researchers with some of our researchers at Microsoft so they can solve big data scientific problems that people care about by applying advanced computing technologies. The programme covers the world, every continent except Antarctica, and it really is all about data and the fact that the scale of data has changed the way we do science. We’re in a new stage of doing science, which is data intensive. It’s often distributed; you have to get data sets from different places and put it together.
Some fields are more open than others. For example, the environmental field is typically much more open than the biomedical science field. In fields like astronomy, all data is free because the research has no commercial value, and so there is much less difficulty in sharing the data. However, there are interesting projects which show what is possible even in a field where you worry about personal information. Even in the medical arena people are seeing the value of sharing and making data open, such as in the NDAR and ADNI projects. Given the confidentiality constraints that you have in medicine, I think that’s remarkable.
Are the barriers to the more open sharing of data cultural or technical?
A large amount of it is cultural, and I find cases where there are researchers who don’t wish to share their data under any circumstances. The problem with that is you need research results to be reproducible. If other researchers want to verify your results you need to tell them what you did, and part of telling them what you did means making the data available in some useful form. Also, I feel if the public has paid for this research, there is an obligation for at least some of that data to be made available to other researchers.
Some researchers will see sharing data as a lot of work and question what sort of reward they will get. I think therefore that there will ultimately need to be some form of sanctions or compulsion, otherwise researchers will not see it as worthwhile to make their data available. Like it or not, they’re increasingly being asked by the research councils in the UK, and in the US by NSF and NIH, to have data management plans as part of their research proposal otherwise they won’t get funded. So they need to think about these things. It really needs to become second nature, and whereas I can see many people will resent that, I think there’s also a reasonable percentage of people that see the need for it and will accept that it is actually a necessary extra overhead to doing research.
The challenge is to show researchers real opportunities and benefits from the reuse of data. We need convincing examples of how data that has been carefully stored is then used by someone else in a constructive way. Ultimately it is a question of deciding which data is of value, as you can’t keep all of the data, and if it’s likely to be reused. These are issues that need to be explored by the community. In some fields like oceanography it’s obvious since the measurements are not repeatable; in other cases there are surprising needs for interoperability. For example, we have studied the Russian River Valley watershed in California, and if you want a complete picture you have to combine sets of data from different federal agencies. The US Geological Survey are the people who look at the water flowing in the rivers, but if you want to know about the rainfall, then you have to go to another agency, NOAA (National Oceanic and Atmospheric Administration).
Whose responsibility is it to create a more open science?
One of the challenges for the open science/open access movement is understanding how best you can link data to text. I like the vision of Jim Gray, the Turing Prize Winner, of a global distributed archive containing both publications and data, which he believed would speed up and make the process of scientific research more efficient. At the moment it seems to me that there’s a lot of reinvention that goes on with scarce research money, so it would be nice if we could make it so you can get some results, understand where other relevant data can be found, put them together and actually come to new conclusions that you can act on. If you have all curated data stored in searchable repositories, you should be able to go from the literature to the supporting data, do some computation on that data, and add your own data. We need to produce tools that make it impossible for any scientists to claim they didn’t know about previous research data. Ultimately it is the responsibility of the research funding agencies to ensure that the data and the publications that result from that data are actually made available to the public.
The suitability of the available infrastructure depends on the discipline and the field. In some areas, like medicine, they have many millions of dollars to provide an infrastructure such as the US National Library of Medicine. In contrast, the data infrastructure of some other research fields, such as archaeology, may just be supported by the efforts of individual research groups that are funded from hand to mouth and that’s not a very satisfactory state of affairs. My personal view is that research libraries in universities need to change and take on new responsibilities because some of their research data needs to be stored locally. University libraries need to play a role in assisting researchers decide what data they need to keep and tell them how to keep it. It’s part of the cost of doing research.
What is the role of Scientific Data?
As I have said, some research fields are well endowed with places to put their data, but other are not. A journal like Scientific Data has the potential to play a great role in establishing best practice in how you store data, what data sets are meaningful, and what metadata you need to have. It can also provide the opportunity to publish examples of how data has been reused and whether the research is reproducible or non-reproducible.
It could be a very exciting future, provided we can work out a sustainable model of funding, presumably part commercial, part from the funding agencies and part from the universities. There is a cost to storing data and ultimately the process has to be financially viable. However, I now think there is a realisation from both funding agencies and researchers that when you’ve spent a lot of money to get the data, there is an obligation to make sure that at least some of it is retrievable and re-usable in 5-10 years from now.
Interview by David Stuart, a freelance writer based in London, UK