Ioannis Xenarios is Director of Swiss-Prot and Vital-IT at the SIB Swiss Institute of Bioinformatics and Professor in computational biology and bioinformatics at the Center for Integrative Genomics, University of Lausanne.
How open to the sharing of data is your field?
My field is bioinformatics, biocuration, and models. People recognize they have to be good citizens, and providing all the data and the tools is the best way forward, but we are not looking at the Higgs Boson or some stars in the sky that will have little direct effect on our present livelihood or economical wealth (not to say that it is not extremely interesting science). There is a tendency to think that all government funded activities have to be open access and reusable, but at the same time you also have a lot of the universities that have created technology transfer organisations as a means to valorize their data and intellectual properties. There is a closed system on top of an open access system.
Sharing is what most people would like to do but the fundamental background of this is that it’s quite unlikely to happen. Access to open data is a goal that the whole scientific community needs to achieve but in the same way as traditional collections were open access to any scientist, you just had to justify that you were a scientist and you could have access to it. The difference is that before we had to create museums to have people going in, and now we have to just provide the right means for people accessing these data, making sure they are scientists and that they “don’t abuse” the access to the data. It’s going to be a trade-off between fully open access, controlled access, and closed access that will persist because people won’t share things that have an economic value.
How important is the quality of the data for data reuse within your field?
In terms of biocuration, I’m the director of the Swiss-Prot Group, which actually curate data, so it’s pretty much safeguarding, and pretty much evaluating the scientific literature that is produced in Nature, Science, and other journals. In that sense, the data has to be evaluated, contrasted and brought to the scientific public or to the mass in a coherent, correct and historical perspective, because there is a tendency to think that when the data is generated it is the end of the story, it is certainly the end of a PhD/Postdoc research study. So that’s what the curation is all about. There is an updating mechanism of the knowledge that needs to be publicly available.
However, the quality of the data is not really the issue. Compare computational biology with the example of a single experiment that was done at the Large Hadron Collider (LHC). The LHC experiment cost several billions to build up one machine to measure one particle. So it was a really controlled high quality experiment. Now you have to imagine multiplying exactly the same type of machine but thousands of times in different labs, with cells that are named similarly but do not have exactly the same genetic makeup. Quality wise there will be a different set of qualities when people use these cells in their experiments. However a remarkable thing in biology is that whenever someone claims to have seen something there is a tendency to reproduce that experiment or either to throw it away and say that’s nonsense. So the quality is percolating as time progresses when people actually reproduce the experiments. So quality is not really the issue, the reproducibility of the research is what matters.
The claim is that people are interested in reusing the data, that’s what the funders want to have, but in reality there are very few examples where reusability of data have been really seriously achieved. The reason why this has been very rarely done is because people have been creating small datasets, and it’s only when a very large fraction of data are generated in the same manner, or at least a controlled manner, that it becomes a reference set that people can reuse. The hope that small datasets will be reused is a very good idea, but it might be a lot of effort to maintain those small datasets and sometimes it will be too much depending on where the data has been generated rather than the quality of the data itself.
You also need to have people that are computationally, or bioinformatically competent to take advantage of these data. That’s the challenge as well, because those are quite rare people on the market currently. It’s a core competency that has to be added in order to become a twenty-first century biologist or a computational biologist.
So how do you see something like Scientific Data fitting into the open data ecosystem?
The problem we are facing nowadays is that it has become extremely hard for a reviewer to assess all the claims that are made in a paper, and Scientific Data allows you to review the reproducibility of the figures, of the analysis, and of the underlying data. It will be interesting to see how Scientific Data will be reused. One possibility is for people that want to have access to a reference set and evaluate methodologies or advance methodologies. There have been a few things that have been trying to take a crowd sourcing approach to evaluating methods, and DREAM (Dialogue for Reverse Engineering Assessments and Methods) is probably a good case where there is not a real gold standard. Scientific Data could become a really good resource for such a kind of activities, especially if it could be mined, reused and cited.
So perhaps Scientific Data could bridge that gap, providing both the way the data has been produced, the ways it’s been analysed, and also the underlying advantage of being able to cite an object which is like a paper but in this case a dataset.
Interview by David Stuart, a freelance writer based in London, UK