Scientific Data | Scientific Data

Data Matters: Interview with Henning Hermjakob

CTMM, Portret/Groep, Amsterdam,14 2 2009Henning Hermjakob is Team Leader for Proteomics Services at the European Bioinformatics Institute, UK

What are the current data sharing practices in your field?

Proteomics and interactomics are marked by a significant difference to genomics and transcriptomics in that there is no tradition of a requirement to deposit data in public databases. More and more in all of these areas there is the realisation that depositing data is necessary, but it’s not like genomics where it has been mandatory for many years. One of the key points I have been working on is to establish frameworks that make data sharing possible, and as easy as possible, and to motivate the community to encourage more and more data sharing.

“One of the key points I have been working on is to establish frameworks that make data sharing possible, and as easy as possible.”

In proteomics there is now a rapidly growing inclination to make data accessible, which I think has to do with the fact that we have established a framework, a global framework in which this can be done. In interactomics there is good practice for larger datasets to be shared, and submitted to databases, but the very valuable small scale datasets are still very much not shared other than through publications, so not in a computationally accessible manner.

How can you help motivate people to share their data?

I’m responsible for the PRIDE and IntAct databases, and the primary thing is to make the data easy to deposit because the primary reproach has always been that it’s so complex to deposit complex datasets and of course you can only adjust that to a certain extent, and we’ve done our best to do that.

I also emphasise that nowadays data is not necessarily found only through publication, but it is found through databases which then in turn will increase the citation count, and that is still a major measure of success scientifically. Related to this, we are trying to provide feedback to the authors, so, whenever we curate something directly from the literature then we send the authors an email saying ‘we’ve inserted your dataset into the database, please have a look to see if you’re happy with the representation’ and consider us next time for a direct submission. We also do focus datasets, dataset of the month, or provide compilations of datasets based on a certain topic.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

What do you think the barriers to data sharing in your field are?

I think in part it’s an awareness problem. People think ‘oh I’ve looked into this years ago and it’s so complicated and I have to provide so much metadata’, and that is simply not the case anymore. In part we have scaled back the requirements, and in part tools have become much better, resulting in a much faster, easier data deposition process.

It’s also an attitude problem, in the sense that it’s not clear what the reward is. A deposition of a good dataset does not count like a simple publication. This is one of the major benefits for Scientific Data to address, because if you provide a good data descriptor, and this is in a publication which is in the Nature Publishing Group, then this has a very strong weight, and I think this might have a strong influence, I hope this will have a strong influence, on slowly changing this attitude of there being a reward.

“I’m supportive of Scientific Data because it explicitly says where appropriate we will ensure that the data goes into the right repositories.”

What are the problems with data not being shared in the best way possible?

The big danger is it becomes unavailable after a while. First of all it is not necessarily findable except through the paper, so if you have a deposition in some local repository or on the author’s website, then you only find the data if you come directly from the publication. The other peril is that there’s a huge turnover in URLs and after two years even if the URL was properly referenced before, it may have changed without any suitable redirection. We now have two major cases in proteomics where repositories have closed down. In one case it was done properly, there was a proper announcement and the data is still available through ftp. In another case, the service simply became unstable and there has actually been significant data loss where data that was referenced in publications is now not available anymore.

A challenge that comes up more and more is independent institutional repositories, which collect the data and say we enable you to say you shared your data, but in fact it’s just an unstructured collection of files. It has its place where there is no proper repository, and there are many data types where there is nothing well established which lends itself to good centralised findability of the data; sharing as a collection of files locally is better than not sharing at all. I’m supportive of Scientific Data because it explicitly says where appropriate we will ensure that the data goes into the right repositories.

What role does a product like Scientific Data have?

The main benefit is that it provides motivation by providing citable publications. That’s really the major thing, everything else could be done in other ways, but citable publications are the major thing. Then it becomes possible to just provide the data with a good description, rather than having to generate the story around it which then often is very construed rather than focused on the actual data. It’s not really something new in many aspects, this has been possible to some extent before, but a key point is that this is now possible from a very reputable publisher with potentially a high visibility. Scientific Data might not be able to get the metadata in as structured a way, but at least in a way which is more familiar.

Interview by David Stuart, a freelance writer based in London, UK

Comments

There are currently no comments.