Scientific Data | Scientific Data

Data Matters: Interview with Albert Heck

Albert_HeckAlbert Heck is a Professor of Biomolecular Mass Spectrometry and Proteomics, at Utrecht University, the Netherlands and head of the Netherlands Proteomics Centre.

What are the current data practices in your field?

In the field of proteomics we generate large data sets, it’s almost impossible to completely interpret these data sets yourself, and so, there’s a huge effort to share these data sets. It’s actually the journals that have been important in that, together they have enforced that people start to use these public repositories. There are different levels of repository. There are repositories where we put our raw data which is mandatory, and repositories that contain a small proportion of the end result.

Data Matters presents a series of interviews with scientists, funders and librarians on topics related to data sharing and standards.

One of the biggest raw data repositories is headed by the European Bioinformatics Institute and their PRIDE (PRoteomics IDEntifications) database. It’s nice that it can be done, but it’s also the hardest part, because they have to have an enormous amount of capacity to load the raw data. UniProt (Universal Protein Resource) extracts a very small part of that. It’s extracted from all that data, and that is not the real data that’s just the conclusion out of the data. You’re very happy if your work ends in UniProt because the user base of PRIDE is small; people like me who want to check or analyse the hardcore raw data of other proteomics research. The user base of UniProt is almost every other biomedical researcher who wants to know something about proteins. I don’t have the statistic but the user base of UniProt is millions, and the user base of PRIDE might be tens to hundreds. But we need both.

Are there any problems with data sharing in your field?

One problem is that if the data are in the public repositories, and people are looking through these data sets and they take out of these public repositories information that they put in their database, at some point it becomes impossible to trace where this data is coming from. I think that UniProt is doing their best to still have references to the original citations where they got their information from, but it’s more the users of UniProt that don’t look at that level anymore they just refer to UniProt, so to get the credit for the data you have analysed is an issue. It’s especially difficult when people start to do meta-analysis of different data sets from all over the world; it might be that our data sets are way more used than I know.

“We want to share data with our colleagues, but we also want to share the huge effort that has been done.”

It would be nice if there was traceability; if you have generated data and made it publicly available, you could then trace back to see if people have been using your data. We put data in these public repositories and we have no idea if people are using it at all. But if people do use it we would like to know who was using it, and have the possibility to get in touch with them if that could be interesting or helpful.

When you do a large proteomics study then only a few percent of what comes out of that work is taken as a new finding and is followed up by all kinds of laborious biological experiments. This might become let’s say a Nature article. And then you feel: “OK, we’ve done this six months work, created these valuable data sets, and it ends up as a small paragraph in the paper”. You have the impression that the full glory of what you’ve achieved is summarised in a few words or so, and in that sense I think in our field we want to share data with our colleagues, but we also want to share the huge effort that has been done.

How does a product like Scientific Data help solve the problem?

One area where Scientific Data helps is where you feel that the description of the data sets would add value to the data. If there was a Nature or Science article in the field where only a few percent of your data set is used, I would like to actually describe it in more detail, and researchers have something to find it valuable. Sometimes we want to make the data available as a resource, and to publish resources is not something that is at the top of our priority for normal journals.

It would also be good if you publish things in Scientific Data if you have an external reviewer where someone can describe the quality or the value of the data that has been generated. We have this in the field a little at the moment. There is a group in Canada, and they search through all the deposited proteomics data. It’s a bit scary of course but they give them a grade, and say, “oh this is very high quality data”, or, “with this data the researcher forgot to do this etc”. I think you want your data to be used more and more, but for the people who are less experienced with your specific sort of data it would be nice to give them a handle on how they should approach this, and what they should and shouldn’t use and I think it would not be bad in Scientific Data if you gave a review with every paper that is there, to be able to get the value of the data.

Interview by David Stuart, a freelance writer based in London, UK

Comments

There are currently no comments.