Scientific Data | Scientific Data

Data Matters: interview with Susanna-Assunta Sansone

SASSusanna-Assunta Sansone is Associate Director at the University of Oxford e-Research Centre and Honorary Academic Editor of Scientific Data

What are the data sharing practices in your field at the moment?

I’m not a data producer, I am a biologist by training, but since my PhD I’ve been in data management, helping other people to structure, share, and explore their data in the life, natural and biomedical sciences. For at least the past 5 years in these areas the funders have been developing stronger, more detailed data sharing and data stewardship policies. A large amount of data is being produced and they want to see those datasets being shared and reused, to get something back from the money after it’s been spent enabling the production of this data.

The funders are the ones with the stick, so the data producers largely have to follow the policies, and I think the scientists are generally accepting of these policies. Scientists largely agree on the idea that data needs to be shared because to power their comparison or statistical analysis they need to link content with similar datasets a third party has produced. They realize that if other investigators do not let them share data, it’s a limitation for them too. The difficulty at the moment is that in some cases the wording of the policies is still quite general and very loose on how the data is shared, or can be shared, or not shared.

The other problem that scientists come across is that even when data are shared, these data are not necessarily reusable. The detail of the experimental context or the data processing steps are often not enough to really understand what was done and to decide whether to use the dataset in a certain analysis, to have the confidence that that data is robust and sound and so can be reused.

“The other problem that scientists come across is that even when data are shared, these data are not necessarily reusable.”

How much attention do you think is currently given to the discoverability of data?

This is connected to how clear and detailed those policies are in terms of how you can or cannot share the data. The problem here is directing the data producer towards a proper repository, for example an institutional repository or a global public repository, that have a sustainability model which are publicly accepted, that are open so the datasets are available, and have standards so that the data gets analyzed in the context of other data, making it reusable. Most of the policies nowadays probably require the data producer to create a data management plan explaining how they are going to share the data. Unfortunately in many cases these parts of the grant proposal weren’t scored; you could write a bad management plan and still get approved. Now the funders are giving attention to how the management plan is written so that people don’t say that I’m going to make this available and that’s it, they are going to have to explain how this will be made available so that the grant reviewer can actually see if that’s an appropriate way to share the data, and if there is a public repository usually you will be directed towards using that.

What do you see as the role for machine readable content and semantic annotations in discoverability?

It’s absolutely critical not just for discoverability but also for reuse. This opens a big discussion about what actually is the most appropriate machine readable format, because even a text file is easily parseable by a tool. Formats employing controlled terminologies in place of free text values and descriptions are an extremely valuable. XML-based languages accepted by the W3C (RDF and OWL) are becoming common practices, for example. At the heart of today’s challenges is the need for a unified and linked view of the data. But it’s not just about the type of format, it’s also about the richness (breath and depth) of the content described, in particular of the experimental context, and if and how this helps with reuse of data.

There are over 500 domain-specific standards (formats, terminologies/ontologies, minimal reporting requirements) in the life sciences and the biomedical domains that have been developed by grass-root communities over the last ten years. These can be quite long term, multi-stakeholders and challenging endeavors. Funders and publishers have been monitoring these efforts, and there has been, and will continue to be help in the development, testing, refinement, endorsement and maintenance of these community standards. The NIH for example as part of the Big Data to Knowledge initiative is working to create a specific framework (policies, funding programme etc) to support the community standards throughout their life cycle, fostering harmonization to ensure standards are interoperable.

“NPG has a history of leading in such matters and already plays a key role in the data reproducibility agenda.”

How does a product like Scientific Data help make data more discoverable?

Funder and governmental data policies have encouraged the open access movement and data standards communities to attempt to transform research communication; publishers are of course beginning to take a more active role as a result. NPG has a history of leading in such matters and already plays a key role in the data reproducibility agenda.

Nowadays hypothesis and discoveries are mainly shared as scientific discourse in narrative form via traditional journal articles.. But if for example you want to be able to reuse the data which underpin the results, and that’s when the difficulty comes. Scientific Data focuses on the data which is fundamental to the discourse, introducing a new content type. The Data Descriptor (DD) has been designed to provide detailed descriptions of experimental and observational datasets, including the methods used to collect the data and technical analyses supporting the quality of the measurement. The DD has narrative component, complemented by a semantically structured machine readable one; it references the traditional article and citing data files, deposited in appropriate community databases. Scientific Data has been designed to become an integral part of the data sharing ecosystem, providing added value descriptions of valuable datasets.

Interview by David Stuart, a freelance writer based in London, UK

Read the “Big data with Susanna-Assunta Sansone” blog post and listen to the podcast on Naturejobs.

Comments

There are currently no comments.