Our roadmap to engagement, your call

Are you a society, or a working group, a data center or library, a consortium or a grass-root organization that works to collect, curate, share and publish datasets from the research sciences? Are you developing or maintaining open, community standards that enable annotation, sharing and reuse of datasets? Do you have databases or tools, with an established user-base, that help with the data curation, reporting and sharing process? Do you work to harmonize, convert or bridge across different community-standards? If you are interested in learning when, where and how you can work with Scientific Data and become part of our community, then here is what you need to know.

Since our first public announcement, several communities have reached out to us, willing to lend their domain expertise and support. We are ready to capitalize on these positive sentiments and acknowledge contributions. Your interest is our drive. To help guide and encourage community participation, we are proposing here a roadmap for engagement, especially for communities operating in the life science, environmental and biomedical domain, but not limited to these fields.  But before outlining its key steps, let’s first give you a walk-through of our motivations and plans.

Our main content type, the Data Descriptor, combines traditional narrative content with structured components, focused on the experimental steps (e.g. provenance of study materials, technology and measurement types) to ensure that published datasets have rich descriptions that make them comprehensible and reusable.  In-house curators will assist authors with the completeness of the description, and strive for alignment with community-driven data reporting standards. But, the latter is not a trivial process and will be fulfilled progressively.

But why? A growing number of groups are developing hundreds of reporting standards (minimum information requirements, terminologies/ontologies, exchange formats) to describe experimental steps and associated results. BioSharing, run by my team, serves as a registry for these standards in the life sciences, environmental and biomedical domains (see list here). The majority of reporting standards, however, have been designed for a specific domain of study, sample-type or technology.  If you try to use more than one to encompass a wide range of different experiment types, like we need for the Data Descriptors, you will hit a brick wall of reality. Fragmentation of standards poses a major challenge to usability; but it is not the only obstacle. The community standards landscape is still fluid. Decisions on the applicability of community standards to Data Descriptors must weigh resources critically, scoring on maturity, breadth of support and so on; it also requires practical experience and engagement with community experts.

And how? To deliver a viable discovery service in such a scenario, we have anchored Data Descriptors to a metadata framework initiated within my group and developed with a great deal of community input over several years—called ISA for ‘Investigation’, ‘Study’ and ‘Assay’.  ISA is an open, generic tabular format (ISA-Tab) that has a core set of fields, for a uniform and searchable structure, and extendable fields, for configuration to domain-specific minimum reporting standards. Several groups are working to represent research data; there are many ways to do this, of course. Then, why ISA-Tab? Its design and development has been built around a very practical need: assisting researchers to reformat and deposit richly annotated experimental descriptions to repositories at the European Bioinformatics Institute (EBI) and at the National Center for Biotechnology Information (NCBI), mainly. Certain design decisions have therefore been constrained by the need to support and convert to the diverse formats and terminologies these public databases require. Nonetheless, ISA-Tab has proven suitable for descriptions of many experiment types from different scientific disciplines (Sansone et al, 2012).  It is also being used by a growing number of public repositories (e.g. Ho Sui et al, 2012; Haug et al, 2013); by one other data publication platform, GigaScience (Sneddon et al, 2012); and, it has been extended to nanotechnology applications (Baker et al, 2013) and even converted to linked data (Kohonen et al, 2013).

In preparation for our call for submissions, we are working to create (i) a ready-to-use ISA configuration, for existing and prospective ISA users, and (ii) Word and Excel templates for all other authors.  Initially, both ISA configuration and spreadsheet templates will hold a set of core fields that offer the most value for broad search and discovery across scientific domains.  The data descriptors will complement the metadata collected by specific data repositories – which may include more technology-specific information that better store and allow queries on the data files. Progressively we will strive to ensure that this combination of information fulfills community-developed minimum information requirements, by providing more detailed templates and ISA configurations for specific data-types and by working with repositories to establish streamlined exchanges of metadata, so that ideally authors enter this information only once.

Finally, coming back to our roadmap. Here is when, where and how you can work with us and become part of the Scientific Data community, especially for—but not limited to—communities operating in the life science, environmental and biomedical domains.

1. Get in contact

We encourage community representatives to register your interest with us.  Please explain your motivations and who you represent. This helps us to gauge interest within a community, build a list of expert contacts, and plan ahead for the next phases.

2. Register your community standards and databases

When relevant, we strongly encourage those developing or maintaining open, community standards, and/or implementing them at community repositories, to publish and register their initiatives at BioSharing.  This helps us to monitor the development and uptake of standards.  First check if your standards and/or database are listed, or register to submit or claim one or more records to update them, if needed.

3. Help us create improved templates for your community

We will invite designated community representatives to enrich our existing generic ISA configurations and Word and Excel templates to help maximize compliance to community-developed minimum information requirements.  These templates will be vetted through a community feedback process and ultimately released on the Scientific Data website to help authors meet community standards.

4. Get integrated!

We will work with existing data repositories, service providers and data producers to implement direct pipelines, using the ISA framework, to minimize authors’ work and streamline information flow.  This could be used to help build direct submission pipelines from other data management systems or repositories – so that authors will only have to write complex experimental metadata descriptions once.

Upon successfully completion of steps 3 or 4, accordingly, we will invite groups to add their logos to a space in our website reserved for trusted community organizations.

This will of course be a process of progressive improvement and development.  At the moment we are focusing on fields within the life sciences, biomedical and environmental science domains, and the ISA framework similarly has been developed with these areas in mind.   But, we would be glad to begin engaging with scientists in other fields of experimental, computational or observational science, while acknowledging that this may involve additional hurdles in terms of developing broadly interoperable metadata standards.

We recognize this roadmap will need to be customized and targeted for different communities, and we will strive to develop multiple approaches to engagement and integration, testing and refining them in a virtuous cycle as our participating community grows.


    Jessica Tenenbaum said:

    Hi SAS and co- wondering if you’ve approached large groups like 1000 Genomes, TCGA, etc. to start this process?

      Andrew Hufton said:

      Our Advisory Panel already has some great representation here. Stephen Chanock is Acting Duty Director of TCGA, and the Synapse platform, developed by Stephen Friend’s Sage Bionetworks, is deeply involved in the sharing and collaborative analysis of the TCGA data. Joseph Ecker is also a important player in the Arabidopsis 1001 Genomes Project. We are in active conservations with scientists from these groups, and have a lot to learn from these big well-organized consortiums about how to share data in useful ways.

      Susanna has also received replies to this roadmap from several groups, and is actively engaging with them.

      Our outreach is still very much a work in progress, so please send interested parties our way. We are still a few months away from our call for submissions, but for big consortiums that may want to get the jump on the submission process, we can start feeding them early information on our format guidelines and discussing data and metadata requirements.