Earlier this month, Scientific Data published its first two Data Descriptors. These pre-launch articles recently cleared peer-review and we have decided to publish them before our formal launch in May 2014. They were published using a simplified article template, but they will be transferred to our more feature-rich publication platform in May, and will retain the same citation information and DOIs.
Both of these works present valuable, previously unpublished datasets. We are also actively considering Data Descriptor manuscripts that expand on previous publications (e.g. releasing important datasets in more detail, or making them more reusable), and we expect to have some excellent examples of these types of follow-up works for our launch in May.
Global integrated drought monitoring and prediction system
Zengchao Hao, Amir AghaKouchak, Navid Nakhjiri and Alireza Farahmand
11 March 2014, doi:10.1038/sdata.2014.1
HTML | PDF | ISA-Tab Metadata
Editor’s summary: Droughts are costly natural disasters that lead to serious water and food crises. Here, the authors provide datasets comprising both historical drought severity information and forward predictions of drought likelihood, across the entire globe. These data may be useful for understanding and reducing drought impacts, particularly in the developing world.
- Drought indicators are provided using three different metrics measuring agricultural drought, meteorological drought, and an integrated drought index.
- Associated with this publication, the authors provide a freely available archival version of these datasets and also maintain a periodically updating resource.
- To help others reuse and reproduce these data, the authors have shared the source code used to generate these indicators.
The systematic identification of cytoskeletal genes required for Drosophila melanogaster muscle maintenance
Alexander D. Perkins, Michael J.J. Lee and Guy Tanentzapf
11 March 2014, doi:10.1038/sdata.2014.2
HTML | PDF | ISA-Tab Metadata
Editor’s summary: Research into the genes that regulate long term muscle maintenance has contributed to our understanding of both aging and myodegenerative diseases. Here, the authors conduct a RNAi screen of 238 genes for involvement in Drosophila muscle maintenance using a climbing-based phenotypic assay, and describe the resulting data in detail. Other researchers may use this dataset to identify genes that deserve more in depth study in the context of muscle maintenance.
- The authors retest initial hits using an adult-specific RNAi expression construct, helping them differentiate genes with roles in muscle maintenance from genes with roles in earlier developmental processes.
- The authors provide their data in different formats through two different data repositories, including both:
- raw data suitable for fully reproducing the author’s data processing steps, available at figshare (https://dx.doi.org/10.6084/m9.figshare.806269).
- and, structured datasets integrated into GenomeRNAi (GR00238-S) that will allow researchers to search and mine these phenotypes easily.
These early publications help highlight some of the unique aspects of Scientific Data’s Data Descriptor article type, including:
An article format that focuses on data quality and reuse
The Data Descriptor contains article sections that aim to help other assess the quality of the data and provide advice on reuse. See in particular the “Technical Validation”, “Data Records” and “Usage Notes” sections within each of these Data Descriptors.
Perkins et al outline several different potential reuse cases for their dataset. Mostly directly, it provides candidate genes that deserve more study within muscle maintenance processes. In addition, these data offer users the chance to compare the set of genes required for muscle development to those required for muscle maintenance. The authors propose that the overlap or differences between the two could provide insight into the underlying mechanisms of both processes. These types of reanalysis would be not possible without access to the full screen data, as provided by the authors here. To help others reuse these data, the authors also suggest relevant analysis tools and databases within their “Usage Notes” section, and thoroughly describe the contents of the raw data files in the “Data Records” section. The latter will be particularly welcome to users since there is no standard format for this data-type.
Hao et al use their “Usage Notes” section to explain how GIDMaPS data can be used to assess the fraction of global land areas experiencing droughts of different severity levels or to investigate a particular region’s drought climatology. They further propose that these data will be of value to both ecologists and resource managers, and could improve decision making and disaster preparedness.
Machine-readable metadata supporting each publication
These records are designed to help data miners, especially as we build a larger library of publications. They are currently downloadable in the ISA-Tab format, and a summary of some of the high-level annotation is presented in a “Structured Summary” box near the beginning of each article. Wherever possible, the annotation terms in these files are taken from community-developed ontologies, and the metadata includes references back to the source ontologies stored at BioPortal.
Data is archived in public, open data repositories
Data associated with these articles are hosted at figshare, GenomeRNAi, and a custom data portal built by Hao et al. Storing data in multiple places helps to serve different user groups and to ensure long term persistence of the data. For the Data Descriptor by Hao et al, the figshare data record provides an archival representation of the data, as it was peer-reviewed, helping others to reproduce the analyses presented in this article. Their web portal provides a more flexible resource that will serve users interested in getting the most up-to-date drought indicator data.
Data citations, which include direct links to datasets at their host data repositories
Data citations provide a formal way to link external data records to publications, in a manner that others can mine just like traditional literature references. This will allow the community to track data reuse and credit authors who share their data.
We are hopeful that these early publications will help interested scientists get started with drafting their own Data Descriptor manuscripts, and ultimately help spur wider data sharing!