Manuscripts published at Scientific Data contain a ‘Data Citations’ section that helps authors formally acknowledge any datasets mentioned in their manuscript. We know that this section is unfamiliar to many of our authors, so here we provide some background on the purpose of data citations, and advice on completing this section when submitting to Scientific Data.
What is a citation?
The majority of researchers will be familiar with the notion that we all stand on the shoulders of giants. We acknowledge the contributions of others via the scholarly etiquette of citation. The traditional mechanism for sharing scholarly output is the peer-reviewed article, and researchers are trained to cite previous articles that describe ideas, results or lines of reasoning that materially impact on their own work. Scholarly articles used to be primarily identified by author names, article title, journal name, volume and page numbers. In 2000 the DOI (digital object identifier) system was implemented for online scholarly articles, and today a scholarly article can also be identified by just its DOI.
The traditional research article was often used as a stand-in for all other scholarly outputs, but researchers are increasingly sharing other research outputs via stable online repositories, like data, source code and protocols – and these too can be directly cited.
What is a data citation?
Data citations are a formal way to record and acknowledge any externally hosted datasets mentioned in a manuscript. They help link an article to related data, and provide credit for data producers, just like traditional literature citations help link to and credit other peer-reviewed works mentioned by the authors. Cited datasets are generally hosted in data repositories1.
Our data citations are designed to conform to the Joint Declaration of Data Citation Principles (JDDCP). If you are interested in learning more about data citation we recommend reading these principles, and the related paper by Starr et al. (2015).
What should I cite when there is both a paper and an accompanying dataset?
Cite what you used. If you are primarily referring to findings or ideas in an article then citing the paper may be sufficient. If you used associated datasets, especially data archived outside of the article and its supplementary material, then you should cite the data. Often it will be appropriate to cite both: the paper and any datasets you used.
Which data should I include in my data citation section?
Any stably archived datasets mentioned in your publication should be formally cited with its persistent identifier2, regardless of where in the manuscript or why the data are mentioned. At Scientific Data, all Data Descriptors will include citations to the datasets centrally described in the article. In addition, any data used from other researchers in the generation or validation of the dataset, should also be cited.
The core data described in each Data Descriptor is also recorded in the machine accessible ISA-Tab metadata files that accompany each published Data Descriptor. The ISA-Tab format allows each data identifier to be recorded in a structured way.
How do I format my data citations?
Data citations at Scientific Data include the following four fields.
- Creator/Authors(s)
The creators of the dataset, which maybe distinct from the authors of the Data Descriptor. - Repository Name
For DataCite DOIs this should align with the “Publisher” field in DataCite metadata. - Dataset identifier
This will be a DataCite DOI or an appropriate repository accession ID. - Dataset Publication Year
The year the data were made publicly available.
It is important to only include information in the data citation that is present in the metadata associated with the data record. For example, if a dataset does not have clearly defined data creators or authors, this information should not be invented or estimated (e.g. by looking at related publications). This might be well intentioned as a means for giving due credit, however, this practice risks entering erroneous information into the citation record.
Our data citation format does not include titles, since not all repositories assign titles in a uniform or easily verified manner at present.
Example citation with a Digital Object Identifier
Hibsh, D., Schori, H., Efroni, S. & Shefi, O. Figshare https://dx.doi.org/10.6084/m9.figshare.1289242 (2015).
Example citation for an accession identifier
NCBI Sequence Read Archive SRP059260 (2015).
Example citation for a range of accession identifiers
NCBI Sequence Read Archive SRX1049768–SRX1049855 (2015).
And it is that simple! Getting data citations right means the submission and publication process at Scientific Data runs smoother. The bigger picture, however, is that data citations are an essential part of the data publishing infrastructure, giving proper credit to those generating and sharing valuable research data.
Scientific Data is an early adopter of the data citation implementation principles and a growing number of journals published by Springer Nature are adopting policies that support the citation of datasets in a way that makes data a first-class object in the scholarly literature.
Data journals may be leading the way, but we believe that, in time, data citation will be become as usual in scholarly practice as citing peer-reviewed literature is today.
Footnotes
1 We provide a list of recommended repositories, with which we have established publication workflows to make data sharing and publication as easy as possible for researchers.
2 Persistent identifiers (PIDs) are assigned to each dataset by the repository hosting the data. The data PID may be an accession identifier, such as those assigned by repositories which are part of the National Center for Biotechnology Information (NCBI). However, many repositories (and especially those outside the biomedical domain), will assign a DataCite digital object identifier (DOI) for hosted data.