Big data are, it seems, everywhere and attracting much attention, but in terms of size are hard to define. Scientific research generates a lot of “small data” too – the average file size for all datasets deposited in our partner repository figshare, for example, is just 1.35 Mb. However, big data are ironically somewhat agnostic of file size, and instead are more about complexity – of the processing techniques and sources the data are derived from. Scientific Data is, for the data underlying our publications, also size agnostic. We welcome data big and small and, in response to feedback from our Editorial Board, have updated our frequently asked questions and scope statement to reflect this.
To be published in Scientific Data, the data described in the Data Descriptor must be scientifically valuable and have the potential for reuse, as judged by our peer-review and data curation process. Our editors and reviewers check, broadly, four key things (see the guide to reviewers for complete information):
- Experimental rigour and technical quality (were the methods sound?)
- Completeness (can others reproduce and reuse the data?)
- Consistency (were community standards for reporting followed?)
- Integrity (are the data in the best repository?)
We ask our peer reviewers to consider the size of datasets only within the context of the applications and uses outlined by the authors. We understand that valuable datasets come in many sizes, and they don’t have to fill an external hard drive to be worth sharing with the scientific community.
Indeed we have published the same number of datasets below 2MB as we have above 200GB — about 5 each. Two of these “small” datasets present important epidemiological data collected by the lab of Simon Hay on dengue fever and leishmania. These works provide information on tens of thousands of disease cases across several decades, with direct public health relevance. Sometimes small can be big.
So what does big even mean? Big could be measured in terms of impact: we regularly publish descriptions of datasets associated with high-profile articles at the Nature-titled journals, such as the Data Descriptor by Baud et al that provides genotype and phenotype data on hundreds of outbred rats, expanding on a previous work at Nature Genetics. We also publish datasets that are big in terms of their scope, such as the Duke Lemur Center’s recent release of nearly 50 years of data on strepsirrhine primate taxa, or the Data Descriptor by the Tree of Sex consortium, with data covering more than 14,000 species.
Scientific Data aims to be an inclusive journal that publishes a wide-range of technically-sound research datasets. We gladly publish descriptions of datasets that have been generated from a single study or individual lab, big or small. See for example the Data Descriptor by Wilson & Chambers, which presents transcriptomic data from 12 samples collected from different regions of the developing chick brain.
If Scientific Data is to help change scientific research for the better – making it more reproducible – we need more researchers making their data, of all sizes, not just available but available in a discoverable and understandable and reusable way. These last three conditions require additional incentives and effort and this is a key service of data journals and our Data Descriptor article-type.
Publication in Scientific Data subjects data to tailored peer review and validation, a service not offered systematically by many primary research journals. If you submit a Data Descriptor for an as yet unpublished study the additional review of the data should benefit and potentially expedite review of future papers from the study, and will provide feedback on data archiving and formatting that will maximise the data’s reuse potential. We see this as a valuable author service.
Scientific Data is a sister publication to the Nature research journals, and we share an understanding, and policy, with Nature journals that Data Descriptors complement traditional research articles. Authors can publish descriptions of their datasets at Scientific Data prior to, or alongside, research articles submitted to the Nature journals. Some other publishers also support complementary publications of data papers (e.g. at BioMed Central). And, in all cases we can coordinate publication of Data Descriptor articles with an author’s related publications if needed. We are also glad to consider Data Descriptors that build on studies previously published in other publishers’ peer-reviewed journals, when a data descriptor would enrich the scientific literature.
The power of “small data” are in their aggregation and integration with other datasets – which we enable by publishing additional metadata in a standardized format, ISA-Tab, with every article. We believe there is value in all well-curated, validated and reusable data – big and small. As researchers continue to invent new ways to integrate disparate datasets, the research community will undoubtedly continue to discover new, unanticipated insights from published data.
Andrew L Hufton, Managing Editor, Scientific Data
Iain Hrynaszkiewicz, Head of Data and HSS Publishing, Open Research