Storing data forever

NGeo_v3_n5.gif

From Nature Geoscience 3, 219 (May 2010)

Unlike accountants, scientists need to store their data forever. This expanding task requires dedication, expertise and substantial funds.

Data are at the heart of scientific research. Therefore, all data and metadata should be stored — forever, and accessibly. But it would be naïve to think that such a ‘gold standard’ of preservation could be achieved. In one spectacular example of the failure of science to save its treasures, some of NASA’s early satellite data were erased from the high-resolution master tapes in the 1980s. The lost data could now help extend truly global climate observations back to the 1960s — had they not been taped over. At the time, the storage capacity of the tapes seemed more valuable than the data they contained.

Until the introduction of full-scale supplementary information, ensuring that accessible records were kept was down to the authors. Of course, the loss of important information is unacceptable from a scientific point of view. But it is hardly surprising and probably widespread: scientists are not well-placed to guarantee continuity of data storage, especially while they are still in their vagabond years of PhD and post-doc work.

Nature Geoscience, in common with all the Nature journals, requires that authors make their data available on publication. The easiest way of ensuring that all the relevant information is accessible, and will remain so in the long term, is to use professionally run databases, which are now available for all sorts of Earth science data.

The creative push in science will always be for the production of better-resolved, more complicated data sets. Ingenious ways of storing and releasing these data are invariably developed with considerable lag. But this is not an excuse to neglect the issue. The preservation of valuable data sets and their distribution on demand is of utmost importance for the progress of science. The continuous attention of dedicated professionals — and substantial funds — is needed for database development to keep up with the science.

Nature Geoscience on the pros and cons of online publication

NGeoApr.gif

Online publishing has blurred the boundary between accepted and published articles, a topic discussed in an Editorial this month (April) in Nature Geoscience ( 3,219; 2010) . From the Editorial:

With the advent of online publication over the past 10 years, it no longer needs to take months or years for an accepted paper to become available to journal subscribers, and the term ‘monthly journal’ is losing its meaning. Articles are published online weeks to months before publication in print, with benefits all round: authors can make their peer-reviewed results available to the scientific community quickly, readers can keep abreast of the latest developments and publishers can provide a continuous stream of content in an increasingly competitive market.

But the downside of early online publishing is a confusing array of publicly available article types, awaiting print publication in various stages of editorial preparation. Some journals place papers online first for peer review, and then in their final form. As the focus of scientific journals is moving from print to electronic publication, each publisher makes its own decision regarding the balance of speed versus the completeness of published work. But when papers go online before they are in final form, uncertainity arises regarding the canonical publication date.

Publisher’s policies regarding the accessibility of online articles are equally piecemeal. Science Express — where Science papers are posted online up to six weeks ahead of publication in print — is available to site licence subscribers only as a premium add-on. And when journals of the American Geophysical Union publish ‘in press’ papers before their print version, only the titles of these papers are available to non-subscribers. On publication in print, abstracts are also free to access online.

Nature Geoscience papers are published online in their final, definitive form — fully proofread and formatted — and the date of online publication is the date of record. However, we consider papers elsewhere as published as soon as the scientific content is fully available online, with a Digital Object Identifier (DOI). That is, we are happy to highlight ‘in press’ articles, whatever format they are in. We also count them as part of the body of existing literature when assessing the advance of a submitted paper over existing knowledge.

As the demand for print subscriptions wanes, unified payment models for accessing papers online and in print are likely to evolve. What needs to be decided is how much a preliminary paper published online should be allowed to change before it constitutes a new paper.

Nature Physics calls for support of the arXiv preprint server

NPhysMar.gif

Funding of the arXiv preprint server must now be shared by more of its users, says Nature Physics in its March Editorial (6, 147; 2010) From the Editorial:

The arXiv preprint server has become central to the working lives of most physicists: ‘checking the arXiv’ is the starting point of many a daily routine. Founded by Paul Ginsparg, the arXiv has expanded to include not only physics — astrophysics, condensed matter, and high-energy physics being heavily represented — but also mathematics, computer science, quantitative biology and even quantitative finance. The arXiv now hosts nearly 600,000 preprints from 101,000 registered submitters in 200 countries, and serves more than 2.5 million article downloads every month to 400,000 users.

The statistics are remarkable. And it is also remarkable that, having initially been hosted at the Los Alamos National Laboratory, the server has in recent years been funded and managed by a single institute, Cornell University. Now the operating costs of the ever-growing arXiv match the entire Cornell library budget for physics and astronomy, and the university has made a plea for help in funding it.

As a short-term solution, Cornell is approaching the top 200 user-institutions of the arXiv — who account for 75% of institutional downloads — for a financial contribution to the maintenance of the arXiv. It is heartening that help has already been promised from several large universities and laboratories, but wider support is still being sought. For the longer term, Cornell will assess, with those who come forward to support the arXiv, what the options are in developing a sustainable model for the future.

Secure, ongoing funding of the preprint server is vital, and surely deserves at least national endorsement from US funding bodies, if not some international arrangement. The fast communication of results — data or theory — between experts that the arXiv facilitates is a boon to physics, and one well recognized by Nature Publishing Group: any submission to Nature Physics or its sister journals may be posted, in that original submitted form, on the preprint server (although we do ask that the final, revised and accepted version is not posted until six months after publication in the journal; the published version, in the Nature Physics layout, may not be posted).

It’s up to physicists now to sustain their arXiv, encouraging their institutions’ libraries in particular to engage in its development. More information is available; and comments and suggestions may be sent by e-mail.

Nature journal policies on posting material before and after publication.

Integrating with integrity, according to Nature Genetics

NGen_Jan.gif

Data worthy of integration with the results of other researchers need to be prepared to explicit export standards, linked to appropriate metadata and offered with field-specific caveats for use. The Editorial in the January edition of Nature Genetics ( 42, 1; 2010) explores the extent to which, to be useful at generating new analyses and hypotheses, data sharing needs to be about standardized formats as much as simply being made ‘available’. For example, the Editorial states, “Sample sizes, selection criteria, statistical significance, number of hypotheses tested, normalization and scaling procedures, read depth and sequence quality scores are all important considerations that can be misunderstood or missed in combining and reanalyzing data. Whether integrative approaches are useful may depend upon whether integration preserves or destroys essential information….

Integration is of most value in two areas: bioinformatic modeling, to predict the effects of genetic and environmental perturbation, and clinical utility, to increase the speed and accuracy of the transfer of preclinical knowledge to clinical trial. Funding bodies hope that encouraging researchers to integrate their results will reduce duplication of effort. Trivially, researchers can agree to work on the same systems and samples or to use agreed standard control materials, but this can be problematic in practice….

Researchers can enable integrative studies by publishing their quality metrics and exchange standards in a timely way in regularly versioned, citable preprints; and by holding integration workshops between data producers and data users from different fields. These exchanges should focus on honest assessment of what data are ready for use and explain the quality metrics used and where the pitfalls lie in using the data. In return, data producers can increase the citability of their datasets by better understanding the metadata needed by users. Requirements for open data deposition and integration that do not include mechanisms to agree on, publish and use data standards risk inflating inconsequential ‘integrative’ bubbles.”

Nature Genetics website

Nature journals’ policies on availability of data and materials

Nature Cell Biology on research integrity and accessibility

The cell biology literature contains manipulated data that distort findings, usually in an attempt to ‘beautify’ and, rarely, to commit fraud, states the September Editorial in Nature Cell Biology (11, 1045; 2009, free to read online) According to the Editorial, a National Academy of Sciences (NAS) report, ‘Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age’, "arrives at no hard and fast rules; the panel found that different fields have quite different requirements. In the words of panel chairs Phillip Sharp and Daniel Kleppner, “the report provides a framework for dealing with the challenges to the community generated by the onrush of digital technology.” Nevertheless, the key tenets that researchers are responsible for ensuring the integrity and accuracy of their data and appropriate training in the management of research data, that all data and experimental details from papers be publicly accessible and carefully archived to allow verification and to facilitate future discoveries, and that field-specific standards have to be developed by researchers, funders, societies and journals, benefit from being spelled out in one document."

Many of the recommendations in the report already are the policy of Nature Cell Biology and the other Nature journals: the Editorial provides further information about these, including references to past Editorials, with particular emphasis on various aspects of data manipulation and plagiarism — which, although widely unrealised, extends to concepts as well as to copying text and illustrations.

Nature Materials on access to the literature

Joerg Heber, a senior editor at Nature Materials, announces that access to all Editorials in the journal is now free to registered users of nature.com. This follows a similar decision taken at Nature some years ago, and more recently, by Nature Cell Biology. The August Editorial of Nature Materials (8, 611; 2009) discusses publishing models more broadly: “As moves towards open-access schemes gain momentum, the choice between ‘author pays’ and subscription-based models may come down to fundamental business considerations rather than limits in access to original research.” In ‘open access’ publishing, authors pay for publication costs, and online access and dissemination of those papers is free for readers. The Editorial goes on to describe the publication model of the Nature journals, which is (in the main) subscription-based (in which the reader or institution pays for access), and which also offer various open-access services to authors, who retain copyright of their articles….“at every stage of manuscript handling we provide an expensive, high-quality service. This not only involves the professional subediting and production of accepted papers, but also an exhaustive prescreening of submitted manuscripts. At Nature Materials, we prescreen well above 80% of submitted manuscripts without peer review. This means that, at a cost, we rely much less on the ‘free’ peer-reviewing services of scientists than journals with lower screening rates. As open access certainly should not be considered as a way to lower publication standards, the overall expenses related to the dissemination of scientific results should be considered so that the costs remain the same. This means that research-intensive institutions in particular (or those paying for their research grants) may well end up paying proportionally more under author-pays models than they would under subscription-based models. Researchers from less research-intensive institutions on the other hand would benefit.”

Read the full version of the Nature Materials editorial here.

Comments on the Editorial are welcome at the Nature Publishing Group news forum at Nature Network.

Chemical biologists could help accelerate drug discovery

This month’s (July) Nature Chemical Biology includes two articles describing how access to the highest quality chemical probes will ensure their prominent position in the biological and drug discovery toolboxes.

Aled M Edwards, Chas Bountra, David J Kerr and Timothy M Willson, in their Commentary (Nature Chemical Biology 5, 436 – 440; 2009) Open access chemical and clinical probes to support drug discovery, say that drug discovery resources in academia and industry are not used efficiently, to the detriment of industry and society. Duplication could be reduced and productivity increased, they write, by performing basic biology and clinical proofs of concept within open access industry-academia partnerships. Chemical biologists could play a central role in this effort.

The authors’ main argument is that the development of new medicines is being hindered by the way in which academia and industry advance innovative targets. By generating freely available chemical and clinical probes and performing open-access science, the overall system will produce a wider range of clinically validated targets for the same total resource, arguably the most effective way to spur the development of treatments for unmet needs.

In a related article in the same issue of the journal, ‘A crowdsourcing evaluation of the NIH chemical probes’, Tudor I. Opera et al. (Nature Chemical Biology 5, 441-447; 2009) write that between 2004 and 2008, the US National Institutes of Health Molecular Libraries and Imaging initiative pilot phase funded 10 high-throughput screening centres, resulting in the deposition of 691 assays into PubChem and the nomination of 64 chemical probes. The authors ‘crowdsourced’ the Molecular Libraries and Imaging initiative output to 11 experts, who expressed medium or high levels of confidence in 48 of these 64 probes. Crowdsourcing is a cross-disciplinary alternative way to assess confidence for both chemical probes and drug leads: it pools multiple levels of expertise from translational disciplines, providing a rigorous chemical-probe evaluation process.

Nature Chemical Biology website.

Nature Chemical Biology guide to authors.

Nature Chemical Biology focuses and supplements.

Nature Chemical Biology symposium 2009: Chemical biology in drug discovery.

Effect of recession on publishing models

In a Correspondence in the current issue of Nature (458; 967, 2009), Raf Aerts of the Katholieke Universiteit Leuven writes:

Your Commentaries on ‘How to survive the recession’ devote much discussion to the effects of the global recession on science (Nature 457, 957–963; 2009). However, the financial squeeze may also be affecting the publication output of research institutions in a more subtle way. It could be boosting the traditional reader-pays publication model for scientific journals at the expense of the author-pays, or open-access, model.

Open-access journals ask authors to pay for processing their manuscripts (which involves organizing a form of quality control, formatting and distribution) so that the final product becomes freely available, and free to use if properly attributed. This model is widely believed to increase the visibility, dissemination and, eventually, the citation and impact of research findings.

However, few peer-reviewed open-access journals have so far had a high impact factor in their field, except for a small number such as those published by the Public Library of Science and BioMed Central. They are therefore struggling to emerge and to attract the most prestigious research findings.

This situation could deteriorate further if open-access journals are forced to move to (partial) site licensing in order to cover their production costs — a shift recently undertaken by the Journal of Visualized Experiments, for example as authors become increasingly reluctant or unable to pay in the current financial climate.

(This is an slightly shortened version of the Correspondence. The full version is available online at Nature’s website.)

Incentives needed for genome annotation

Roy Welch and Laura Welch of Syracuse University, New York, examine why researchers seem reluctant to be more directly involved in the annotation of microbial genomes in the February issue of Nature Reviews Microbiology (7, 90; 2009). They write:

“To annotate an organism’s genome, biological information about the organism must be matched to the genes and genetic elements in the sequenced genome. The process is iterative and open-ended: new information is constantly incorporated into the annotation. It can also be recursive: analysis of the annotation may provide insight about the organism that in turn leads to changes to the annotation. Unfortunately, the generation of new information and annotation of the genome are at present completely separate processes. Often new information does not become incorporated into the annotation in a timely manner, a costly loss for those who rely on it to advance their research.

The community of expert researchers who study an organism produce most of the information that becomes part of the annotation and are also the primary group of end-users. It is therefore curious that the annotation process is circuitous and inefficient: researchers communicate new information not as direct updates to the annotation, but as research papers that must later be interpreted and incorporated into the annotation separately — most often by a third party! Indeed, some information never finds its way into the annotation. It would be far more efficient for the research community to contribute directly to genome annotation. Yet the life science community as a whole remains stuck in the old, inefficient paradigm.”

The authors go on to argue that technology is not the impediment, given the wide availability of wikis (collaborative editing websites) and the databases that have been created using these technologies, including EcoliWiki, GONUTS, Myxopedia and Wikipathways. Rather, state the authors, the impediment seems to be sociological: until contributions to a genome-annotation collaborative information repository can be credited by inclusion in a PhD thesis, curriculum vitae, tenure application or grant proposal, direct collaborative annotations are unlikely to fulfil their promise and potential to accelerate scientific achievement.

Historical microbiology archive made free to all

In its November Editorial, Nature Reviews Microbiology (6, 794; 2008) reports that the archive of the International Journal of Systematic and Evolutionary Microbiology (IJSEM) has been made available free online: a boon for scientists, historians and the public. The Society for General Microbiology publishes IJSEM on behalf of the International Committee on Systematics of Prokaryotes of the International Union of Microbiological Societies. The society has now provided funding for the entire back archive of the journal to be made freely available worldwide without a journal subscription. (The current content, or past two years, remains subject to access controls.)

From the Nature Reviews Microbiology Editorial: Systematics is the foundation for studies of all types of organisms, because it helps us to understand how one organism relates to another. The value of systematics is often underappreciated, however, for bacteria and viruses. For example, there is a huge imbalance between the 7,000 named bacterial species and the 1,000,000 named insect species. This is particularly important given that it is now well-known that bacteria and viruses are the most populous organisms on Earth, and furthermore, that more than 99% of bacteria have yet to be cultivated. Why should we be interested in naming and characterizing different species of bacteria? The advent of metagenomics has swelled the literature with ever-increasing estimates of numbers and types of bacteria and viruses in the biosphere. An important adjunct to genomics-based approaches is the detailed characterization of these myriad species and investigation of the relationships between them. The availability of the IJSEM archive will hopefully spur renewed interest in this area.

Jean Euzeby, the IJSEM list editor, maintains an incredibly useful web resource that details all those species that have been ratified — the List of Prokaryotic names with Standing in Nomenclature. Another useful site named Bacterial Nomenclature Up-to-Date has an up-to-date list of bacteria and is based on the work of Norbert Weiss, who maintained the database until his retirement in February 2003. The current database is maintained under the supervision of Manfred Kracht. Finally, a comprehensive taxonomy of the Bacteria and Archaea can be found in the Taxonomic Outline of Bacteria and Archaea (TOBA) Release 7.7, which was last updated in 2007.

Other useful resources are described in the Editorial.