Nature Physics calls for support of the arXiv preprint server

NPhysMar.gif

Funding of the arXiv preprint server must now be shared by more of its users, says Nature Physics in its March Editorial (6, 147; 2010) From the Editorial:

The arXiv preprint server has become central to the working lives of most physicists: ‘checking the arXiv’ is the starting point of many a daily routine. Founded by Paul Ginsparg, the arXiv has expanded to include not only physics — astrophysics, condensed matter, and high-energy physics being heavily represented — but also mathematics, computer science, quantitative biology and even quantitative finance. The arXiv now hosts nearly 600,000 preprints from 101,000 registered submitters in 200 countries, and serves more than 2.5 million article downloads every month to 400,000 users.

The statistics are remarkable. And it is also remarkable that, having initially been hosted at the Los Alamos National Laboratory, the server has in recent years been funded and managed by a single institute, Cornell University. Now the operating costs of the ever-growing arXiv match the entire Cornell library budget for physics and astronomy, and the university has made a plea for help in funding it.

As a short-term solution, Cornell is approaching the top 200 user-institutions of the arXiv — who account for 75% of institutional downloads — for a financial contribution to the maintenance of the arXiv. It is heartening that help has already been promised from several large universities and laboratories, but wider support is still being sought. For the longer term, Cornell will assess, with those who come forward to support the arXiv, what the options are in developing a sustainable model for the future.

Secure, ongoing funding of the preprint server is vital, and surely deserves at least national endorsement from US funding bodies, if not some international arrangement. The fast communication of results — data or theory — between experts that the arXiv facilitates is a boon to physics, and one well recognized by Nature Publishing Group: any submission to Nature Physics or its sister journals may be posted, in that original submitted form, on the preprint server (although we do ask that the final, revised and accepted version is not posted until six months after publication in the journal; the published version, in the Nature Physics layout, may not be posted).

It’s up to physicists now to sustain their arXiv, encouraging their institutions’ libraries in particular to engage in its development. More information is available; and comments and suggestions may be sent by e-mail.

Nature journal policies on posting material before and after publication.

Nature Medicine on access to and integrity of data

Nmedfeb.gif

Everybody agrees that ensuring the integrity and accessibility of research data is crucial for scientific progress. Agreeing on the best way to do so is the hard part, says Nature Medicine in its February Editorial ( 16, 131; 2010).

Technological advances have enabled researchers to tackle questions that involve generating vast amounts of data, posing challenges concerning data analysis, manipulation, annotation, sharing and storage that researchers, institutions, funders and journals have not yet fully grasped. How should data be annotated before being stored in a database so that it can be as useful as possible to other researchers? Should data-sharing requirements be extended to the computer codes that were used to analyze the data? Who should have access to the data, and who pays for data storage and management?

These questions will become more pressing as further technological advances make it even easier to produce ever larger data sets, and it won’t be simple come up with the answers. The US National Academy of Sciences, the National Academy of Engineering and the Institute of Medicine published a report called Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, focusing on data integrity, access and long-term preservation, and providing a useful framework around which to organize what has become an urgent dialogue.

The conclusions of the report, while worthy, are hardly news to those who have pondered these issues. At the Nature journals, for example, data sharing has long been a requirement for publication, and the editors directly insist to authors that they must fulfill their commitment to sharing when other researchers have reported difficulty in obtaining data and materials. The merit of the acadamies’ report does not lie in its recommendations but in its disciplined analysis of the current state of play, its multidisciplinary perspective on the problems and its identification of the tough questions that scientists, institutions, funders and journals need to answer to move forward, even though it provides little in terms of answers.

The Editorial goes on to further analyse the issues raised by the report, concluding that scientists themselves should develop the right standards, lobby for the resources to set up the appropriate infrastructure and decide on the right measures to deter other scientists from data mismanagement. Data may not be the legal property of scientists, but looking after the data is certainly their responsibility.

Nature Medicine journal website.

Nature journals’ policy on availability of data and materials.

Are smartphones making inroads into the laboratory?

methfeb.gif

Mobile computing platforms such as the iPhone are beginning to make inroads into the laboratory—serious prospect or fairy tale? So asks Nature Methods (7, 87; 2010), starting its February Editorial in traditional genre style: “Once upon a time phones were used exclusively for conversing with other people, and computers ran software applications. The computer became an indispensable tool in the laboratory while the phone developed into a mobile device that has disrupted countless lectures at scientific conferences. But recently researchers can be seen talking on their computer and using their cell phone for running fancy—and sometimes powerful—software programs.

This metamorphosis of the cell phone into a mobile computing platform with voice capabilities is epitomized by the iPhone—one of a new breed of smartphone that is not only popular among the general public but seemingly ubiquitous among scientists. Earlier phones had similar capabilities, but the arrival of the ”https://www.apple.com/iphone/apps-for-iphone/“>Apple App Store in 2008 provided a dizzying array of software applications, or apps, that could be installed at a touch of the screen. Stanford University even offers a free course on developing iPhone apps.”

The Editorial goes on to debate whether such devices will be useful in wet-lab procedures, speculating on a few possible “killer apps” that would stimulate general adoption. Even so, says Nature Methods, for the present, the most immediate potential for these devices is in providing a painless way for researchers to keep up with their reading wherever they happen to be. Mass media publishers have embraced the iPhone for delivering their content, but there has been little activity in the scientific publishing arena—RSS news feeds notwithstanding. But the situation is changing." Several publishers, including Nature Publishing Group, have apps that will go live any day. The just-released nature.com app lets you read full-text articles, view full-size figures and save references. This Nascent post highlights some of the features and describes how to use them.

The Nature Methods editors welcome comments at Methagora, the journal’s blog – where there is a set of links to various iPhone applications for scientists. Does your iPhone or other smartphone have a place in the lab? What is the must-have app you are looking for?

Download the nature.com iPhone application.

Nature Genetics on conclusion by exclusion

NgeneticsFeb.gif

“Science is a way to distinguish things we know not to be true from other things. Large challenges lie ahead as we apply the scientific method to understanding biochemical systems, cellular organization and the functions of complex organs such as the brain.” So begins the February Editorial in Nature Genetics (42, 95; 2010). If the success of the early years of molecular biology can be attributed to the simplicity of the problems to solve, combined with rigorous experimental design including disprovable hypotheses and decisive experiments, what of todays immensely more complex scientific landscape and greatly increased number of scientists, not to mention orders of magnitude more computer power? Are we better equipped to generate, experimentally test, and choose or discard competing hypotheses?

The Editorial argues that “the complexity of a research project does not change the basic requirement for inference so long as the results are intended to be understood by human brains. A model or predictor aids secure inference when it is treated as a falsifiable hypothesis with falsifiable sub-hypotheses. Therefore, we would expect to publish a list of conditions in which the model or predictor is not valid, and tests demonstrating conditions in which it is not valid, as well as hypotheses drawn from the model or predictor and tests that disprove these hypotheses.

There are a number of benefits to separating the logical gems that authors are prepared to have tested by others from their setting of consistent observations and rhetoric that is not directly part of the scientific work of the paper. These pluses are: to allow peer referees to do their job and readers to understand the work; to make clear the caveats and limits to application of results to other fields; to limit proliferation of useless observational studies and reduce duplication and waste of effort.

It may also be possible to distinguish the direct influence of the research independently of the publications that describe it. In order to do this, each of these two components—hypotheses and experiments—will need to be coded with unique identifiers and separately cited. Such an extreme cultural change may not be needed if publications are carefully structured. Surely it is obvious that a study providing strong inferences will be both well used and highly cited.”

Protein Data Bank policies for disputed structures

Helen M. Berman, director of the RCSB (Research Collabatory for Structural Bioinformatics) Protein Data Bank, and co-authors wrote a Correspondence to Nature ( 463, 425; 2010) to clarify the PDB’s correction procedures and policies in the light of a current investigation. Their letter is reproduced here.

Your News story ‘Fraud rocks protein community’ (Nature 462, 970; 2009) discusses allegations that 12 Protein Data Bank (PDB) entries are based on fabricated data. Pending verdicts on these entries from the US Department of Health and Human Services Office for Research Integrity, we wish to clarify PDB policies and actions.

The PDB archive, which is managed by the Worldwide PDB (wwPDB), houses more than 62,000 entries for macromolecular structure models and their experimental data. It is maintained for the public good. Deposited structures are validated using community-developed standards, and any related corrections are made by depositors before release and publication.

Entries can be replaced on written request from the depositor(s) if better data have become available or the interpretation of existing data has changed. Entries can be withdrawn (that is, rendered obsolete) by the senior author, or by journal editors when the published paper describing the entry is retracted.

An author’s employer (in this case, the University of Alabama at Birmingham) may request removal of an entry, but this request must be fully documented and the original paper describing the entry must be retracted. This ensures due process for the author(s) and the scientific integrity of the PDB archive. To date, the paper describing one of the 12 PDB entries in question has been retracted (J. Biol. Chem. 284, 34468; 2009), and the corresponding PDB entry (PDB code 1BEF) has been made obsolete by the wwPDB at the request of the publisher.

To ensure that PDB entries are validated using state-of-the-art methods, wwPDB validation task forces have been convened for X-ray crystallography and nuclear magnetic resonance spectroscopy. Their recommendations will be reviewed and incorporated into wwPDB’s deposition and annotation procedures.

wwPDB encourages all journals publishing macromolecular structures to stipulate accompanying submission of wwPDB validation reports. These will help editors and referees to assess the reliability of structural data and their interpretation. A few journals have already indicated their interest.

With the support of the structural-biology community, the mission of the wwPDB is to safeguard the integrity and improve the quality of the PDB archive. It is the public availability of atomic coordinates and experimental data that enables errors and possible fabrications to be detected in the first place. Current validation procedures were designed to identify occasional honest mistakes, not to guard against rare cases of malfeasance.

Full authorship of this Correspondence:

Helen M. Berman, Director, RCSB PDB, Rutgers University, Piscataway, New Jersey 08854-8087, USA

Gerard J. Kleywegt, Head, PDBe, EMBLEBI Hinxton, Cambridge CB10 1SD, UK

Haruki Nakamura, Head, PDBj, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan

John L. Markely, Head, BioMagResBank, University of Wisconsin-Madison, Madison, Wisconsin 53706-1544, USA

Stephen K. Burley, Chair, wwPDB Advisory Committee, Eli Lilly and Company, San Diego, California 92121, USA

About the PDB archive.

Note from Nature: Nature would like to make clear that the misconduct investigation by the University of Alabama concluded in December 2009 that H. M. K. Murthy acted alone in generating allegedly falsified protein structures. He denies the allegations.

Integrating with integrity, according to Nature Genetics

NGen_Jan.gif

Data worthy of integration with the results of other researchers need to be prepared to explicit export standards, linked to appropriate metadata and offered with field-specific caveats for use. The Editorial in the January edition of Nature Genetics ( 42, 1; 2010) explores the extent to which, to be useful at generating new analyses and hypotheses, data sharing needs to be about standardized formats as much as simply being made ‘available’. For example, the Editorial states, “Sample sizes, selection criteria, statistical significance, number of hypotheses tested, normalization and scaling procedures, read depth and sequence quality scores are all important considerations that can be misunderstood or missed in combining and reanalyzing data. Whether integrative approaches are useful may depend upon whether integration preserves or destroys essential information….

Integration is of most value in two areas: bioinformatic modeling, to predict the effects of genetic and environmental perturbation, and clinical utility, to increase the speed and accuracy of the transfer of preclinical knowledge to clinical trial. Funding bodies hope that encouraging researchers to integrate their results will reduce duplication of effort. Trivially, researchers can agree to work on the same systems and samples or to use agreed standard control materials, but this can be problematic in practice….

Researchers can enable integrative studies by publishing their quality metrics and exchange standards in a timely way in regularly versioned, citable preprints; and by holding integration workshops between data producers and data users from different fields. These exchanges should focus on honest assessment of what data are ready for use and explain the quality metrics used and where the pitfalls lie in using the data. In return, data producers can increase the citability of their datasets by better understanding the metadata needed by users. Requirements for open data deposition and integration that do not include mechanisms to agree on, publish and use data standards risk inflating inconsequential ‘integrative’ bubbles.”

Nature Genetics website

Nature journals’ policies on availability of data and materials

Nature Biotechnology focus on synthetic biology

NBT dec.gif

The December 2009 issue of Nature Biotechnology focuses on synthetic biology, in a special feature (subscription) containing news, opinion, comment and research articles on the topic. The focus discusses some of the progress in synthetic biology towards practical applications, as this latest iteration of genetic engineering, although still in its infancy, offers the prospect of the design and construction of new life forms from biological parts, devices and systems. If, however, you aren’t sure exactly what synthetic biology is, Nature Biotechnology asked 20 specialists for their definitions, so you can take your pick.

The Editorial that begins the focus asserts that “it is not too hard to imagine a future where, with relatively little effort, we can create alternative life forms—minimal-genome chassis organisms with interchangeable standardized gene circuits—that will enable genetic engineers to rapidly move from one industrial project to another. The technology is disruptive, with the potential to transform biological engineering, which until now has been limited to tinkering with natural organisms, and relies on a good deal of serendipity for success.

At the turn of the last century, the Wright brothers achieved manned flight not by mimicking natural systems, but by applying the principles of engineering and aerodynamics. Similarly, synthetic biology allows us to dispense with biological mimicry and design life forms uniquely tailored to our needs. In doing so, it will offer not only fundamental insights into questions of life and vitality but also the type of exquisite precision and efficiency in creating complex traits that genetic engineers could previously only dream of.”

One of the articles in the focus that I particularly enjoyed is Parts, property and sharing by Joachim Henkel of the Munich Universtiy of Technology and Stephen M. Maurer of the University of California, Berkeley, who suggest that synthetic biology should look to other industries’ models for ownership and open sharing. The authors write:

“Synthetic biology is bound to change the rules of the game in genetic engineering. Its reliance on large numbers of parts turns the field into a complex technology, and the importance of shared learning implies network effects and makes winner-take-all outcomes likely. Both aspects are compounded by weaknesses of the IP system—in particular, its lack of transparency. Although these problems may seem modest today, they are likely to become much more serious once the synthetic biology industry starts to generate significant profits.

For these reasons—and even though the general usefulness of patents in the life sciences is beyond doubt—reasonable steps to grow the commons and support open sharing seem highly advisable. We have already argued that an embedded Linux-style open parts collaboration makes good legal and economic sense. Furthermore, the open parts idea enjoys widespread support, not just in the academic community but also, to a large extent, in industry. For every front-runner, there are several firms for whom sharing is the only way to catch up. Similarly, companies that sell synthetic genes and other support services know that cheap, abundant, high-quality parts are good for business. Open parts are the best way to deliver this result. Finally, government has repeatedly intervened to promote open source-style sharing in software and, more recently, stem cell research. We think it will be similarly predisposed to support an open parts project. Yet no matter how synthetic biology is made more open, it needs to happen soon.”

Nature Biotechnology focus on synthetic biology.

Nature Cell Biology joins call for microattribution of datasets

Nature Cell Biology (11,1273; 2009) joins in the call for ‘microattribution’ in its November Editorial, stating that reference datasets should be accessible independently of scientific papers in a citable form. The problem, from a cell biological perspective:

“Scholarly publication remains essential for describing and contextualizing findings, but it is inadequate as the only document of research activity. Most journals require a significant conceptual advance, and format constraints typically allow only for the presentation of representative qualitative, or statistically processed quantitative data. Consequently, the majority of raw data never emerges from lab hard drives, and a wealth of information, hard work and funding is wasted. High throughput platforms generate reams of data that cannot be captured in traditional papers. Moreover, methods sections fail to adequately describe metadata essential for the comparison and reproduction of experiments. Databases are essential for comprehensively archiving both published and unpublished data, but have only become fully integrated into the scientific process in a few cases, such as DNA sequencing and microarray data. For many types of data, including light microscopy, no databases exist at all. "

Prepublication deposition into databases is relatively new to biology, but is essential, according to the Editorial, whether or not some embargo condition is imposed by authors, funders or publishers. Journals, in their turn, need to systematically link online to data and other material in databases, in order to remain relevant. The Editorial concludes that “Large reference datasets that benefit the wider community and that cannot be analysed efficiently by the data producers should enter the public domain without delay, as long as appropriate attribution and credit can and is given. Scientific culture has to change so that data is valued alongside publications.”

See also: ‘Accreditation and attribution in data sharing’ by Gudmundur A. Thorisson of the Department of Genetics, Leicester, UK (Correspondence to Nature Biotechnology 27, 984-985; November 2009).

Data producers deserve citation credit, says Nature Genetics

Datasets released to public databases in advance of (or with) research publications should be given digital object identifiers to allow databases and journals to give quantitative citation credit to the data producers and curators, according to the October Editorial of Nature Genetics (41, 1045; 2009) .

After reviewing the arguments for assigning a citable credit to data, particularly those which are released publicly before formal publication in a journal, as is increasingly the case in some fields (and required by some funders), the Editorial asks: "What form should citable data identifiers take? They must work with existing unique resource identifier conventions and with the existing well-funded stable repositories used by research communities. However, these identifiers are not just for locating data but are for stably identifying the data units and versions with particular data producers, curators, funders and affiliations in a citable form. Because publications are currently the main source of scientific credit and because publishers have already developed citable digital object identifiers (DOI), it would seem to be their opportunity to grasp or to fumble. We propose citing DOIs that tag a combination of repository, database, accession, version, contributor and funder.

Of course, precise citation of all research output represents the bare minimum of respect for colleagues and competitors. This journal also endorses communication between data producers and data users. Whereas it is impossible for journals to restrict the use of data already in the public domain, we can show evidence of communication between producers and users to referees. Many funders of large resource projects now require a data release policy and plan for global analysis by the data producers. These parts of the successfully refereed grant should be published as a ‘marker paper’ or deposited in a citable preprint archive such as Nature Precedings. At very least, the details of the producers’ work and intents should be available to users in a citable form in the database holding the data. Data users can submit an email demonstrating that they have contacted the data producers with their plan for use of the data and showing that they have read the producers’ data release policy, conditions and plan for analysis."

Please see also the continuing Nature Network online discussions about pre-publication and post-publication data release. We welcome your views there.

Nature Biotechnology: Personal genome data on the line

Continuing the theme of yesterday’s post about data sharing, Nature Biotechnology is running an Editorial this month (Nature Biotech. 27, 777; 2009), ‘DNA confidential’, pointing out that as “the cost of human genome sequencing plunges and large-scale genome-phenotype studies become possible, society should do more to reward those individuals who choose to disclose their data, despite the risks”. The Editorial continues:

“The genome sequence of Patient Zero is disclosed on ”https://www.nature.com/nbt/journal/v27/n9/full/nbt.1561.html">p. 847 of this issue. The paper is notable not only because it provides the first description of the performance of a single-molecule platform in sequencing a human genome (90% of it, at least), but also because Stanford professor Stephen Quake (aka Patient Zero) opted to tell the world that it was his DNA that had been sequenced. Like scientific pioneers before him, Quake is heroically self-experimenting—testing the risks in publishing identifiable personal information of the most intimate kind."

The Editorial goes on to weigh up some of the risks and benefits to an individual and to society at large if people’s genome sequences are generally available, covering healthcare, privacy issues and costs, concluding that “There will be some individuals, like Steve Quake, who will provide samples and data without an incentive; but when it comes to exploring the basis of being human and moving toward the goal of genomic medicine, society needs to do more to provide personal incentives to those who choose to disclose their data, despite the risks. After all, everybody will ultimately benefit—both those who share and those who choose not to.”

Single-molecule sequencing of an individual human genome

Dmitry Pushkarev, Norma F Neff & Stephen R Quake

Nature Biotechnology 27, 847-850; 2009. doi:10.1038/nbt.1561