Data sharing discussed at Nature and Nature Network

Sharing data is good. But sharing your own data? That can get complicated. As two research communities who held meetings on this question in Rome and in Toronto in May report their proposals to promote data sharing in biology, a special issue of Nature (10 September 2009) examines the cultural and technical hurdles that can get in the way of good intentions. Some of the authors of these proposals are participating in two online forums (Rome and Toronto) at Nature Network – so please accept our invitation to visit and have your say on these questions.

More details:

The two research communities held meetings with a broad range of stakeholders to discuss ways to promote data sharing in biology. Data producers and users met at a workshop in Toronto to discuss the benefits and best practices of rapid data release prior to publication. Ewan Birney, Tom Hudson and colleagues report the main conclusions of these discussions in a community statement, free to access here.

The Toronto group propose that the principles for early release of genomics data should be extended to other large datasets in biology and medicine. A grace period should be allowed, if requested, to enable data producers to analyse and publish their dataset, but this should be limited to one year. The authors also suggest a set of best practices for funding agencies, scientists and journal editors.

The recommendations are intended to spark community discussion on this subject. Ewan Birney, Tom Hudson and others will be responding to reader comments in our Nature Network forum. Be sure to have your say.

Mouse researchers, along with funding agencies and publishers, met in Rome to address the barriers preventing more effective sharing of data and biomaterials — particularly mouse strains and embryonic stem cells. Their agenda, free to access here, suggests guidelines to enable sharing of materials under the least restrictive terms, avoiding material transfer agreements where possible.

The Rome participants argue that funding organizations, journals and researchers need to work together to encourage better use of public repositories and to promote a ‘research commons’ in mouse biology.

The recommendations are intended to spark community discussion on this subject. Paul Schofield and others will be responding to reader comments in our Nature Network forum. Be sure to have your say.

See also the Editorial (free to access online) in the same issue of Nature (461, 145; 2009): ’Data’s shameful neglect’, opining that research cannot flourish if data are not preserved and made accessible. All concerned must act accordingly.

Nature’s special issue on data sharing.

Nature Cell Biology on research integrity and accessibility

The cell biology literature contains manipulated data that distort findings, usually in an attempt to ‘beautify’ and, rarely, to commit fraud, states the September Editorial in Nature Cell Biology (11, 1045; 2009, free to read online) According to the Editorial, a National Academy of Sciences (NAS) report, ‘Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age’, "arrives at no hard and fast rules; the panel found that different fields have quite different requirements. In the words of panel chairs Phillip Sharp and Daniel Kleppner, “the report provides a framework for dealing with the challenges to the community generated by the onrush of digital technology.” Nevertheless, the key tenets that researchers are responsible for ensuring the integrity and accuracy of their data and appropriate training in the management of research data, that all data and experimental details from papers be publicly accessible and carefully archived to allow verification and to facilitate future discoveries, and that field-specific standards have to be developed by researchers, funders, societies and journals, benefit from being spelled out in one document."

Many of the recommendations in the report already are the policy of Nature Cell Biology and the other Nature journals: the Editorial provides further information about these, including references to past Editorials, with particular emphasis on various aspects of data manipulation and plagiarism — which, although widely unrealised, extends to concepts as well as to copying text and illustrations.

Metagenomics analysed at Nature Methods

Metagenomics sprang from advances in sequencing technology, and continued improvements are providing data in quantities unimaginable a few years ago. But without concerted efforts, the amount of data will quickly outpace the ability of scientists to analyse it. The September Editorial of Nature Methods (6, 623; 2009), ‘Metagenomics versus Moore’s law’ draws attention to some articles in the same issue of the journal that illustrate some of the dangers and problems, as well as the solutions that are being sought.

Three years ago, the Editorial continues, the first two second-generation metagenomes were reported at less than 40 megabases each. Now, there are more than 4,000 sequenced metagenomes that would take years or tens of years to analyse (depending on the processing power used). Major initiatives are needed to avoid metagenome-analysis gridlock: according to the Editorial, funding agencies need to increase support for data analysis; and the community needs to improve data-sharing through standards and centralized coordination and by aggregating computationally intensive operations. The conclusion:

“This summer, after discussions at the International Conference on Systems for Intelligent Molecular Biology, community members formed the M5 (metagenomics, metadata, metaanalysis, multiscale-models and metainfrastructure) Consortium under the roof of the Genomics Standards Consortium to devise a solution to the coming gridlock. Their proposed ‘M5 Platform’—to be announced later this year—deserves the support of the community, funding agencies and those who hold the keys to the high-performance computing centers. Unless major efforts are taken immediately, researchers will find they have a wealth of data but no way to interpret it.”

Readers’ comments and discussion of this Editorial are welcomed at Methagora, the Nature Methods blog.

No restrictions on tissue distribution

The distribution of human cell lines used in research should not be hindered by restrictions from donors, states an Editorial in Nature last week (460, 933; 2009 ; free to access online). The occasion of the Editorial is a Corrigendum relating to a paper published in the journal last year (‘Generation of pluripotent stem cells from adult human testis’ by

S. Conrad et al., Nature 456, 344-349; 2009). In the Corrigendum, the authors explain how the original patient consent forms to collect the material used to derive the pluripotent stem cells precluded distribution to third parties under the regulations of the relevant hospital ethics committee. (The authors also explain that they are going to cultivate new cells, under different terms of consent, which they can then distribute upon request.)

As the Editorial points out, failures to distribute cell lines are incompatible with Nature journal policies and with the efficient progression of scientific knowledge. The Corrigendum alerts investigators to this situation and the steps being taken to rectify it. Even when clinicians, researchers and their local ethics board follow internal procedures that promote both donor safety and medical research, serious problems can arise regarding the unhindered distribution of samples.

Here is a slightly shortened version of the rest of the Editorial:

The community was not that surprised by this situation — six of seven researchers contacted by Nature thought this could happen again. Researchers developing cell lines must investigate the restrictions associated with the human tissue they are using, particularly if someone else collected the samples, if the samples come from multiple clinical sources or if they come from several legal jurisdictions. If a scientist needs to create cell lines that might be used for as-yet-unforeseen purposes, only tissue with no restrictions should be used.

Journals can remind authors in their policy guidelines that authors of submissions that involve consent forms must make editors aware of any limits that result from those forms. The Nature journals will be revising their policies to make this clearer.

Most importantly, patients, researchers, clinicians, and review and ethics boards worldwide need to agree on conventions that are acceptable to most parties under most circumstances. Internationally standardized consent forms for the donation of human tissue should cover new uses, genomic comparisons, patents and product development, and should discourage limiting access or lifespan.

Ethics and review boards are set up to protect individuals, but can also go much further to promote research. No one can deny that donors need to understand the risks and benefits of a procedure, trial or donation. However, it seems most ethically responsible, given the value of research, for the boards to explain the consequences that restricted access and time limits can have on the value of a donor’s tissue.

Not always so simple to share mouse strains

This is the text of a Correspondence by Richard Behringer of the University of Texas, published in the current issue of Nature (460, 324; 16 July 2009).

I was disappointed by the view expressed in your Editorial ‘The sharing principle’ (Nature 459, 752; 2009 – free to read online) that the mouse community does not share its strains. This is untrue. Most labs are very collegial, spending a considerable amount of time and effort on distributing their mouse strains. Although there are a few labs that withhold distribution, any community may contain such individuals. The fact that a mouse strain is not found in a repository does not mean that it is not being shared.

I was also puzzled by the conclusion of May’s CASIMIR workshop, noted in the Editorial, that “the sharing problem urgently needs resolution” with regard to international mouse gene-knockout projects. Such mutant alleles will mostly be archived as embryonic stem-cell lines. Readers should also realize that repositories cannot keep all their mouse strains live ‘on the shelf’: most strains are frozen. The cost for a user to have a strain thawed is thousands of dollars and it takes many months before the recovered mice become available. This is a big disadvantage for labs on tight budgets. With regard to funding agencies: in grant proposals to the US National Institutes of Health, for example, applicants are required to write a ‘resource-sharing plan’ that includes genetically modified mice.

It was suggested that sharing avoids duplication of effort. But it is essential that more than one group generates mutations in the same gene as a crosscheck. No two labs generate the same allele, and every geneticist knows that the expression of different alleles can lead to very distinct phenotypes.

Your claim that sharing mice “has never been easier” is questionable, considering all the paperwork, health certificates, veterinary screens, special serology screens, costs, time and logistics involved. This is quite different from uploading DNA sequences in the comfort of your office.

It would be great if funding agencies supplemented grants involving the generation of mouse strains to cover the costs of sending the strains to a repository. In these tough financial times, that seems unlikely.

Department of Genetics, University of Texas, M. D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, Texas 77030, USA.

Nature journals’ policy on data and materials availability.

Nature Biotechnology calls for better data-sharing practices

A universal tagging system that links data sets with the author(s) that generated them is essential to promote data sharing within the proteomics and other research communities. The July Editorial in Nature Biotechology (27, 579; 2009) reports the results of the journal’s survey of author compliance in depositing proteomics and molecular-interaction data underlying the papers they published. The editors found that even authors who are proponents of data deposition are not making data available in all of the papers they publish. Inhibitory factors include data quality and the user-unfriendliness of some databases. The Editorial concludes:

“One option would be to provide researchers who release data to public repositories with a means of accreditation. This would take the form of a universally standardized tag for data that could be searched and recognized by both funding agencies and employers. An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. In essence, the tag would be a digital object identifier (DOI), currently best known for its use in unambiguously identifying papers online.

Similar to citation information about publications, citation information about a researcher’s data DOIs could be gathered by funders assessing future support and used by institutions in performance evaluation. Researchers who disclose data sets that subsequently prove particularly useful to the community would end up with highly cited data DOIs, and could thereby be rewarded accordingly.

Such a system would not solve all the problems slowing data disclosure in proteomics and elsewhere. But it would provide greater incentive than the present system of evaluation, which is skewed almost exclusively to publications in high-profile journals and citation metrics. Data DOIs would not only enhance a researcher’s reputation but also establish priority of data generation. Most important of all, they would provide a way to acknowledge the time and effort individuals must invest in sharing data, which ultimately benefits the scientific community as a whole.”

See also a Correspondence in the same issue of Nature Biotechnology (27, 597-598; 2009): PRIDE Converter: making proteomics data-sharing easy, by Harald Barsnes, Juan Antonio Vizcaíno, Ingvar Eidhammer and Lennart Martens, a collaboration between the University of Bergen and the European Bioinformatics Institute.

Nature journal policies on data and materials availability.

US scientist jailed for sharing sensitive data

From Nature News (Nature 460, 163; 8 July 2009):

A former University of Tennessee professor has been sentenced to four years in prison for sharing sensitive technologies with his Chinese and Iranian graduate students.

J. Reece Roth, an emeritus professor of electrical engineering, was sentenced on 1 July by a Tennessee district court for violating the Arms Export Control Act. He had been developing ways to reduce the drag on unmanned planes, and employed two research assistants without obtaining the required licence (see Nature 442, 232–233; 2006). Roth plans to appeal the verdict.

In a separate case, a Chinese-born scientist who has lived in the United States for 23 years is suing the US government for rights violations for expelling him last year from the NASA Ames Research Center, California.

Haiping Su, a US citizen who received his doctorate in 1991 from Kansas State University in Manhattan, alleged in a case filed on 24 June in a San Jose federal court that a 2007 security badge-issuing process led to his illegal ousting.

Su was working on airborne systems for imaging forests. His attorneys say he had no involvement with classified material.

Chemical biologists could help accelerate drug discovery

This month’s (July) Nature Chemical Biology includes two articles describing how access to the highest quality chemical probes will ensure their prominent position in the biological and drug discovery toolboxes.

Aled M Edwards, Chas Bountra, David J Kerr and Timothy M Willson, in their Commentary (Nature Chemical Biology 5, 436 – 440; 2009) Open access chemical and clinical probes to support drug discovery, say that drug discovery resources in academia and industry are not used efficiently, to the detriment of industry and society. Duplication could be reduced and productivity increased, they write, by performing basic biology and clinical proofs of concept within open access industry-academia partnerships. Chemical biologists could play a central role in this effort.

The authors’ main argument is that the development of new medicines is being hindered by the way in which academia and industry advance innovative targets. By generating freely available chemical and clinical probes and performing open-access science, the overall system will produce a wider range of clinically validated targets for the same total resource, arguably the most effective way to spur the development of treatments for unmet needs.

In a related article in the same issue of the journal, ‘A crowdsourcing evaluation of the NIH chemical probes’, Tudor I. Opera et al. (Nature Chemical Biology 5, 441-447; 2009) write that between 2004 and 2008, the US National Institutes of Health Molecular Libraries and Imaging initiative pilot phase funded 10 high-throughput screening centres, resulting in the deposition of 691 assays into PubChem and the nomination of 64 chemical probes. The authors ‘crowdsourced’ the Molecular Libraries and Imaging initiative output to 11 experts, who expressed medium or high levels of confidence in 48 of these 64 probes. Crowdsourcing is a cross-disciplinary alternative way to assess confidence for both chemical probes and drug leads: it pools multiple levels of expertise from translational disciplines, providing a rigorous chemical-probe evaluation process.

Nature Chemical Biology website.

Nature Chemical Biology guide to authors.

Nature Chemical Biology focuses and supplements.

Nature Chemical Biology symposium 2009: Chemical biology in drug discovery.

New rules for presentation of statistics in cell biology

New rules for the presentation of statistics in the Nature journals are described in the June Editorial of Nature Cell Biology (11, 667; 2009). From the Editorial:

Thanks to advanced imaging technologies and better integration with molecular and systems approaches, cell biology is undergoing something of a renaissance as a quantitative science. Robust conclusions from quantitative data require a measure of their variability. Cell biology experiments are often intricate and measure complex processes. Consequently the number of independent repeats of a measurement can be limited for practical reasons, yet the variability of the measurements can be rather high. Cell biologists have developed good intuition to guide their analysis of such constrained datasets. Biological complexity and the reliance on intuition can cause culture shock to physical scientists crossing over into cell biology (a kind of extension of the celebrated ‘two cultures’ concept of C. P. Snow).

With the arrival of quantitative information and ‘-omic’ datasets, statistical analysis becomes a necessity to complement instinct. The problem is that statistical tools are built on basic assumptions such as the independence of replicate measurements and the normality of data distribution. Usually, sizeable datasets are prerequisite for statistical analysis. Alas, these can be as hard come by as a biostatistician (n is typically well below 5). The result is that all too often statistics (frequently undefined ‘error bars’) are applied to data where they are simply not warranted.

There are no easy solutions to rectify the prevalence of poor statistics in cell biology studies. However, an obvious recommendation is to consult a statistician when planning quantitative experiments. Consider whether n represents independent experiments (you may actually be publishing a measure of the quality of your pipette!) and whether it is large enough for the test applied. Avoid showing statistics when they are not justified; instead, show ‘typical’ data or, better still, all the measurements. Importantly, displaying unwarranted statistics attributes a misleading level of significance to the data. Always describe and justify any statistical analysis applied. We have updated our guidelines to reflect these recommendations. One key rule: if the number of independent repeats is less than the fingers of one hand, show the actual measurements rather than error bars. If you wish to present error bars, include the actual measurements alongside them.

Finally, please remember that you are interrogating a complex system — be careful not to discard ‘outlier’ data points on a whim, as they may well be as relevant as clustered measurements. One is naturally inclined to ignore data that does not match the hypothesis tested, but biology is rarely as black and white as we would like. Do not make ‘hypothesis driven’ research become ‘hypothesis forced’!

Genetically modified mouse strains must be made available

This is the text of one of the Editorials in the current issue of Nature, The sharing principle (Nature 459, 752; 2009):

Back in 1996, human-genome scientists signed up to the Bermuda agreement to share their data without delay. Since then, the sharing principle has entered the mainstream — it now applies to all genomic data generated using public funding, as well as to all the relevant resources cited in publications.

But this principle is not universally observed for genetically modified mice, designed as vital resources in the quest to unpick basic biological mechanisms or to model human disease. The size of the problem is unclear, but existing surveys, combined with extensive anecdotal evidence, suggest it is substantial. In April 2006, for example, scientists at the US National Institutes of Health found that nearly 4,000 unique mice strains had been created, yet barely 700 had been placed in a repository.

Some scientists say they do not have the time nor money to breed and distribute their mice, or even to send the animals to publicly funded mouse repositories such as the European Mouse Mutant Archive, the Jackson Laboratory in Bar Harbor, Maine, and RIKEN BioResource Centre in Ibaraki, which would do those chores for them. Others claim that the careers of young and vulnerable researchers (or old and vulnerable researchers) could be harmed if they lost their exclusive access to a resource they made for their own research projects. Or, they say their institution’s technology-transfer offices or companies will not let them part with mouse strains that could perhaps be made to turn a profit.

Such attitudes were noted with concern last month at a workshop in Rome hosted by CASIMIR, a European Union project to coordinate and sustain mouse resources internationally (see background documents). The workshop brought together representatives from funding agencies, publishers and the mouse repositories from Europe, the United States and Australasia. They concluded that the sharing problem urgently needs resolution — not least because international projects to systematically generate mouse lines deficient in each gene in the genome will generate thousands of new strains in the next five years or so.

To solve the problem, however, journals and funding agencies must take a tougher line. The Nature journals are among the very few actually requiring that authors use established public repositories wherever possible as a condition of publication. Most journals simply ‘encourage’ their authors to make mice used in their publications freely available to other laboratories, or ‘suggest’ that the mice be deposited in repositories. Funding agencies similarly prefer such cajoling terms as ‘encourage’ in their policies on sharing mouse resources, and rarely police the outcome.

Journals should now require researchers to place their mice in repositories as a condition of publication. And funding agencies should require repository plans to be included in all grant applications that are likely to generate new mouse strains. Part of the grant money should be reserved for this task and final reports or evaluations of the grants should refer to the repository used. The repositories themselves should help the journals and funding agencies by finding a way to generate a unique accession number for each mouse strain.

The sharing principle allows biology to progress efficiently. It avoids duplication of effort and allows different laboratories to use the same tools. It is essential that scientists sign up to it. Sharing mice has never been easier — the repositories around the world are efficient and professional, and they are coordinated. Just a few changes in the modus operandi of key institutions could ensure that the makers of mice will have no possible excuse not to use them.

Nature journals’ policy on availability of data and materials.