Bring on the box plots

Box plots are excellent for visualizing important core statistics of sample data. We hope that a new online plotting tool BoxPlotR will help encourage their wider use in basic biological research.

The same three samples plotted by bar chart (left) and box plot (right).

The same three samples plotted by bar chart with s.e.m. error bars (left) and Tukey-style box plot (right). The box plot more clearly represents the underlying data.

A bar chart is often a person’s first choice of plot type when they want to compare values. This is appropriate when the values arise from counting. But when the value is a mean or median of data points taken from a sample, a bar chart is usually inappropriate. As discussed in our March Editorial and the accompanying Points of View and Points of Significance columns, a “mean-and-error” scatter-type plot or a box plot are more appropriate for sampled data. In summary, we strongly recommend that box plots be used when you have at least five data points, but for samples with 3-5 data points mean-and-error plots are more appropriate.

Box plots are heavily used in biomedical research in which statisticians have historically had considerable input into study design and analysis. But although similar types and quantities of sample data also appear in basic research (such as that published in Nature journals) box plots are much less common than bar charts in these manuscripts. Last year in Nature Methods for example, ~80% of sampled data was plotted using bar charts.

Discussions we had with the community suggested that an impediment to using box plots instead of bar charts to graph sample data was due to limited support for box plots in plotting programs commonly used by researchers. It also became apparent that some software that did support the box plot was deficient in communicating to users what the different elements of the plot represented. As a result, strangely labeled box plots were showing up in published papers. At NPG we thought it would be useful to provide authors with a simple online tool they could use to generate basic box plots of their data for publication.

The origin of BoxPlotR
At the VizBi 2013 conference in Cambridge Massachusetts I mentioned NPG’s desire for such a tool at a breakout session chaired by Martin Krzywinski in which the participants, including a young researcher named Jan Wildenhain, discussed what the community needed to create better figures. I also happened to mention our interest in this to Michaela Spitzer while visiting her poster from the Juri Rappsilber and Mike Tyers labs showing how the R-package ‘shiny’ by RStudio can be used to easily convert R code (a popular scripting language for statistics) into a visual application for exploring data.

Later at the conference Jan approached me and said he was intrigued by our desire for someone to design a webtool to create box plots and that he was interested in working on such a project. I happily told him to get in touch with me after the conference so we could discuss it further.

Three weeks after the conference concluded I still hadn’t heard from Jan and was beginning to worry that he had decided not to pursue this. Then… a few days later, I received an email from Jan. Much to my surprise he provided a link to a highly functional tool that he and Michaela, through their own initiative, had gone ahead and created using shiny and R. What followed was a productive and rewarding period of discussion and development during which time Michaela incorporated additional functionality and made selected design changes. The tool appeared so well designed and functional that I encouraged them to submit it to Nature Methods for publication as a Correspondence. After incorporating additional functionality and changes based on comments brought up during peer review BoxplotR was ready for publication.

Sample BoxPlotR plots

Sample BoxPlotR plots. Top: Simple Tukey-style box plot. Bottom: Tukey-style box plot with notches, means (crosses), 83% confidence intervals (gray bars; representative of p=0.05 significance) and n values.

Launch of BoxPlotR
To accompany the publication and launch of BoxPlotR we thought it would be useful to provide some information and practical advice about box plots to our readers. Nils Gehlenberg, a former author of several Points of View articles with Bang Wong, agreed to resurrect that popular column for our February issue with an article on bar charts and box plots. Similarly, Martin Krzywinski and Naomi Altman agreed to delay our planned Points of Significance article on the two-sample and paired t-test and instead devote an article to box plots.

Seeing how the community responded to our interest in creating an online box plot tool and then working with them on this project has been a great experience. This never would have been possible without the initiative and talent of Jan and Michaela or the support they received from their PIs Mike and Juri. We hope both our authors and others find BoxPlotR useful and we encourage feedback. General comments can be made here on our blog or by emailing the journal. For specific bug reports and feature requests please see the contact information at https://boxplot.tyerslab.com.

A retraction resulting from cell line contamination

After nine years in print, Nature Methods today published its first retraction; one that could have been prevented by cell line authentication. What does this mean for journal-mandated cell line testing?

Gliomasphere image

Two-photon fluorescence image of live primary gliomasphere from retracted manuscript.

In a Nature Methods paper published in 2010, Ivan Radovanovic and colleagues described a method to isolate cancer-initiating cells in human glioma without the need for molecular markers. Based on morphology and on a green autofluorescence, the authors reported they could use FACS to sort cancer-initiating cells from gliomasphere cultures (which had been derived from primary tumors). They also detected autofluorescence in cells from fresh glioma specimens, but at a much lower level.

Cells from the autofluorescent fraction could self renew clonogenically in vitro and were tumorigenic when transplanted into mouse brains, the authors reported, and in both cases performed better than non-autofluorescent cells from the rest of the culture or tissue. The origin of this autofluorescent signal was not understood at the time. The authors speculated it may have been related to the unique metabolism of the cancer-initiating cells.

It turns out that most of the primary gliomasphere lines (7 out of 10) were contaminated with HEK cells expressing GFP, leading to retraction of the paper. Using short-tandem-repeat (STR) profiling of two of the lines the authors determined that the contamination occurred over the course of culture in the lab: samples taken from early passages match the original tissue from which the lines were derived, but later passages no longer do so.

It is hardly surprising that the first retraction in Nature Methods is due to cell line contamination, a well acknowledged problem. A 2009 Editorial in Nature pointed to the disturbing results of cell testing by repositories which indicated that 18-36% of cultures were misidentified. It called on repositories to authenticate all of their lines, and for major funders to provide testing support to grantees. At that point funders could require cell line validation for investigators to retain funding, and Nature would require that all immortalized lines used in a paper were verified before publication. Unfortunately, it is now 2013 and we are still far from this goal.

But progress is being made. Community-based efforts are alerting researchers to this problem and providing resources to help them avoid being misled by erroneous results caused by cell line contamination. A 2012 Correspondence in Nature by John R. Masters on behalf of the International Cell Line Authentication Committee (ICLAC) pointed to the following resources available to researchers:

Please go to the ICLAC website for the most recent version of each of these documents.

Meanwhile in early 2013, at the publication end of the process, the Nature journals published coordinated editorials announcing a reproducibility initiative and stating that “…authors will need to […] provide precise characterization of key reagents that may be subject to biological variability, such as cell lines and antibodies.” In practice, the Nature journals are currently requiring all authors to state whether or not testing was done but are only requiring testing in cases where it makes particular sense.

Advocates for mandatory testing have cogent arguments for a uniform mandatory testing policy. First, it would avoid sending a confusing message; second, researchers can’t be certain that cell identity or mycoplasma contamination aren’t affecting results; and finally, continued publication of inaccurate species and tissue designations of misidentified cell lines continues to propagate misinformation.

In the work described in the retracted 2010 manuscript from Radovanovic and colleagues mandatory testing would certainly have been beneficial. However, for probably the majority of work published by Nature Methods there is no question that testing would have no impact on the reported results. For example, in 2011 and 2012 we published at least 17 manuscripts reporting new fluorescence microscopy methods and using imaging data from cell lines to assess the performance of the techniques in measuring fundamental cell properties such as the appearance and width of actin or microtubule filaments, membrane vesicles or other universal cellular structures. Cell line identity and even mycoplasma contamination would not impact the efficacy or conclusions of these measurements. This same situation exists for the validation and testing of many methods in other research disciplines such as proteomics, genomics and biophysics.

Even if these labs should be doing cell validation and mycoplasma testing as a matter of course as part of proper cell culture procedure, mandating that all these studies include such testing as a requirement for publication is unjustified.

But clearly even our most recent efforts at improving compliance with good testing practice will not be sufficient to eliminate cell contamination as a problem in work published in Nature journals. A possible solution may be to require testing by default but authors would be permitted to argue why, in their case, testing is clearly unnecessary. Editors (possibly with reviewer input) would be the final arbiters and would need to ensure that although the lines must be named and sourced, no species or tissue identifiers should be included in the manuscript in the absence of proper validation.

Technology development labs or others that only use cell lines for purposes distinct from biological investigation could continue to avoid testing. But any lab that might potentially use their cell lines to obtain biological results would know that they should institute a proper testing regimen or risk their work not being publishable in a Nature journal.

At this point this is only an idea based on our experience at Nature Methods. We encourage the community to comment and let us know what they think.

Let’s give statistics the attention it deserves

This month we launch a new column ‘Points of Significance’ devoted to statistics, a topic of profound importance for biological research, but one that often doesn’t receive the attention it deserves.

For the past three years Nature Methods has been publishing the Points of View column, one page a month dedicated to practical advice for researchers on how to create accessible and accurate visualizations of their data. The response to the column articles has been fantastic and most recently we organized them by topic here on our blog.

Unfortunately, a truth about data visualization is that no matter how good the visualization, if the experiment wasn’t appropriately designed and the data wasn’t analyzed correctly, the resulting visual depiction of the data will be inherently flawed. Nature Methods and the other Nature journals recently made changes to improve data and methods reporting as part of a reproducibility initiative. We feel this is an important first step in improving experimental reproducibility and repeatability, but unfortunately by the time work is submitted for publication it can be difficult to correct shortcomings in experiemntal design and analysis.

A population distribution and a distribution of sample means.

A population distribution and a distribution of sample means.

In our September issue readers will find a new column, Points of Significance, that we hope will be as useful as the column that preceded it, perhaps more so. Martin Krzywinski, who has been writing the visualization column, is now joined by Naomi Altman, Professor of Statistics at The Pennsylvania State University. Among other things, Naomi will be responsible for ensuring that the information and advice we provide about statistics in every Points of Significance article is accurate.

The column has been expanded from one to two pages and will often have an Excel spreadsheet associated with it. This expansion will help us better communicate information that is less well served by display items. However, as illustrated by the figures in the first article of the column and the accompanying spreadsheet, visual displays will continue to play a vital role due to their strength in providing easily interpretable examples that can often be more readily grasped than mathematical or narrative descriptions.

We will strive to present the material so that each article in the column builds on prior ones. In this spirit the first article discusses populations and sampling, a foundation for nearly all topics to follow. The accompanying spreadsheet allows readers to play around with sampling and see for themselves how often values obtained from samples deviate substantially from the real population. It can be disconcerting to see just how often ‘bad luck’ can give a ‘wrong’ result in one set of measurements while in another set of measurements the ‘right’ result is obtained but statistical measures would suggest that the former is more likely to be ‘correct’ than the latter. This excellently highlights how statistics is unable to tell you if you are right. But this doesn’t suggest statistics has limited value. Instead, readers of scientific articles reporting statistical results need a healthy grasp of the limitations of statistical analysis and users of statistics can always learn ways to improve the power of their analysis.

The “aura of exactitude” that often surrounds statistics is one of the main notions that the Points of Significance column will attempt to dispel, while providing useful pointers on using and evaluating statistical measures. We expect that readers will find the upcoming October Points of Significance article on error bars and confidence intervals with its practical tips on interpreting these graphical elements to be particularly useful almost every time they read a manuscript containing these popular visual representations of uncertainty.

We hope readers enjoy Points of Significance. It is appropriate that the column is debuting during the International Year of Statistics. To allow readership by a wider audience each article will be free to access for a period of one month after it is published.

Update: All Points of Significance articles are now free access and have been collected together on a dedicated page in the nature.com “Statistics for biologists” resource.

For more on statistics, and particularly statistics training, don’t miss this September’s Editorial.

. . . . . . . .

Update: Below is a continuously updated list of the Points of Significance articles.

Importance of being uncertain – September 2013
How samples are used to estimate population statistics and what this means in terms of uncertainty.
Error Bars – October 2013
The use of error bars to represent uncertainty and advice on how to interpret them.
Significance, P values and t-tests – November 2013
Introduction to the concept of statistical significance and the one-sample t-test.
Power and sample size – December 2013
Using statistical power to optimize study design and sample numbers.
Visualizing samples with box plots – February 2014
Introduction to box plots and their use to illustrate the spread and differences of samples.
Comparing samples—part I – March 2014
How to use the two-sample t-test to compare either uncorrelated or correlated samples.
Comparing samples—part II – April 2014
Adjustment and reinterpretation of P values when large numbers of tests are performed.
Nonparametric tests – May 2014
Use of nonparametric tests to robustly compare skewed or ranked data.
Designing comparative experiments – June 2014
The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.
Analysis of variance and blocking – July 2014
Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.
Replication – September 2014
Technical replication reveals technical variation while biological replication is required for biological inference.
Nested designs – October 2014
Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.
Two-factor designs – December 2014
It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.
Sources of variation – January 2015
To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication to collect replicable and meaningful results.
Split plot design – March 2015
When some experimental factors are harder to vary than others, a split plot design can be efficient for exploring the main (average) effects and interactions of the factors.
Bayes’ theorem – April 2015
Use Bayes’ theorem to combine prior knowledge with observations of a system and make predictions about it.
Bayesian statistics – May 2015
Unlike classical frequentist statistics, Bayesian statistics allows direct inference of the probability that a model is correct and it provides the ability to update this probability as new data is collected.
Sampling distributions and the bootstrap – June 2015
Use the bootstrap method to simulate new samples and assess the precision and bias of sample estimates.
Bayesian networks – September 2015
Model interactions between causes and effects in large networks of causal influences using Bayesian networks, which combine network analysis with Bayesian statistics.
Association, correlation and causation – October 2015
Pairwise dependencies can be characterized using correlation but be aware that correlation only implies association, not causation. Conversely, causation implies association, not correlation.
Simple linear regression – November 2015
Linear regression is a flexible way to predict the values of one variable using the values of the other to find a ‘best line’ through the data points.

Importance of data sharing

No more (trade) secrets

Withholding information on the clinical significance of genetic variants from the scientific community impedes the progress of research and medicine.

Imagine you are a physician or researcher and seek to get more confirmation on the clinical impact of particular genetic variants. If your search of public databases comes up empty this does not necessarily mean that nothing is known about the mutations in question. Rather, the information may be locked away as a trade secret in a genetic testing company’s proprietary database.

Physicians and their patients are not able to independently verify the medical significance of a testing company’s finding, instead the results have to be taken on blind faith.  Researchers are limited in their knowledge of the vast mutational landscape in genes associated with diseases such as cancer which in turn may limit their understanding of the molecular underpinning of the disease.

Robert Nussbaum, at the University of California, San Francisco, recently pointed out that in other fields of medicine such an approach would be unthinkable. In a Technology Review he said, “Imagine if radiological images or histopathology slides of cancers were examined by a single monopoly holder without the medical community being able to assess and learn from what these images and tissue specimens teach us.” He launched  the Sharing Clinical Reports Project, an initiative to collect de-identified information on genetic testing data on the BRCA1 and 2 genes (as discussed in our August editorial).

With more genetic testing companies likely to enter the market, after the US Supreme Court invalidated some gene patents, the problems caused by proprietary data may increase. Clinicians may now have more options to obtain a genetic test, but, if they go with the less established testing company, they are then left with a suboptimal interpretation with possibly grave implications for the patient.

A resolution  from the American Medical Association passed in June 2013 supports public access to genetic data. The resolution calls for companies, laboratories, researchers and providers to publicly share data on genetic variants in a manner consistent with privacy and HIPAA protections.

Whether such calls will be heeded is another question. In a New York Times OdEd piece aptly named “Our genes, their secrets” the author wonders if the recent Supreme Court decision will prompt genetic testing companies to rely more on this strategy of treating information on the clinical impact of mutations as trade secrets and thereby try to deter competition and ensure revenue.

How can this be prevented? Cook-Deegan et al.  – in a recent article in the European Journal of Human genetics – call for joint action by national health systems,  insurers, regulators, researchers, providers and patients to ensure broad access to information about the clinical significance of variants. Some of their suggestions, besides the promotion of voluntary sharing, include sharing as a condition of payment or regulatory approval of the testing laboratories.

The battle about who may offer certain genetic tests is certainly heating up. Ambry Genetics and Gene by gene, two of the companies now offering BRCA1 and 2 testing, have been sued by Myriad Genetics for patent infringement.  A few days later, on July 12, US senator Patrick Leahy, a democrat from Vermont, wrote to Francis Collins, the director of the NIH, urging him to force Myriad to license the patent on reasonable terms to other parties to ensure affordable life-saving diagnostic tests.  As the federal  agency that provided the funding for the research behind Myriad’s patent  the NIH has the authority to do so, based on a provision in the Bayh-Dole Act that enabled universities to own inventions based on federal funding. Whether it will exercise this authority is unclear. Collin’s reply is still outstanding.

Ambry Genetics disputes that it infringes any of Myriad’s patents and a company spokesperson told Nature Methods that Ambry plans to share their testing data.

If enough companies follow suit, the desirable equilibrium of compensating a company fairly for the cost of its test and at the same time letting the public benefit from the results of these tests should be within reach.

Data visualization: A view of every Points of View column

We’ve organized all the Points of View columns on data visualization published in Nature Methods and provide this as a guide to accessing this trove of practical advice on visualizing scientific data.

As of July 30, 2013 Nature Methods has published 35 Points of View columns written by Bang Wong, Martin Krzywinski and their co-authors: Nils Gehlenborg, Cydney Nielsen, Noam Shoresh, Rikke Schmidt Kjærgaard, Erica Savig and Alberto Cairo. As we prepare to launch a new column in our September issue we felt this would be a good time to collect and organize links to all the Points of View articles together in one place to make it easier to navigate this wonderful resource that the authors have provided us. For the month of August we will be making all the columns free to access so everyone can benefit from this practical advice on data visualization.

This should not be the end of the Points of View column though. We will be inviting new visualization experts to author articles on new topics that have not been covered so far or which can be expanded on. This page will be continuously updated whenever a new article is published so stay tuned. If you have a suggestion for a topic you would like to see covered in a future points of view article please comment below.

Update of March 28, 2015: A PDF eBook of the 38 Points of View articles published between August 2010 and February 2015 is now available at the Nature Shop for $7.99 under the title “Visual strategies for biological data: the collected Points of View”. The article summaries below provide a nice overview of what is contained in that eBook collection.

. . . . . . . .

Introduction
Visualizing biological data – December 2012
Data visualization is increasingly important, but it requires clear objectives and improved implementation
The overview figure – May 2011
An economic overview figure to convey general concepts helps readers understand a research study

. . . . . . . .

Composition and layout
The design process – December 2011
Use good design to balance self-expression with the need to satisfy an audience in a logical manner
Figure design and layoutLayout – October 2011
Proper layout reveals the hierarchical relationship of informational elements
Gestalt principles (Part 1) – November 2010
Gestalt principles (Part 2) – December 2010
Exploit perceptual phenomena to meaningfully arrange elements on the page
Negative space – January 2011
Whitespace is a powerful way of improving visual appeal and emphasizing content
Salience to relevance – November 2011
Ensure that viewers notice the right content by making relevant information most noticeable
Elements of visual style – May 2013
Translate the principles of effective writing to the process of figure design
Storytelling – August 2013
Relate your data to the world around them using the age-old custom of telling a story

. . . . . . . .

Using colorUsing color in data visualizations
Color coding – August 2010
Choose colors appropriately to avoid bias and unwanted artifacts in visuals
Color blindness – June 2011
Make your graphics accessible to those with color vision deficiencies
Avoiding color – July 2011
Improve the overall clarity and utility of data displays by using alternatives to color
Mapping quantitative data to color – August 2012
Color is useful for compact visualizations of large data sets but must highlight salient features
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively

. . . . . . . .

Elements of a data figureElements of a figure
Typography – April 2011
Choose typefaces, sizes and spacing to clarify the structure and meaning of the text
Axes, ticks and grids – March 2013
Make navigational elements distinct and unobtrusive to maintain visual priority of data
Labels and callouts – April 2013
Figure labels require the same consistency and alignment in their layout as text
Plotting symbols – June 2013
Choose distinct symbols that overlap without ambiguity and communicate relationships in data
Arrows – September 2011
Use well-proportioned arrows sparingly and consistently as a guide through complex information

. . . . . . . .

Plot types
Bar charts and box plots – February 2014
Choose the appropriate plot according to the nature of the data and the task at hand
Sets and intersections – July 2014
Euler and Venn diagrams are appropriate for up to three sets but for greater numbers use more scalable plots
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively
Temporal data – Feb 2015
Use inherent properties of time to create effective visualizations
Unentangling complex plots – July 2015
Carefully designed subplots scaled to the data are often superior to a single complex overview plot
Pathways – January 2016
Apply visual grouping principles to add clarity to information flow in pathway diagrams
Neural circuit diagrams – March 2016
Use alignment and consistency to untangle complex neural circuit diagrams

. . . . . . . .

Improving figure clarityImproving figure clarity
Simplify to clarify – August 2011
Simplify your presentation to improve clarity
Design of data figures – September 2010
Improve figure decoding by using strong visual cues to encode data
Salience – October 2010
Use salience to differentiate graphical symbols and speed up figure reading
Points of review (Part 1) – February 2011
Examples of figure redesigns
Points of review (Part 2) – March 2011
Simple tips to improve pie chart, scatter plot and color scale data displays

. . . . . . . .

Multidimensional data
Visualizing multidimensional dataInto the third dimension – September 2012
3D visualizations are effective for spatial data but rarely for other data types
Power of the plane – October 2012
Combine 2D plots for effective visualization of multivariate data
Multidimensional data – July 2013
Visually organize complex data by mapping them onto familiar representations of biological systems

. . . . . . . .

Data exploration
Pencil and paper – November 2012
Quick sketches and doodles of data or models aids thinking and the scientific processVisualization for data exploration
Data exploration – January 2012
Create ‘slices’ of data to enhance the process of pattern discovery
Networks – February 2012
Choose your network visualization based on the patterns you are looking for
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively
Integrating data – April 2012
Combine visualizations of multiple data types to find correlations and potential relationships
Representing the genome – May 2012
Limit what is displayed based on the question being asked
Managing deep data in genome browsers – June 2012
Compaction and summarization help find patterns in overwhelming data
Representing genomic structural variation – July 2012
Use arcs, color, dot plots and node graphs to show relations between distant genomic positions

. . . . . . . .

Return of the Points of View column

Our popular “Points of View” column returns this month after a brief hiatus. Here is a bit of history of the column and an introduction to its new author.

On this day four years ago Sean O’Donoghue contacted Nature Methods about a workshop he was organizing on visualizing biological data. This culminated in a Nature Methods Supplement on Visualizing Biological Data published one year later that coincided with the first VizBi meeting in Heidelberg, Germany.

During this meeting Bang Wong and I hatched the idea of a Nature Methods column that would provide practical advice on the visual presentation of data for researchers. Later that year our August issue featured Bang’s very first Points of View column, “Color coding“. What followed was a labor of love by both Bang and I, with plenty of stress over deadlines, that extended over two years.

The column seemed to fill a need in the community and generated considerable positive feedback, including from authors and reviewers who would sometimes refer to advice from Bang’s columns. At the end of 2012 Bang took a needed break and the column went on hiatus. But in the meantime I had again met someone at a meeting in Germany who was passionately interested in the visual display of data.

The Points of View column returns in our March issue authored by Martin Krzywinski (staff scientist, creator of the visualization software Circos, and former fashion photographer).

I decided we couldn’t let someone with Martin’s varied experiences debut as the new Points of View columnist without learning a bit more about him so I asked our Technology Editor, Vivien Marx, to see what she could dig up.

Martin Krzywinski

Martin Krzywinski

Current mode: Makes cancer research and genome analysis visual.
Introduction to genomics: Built computing infrastructure at Genome Sciences Center
Past activities (incomplete): fashion photography, computer security, particle physics.
Published information graphics (incomplete): Book covers, American Scientist, EMBO Journal, PNAS, The New York Times, Wired, Conde Nast Portfolio.

Alex the rat

Alex the rat

Q: You photographed Alex (2000-2002) and helped her become the poster rat for genome sequencing. For example, she was Genome Research’s rat cover-girl. She frequently rode on your shoulder and seems like a groovy friend.

M.K.: Don’t be fooled by Alex’s visual presentation. She bit me countless times. But what do you expect from a rat? Maybe it is I that never learned.

Q: In addition to photo-shoots with Alex, you have had human fashion models in front of your lens. Fashion is pretty. Why should science be pretty?

Continue reading

New video functionality in online manuscripts

Data in research papers that is best presented in the form of videos gets short shrift compared to data that can be easily presented in figures and tables. Printing of representative video frames is a poor surrogate. Embedding videos in PDFs is possible but rare. Even online, where embedding videos in an HTML page is technologically easy, videos are usually provided only as links in the supplementary information for downloading video files.

This week, Nature Methods published two manuscripts from Keller and colleagues and Hufnagel and colleagues describing improved light-sheet microscopy technology that captures amazing time-lapse 3D images of fluorescently labeled cells in developing Drosophila embryos. To help showcase the beautiful videos containing this data we debuted new video functionality that Nature Publishing Group will be rolling out to other journals over time.

We invite you to watch these videos and let us know what you think about the new streaming video player, or the imaging method used to obtain this data. Some of the videos are very large and will take some time to start if you have a slow Internet connection but we hope that even in these cases you find this to be an improvement.

Of course, we still offer the ability to download the original video files supplied by the authors so you can see them in their original resolution, regardless of the speed of your connection.

To share or not to share

Many in the mass spectrometry community agree that MS data should be made publicly available for everybody’s benefit. All data, including the raw files generated by the mass spectrometers.

In the May editorial we support this request and introduce a new raw data repository run by the EBI that offers to replace the declining TRANCHE, up to very recently the only repository for such data.

Several good reasons can be made for making raw data available – one of them is the re-analysis of published data to validate claims. For example, the controversy arising in the wake of the analysis of fossilized Tyrannosaurus rex bones by Asara and colleagues  which led them to suggest that T. rex is more closely related to birds than to reptiles (Asara et al., Science 2007).  Their findings were finally corroborated in 2009 (Bern et al.J. Proteome Res.) but could have been examined much quicker,  if access to raw data had been given at the time of publication.

Re-analysis aside, raw data present a treasure trove of information that can be examined from different angles and, over time, with new tools that bring aspects to light that the original experimenters did not think of.  To create such new analysis tools, software developers rely on raw data to benchmark against established techniques.

Having access to raw files does not mean that they are easy to use – we realize that the diversity in file formats and the difficulty in converting one file type to another makes their analysis not as straight forward as it could be with a single community supported format.  And we also realize that these files are large and uploading them to the new EBI, or any other repository, will take time and some effort, particularly if important meta data about the experiment are included.

Still, we think the effort is worth it to ensure the field can move forward.  We’d love to hear your views, particularly if you disagree.

Where’s your ground truth?

When using or developing experimental and observational methods it is crucial to assess the method performance in an effort to ensure that the information it provides reflects reality. For experimental biologists this often means conducting carefully chosen control experiments with alternative methods or different experimental settings. More rigorous assessment, particularly for high-throughput or large-scale methods, often requires the use of ‘ground truth’ or ‘gold standard’ data sets. But talk to different people and you will get different answers regarding what ‘ground truth’ or ‘gold standard’ data is. This often includes a nice historical explanation of where the term ‘ground truth’ comes from.

For developers of signal processing and image analysis algorithms though, the situation is clearer; the ground truth is the signal or image you start with. But add a living system into the mix and things get far more complicated. The Editorial in the November issue of Nature Methods discusses the challenges facing developers and users of algorithms for automated analysis of biological data, with a focus on image data. In short, traditional ground truth data is often insufficient. The addition of integrated-editing and change-logging capabilities to these software tools can increase the quality of the analysis, aid further algorithm development and increase the likelihood of biologists adopting the software in the first place.

Cloud computing in biology

The sheer amount of data being generated in large-scale high-throughput biological studies is challenging current capabilities for data storage and analysis. One solution to this has been to move to cloud computing. In our editorial this month we discuss current efforts in this direction and the particular challenges of biological analysis in the cloud.