Let’s give statistics the attention it deserves

This month we launch a new column ‘Points of Significance’ devoted to statistics, a topic of profound importance for biological research, but one that often doesn’t receive the attention it deserves.

For the past three years Nature Methods has been publishing the Points of View column, one page a month dedicated to practical advice for researchers on how to create accessible and accurate visualizations of their data. The response to the column articles has been fantastic and most recently we organized them by topic here on our blog.

Unfortunately, a truth about data visualization is that no matter how good the visualization, if the experiment wasn’t appropriately designed and the data wasn’t analyzed correctly, the resulting visual depiction of the data will be inherently flawed. Nature Methods and the other Nature journals recently made changes to improve data and methods reporting as part of a reproducibility initiative. We feel this is an important first step in improving experimental reproducibility and repeatability, but unfortunately by the time work is submitted for publication it can be difficult to correct shortcomings in experiemntal design and analysis.

A population distribution and a distribution of sample means.

A population distribution and a distribution of sample means.

In our September issue readers will find a new column, Points of Significance, that we hope will be as useful as the column that preceded it, perhaps more so. Martin Krzywinski, who has been writing the visualization column, is now joined by Naomi Altman, Professor of Statistics at The Pennsylvania State University. Among other things, Naomi will be responsible for ensuring that the information and advice we provide about statistics in every Points of Significance article is accurate.

The column has been expanded from one to two pages and will often have an Excel spreadsheet associated with it. This expansion will help us better communicate information that is less well served by display items. However, as illustrated by the figures in the first article of the column and the accompanying spreadsheet, visual displays will continue to play a vital role due to their strength in providing easily interpretable examples that can often be more readily grasped than mathematical or narrative descriptions.

We will strive to present the material so that each article in the column builds on prior ones. In this spirit the first article discusses populations and sampling, a foundation for nearly all topics to follow. The accompanying spreadsheet allows readers to play around with sampling and see for themselves how often values obtained from samples deviate substantially from the real population. It can be disconcerting to see just how often ‘bad luck’ can give a ‘wrong’ result in one set of measurements while in another set of measurements the ‘right’ result is obtained but statistical measures would suggest that the former is more likely to be ‘correct’ than the latter. This excellently highlights how statistics is unable to tell you if you are right. But this doesn’t suggest statistics has limited value. Instead, readers of scientific articles reporting statistical results need a healthy grasp of the limitations of statistical analysis and users of statistics can always learn ways to improve the power of their analysis.

The “aura of exactitude” that often surrounds statistics is one of the main notions that the Points of Significance column will attempt to dispel, while providing useful pointers on using and evaluating statistical measures. We expect that readers will find the upcoming October Points of Significance article on error bars and confidence intervals with its practical tips on interpreting these graphical elements to be particularly useful almost every time they read a manuscript containing these popular visual representations of uncertainty.

We hope readers enjoy Points of Significance. It is appropriate that the column is debuting during the International Year of Statistics. To allow readership by a wider audience each article will be free to access for a period of one month after it is published.

Update: All Points of Significance articles are now free access and have been collected together on a dedicated page in the nature.com “Statistics for biologists” resource.

For more on statistics, and particularly statistics training, don’t miss this September’s Editorial.

. . . . . . . .

Update: Below is a continuously updated list of the Points of Significance articles.

Importance of being uncertain – September 2013
How samples are used to estimate population statistics and what this means in terms of uncertainty.
Error Bars – October 2013
The use of error bars to represent uncertainty and advice on how to interpret them.
Significance, P values and t-tests – November 2013
Introduction to the concept of statistical significance and the one-sample t-test.
Power and sample size – December 2013
Using statistical power to optimize study design and sample numbers.
Visualizing samples with box plots – February 2014
Introduction to box plots and their use to illustrate the spread and differences of samples.
Comparing samples—part I – March 2014
How to use the two-sample t-test to compare either uncorrelated or correlated samples.
Comparing samples—part II – April 2014
Adjustment and reinterpretation of P values when large numbers of tests are performed.
Nonparametric tests – May 2014
Use of nonparametric tests to robustly compare skewed or ranked data.
Designing comparative experiments – June 2014
The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.
Analysis of variance and blocking – July 2014
Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.
Replication – September 2014
Technical replication reveals technical variation while biological replication is required for biological inference.
Nested designs – October 2014
Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.
Two-factor designs – December 2014
It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.
Sources of variation – January 2015
To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication to collect replicable and meaningful results.
Split plot design – March 2015
When some experimental factors are harder to vary than others, a split plot design can be efficient for exploring the main (average) effects and interactions of the factors.
Bayes’ theorem – April 2015
Use Bayes’ theorem to combine prior knowledge with observations of a system and make predictions about it.
Bayesian statistics – May 2015
Unlike classical frequentist statistics, Bayesian statistics allows direct inference of the probability that a model is correct and it provides the ability to update this probability as new data is collected.
Sampling distributions and the bootstrap – June 2015
Use the bootstrap method to simulate new samples and assess the precision and bias of sample estimates.
Bayesian networks – September 2015
Model interactions between causes and effects in large networks of causal influences using Bayesian networks, which combine network analysis with Bayesian statistics.
Association, correlation and causation – October 2015
Pairwise dependencies can be characterized using correlation but be aware that correlation only implies association, not causation. Conversely, causation implies association, not correlation.
Simple linear regression – November 2015
Linear regression is a flexible way to predict the values of one variable using the values of the other to find a ‘best line’ through the data points.

Data visualization: A view of every Points of View column

We’ve organized all the Points of View columns on data visualization published in Nature Methods and provide this as a guide to accessing this trove of practical advice on visualizing scientific data.

As of July 30, 2013 Nature Methods has published 35 Points of View columns written by Bang Wong, Martin Krzywinski and their co-authors: Nils Gehlenborg, Cydney Nielsen, Noam Shoresh, Rikke Schmidt Kjærgaard, Erica Savig and Alberto Cairo. As we prepare to launch a new column in our September issue we felt this would be a good time to collect and organize links to all the Points of View articles together in one place to make it easier to navigate this wonderful resource that the authors have provided us. For the month of August we will be making all the columns free to access so everyone can benefit from this practical advice on data visualization.

This should not be the end of the Points of View column though. We will be inviting new visualization experts to author articles on new topics that have not been covered so far or which can be expanded on. This page will be continuously updated whenever a new article is published so stay tuned. If you have a suggestion for a topic you would like to see covered in a future points of view article please comment below.

Update of March 28, 2015: A PDF eBook of the 38 Points of View articles published between August 2010 and February 2015 is now available at the Nature Shop for $7.99 under the title “Visual strategies for biological data: the collected Points of View”. The article summaries below provide a nice overview of what is contained in that eBook collection.

. . . . . . . .

Introduction
Visualizing biological data – December 2012
Data visualization is increasingly important, but it requires clear objectives and improved implementation
The overview figure – May 2011
An economic overview figure to convey general concepts helps readers understand a research study

. . . . . . . .

Composition and layout
The design process – December 2011
Use good design to balance self-expression with the need to satisfy an audience in a logical manner
Figure design and layoutLayout – October 2011
Proper layout reveals the hierarchical relationship of informational elements
Gestalt principles (Part 1) – November 2010
Gestalt principles (Part 2) – December 2010
Exploit perceptual phenomena to meaningfully arrange elements on the page
Negative space – January 2011
Whitespace is a powerful way of improving visual appeal and emphasizing content
Salience to relevance – November 2011
Ensure that viewers notice the right content by making relevant information most noticeable
Elements of visual style – May 2013
Translate the principles of effective writing to the process of figure design
Storytelling – August 2013
Relate your data to the world around them using the age-old custom of telling a story

. . . . . . . .

Using colorUsing color in data visualizations
Color coding – August 2010
Choose colors appropriately to avoid bias and unwanted artifacts in visuals
Color blindness – June 2011
Make your graphics accessible to those with color vision deficiencies
Avoiding color – July 2011
Improve the overall clarity and utility of data displays by using alternatives to color
Mapping quantitative data to color – August 2012
Color is useful for compact visualizations of large data sets but must highlight salient features
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively

. . . . . . . .

Elements of a data figureElements of a figure
Typography – April 2011
Choose typefaces, sizes and spacing to clarify the structure and meaning of the text
Axes, ticks and grids – March 2013
Make navigational elements distinct and unobtrusive to maintain visual priority of data
Labels and callouts – April 2013
Figure labels require the same consistency and alignment in their layout as text
Plotting symbols – June 2013
Choose distinct symbols that overlap without ambiguity and communicate relationships in data
Arrows – September 2011
Use well-proportioned arrows sparingly and consistently as a guide through complex information

. . . . . . . .

Plot types
Bar charts and box plots – February 2014
Choose the appropriate plot according to the nature of the data and the task at hand
Sets and intersections – July 2014
Euler and Venn diagrams are appropriate for up to three sets but for greater numbers use more scalable plots
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively
Temporal data – Feb 2015
Use inherent properties of time to create effective visualizations
Unentangling complex plots – July 2015
Carefully designed subplots scaled to the data are often superior to a single complex overview plot
Pathways – January 2016
Apply visual grouping principles to add clarity to information flow in pathway diagrams
Neural circuit diagrams – March 2016
Use alignment and consistency to untangle complex neural circuit diagrams

. . . . . . . .

Improving figure clarityImproving figure clarity
Simplify to clarify – August 2011
Simplify your presentation to improve clarity
Design of data figures – September 2010
Improve figure decoding by using strong visual cues to encode data
Salience – October 2010
Use salience to differentiate graphical symbols and speed up figure reading
Points of review (Part 1) – February 2011
Examples of figure redesigns
Points of review (Part 2) – March 2011
Simple tips to improve pie chart, scatter plot and color scale data displays

. . . . . . . .

Multidimensional data
Visualizing multidimensional dataInto the third dimension – September 2012
3D visualizations are effective for spatial data but rarely for other data types
Power of the plane – October 2012
Combine 2D plots for effective visualization of multivariate data
Multidimensional data – July 2013
Visually organize complex data by mapping them onto familiar representations of biological systems

. . . . . . . .

Data exploration
Pencil and paper – November 2012
Quick sketches and doodles of data or models aids thinking and the scientific processVisualization for data exploration
Data exploration – January 2012
Create ‘slices’ of data to enhance the process of pattern discovery
Networks – February 2012
Choose your network visualization based on the patterns you are looking for
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively
Integrating data – April 2012
Combine visualizations of multiple data types to find correlations and potential relationships
Representing the genome – May 2012
Limit what is displayed based on the question being asked
Managing deep data in genome browsers – June 2012
Compaction and summarization help find patterns in overwhelming data
Representing genomic structural variation – July 2012
Use arcs, color, dot plots and node graphs to show relations between distant genomic positions

. . . . . . . .

Return of the Points of View column

Our popular “Points of View” column returns this month after a brief hiatus. Here is a bit of history of the column and an introduction to its new author.

On this day four years ago Sean O’Donoghue contacted Nature Methods about a workshop he was organizing on visualizing biological data. This culminated in a Nature Methods Supplement on Visualizing Biological Data published one year later that coincided with the first VizBi meeting in Heidelberg, Germany.

During this meeting Bang Wong and I hatched the idea of a Nature Methods column that would provide practical advice on the visual presentation of data for researchers. Later that year our August issue featured Bang’s very first Points of View column, “Color coding“. What followed was a labor of love by both Bang and I, with plenty of stress over deadlines, that extended over two years.

The column seemed to fill a need in the community and generated considerable positive feedback, including from authors and reviewers who would sometimes refer to advice from Bang’s columns. At the end of 2012 Bang took a needed break and the column went on hiatus. But in the meantime I had again met someone at a meeting in Germany who was passionately interested in the visual display of data.

The Points of View column returns in our March issue authored by Martin Krzywinski (staff scientist, creator of the visualization software Circos, and former fashion photographer).

I decided we couldn’t let someone with Martin’s varied experiences debut as the new Points of View columnist without learning a bit more about him so I asked our Technology Editor, Vivien Marx, to see what she could dig up.

Martin Krzywinski

Martin Krzywinski

Current mode: Makes cancer research and genome analysis visual.
Introduction to genomics: Built computing infrastructure at Genome Sciences Center
Past activities (incomplete): fashion photography, computer security, particle physics.
Published information graphics (incomplete): Book covers, American Scientist, EMBO Journal, PNAS, The New York Times, Wired, Conde Nast Portfolio.

Alex the rat

Alex the rat

Q: You photographed Alex (2000-2002) and helped her become the poster rat for genome sequencing. For example, she was Genome Research’s rat cover-girl. She frequently rode on your shoulder and seems like a groovy friend.

M.K.: Don’t be fooled by Alex’s visual presentation. She bit me countless times. But what do you expect from a rat? Maybe it is I that never learned.

Q: In addition to photo-shoots with Alex, you have had human fashion models in front of your lens. Fashion is pretty. Why should science be pretty?

Continue reading