Bring on the box plots

Box plots are excellent for visualizing important core statistics of sample data. We hope that a new online plotting tool BoxPlotR will help encourage their wider use in basic biological research.

The same three samples plotted by bar chart (left) and box plot (right).

The same three samples plotted by bar chart with s.e.m. error bars (left) and Tukey-style box plot (right). The box plot more clearly represents the underlying data.

A bar chart is often a person’s first choice of plot type when they want to compare values. This is appropriate when the values arise from counting. But when the value is a mean or median of data points taken from a sample, a bar chart is usually inappropriate. As discussed in our March Editorial and the accompanying Points of View and Points of Significance columns, a “mean-and-error” scatter-type plot or a box plot are more appropriate for sampled data. In summary, we strongly recommend that box plots be used when you have at least five data points, but for samples with 3-5 data points mean-and-error plots are more appropriate.

Box plots are heavily used in biomedical research in which statisticians have historically had considerable input into study design and analysis. But although similar types and quantities of sample data also appear in basic research (such as that published in Nature journals) box plots are much less common than bar charts in these manuscripts. Last year in Nature Methods for example, ~80% of sampled data was plotted using bar charts.

Discussions we had with the community suggested that an impediment to using box plots instead of bar charts to graph sample data was due to limited support for box plots in plotting programs commonly used by researchers. It also became apparent that some software that did support the box plot was deficient in communicating to users what the different elements of the plot represented. As a result, strangely labeled box plots were showing up in published papers. At NPG we thought it would be useful to provide authors with a simple online tool they could use to generate basic box plots of their data for publication.

The origin of BoxPlotR
At the VizBi 2013 conference in Cambridge Massachusetts I mentioned NPG’s desire for such a tool at a breakout session chaired by Martin Krzywinski in which the participants, including a young researcher named Jan Wildenhain, discussed what the community needed to create better figures. I also happened to mention our interest in this to Michaela Spitzer while visiting her poster from the Juri Rappsilber and Mike Tyers labs showing how the R-package ‘shiny’ by RStudio can be used to easily convert R code (a popular scripting language for statistics) into a visual application for exploring data.

Later at the conference Jan approached me and said he was intrigued by our desire for someone to design a webtool to create box plots and that he was interested in working on such a project. I happily told him to get in touch with me after the conference so we could discuss it further.

Three weeks after the conference concluded I still hadn’t heard from Jan and was beginning to worry that he had decided not to pursue this. Then… a few days later, I received an email from Jan. Much to my surprise he provided a link to a highly functional tool that he and Michaela, through their own initiative, had gone ahead and created using shiny and R. What followed was a productive and rewarding period of discussion and development during which time Michaela incorporated additional functionality and made selected design changes. The tool appeared so well designed and functional that I encouraged them to submit it to Nature Methods for publication as a Correspondence. After incorporating additional functionality and changes based on comments brought up during peer review BoxplotR was ready for publication.

Sample BoxPlotR plots

Sample BoxPlotR plots. Top: Simple Tukey-style box plot. Bottom: Tukey-style box plot with notches, means (crosses), 83% confidence intervals (gray bars; representative of p=0.05 significance) and n values.

Launch of BoxPlotR
To accompany the publication and launch of BoxPlotR we thought it would be useful to provide some information and practical advice about box plots to our readers. Nils Gehlenberg, a former author of several Points of View articles with Bang Wong, agreed to resurrect that popular column for our February issue with an article on bar charts and box plots. Similarly, Martin Krzywinski and Naomi Altman agreed to delay our planned Points of Significance article on the two-sample and paired t-test and instead devote an article to box plots.

Seeing how the community responded to our interest in creating an online box plot tool and then working with them on this project has been a great experience. This never would have been possible without the initiative and talent of Jan and Michaela or the support they received from their PIs Mike and Juri. We hope both our authors and others find BoxPlotR useful and we encourage feedback. General comments can be made here on our blog or by emailing the journal. For specific bug reports and feature requests please see the contact information at https://boxplot.tyerslab.com.

Alberto Cairo on storytelling in science communication

Alberto Cairo responds to a Correspondence criticising the use of storytelling techniques in scientific research articles and journalism.

Nature Methods’ August Points of View article by Alberto Cairo and Martin Krzywinski described how to use techniques of storytelling to design better scientific figures. That article prompted a passionate response from Yarden Katz arguing that storytelling has no place in scientific articles. Cairo and Krzywinski respond that their article was overinterpreted. This exchange prompted us to argue in the November Editorial that storytelling serves an important role when used properly.

In this guest post, Alberto Cairo expands on their printed response.

Alberto CairoYarden Katz’s thoughtful response to our short column about visual storytelling techniques in science communication makes many cogent observations. We will use them as a starting point for a deeper discussion of the contents of the column itself.

First of all, Katz sees too much in our words. As explained in our published response to the Correspondence by Katz, we didn’t advocate for the use of storytelling to drive experiments. That is a very legitimate concern, but it was not our goal to promote this idea, so we won’t comment further on it.

Second, Katz presents an incomplete image of what storytelling and journalism are. He says that “great storytellers embellish and conceal information as necessary to evoke a response in their audience. Inconvenient truths are swept away while marginalities are amplified or spun to make a point more spectacular.” This is a rather bold claim that may be guilty of the same malady it denounces. It highlights the worst and obscures the best to be emotionally powerful.

It is true that many journalists begin with a preconceived idea—a narrative structure—and then choose the data which better fit it. They cherry-pick evidence to make a stronger and clearer point. They magnify outliers without mentioning the overwhelming prevalence of average values. This is the problem Christopher Chabris has identified in the work of famous journalist Malcolm Gladwell, in a recent long article (1).

This is not the approach we were trying to explain in our column. Proceeding this way, as Katz wrote, is wrong, and it is as wrong in science as it is in journalistic storytelling.

Moreover, we would like to remind Katz that there’s a long-rooted tradition in journalism that tries to stick to standards of truth which are close to those used in science. It was defined forty years ago by professor Philip Meyer, from the University of North Carolina at Chapel Hill, as “precision journalism”. Precision journalism consists of the use of social science research techniques in news reporting: Surveys, statistics, data analysis, visualization, etc. In the best of the worlds, all journalism should be based on a careful evaluation of data and evidence, but precision journalism tried to elevate the standards of what proper evidence really is, even considering the pressures and tight deadlines journalists need to endure and meet.

That tradition has mutated into different branches of journalism that overlap greatly: Computer-assisted reporting (CAR), and data-driven journalism (2) among them. The most famous exponent of this tradition nowadays is Nate Silver, author of the blog FiveThirtyEight who, using mainly Bayesian techniques, correctly predicted the results of several elections (3).

What is the method of journalists—storytellers—in these areas? They don’t pitch an idea and then try to find the best data to support and embellish it. Ideally, they may begin with a fuzzy notion of what they want to focus on, and then they collect evidence systematically and let stories emerge from it. These stories may be completely opposite to the notion they had in mind at the beginning. Finally, they write those stories or, as we suggested in our column, they visualize them, in many cases with the close advice of experts in the areas they are covering (4). This is the storytelling tradition we were thinking about when writing our column.

Another point that we made is that the techniques described are helpful mainly when researchers need to communicate with non-specialized audiences. Journalists and storytellers are aware that people cannot absorb large amounts of information at once, and that in many cases they lack the background necessary to understand complex scientific research. As we wrote, “inviting readers to draw their own conclusions is risky because even simple messages can hide in simple data sets.”

However, and this is a critical point, nothing impedes researchers or journalists to present two or more competing interpretations when they are equally founded on evidence or there’s great uncertainty. Or to first present their main conclusions in the form of an evidence-based visual story, a narrative or, at least, a compelling composition—not all information can be framed as a story, after all—and then let those readers interested in exploring the multiple nuances or angles of an investigation access the data gathered and analyzed for it. This is something data journalists do today (5).

Any of those approaches would help avoid the challenge correctly pointed out by Katz: “complex experiments afford multiple interpretations and so such deviances from the singular narrative must be present somewhere.” Indeed. Just not at the first level of the presentation. To communicate effectively, information needs to be layered and sequenced in a way that can be processed correctly by audiences (6) while respecting all its nuances. For good examples of journalistic work that is both engaging and evidence-based, see the books by David Quammen, Carl Zimmer, or David Dobbs.

And it’s not just journalists who embrace this particular kind of storytelling technique. Many scientists do, too. As a recent example, take Michael E. Mann’s The Hockey Stick and the Climate Wars: Dispatches from the Front Lines, a book that presents the evidence for global warming in the form of a narrative that is deep, rich, and captivating at the same time.

I’d like to conclude by quoting the words by the Yale University professor Robert P. Abelson that we included in our column. In his most popular book, Statistics as Principled Argument (1995), Abelson wrote that he used to ask his students “If your study were reported in the newspaper, what would the headline be?” That doesn’t mean that this headline is the only element that should be reported. Rather, it means that it should be the first element to be reported, followed by a discourse based on—to borrow Katz’s beautiful description—”evidence and arguments that are used—with varying degrees of certainty—to support models and theories.” This would be a discourse that is interesting to read and that thoroughly respects the integrity and the complexity of the underlying data. Therefore, we believe that storytelling, if carefully handled, can be compatible with the framing for presenting scientific results Katz outlines.

Footnotes
(1) See https://blog.chabris.com/2013/10/why-malcolm-gladwell-matters-and-why.html

(2) The academic literature in communication studies and journalism has not reached an agreement on how these categories should be defined. Basically CAR focuses on the use of data and databases to inform traditional reporting work (writing and speaking). Data-driven journalism expands the scope to include also the design of tools for readers to explore data, such as visualizations, mobile apps, etc.

(3) Silver’s blog used to be hosted by The New York Times. It has recently moved to ESPN.

(4) Journalists are, by tradition and training, jack-of-all-trades, even those who specialize in research, statistics, and computing.

(5) ProPublica and Texas Tribune, for instance, are two independent, non-profit investigative journalism organizations which frame their projects as stories, but then they usually let readers access the databases they put together and analyzed.

(6) Multiple recent books warn against the dangers of storytelling, cognitive biases, and patternicity, the tendency to see patterns where none exist. Arguably, the most popular ones are Kahneman (2011) and Shermer (2012). However, both authors also concede that we humans love stories, and we understand complicated information better if it can be presented as a story. So why not take advantage of that feature if we are conscious of its possible shortcomings?

REFERENCES
Abelson, Robert P. (1995) Statistics as Principled Argument. Psychology Press.

Kahneman, Daniel (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux

Mann, Michael E. (2012) The Hockey Stick and the Climate Wars: Dispatches from the Front Lines. Columbia University Press.

Meyer, Philip (1973). Precision Journalism: A Reporter’s Introduction to Social Science Methods. Indiana University Press.

Shermer, Michael (2011). The Believing Brain: From Ghosts and Gods to Politics and Conspiracies—How We Construct Beliefs and Reinforce Them as Truths. Times Books.

Silver, Nate (2012). The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t. Penguin Press.

Data visualization: A view of every Points of View column

We’ve organized all the Points of View columns on data visualization published in Nature Methods and provide this as a guide to accessing this trove of practical advice on visualizing scientific data.

As of July 30, 2013 Nature Methods has published 35 Points of View columns written by Bang Wong, Martin Krzywinski and their co-authors: Nils Gehlenborg, Cydney Nielsen, Noam Shoresh, Rikke Schmidt Kjærgaard, Erica Savig and Alberto Cairo. As we prepare to launch a new column in our September issue we felt this would be a good time to collect and organize links to all the Points of View articles together in one place to make it easier to navigate this wonderful resource that the authors have provided us. For the month of August we will be making all the columns free to access so everyone can benefit from this practical advice on data visualization.

This should not be the end of the Points of View column though. We will be inviting new visualization experts to author articles on new topics that have not been covered so far or which can be expanded on. This page will be continuously updated whenever a new article is published so stay tuned. If you have a suggestion for a topic you would like to see covered in a future points of view article please comment below.

Update of March 28, 2015: A PDF eBook of the 38 Points of View articles published between August 2010 and February 2015 is now available at the Nature Shop for $7.99 under the title “Visual strategies for biological data: the collected Points of View”. The article summaries below provide a nice overview of what is contained in that eBook collection.

. . . . . . . .

Introduction
Visualizing biological data – December 2012
Data visualization is increasingly important, but it requires clear objectives and improved implementation
The overview figure – May 2011
An economic overview figure to convey general concepts helps readers understand a research study

. . . . . . . .

Composition and layout
The design process – December 2011
Use good design to balance self-expression with the need to satisfy an audience in a logical manner
Figure design and layoutLayout – October 2011
Proper layout reveals the hierarchical relationship of informational elements
Gestalt principles (Part 1) – November 2010
Gestalt principles (Part 2) – December 2010
Exploit perceptual phenomena to meaningfully arrange elements on the page
Negative space – January 2011
Whitespace is a powerful way of improving visual appeal and emphasizing content
Salience to relevance – November 2011
Ensure that viewers notice the right content by making relevant information most noticeable
Elements of visual style – May 2013
Translate the principles of effective writing to the process of figure design
Storytelling – August 2013
Relate your data to the world around them using the age-old custom of telling a story

. . . . . . . .

Using colorUsing color in data visualizations
Color coding – August 2010
Choose colors appropriately to avoid bias and unwanted artifacts in visuals
Color blindness – June 2011
Make your graphics accessible to those with color vision deficiencies
Avoiding color – July 2011
Improve the overall clarity and utility of data displays by using alternatives to color
Mapping quantitative data to color – August 2012
Color is useful for compact visualizations of large data sets but must highlight salient features
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively

. . . . . . . .

Elements of a data figureElements of a figure
Typography – April 2011
Choose typefaces, sizes and spacing to clarify the structure and meaning of the text
Axes, ticks and grids – March 2013
Make navigational elements distinct and unobtrusive to maintain visual priority of data
Labels and callouts – April 2013
Figure labels require the same consistency and alignment in their layout as text
Plotting symbols – June 2013
Choose distinct symbols that overlap without ambiguity and communicate relationships in data
Arrows – September 2011
Use well-proportioned arrows sparingly and consistently as a guide through complex information

. . . . . . . .

Plot types
Bar charts and box plots – February 2014
Choose the appropriate plot according to the nature of the data and the task at hand
Sets and intersections – July 2014
Euler and Venn diagrams are appropriate for up to three sets but for greater numbers use more scalable plots
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively
Temporal data – Feb 2015
Use inherent properties of time to create effective visualizations
Unentangling complex plots – July 2015
Carefully designed subplots scaled to the data are often superior to a single complex overview plot
Pathways – January 2016
Apply visual grouping principles to add clarity to information flow in pathway diagrams
Neural circuit diagrams – March 2016
Use alignment and consistency to untangle complex neural circuit diagrams

. . . . . . . .

Improving figure clarityImproving figure clarity
Simplify to clarify – August 2011
Simplify your presentation to improve clarity
Design of data figures – September 2010
Improve figure decoding by using strong visual cues to encode data
Salience – October 2010
Use salience to differentiate graphical symbols and speed up figure reading
Points of review (Part 1) – February 2011
Examples of figure redesigns
Points of review (Part 2) – March 2011
Simple tips to improve pie chart, scatter plot and color scale data displays

. . . . . . . .

Multidimensional data
Visualizing multidimensional dataInto the third dimension – September 2012
3D visualizations are effective for spatial data but rarely for other data types
Power of the plane – October 2012
Combine 2D plots for effective visualization of multivariate data
Multidimensional data – July 2013
Visually organize complex data by mapping them onto familiar representations of biological systems

. . . . . . . .

Data exploration
Pencil and paper – November 2012
Quick sketches and doodles of data or models aids thinking and the scientific processVisualization for data exploration
Data exploration – January 2012
Create ‘slices’ of data to enhance the process of pattern discovery
Networks – February 2012
Choose your network visualization based on the patterns you are looking for
Heat maps – March 2012
Color, clustering and parallel coordinate plots are essential for using heatmaps effectively
Integrating data – April 2012
Combine visualizations of multiple data types to find correlations and potential relationships
Representing the genome – May 2012
Limit what is displayed based on the question being asked
Managing deep data in genome browsers – June 2012
Compaction and summarization help find patterns in overwhelming data
Representing genomic structural variation – July 2012
Use arcs, color, dot plots and node graphs to show relations between distant genomic positions

. . . . . . . .

Return of the Points of View column

Our popular “Points of View” column returns this month after a brief hiatus. Here is a bit of history of the column and an introduction to its new author.

On this day four years ago Sean O’Donoghue contacted Nature Methods about a workshop he was organizing on visualizing biological data. This culminated in a Nature Methods Supplement on Visualizing Biological Data published one year later that coincided with the first VizBi meeting in Heidelberg, Germany.

During this meeting Bang Wong and I hatched the idea of a Nature Methods column that would provide practical advice on the visual presentation of data for researchers. Later that year our August issue featured Bang’s very first Points of View column, “Color coding“. What followed was a labor of love by both Bang and I, with plenty of stress over deadlines, that extended over two years.

The column seemed to fill a need in the community and generated considerable positive feedback, including from authors and reviewers who would sometimes refer to advice from Bang’s columns. At the end of 2012 Bang took a needed break and the column went on hiatus. But in the meantime I had again met someone at a meeting in Germany who was passionately interested in the visual display of data.

The Points of View column returns in our March issue authored by Martin Krzywinski (staff scientist, creator of the visualization software Circos, and former fashion photographer).

I decided we couldn’t let someone with Martin’s varied experiences debut as the new Points of View columnist without learning a bit more about him so I asked our Technology Editor, Vivien Marx, to see what she could dig up.

Martin Krzywinski

Martin Krzywinski

Current mode: Makes cancer research and genome analysis visual.
Introduction to genomics: Built computing infrastructure at Genome Sciences Center
Past activities (incomplete): fashion photography, computer security, particle physics.
Published information graphics (incomplete): Book covers, American Scientist, EMBO Journal, PNAS, The New York Times, Wired, Conde Nast Portfolio.

Alex the rat

Alex the rat

Q: You photographed Alex (2000-2002) and helped her become the poster rat for genome sequencing. For example, she was Genome Research’s rat cover-girl. She frequently rode on your shoulder and seems like a groovy friend.

M.K.: Don’t be fooled by Alex’s visual presentation. She bit me countless times. But what do you expect from a rat? Maybe it is I that never learned.

Q: In addition to photo-shoots with Alex, you have had human fashion models in front of your lens. Fashion is pretty. Why should science be pretty?

Continue reading