It’s been nearly a decade since Eric Welsh first noticed some weirdness with Microsoft Excel. A senior staff scientist in the Cancer Informatics Core at the H. Lee Moffitt Cancer Center and Research Institute in Tampa, Florida, Welsh was using Microsoft’s venerable spreadsheet application to view mouse and human gene expression data, the better to sort and understand the numbers. But a quick glance revealed the import hadn’t gone exactly as planned. “Excel would screw them up every time,” he says.
How so? When data are imported into Excel, the program works hard to figure out what kind of value each cell holds. Most of the time, Excel is smart enough to do that correctly, and values like ‘BRCA1’ and ‘12345’ are converted into text and integers, as expected. But “Excel is a little too smart for its own good,” Welsh says. If a cell reads “SEPT7,” the program assumes the author meant to write a date, and converts it automatically. It also sometimes translates what appear be numbers in scientific notation – say, ‘2310009E13’ – into actual scientific notation (‘2.31E+13’). The problem is, those two terms are neither dates nor numbers – they are proper names, scientifically speaking: gene names, sample identifiers or accession numbers. And by autoconverting them, those names are lost, or at least, obscured.
Researchers have been aware of this issue since at least 2004, when the first report was published in BMC Bioinformatics. In 2016, Mark Ziemann of the Baker IDI Heart and Diabetes Institute in Melbourne, Australia, and colleagues revisited the problem, reporting in Genome Biology that 704 of 3,597 papers from 18 journals that contained gene lists in Excel format contained mangled gene names. That’s nearly one in five. An independent analysis by the editors of Nature Genetics obtained similar results. Fake news site The Allium summed up the problem admirably, writing, “Scientific community capitulates to Microsoft, officially changes all gene names to dates.”
Excel “tries to be helpful but makes the wrong choice and ends up corrupting the data,” Welsh says. “And this can be silent. And there’s no way to properly turn these smart features off.” (It is possible, on a column-by-column basis, to manually control how Excel treats spreadsheets in some cases, he notes.)
Several years ago, Welsh found a workaround online. As it turns out, Excel will ignore text that is embedded in an equation. He encoded that solution in a program called ‘Escape Excel,’ and posted a preprint describing the tool on bioRxiv on 27 January. (ETA: That manuscript was published 27 September in PLOS ONE; it is available here.) The abstract has been read over 2,800 times, according to coauthor Paul Stewart.
Welsh explains the name thusly: “In computer terms, an ‘escape sequence’ is something you put in front of text to keep software from interpreting that text as something other than what you intended it to be.” Executed from the command line or the Galaxy bioinformatics environment, Escape Excel scans tab-delimited text files – a format popular in bioinformatics – for sequences likely to be mangled by Excel, and converts those values into equations, outputting a new datafile that can be safely opened.
Few dyed-in-the-wool computational biologists would bother to do so, of course; they can extract and manipulate the data programmatically using languages like R and Python. But most biologists aren’t fluent in those languages, and Excel is ubiquitous, says Stewart, a postdoc at Moffitt who developed Escape Excel’s Galaxy implementation. “Excel is and still can be a useful tool,” he says.
The fact of the matter is, says Jennifer Bryan, an associate professor of statistics at the University of British Columbia in Vancouver (currently on leave as a developer at RStudio), researchers need some way to view and sort tabular data. Lightweight alternatives to Excel have been developed, including Comma Chameleon and TableTool. But none is really ready for prime time, Bryan says. Google Sheets can also open tab-delimited files without name-mangling, but it is web-based, and can be relatively slow when handling very large tables.
Ideally, Bryan says, biologists would wean themselves off Excel with bioinformatics training. And funding agencies would sponsor the development of tools to fill the gap. “This is the equivalent of building sewers and highways, but for science,” she says. “These are really unsexy, important projects, and this is what your government should be helping you build.”
That that isn’t happening clearly frustrates the bioinformatics community. Upon hearing the news about Escape Excel, Bryan echoed many when she tweeted, “I applaud the effort to do [something] about this problem, but this is what it sounds like when genomics cries for help.”
9 October 2017: This post has been updated to reflect the publication of the Escape Excel manuscript in PLOS ONE.
Suggested posts
Science meets Netflix with data streaming
Recent comments on this blog
African astronomy and how one student broke into the field
From Doctorate to Data Science: A very short guide
Work/life balance: New definitions