Open data can help us avoid inherent biases in our work, says winner of the Naturejobs ‘Better Science through Better Data’ writing competition Ayushi Sood.
Ayushi is an undergraduate microbiology student at Amity University, Noida, Uttar Pradesh. Her interest in what makes life tick made her fall in love with bacteria and astrobiology, and her passion for making scientific research more efficient and accessible led her to explore bioinformatics. She has been a part of research projects investigating nanoparticle-plant interactions, transgenic algae, and bacteria-algae associations.
Recently, an economist friend told me that “scientific inquiry is inherently cursed.” At first I was offended. But I had to agree after he elaborated further – science today suffers from something economists enigmatically call the “winner’s curse”.
The first scientific journals were print editions — something akin to a printed memo — circulated among researchers to update them of the findings of others in the field. To submit a paper for publication, only the observations required to prove results needed to be included in a manuscript, and rightly so: if every paper included every shred of data, journals would run into thousands of pages. This means, though, that what was communicated to the scientific community was only a fraction of what could have been communicated: only the observations that were ‘winners’ – the ones which best supported a result – would be presented, and the others would be effectively relegated to obscurity. Although we’re not limited by paper and page counts today, the same pattern of data use continues. This leads us to the problem of the winner’s curse: by the process of selection, the ‘winning’ observation oversells itself.
In economics, the winner’s curse refers to situations in auctions where the winner tends to overpay, because the actual value of the product is the average of the bids, not the highest bid. In scientific research, the curse takes hold in scientists who aim for publication in the most selective journals, with the most impressive results being favored. This ignores all the other results — the ones which weren’t so impressive — while giving disproportionate importance to the ‘winning result’.
The problem with this phenomenon isn’t immediately evident — isn’t the result what actually matters? The data is, after all, just a tool, necessary only to prove what’s important — the conclusion. In looking for conclusions in data, however, researchers can forget to ask: “does the conclusion effectively justify my repeated sampling of the real world?” In other words, is reality accurately reflected by the dataset presented? All the observations we take, whether they are inconclusive, negative, or ‘winners’, represent an analysis of the natural world. By only reporting the ones that work, the other sampling efforts cannot be used by anyone else. This process confers on a small, selected number of observations the authority to predict an unpredictable future! Back in the auction house, this would mean the value of the product is set only by the winning bid. When we report only the best set of data, we are relegating the less impressive observations to obscurity, even though these also represent an analysis of the real world, with real potential to inform.
So what does this mean for us? How should scientists avoid falling into the trap of the winner’s curse? One way would be to save, store and share all data — not just positive results. We are only human. By making our data openly accessible, we can avoid internal inconsistencies. The smallest of mistakes would be corrected by fresh eyes poring over the very same data.
More importantly, open data could prove to be a shot in the arm for scientific inquiry as a whole. What data I find important may be perfect for my study, yet a small cluster of ignored numbers in my dataset could lead to a breakthrough for someone else, possibly in a way that I could never have imagined! Gene expression data in cancer cells could provide insights into cell signaling pathways in neurodegenerative disorders. Algal bloom observations in polluted lakes could help in effective biomass production for algal biofuel. The analysis and application of open data could usher in a new age of scientific connectivity, with the available knowledge transcending traditional discipline boundaries in way never seen before.
Well, if it’s so good, why hasn’t open data been the norm since science began? We come back to the thousand-page journal here — the question wasn’t of why not, but of how. Transmitting every single byte of data through papers and talks was impossible before the advent of computers and the emergence of the internet in the 1990s. In 2017, however, we have the tools at our disposal to store, parse, organize and retrieve every single digit. The burgeoning field of data science and analysis is ours to exploit, just waiting to script the next scientific success story.
So, I have to hand it to the economists on this one — the winner’s curse is alive and kicking in science. But, like any good scientist, I’m thinking of solutions, and every clue suggests that open data, accessibility and collaboration could be just the spell that breaks this curse.