Mercenary computer coders are helping scientists cope with the deluge of data pouring out of research labs. A contest to write software to analyse immune-system genes garnered more than 100 entries, including many that vastly outperformed existing programs.
The US$6,000 contest was launched by researchers at Harvard Medical School and Harvard Business School, both in Boston, Massachusetts. TopCoder.com, a community of more than 400,000 coders who compete in programming competitions, hosted the contest. The results are described in a letter published today in Nature Biotechnology.
The challenge was to analyse the genes involved in the production of antibodies and immune-system sentinels called T-cell receptors. These genes are formed from dozens of modular DNA segments located throughout the genome, and they can be mixed and matched to yield trillions of unique proteins, each capable of recognizing a different pathogen or foreign molecule.
Computer programs used to determine the origin of the segments that make up antibody and T-cell receptor genes are typically slow. Some run only on centralized supercomputers. Scientists “should be able to do this on their own laptop with a publicly available algorithm,” says Eva Guinan, a radiation oncologist at Harvard Medical School who was involved in setting up the contest with Karim Lakhani of Harvard Business School, Ramy Arnaout and Kevin Boudreau, both at Harvard Medical School.
With the help of employees at TopCoder.com, Guinan’s team created a contest that expressed the problem in generic, non-biological terms, such as strings and sub-strings instead of gene sequences and gene segments. Their contest ran for two weeks and awarded weekly $500 prizes to top performers.
In total, 122 people submitted computer programs to characterize the genes involved in immune responses. Half of the entrants were professional computer programmers, yet none worked as computational biologists. Contestants spent an average of 22 hours on the problem, accumulating a total of nearly 2,700 hours of development time.
Many of the submissions — including an algorithm designed by the immunology lab that came up with the challenge — ran much faster than existing software, returning solutions in seconds instead of minutes or hours. Sixteen of the submissions were more accurate than the existing software. The codes for the best-performing programs are now available for free download.
Guinan believes her team’s approach can easily be applied to other Big Data challenges. She is helping other Harvard scientists to crowdsource bioinformatics problems, such as interpreting computed tomography scans and distinguishing mutations from sequencing errors in HIV genetic data, and hopes scientists elsewhere will look outside academia for computational expertise. In late January, NASA sponsored a challenge on TopCoder.com to design software to more efficiently deploy the solar panels on the International Space Station.
“Expressing the problem in a way that computer scientists could understand was a key element in the success of this work,” says Alex Bateman, a computational biologist at the European Bioinformatics Institute near Cambridge, UK. “I think that this article should motivate many more groups to consider this kind of outsourcing.”
However, Manuel Corpas, a bioinformatician at the Genome Analysis Centre in Norwich, UK, sees some limits to crowdsourcing ‘Big Data’ analysis, particularly for problems that involve patient data, which is subject to strict controls.