The University of California at Santa Cruz (UC-Santa Cruz) has opened a US$10.3-million cancer-genomics hub (CGHub) that will store and make available all the data from the three major US cancer-sequencing projects, including the Cancer Genome Atlas (TCGA).
The hub will combine all of the data from the projects in one place, making it the largest collection of cancer genomes accessible to researchers around the world. Project leader David Haussler, who also directs the Center for Biomolecular Science and Engineering at UC-Santa Cruz, says that the aggregation of all the data in one place is crucial to advancing the application of personal-genomics technologies in cancer.
“If we don’t do something about it,” he told the Sage Bionetworks Commons Congress meeting in San Francisco on 20 April, “[The data] will all be locked up in different medical centres and people will be concentrating on small cohorts, and never getting the statistical power we need to attack the disease.” (Watch Haussler’s talk about cancer genomics from the meeting here.)
The data were previously stored in a database maintained by the National Center for Biotechnology Information, but the sheer amount of data involved in the cancer projects threatened to overwhelm that resource. Haussler says that CGHub is planning to store a terabyte (1 trillion bytes) of data from each of the 10,000 tumours from patients that will be sequenced as part of the Cancer Genome Atlas. CGHub’s current capacity is 5 petabytes (1 petabyte is roughly 1,000 terabytes) and is scalable up to 20 petabytes; the Cancer Genome Atlas alone could produce 10 petabytes of data in the next four years.
Haussler also hopes that aggregating the data in one place will enable a broader community of scientists to develop better tools for analysing them. Researchers might one day be able to access the CGHub data through cloud computing, as is now possible with data from the 1000 Genomes Project, and they can already install their own computers in the San Diego Supercomputer Center, where the CGHub data is physically housed, so that they can run their own analyses.
Haussler hopes that this broader accessibility could lead to speedier solutions to some of the more difficult problems confronting the genome-analysis field. For instance, he told the Sage Commons Congress, different centres that are participating in the Cancer Genome Atlas project are using different methods to identify somatic mutations — which appear after conception and may contribute to cancer. As a result, different centres are identifying different mutations in the same exact sequence data mapped to the same reference genome in exactly the same way.
“We do not have extremely precise tools that are guaranteed to give you the right answer,” Haussler said. “This is an incredibly hard problem. If it’s just a few gurus sitting around trying to work this out, it won’t happen.”
Follow Erika on Twitter at @Erika_Check.