The rise of data-intensive research is increasing the need for collaborative science.
Guest contributor Lakshini Mendis
Big data, a term thought to have originated in the mid-90s, is a current buzzword amongst scientific communities. Rather than a sole reference to the size of complex datasets, the term broadly encompasses all aspects of working with large datasets from acquisition to analysis.
Big data in science
At its core, scientific research is driven by our curiosity to understand the relationship between cause and effect. Traditionally, ‘hypothesis-driven’ experiments are designed to answer a specific question about a cause-effect relationship.
However, over the last sixty years there has been a trillion fold increase in computing performance. The per-capita capacity to store information has roughly doubled every forty months since the 1980s. These technological advances are revolutionising almost all facets of human life, including how scientific research is conducted.
In contrast to the traditional ‘hypothesis-driven’ approach, advancing technology allows us to acquire larger, more complex datasets, encompassing as many variables as possible, without bias from preconceived ideas. Powerful computation also enables us to finally realize the full potential of decades-old mathematical and statistical concepts. We can now sift through many variables and identify numerous cause-effect relationships in the same dataset, which would have previously been undetectable to the unaided human mind. These principles are now being applied to diverse fields, from astronomy to neuroscience, from particle physics to genomics.
The need for collaboration
The National Human Genome Research Institute reports that the cost of sequencing a human-sized genome was almost US$10 million in 2001, which had halved a couple of years later. The Human Genome Project took 13 years and cost about US$2.7 billion; however, human whole-genome sequencing is now more affordable and accessible than ever. Today, Illumina’s HiSeq X Ten System can sequence “over 18,000 human genomes per year at the price of about $1000 per genome”. Advances such as this have allowed scientists like Theordora Ross from UT Southwestern Medical Center to identify novel mutations in “mystery breast cancer patients” – those with a strong family history of cancer but who did not possess the BRCA mutation – using human whole-genome sequencing. Advances in human whole-genome sequencing are also paving the way for large-populations studies, which in turn is inching us toward precision medicine.
Thus, a lack of data is no longer the bottleneck to discovery. Rather, it is the effective management, analysis, and sharing of large datasets that now pose a challenge.
Initiatives such as the Open Science Data Cloud and the Multi-Institutional Open Storage Research InfraStructure provide an online repository to efficiently store large datasets and share them between different groups. Effectively analysing complex datasets requires abilities that often extend beyond a single researcher’s immediate skillset. Even the most tech-savvy researcher can struggle with some of the mathematical and computational expertise needed to correctly interpret large datasets. Thus, collaboration is key. Having a versatile team comprised of researchers, software engineers, bioinformaticians and statisticians, helps each focus on what they do best. There is no longer a requirement for the sole researcher to become a ‘jack-of-all-trades’. However, there is a need for clear communication between the experts of each field.
Addressing the barrier to collaboration
However, when it comes to projects that are being conducted on a smaller scale, many researchers are still apprehensive about openly sharing their data. The reasons cited include intellectual property concerns and the fear of being scooped. These concerns have been generated, in part, by the hypercompetitive environment of research, where a high impact factor publication alone has become the ultimate goal of scientists, no matter the cost.
Journals such as Scientific Data and GigaScience help encourage researchers to share their data openly by recognizing their contributions as publications. Further, disseminating the entire dataset helps validate the interpretation of the data and the findings from it. It also opens the door to enable other researchers to reuse the data to investigate their own hypotheses, while guaranteeing proper acknowledgment of the source. For instance, different researchers can make maximal use of a large mass spectrometry dataset to investigate different proteins of interest, without the need for additional time and resources. This approach can help streamline scientific discovery with efficient use of funding.
There are already discernible changes to the scientific research landscape that address the challenges of big data projects. However, the rise of data-intensive research requires a change of mind-set amongst scientists. There is an increased need for multidisciplinary research teams, with clear communication between experts of different fields. Scientists also need to be innovative and become more aware of the tools that will enable them to widely collaborate and openly share data. These changes will help us fully grasp the potential of big data and accelerate understanding.
Lakshini Mendis is a winner of the 2015 Scientific Data writing competition. She is also a PhD student at the Centre for Brain Research in Auckland, and studies how the human brain changes in Alzheimer’s disease. She is passionate about good science communication and is a strong advocate for women in STEM, and volunteers as the Editor-in-Chief at The Scientista Foundation! Follow her musings on Twitter!