#Scidata15: Make the most of your research: Publish better data

Primary research papers are the currency of academics, but they’re also part of a much wider body of knowledge that is restricted by a lack of transparency.

Guest contributor Lakshini Mendis

naturejobs-blog-Scidata15-ResearchData-for-Publications

{credit}Image credit: SCIENTIFIC DATA/LUDIC GROUP{/credit}

Historically, a great deal of trust has been placed in statements made in research papers for which the underlying data have not been shared. The invention of the laser was described in a paper containing just three data-points, for instance, and Watson and Crick first described the structure of DNA in a paper without any data at all. But with about 1,500 papers retracted since 2012, and 26.6% due to misconduct, scientific papers are now firmly under the microscope.

Improving the availability and readability of original research data would go a long way to improving matters. And as scientific publishers largely determine how research data is disseminated, their involvement will be central to any change. Speaking at Publishing Better Science Through Better Data in late October 2015, Dr Joerg Heber and Dr Andrew Hufton, editors at Nature Communications and Scientific Data respectively, emphasised that to make the most of research data it must be more open.

Overcoming the data-sharing challenge

According to Hufton, the status quo is for researchers to only share data with others directly. As well as being inefficient, data associated with published work disappears at a rate of about 17% a year as a result of researchers failing to properly catalogue findings. There is now, therefore, a move from scientific publishers to make data findable, accessible, interoperable and re-useable – or, to use an acronym as those of a scientific persuasion are so often inclined to do, FAIR. Continue reading

#Scidata15: Big data: Challenges create opportunities

The era of big data brings with it a sea of opportunities for development and innovation.

Guest contributor Daniela Quaglia

naturejobs-blog-Scidata15-Opportunities-from-Research-Data

{credit}Image credit: SCIENTIFIC DATA/LUDIC GROUP{/credit}

Big data is here to stay. As scientists, we stand to benefit by being part of this exciting revolution. At the second Publishing Better Science through Better Data conference, held in London on October 23rd, Dr. Ewan Birney, joint associate director of the European Bioinformatics Institute (EBI), and Dr. Timo Hannay, founder of SchoolDash (a website that provides statistics about schools in England), walked us through some of the opportunities that arise from working with big data.

Opportunities in biology

Birney spoke about how the increase in big data is influencing the way we do biology. He promised to give the audience “an EBI centric view of the world”. I’m glad he did, because every scientist wanting to use big data should understand how EBI can help them.

EBI takes data provided by laboratories and stores, verifies, classifies and shares it. This approach means that a wealth of molecular-biology data, from DNA sequences to full systems (such us biomolecular pathways and metabolomics data), can be found in one place. As most scientists do not want to have to work from shared data in their raw form, the institute also works with the scientific community to convert original data into useful formats. Data from the Human Genome Project provides a compelling example of how such transformations can benefit the community — as Birney pointed out, not even the most experienced researchers want to analyse such complex raw data. Continue reading

Data sharing: Contribute to the community

Data sharing can make a significant contribution to the scientific community, but it comes with challenges, says Caroline Weight.

Guest contributor Caroline Weight

We have all heard of it. We are all worried about it. We hear whispers of it in the corridors. We are advised to be careful what we say to ‘others’. We constantly check the literature. It matters to us. After all, it is our careers on the line.

‘Scooped’.

The process of publication is vigorous, competitive and tricky. It’s not uncommon for five years to pass between writing the grant application and publishing the work. Big labs with state-of-the-art facilities stand a better chance of getting their work out there first, given the extra manpower and often more-established protocols. This race for ownership of the data makes it difficult to share information and present new findings at meetings or conferences. Even at manuscript submission, there is often a chance to actively inhibit particular referees in case of conflicts of interest or personal competitors, to retain the novel concepts and data until they have been made public. Not until the publication has been accepted and is in print can you heave a sigh of relief and move on to the next project. Yet, sharing of data is essential to the progression of science in the modern world. Continue reading

Data sharing: Fewer experiments, more knowledge

Data sharing will reduce the experiments needed in the lab and will increase the speed of knowledge generation by decreasing the time spent on the generation of equivalent datasets.

Guest contributor Ana Sofia Figueiredo

biological-model-naturejobs-blogI’m a postdoctoral scientist in systems biology at the University of Magdeburg, Germany. There, I build mathematical models to understand the mechanisms behind certain biological processes, such as the process of energy production by cells under extreme conditions. These mathematical models are representations of reality and some of them can be useful, although all of them are wrong. When well parameterized with data, these models give a quantitative representation and better understanding of such biological processes. Using a systems biology approach, I can do experiments in silico that are very difficult or technically impossible to do in vitro or in vivo.  However, a model is only as good as the data it incorporates.

When I have access to publicly available experimental datasets, I can plug the data into my models and, from the synergy of combining mathematical models with experimental data, learn more about the biological system I have at hands.

Sharing data, models and experimental protocols can push forward the generation of knowledge in science. Continue reading

Big data: Collaborative science

The rise of data-intensive research is increasing the need for collaborative science.

Guest contributor Lakshini Mendis

data-naturejobs-blog

{credit}Image credit: iStock/Thinkstock{/credit}

Big data, a term thought to have originated in the mid-90s, is a current buzzword amongst scientific communities. Rather than a sole reference to the size of complex datasets, the term broadly encompasses all aspects of working with large datasets from acquisition to analysis.

Big data in science

At its core, scientific research is driven by our curiosity to understand the relationship between cause and effect. Traditionally, ‘hypothesis-driven’ experiments are designed to answer a specific question about a cause-effect relationship.

However, over the last sixty years there has been a trillion fold increase in computing performance. The per-capita capacity to store information has roughly doubled every forty months since the 1980s. These technological advances are revolutionising almost all facets of human life, including how scientific research is conducted.

In contrast to the traditional ‘hypothesis-driven’ approach, advancing technology allows us to acquire larger, more complex datasets, encompassing as many variables as possible, without bias from preconceived ideas. Powerful computation also enables us to finally realize the full potential of decades-old mathematical and statistical concepts. We can now sift through many variables and identify numerous cause-effect relationships in the same dataset, which would have previously been undetectable to the unaided human mind. These principles are now being applied to diverse fields, from astronomy to neuroscience, from particle physics to genomics.

The need for collaboration

The National Human Genome Research Institute reports that the cost of sequencing a human-sized genome was almost US$10 million in 2001, which had halved a couple of years later. The Human Genome Project took 13 years and cost about US$2.7 billion; however, human whole-genome sequencing is now more affordable and accessible than ever. Today, Illumina’s HiSeq X Ten System can sequence “over 18,000 human genomes per year at the price of about $1000 per genome”. Advances such as this have allowed scientists like Theordora Ross from UT Southwestern Medical Center to identify novel mutations in “mystery breast cancer patients” – those with a strong family history of cancer but who did not possess the BRCA mutation – using human whole-genome sequencing. Advances in human whole-genome sequencing are also paving the way for large-populations studies, which in turn is inching us toward precision medicine.

Thus, a lack of data is no longer the bottleneck to discovery. Rather, it is the effective management, analysis, and sharing of large datasets that now pose a challenge.

Initiatives such as the Open Science Data Cloud and the Multi-Institutional Open Storage Research InfraStructure provide an online repository to efficiently store large datasets and share them between different groups. Effectively analysing complex datasets requires abilities that often extend beyond a single researcher’s immediate skillset. Even the most tech-savvy researcher can struggle with some of the mathematical and computational expertise needed to correctly interpret large datasets. Thus, collaboration is key. Having a versatile team comprised of researchers, software engineers, bioinformaticians and statisticians, helps each focus on what they do best. There is no longer a requirement for the sole researcher to become a ‘jack-of-all-trades’. However, there is a need for clear communication between the experts of each field.

Current global big data projects, such as the Sloan Digital Sky Survey, the Blue Brain Project, and the Human Proteome Project, HapMap effectively demonstrate the value of collaboration.

Addressing the barrier to collaboration

However, when it comes to projects that are being conducted on a smaller scale, many researchers are still apprehensive about openly sharing their data. The reasons cited include intellectual property concerns and the fear of being scooped. These concerns have been generated, in part, by the hypercompetitive environment of research, where a high impact factor publication alone has become the ultimate goal of scientists, no matter the cost.

Journals such as Scientific Data and GigaScience help encourage researchers to share their data openly by recognizing their contributions as publications. Further, disseminating the entire dataset helps validate the interpretation of the data and the findings from it. It also opens the door to enable other researchers to reuse the data to investigate their own hypotheses, while guaranteeing proper acknowledgment of the source. For instance, different researchers can make maximal use of a large mass spectrometry dataset to investigate different proteins of interest, without the need for additional time and resources. This approach can help streamline scientific discovery with efficient use of funding.

There are already discernible changes to the scientific research landscape that address the challenges of big data projects. However, the rise of data-intensive research requires a change of mind-set amongst scientists. There is an increased need for multidisciplinary research teams, with clear communication between experts of different fields. Scientists also need to be innovative and become more aware of the tools that will enable them to widely collaborate and openly share data. These changes will help us fully grasp the potential of big data and accelerate understanding.

Lakshimi-mendis-naturejobs-blog

{credit}Image credit: Lakshimi Mendis{/credit}

Lakshini Mendis is a winner of the 2015 Scientific Data writing competition. She is also a PhD student at the Centre for Brain Research in Auckland, and studies how the human brain changes in Alzheimer’s disease. She is passionate about good science communication and is a strong advocate for women in STEM, and volunteers as the Editor-in-Chief at The Scientista Foundation! Follow her musings on Twitter!

Big data: The impact of the Human Genome Project

The Human Genome Project led to a paradigm shift in the way science is conducted and data is shared, says Rehma Chandaria.

Guest contributor Rehma Chandaria

164480115_genome_istock_thinkstock

{credit}istockphoto/thinkstock{/credit}

In 1996, an international group of scientists came together in Bermuda to discuss how sequence data from the Human Genome Project (HGP) should be released. The meeting concluded in the formation of the ‘Bermuda Principles’, a set of rules ensuring the data would be immediately shared on publicly accessible databases as it was generated. This ground-breaking accord contravened the conventional practice of releasing data only after publication in scientific journals. It changed the way we see data sharing, and ultimately, changed the way science research was conducted.

Its success demonstrated how a global community of scientists could collectively produce and use data far more efficiently than an individual could. This greatly benefited scientific progress and led to many important new insights and discoveries. For example, information of 30 genes associated with disease was published prior to publication of the draft sequence in 2001.

Recognising its ability to accelerate progress, there is an enormous push for all scientists to make raw data publicly available for others to analyse and use. As a prerequisite for publication or receiving grants, it is becoming increasingly common for journals and funding bodies to insist that data is shared openly. Continue reading

Sharing data: Why it should be done

As data continues to be produced at staggering rates, scientists need to become more aware of the benefits of data sharing, says Eleni Liapi.

Guest contributor Eleni Liapi

CD-stack-naturejobs-blog

{credit}PhotoDisc/Getty Images{/credit}

The scientific community is currently experiencing an explosion in data generation. At CERN (the European Council for Nuclear Research), the rate of data production is 1 petabyte (=1015 bytes) per day inside the Large Hadron Collider (LHC), which is comparable to 210,000 DVDs.  At the European Bioinformatics Institute, 20 petabytes of biological data had been stored between 2004- 2012.  In the US alone, the volume of data produced by the healthcare industry in 2011 was estimated at 150 exabytes (=1018 bytes). Undoubtedly, this volume of information brings with it several problems, including data storage and sharing.

Access to data is a topic that initiates numerous discussions and opinions between scientists and other communities for a plethora of reasons, including concerns about inappropriate use, institutional or industrial restrictive policies where the gigabytes of obtained genomic data are to be utilised for pharmaceutical research, for example. To date, there have already been attempts to estimate the extent of the problem. In one survey, 67% of the participants expressed the view that inaccessible data hinder scientific progress. Continue reading

Data sharing: Why it’s all ‘mine’

Data sharing makes scientific sense, but the career-conscious nature of scientists may stand in the way.

Guest contributor Rachel Yoho

As with many aspects of society, human nature shapes interactions in science research. When we consider “data sharing,” the likely response is probably a shrug. We’ve all been there. Group work and competition at its finest. The increasingly competitive environment for grant funding, and the ‘publish or perish’ attitude promotes the “mine, mine, mine” attitude among scientists. To focus on the issue of overcoming career-protecting objections to data sharing however, we can focus on several trends.

Data ownership
With many factors, including budget cuts, sequestration and economic downturns, the current scarcity of grant funding creates financial stress in labs. ”Big grants” like the NIH R01, had lower success rates for new grants in 2014 as compared to the last four of five years. In turn, data ownership becomes possessive to the PI and lab, even beyond that of the funding agency or institution. Simply, it’s our grant money, it’s our data. By working for and finally achieving a grant, often after many attempts, a sense of accomplishment and pride in ownership develops. Continue reading