How is the rise of data-intensive research changing what it means to be a scientist?

Data-intensive research requires a new breed of scientist: interdisciplinary analysts who enjoy swimming in data, says Atma Ivancevic.

There has always been an emphasis on the generation of novel data in science. Being a scientist involves progressing from observation to hypothesis to experiment to output. In the past, a combination of scarce data to look at and low throughput machinery to make more has led to limited experimental outcomes.

2016-09-12-Atma Ivancevic 04-smaller-cropped

Atma Ivancevic

Continue reading

Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

The Method of the Year for 2013 is… single-cell sequencing

Single-cell sequencing edged out other contenders as our choice of Method of the Year in 2013. These techniques really came into their own in 2013 and are fast providing new insights into the workings of single cells that ensemble methods are incapable of.

Method of the Year 2013Back in 2008 we chose next-generation sequencing as our Method of the Year not only because of how the new techniques would improve performance in conventional sequencing applications, but also because they opened up whole new applications, unthinkable with traditional Sanger sequencing. Our choice of Method of the Year in 2013 bears this out, as none of these single-cell sequencing applications would be possible without next-generation sequencing. And in some applications the sequencing is used almost exclusively for identifying and counting tagged molecules.

Our choice likely comes as a surprise to all those who were certain that we would pick CRISPR/Cas9 technology for targeted genome modification. This is certainly an exciting technology, and not only for genome engineering, but also for epigenome editing as described in a Method to Watch. But genome editing with engineered nucleases was our pick for the 2011 Method of the Year and although CRISPR/Cas9 provides a huge practical improvement by largely dispensing with the need to engineer the nuclease and relying instead on a programmable guide RNA, the advance over 2011 is mostly one of ease-of-use.

Methods to investigate biology at the level of single cells have been of keen interest to Nature Methods since the journal started. Our first research article from Robert Singer described a paraffin-embedded tissue FISH (peT-FISH) method to simultaneously detect expression of several genes in situ in single cells while maintaining tissue morphology (Capodieci, P. 2005). This was followed by many other imaging-based methods for such things as measuring cell growth (Groisman, A. 2006), quantifying mRNA (Raj, A. 2008) and protein (Gordon, A. 2006) levels, profiling intracellular signaling (Krutzik, P.O. & Nolan, G.P. 2006)(Loo, L.-H. 2007) and DNA insertion-site analysis (Schmidt, M. 2008) in single cells.

The number of original research articles published in Nature journals exploded in 2013

The number of original research articles published in Nature journals exploded in 2013. These numbers may not be complete.

The publication of M. Azim Surani’s article on mRNA-Seq whole-transcriptome analysis of a single cell (Tang, F. 2009) in 2009 helped signal the rise of sequencing-based methods for single-cell analysis. But even two years later the Reviews and Perspectives in our supplement on single-cell analysis were more focused on imaging-based than sequencing-based aproaches to single-cell analysis.

It was only in 2013 that we finally saw an explosion of original research articles using or reporting single-cell sequencing methods in Nature-family journals. Numerous studies reported new biological results that relied on sequencing of whole or partial genomes or transcriptomes from single cells.

Our Method of the Year special feature has three Commentaries by researchers in the field, including some of the earliest developers and users of methods for single-cell analysis. An Editorial, News Feature and Primer describe our choice and provide helpful background information. We hope you enjoy the selection of articles in our special feature.

A star is born: the updated Human Reference Genome

The release of the 38th build of the human reference genome gets a well-deserved rock-star greeting by the scientific community.

The new GRCh38 is already a rock-star

The new GRCh38 is already a rock-star{credit}Wikimedia Commons/Flickr:Starman/K.Spencer{/credit}

Fans know it is worth the effort to camp out for tickets to a concert by a beloved rock, pop or country star. GRCh38, the newest build of the human reference genome, is that kind of star. Delayed by a few snags and also held up by the US government shut-down, the sequence has just traveled to GenBank for use by the scientific community.

Not only has Genome Reference Consortium build 38 (GRCh38) eliminated some pesky previous gaps, it will be the first human reference assembly to have sequence information for centromeres. Up until now, centromeres, which are specialized structural components of chromosomes, have been represented in the reference by gaps of 3 million base pairs. The news about centromere sequence will be of interest to cell biologists and genomics researchers alike.

“This will be a major boon to evolutionary studies of human populations and to the many groups doing mechanistic work on human centromeres and kinetochores,” says Stanford University researcher Aaron Straight, whose work focuses on cell division and chromosome segregation. “Finally, now we can stop saying ‘mind the gap’.”

The reference genome finishers are the members of the Genome Reference Consortium (GRC) at the European Bioinformatics Institute, the US National Center for Biotechnology Information, The Wellcome Trust Sanger Institute and The Genome Institute at Washington University.

Scientists may not have physically camped like concert-goers in front of the buildings where genome finishers scurry to get the sequence out the door. But the throngs have been virtually present. The GRC, which works on human, mouse and zebrafish reference genomes, is “having to field a lot of questions from folks who want to know the minute they can have the assembly,” says Deanna Church, a genomicist formerly at the US National Center for Biotechnology Information and who has, since this interview, moved to Personalis, a genetic testing and analysis company.

The din has faded from the 2001 celebration marking the end of the Human Genome Project. But the sequence was not complete nor is it complete now. As colleagues at Nature Methods have pointed out here and here, the sequence originally had around 150,000 gaps.

The most recent reference genome, Genome Reference Consortium build 37 (GRCh37), has 357 gaps. And is missing sequence around the centromeres. No longer.

Come here, centromere
The structure and repetitive nature of centromeric regions has made them largely inaccessible to methods used to create the reference assembly, says Church. The concept and the methods to produce the centromere sequences for this reference build were developed by a research team at University of California at Santa Cruz (UCSC). They constructed sequences using the Sanger technique and the data helped the team behind GRCh38 to fill in these important gaps.

The centromere community will be happy to no longer say this.

The centromere community will be happy to no longer say this.{credit}Wikimedia Commons/Clicsouris{/credit}

In a paper, the UCSC team, led by Karen Miga and Jim Kent, a member of GRC’s scientific advisory board, noted that centromeric regions are replete with near-identical tandem repeats—satellite DNA. Difficult assembly of these regions have led them frequently to be excluded from genomic studies. In the new reference genome, the scientists used reads generated during the Venter genome assembly and created models for the centromeres, says Church.

“These models don’t exactly represent the centromere sequences in the Venter assembly, but they are a good approximation of the ‘average’ centromere in this genome,” she says. And these sequence models are not exact representations of any one centromere, either. But including these sequences in the reference assembly “will likely improve genome analysis using current methods, and allow for some further study of population variation in centromere sequences,” says Church.

Continue reading

Preview to the 7th Genomics of Common Diseases

The 7th annual The Genomics of Common Diseases conference is taking place this weekend, from September 7-10, in Keble College, Oxford University. At this conference, we seek to represent a top selection of the latest research characterizing the genetic basis of a range of common diseases.

We held the first Genomics of Common Diseases conference in 2007, with a program that highlighted rapid advancements in identifying common variants associated with a range of common diseases, made possible by new methods enabling genome-wide association studies (GWAS). Over the past seven years, our understanding of the genetic architecture of disease has been progressively redefined by GWAS characterizing common variation, the fine mapping of associated regions, the emergence and growth of new sequencing technologies and the assessment of rare variant association. We have represented the progress in the field facilitated by rapid improvements in and reduced costs of genotyping and sequencing technologies. We have also seen rapid growth in the scale of genetic datasets, with the need to analyze progressively larger sample sizes. Our sixth annual conference focused both on presenting the latest applied technologies and on how to meet challenges posed by the analysis and interpretation of these large-scale genetic datasets. Continue reading

Illumina, BGI spar over Complete Genomics

In a battle that will shape the market for DNA sequencing services, sequencing company Complete Genomics has received letters from rival suitors. Illumina, a dominant supplier of high-throughput DNA sequencing machines, and BGI, a giant sequencing-services firm, have both offered to purchase the Mountain View, California-based company, which has highly accurate, proprietary technology for sequencing human genomes.

Complete Genomics accepted BGI’s merger offer in September, but the deal still requires US regulatory approval. In addition, shareholders have filed suit to block the deal, saying that BGI’s purchase price of US$3.15 per share is too low.

Today’s letter from China-based BGI-Shenzhen says that its offer is in the best interest of stockholders, customers, and employees and argues that the offer from Illumina, based in San Diego, California, to buy Complete Genomics is intended merely to eliminate competition. “It is a thinly veiled attempt by Illumina to disrupt and interfere with our Merger Agreement in order to prevent Complete’s technology from posing a competitive threat to Illumina’s market dominance,” wrote BGI chief operational officer Ye Yin in a letter that also accuses Illumina of hypocrisy and knowingly making false assertions.

Yin also refuted allegations that the deal raised US national security issues, noting that BGI has been a major purchaser of Illumina machines and reagents and is a private company, not a state-owned enterprise.

BGI’s letter is a response to one sent last week by Illumina that touted the merits of its unsolicited proposal: a 5% premium over BGI’s bid and no need for approval by a US committee that considers foreign investment. What’s more, pointed out Illumina chief executive Jay Flatley, BGI’s proposal still requires receipt of financing and other approvals that had still not been completed two months after Complete Genomics and BGI announced their agreement.

Experts at Leerink Swann have noted that the offer makes sense for Illumina. If the deal goes through, Illumina keeps a competitor away from a large customer and expands its technology.

Several sequencing platforms are on the market, including ones made by 454, Life Technologies and Pacific Biosciences, and other technologies are in development (see ‘The battle for sequencing supremacy’; subscription required). However, Illumina, which successfully fought off a takeover bid earlier this year, dominates the space. The company itself claims that more than 90% of sequencing data comes from its machines. When the agreement between BGI and Complete Genomics was first announced, researchers largely welcomed it, saying that it would encourage competition and keep a valued technology available.

Meanwhile, Complete Genomics and Illumina are engaged in a patent dispute. In another announcement today, Illumina said that a judge would reconsider an earlier ruling that invalidated claims in one of its patents, which Illumina believes Complete Genomics has infringed upon.