Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

Sequencing: Ship-Seq sails the seas

To study a primordial nervous system, Leonid Moroz brings the tools of biology to the open sea. Nature Methods spoke with the neurobiologist turned sea adventurer.

Leonid Moroz diving in Palau, collecting Nautilus.

Leonid Moroz diving in Palau, collecting Nautilus.{credit}Aggressor Fleet / L.L. Moroz{/credit}

Meet neurobiologist Leonid Moroz of the University of Florida, the inventor of Ship-Seq. His hair is not always this wild, although his ideas tend to be.

Ship-Seq is a boat with a sequencing lab on board. On the high seas, Moroz and his crew of sailor-scientists do high-throughput sequencing of DNA and RNA from single cells, as well as neurobiology experiments. And they analyze results, too.

The ctenophore Beroe ovata.

The ctenophore Beroe ovata.{credit}J. Netherton/ L.L. Moroz{/credit}

He is especially intrigued by ctenophores, now believed to be the first multicellular organisms, which also have a nervous system but it is utterly unlike ours. It is likely, he says, that their ‘elementary brains,’ their neural and muscular systems, such as the ones found in molluscs and basal metazonas, have evolved independently from all other animal lineages.

In his Nature paper recently published, he and his colleagues present the genome of the ctenophore of the Pacific sea gooseberry (Pleurobrachia bachei)—the data are here—along with transcriptome analysis of other ctenophores. He and his colleagues also present metabolic and physiological data about these organisms. The authors describe how ctenophores have evolved neuronal organizations that show ‘molecular innovations.’ There is also an accompanying News and Views piece by Andreas Hejnol of the University of Bergen in Norway and a Nature news story by Ewen Callaway.

Labs can be outdoors and on-ship.

Labs can be outdoors and on-ship. {credit}L.L. Moroz{/credit}

Although organisms can be taken from the sea to the lab, they often need ocean depths or a certain temperature to survive. And when samples are prepared for travel, they need optimized conditions to not degrade. Three decades of dealing with dead organisms, degraded samples, delayed shipments and customs snafus have led Moroz to try something new: Ship-Seq. “We cannot bring the sea to the lab, but we can bring a whole lab to the sea,” he says.

After completing two proof-of-concept Ship-Seq voyages—one to the Bahamas and another near the Florida Keys and one to Palau to prepare those voyages—Moroz shares some of his findings here, offers a glimpse at his logistics and future plans. He hopes others can follow his example, because probing and analyzing nature while in and around nature is an adventure with biomedical value.

Leonid Moroz

Leonid Moroz wanted to bring the lab to the sea. {credit}L.L. Moroz{/credit}

Biologist and entrepreneur Craig Venter and his Global Ocean Sampling Expedition in some ways parted the seas for Moroz’s project. Moroz wanted to explore biodiversity through sequencing but also take an extra step to do on-site ‘integrative experimental biology,’ which is about using many types of tools to study whole organisms, their behavior and their cells and genomes.

Field biology tends to be an observational science, because in the field, biologists do not usually have an entire high-tech molecular biology lab in tow. And, says Moroz, field scientists may not be completely familiar with new genomics tools, which is too bad since nature has performed genetics experiments waiting to be evaluated. On the boat he studied regeneration, which is hard or even impossible to accomplish “in a dish,” he says, because the animals he studies are incredibly fragile.

King of Regeneration
Meet the comb jelly Bolinopsis, which Moroz calls ‘the king of regeneration.’

Bolinopsis can regenerate its brain in three to five days.

Bolinopsis can regenerate its brain in three to five days. {credit}L.L. Moroz{/credit}

These transparent organisms from the phylum Ctenophora propel themselves through the water with rows of iridescent combs of tiny hairs. Though they may be small and unassuming, they perform an amazing feat: they can regenerate their entire ‘elementary’ brain in three to five days.

Moroz calls their aboral organ with gravity sensors an ‘elementary’ brain; it is not homologous to the human brain. But it is a control center with many neuron types and it coordinates behaviors and motions. In that sense it is an “analog” of the human brain, he says. What astounded Moroz is that when it is dissected from the animal, it grows back.

Other marine organisms such as Hydra are known to regenerate organs, but examples are limited, particularly for organisms that can be maintained in the lab. Finding models for such biological phenomena are crucial in neurobiology, he says. And for regenerative medicine, too. Aplysia, the marine sea slug, has long been helping scientists study memory. And there are more such organisms to find and with which he wants to do ‘real-time’ experiments and analysis, for example look at the dialogue between pre- and post-synaptic neurons.

Bolinopsis has another intriguing trait that Moroz discovered by accident. He was making some small incisions and then briefly interrupted his work. “When I came back around 40 minutes or an hour later, I couldn’t find my cut,“ he says. He made another incision and watched the wound begin to close before his eyes. Overnight, the wound became invisible. “It’s very cool,” says Moroz.

Sequencing team on the first ShipSeq voyage, from feft to right: Tatiana Moroz, Andrea Kohn, Rachel Sanford

Sequencing team on the first ShipSeq voyage, from left to right: Tatiana Moroz, Andrea Kohn, Rachel Sanford{credit}L.L. Moroz{/credit}

He found this wound-healing ability in five or six ctenophore species. It is likely an adaptation to life close to the water surface, where there are predators and formidable waves that can inflict bodily harm on these organisms. A related ctenophore species that lives in deeper waters appears to have lost this wound-healing ability. In this sense, he says, “nature already performed knock-out experiments for us,” inviting researchers to investigate which genes might play a role in these instances. Some species in the same lineage are slow regenerators, others fast, another aspect that invites genomic analysis.

Traditional ways of exploring the biochemical underpinnings of physiology and behavior can be slow. With new technologies such as high-throughput sequencing, it is possible to connect data types more quickly. For example, one can see an organism behave and use genomics to see molecular changes, for example in gene expression or epigenetic markers. Being on the boat lets scientists directly address observed biology; “you basically follow up with what nature suggests to you,” says Moroz.

One-way ticket

The Ship-Seq sequencing team for the second trip (from left to right Suzette,  Lauran, Rachel, Gabby, Andrea, Greg, Emily, Leonid, Gustav).

The Ship-Seq sequencing team for the second trip (from left to right Suzette,Lauran, Rachel, Gabby, Andrea, Greg, Emily, Leonid, Gustav).{credit}L.Moroz{/credit}

ShipSeq is also an environmental research project. Roughly every six hours a species is lost, he says. The disappearance of these organisms means ecological harm and the loss of important molecular blueprints, which is not unlike losing precious art and heritage sites, he says.

Comparative biologists face the criticism that their work does not have ‘translational value’ for biomedicine. But Moroz believes Ship-Seq shows that marine organisms have tremendous biomedical value. Bolinopsis is one example of many.

A small volcanic island in Antarctica. Moroz nicknamed it  Aplysia Island given that it looks like a model organism,  the sea slug, Aplysia.

A small volcanic island in Antarctica. Moroz nicknamed it Aplysia Island because it looks like the sea slug, Aplysia, a model organism. {credit}L.L. Moroz {/credit}

Too many human diseases are “a one way ticket,” he says, such as age-related memory loss. Spinal cord injury and stroke lead to irreparable damage. But genomic analysis, including genome-wide expression studies can help researchers explore how to lessen the impact of these diseases and injuries. Scientists need to “jump” from the genome to complex functions and brain circuits, which recruit many parts of the genome.

By delivering the basic alphabet of an organism, sequencing is a boon to many fields. What scientists also need is the grammar with which this alphabet creates the biological equivalent of language, which is behavior and physiology.

With his approach to ‘real-time genomics,’ he wants to help expose this grammar, says Moroz. For example, scientists might want to capture epigenetic changes over the course of learning or regeneration.

Ship-Seq logistics

Copasetic with the mobile sequencing lab aboard

Copasetic with the mobile sequencing lab aboard{credit}Ian van der Watt{/credit}

This is Leonid Moroz’s boat, the Copasetic, a 141-foot yacht. Actually it isn’t his boat. And the story about how he gained access to it, is a tale of Moroz’s brand of determination.

Logistics expenses for field expeditions are usually not covered by traditional grants, so Moroz built a collaboration between companies and non-profits to make Ship-Seq a reality.Over the years, he found opportunities, but the tide was against him. One time, everything was ready to go, but the boat’s owner decided to sell the boat, a mere week before the scientists wanted to set sail. Ship-Seq’s maiden voyage was cancelled.

Then Moroz came across the Florida-based International Seakeepers Society, through which yacht-owners loan out their boats for research purposes when they are not using them.

In late 2012, Moroz was invited to an International Seakeepers Society dinner. He had a semiconductor chip in his pocket that is used in semiconductor-based sequencers from Life Technologies, now a part of Thermo Fisher. The scheduled presentation was delayed due to a glitch with the projector. Until the projector was fixed, Moroz gave an impromptu talk about how the small chip could help save the oceans’ heritage and tell the world about the genomic blueprints of marine organisms. He had already been using the technology in his lab and saw how the instrument was accelerating his work.

Some of the listeners smiled politely and ignored him, he says, but a few were excited. Around nine months after that dinner, finally an opportunity presented itself that allowed Ship-Seq to leave the dock.

Boat, crew, captain

Steven Sablotsky designed the Copasetic

Steven Sablotsky designed the Copasetic{credit}L.L. Moroz{/credit}

Steven Sablotsky, a University of Florida alumnus, engineer, businessperson, yacht owner and member of the International Seakeepers Society approached Moroz. Sablotsky had designed his own boat, the 141-foot Copasetic, with marine research in mind. Sablonsky offered his boat for Moroz’s “proof-of concept” trips for free, including his crew.

The added crew was important. Private boat owners can be their own skippers, but large boats are legally obliged to have a competent crew. “It’s pretty complicated machinery,” says Moroz. “You really have to work around the clock.”

The Copasetic crew

The Copasetic crew{credit}L.L. Moroz{/credit}

At the time, Moroz was also speaking with sequencer manufacturers. He had set up a Life Technologies’ Personal Genome Machine (PGM), which is a bench-top, semiconductor-based sequencer. The instrument’s semiconductor chip uses millions of wells to capture DNA sequence information. DNA is fragmented and each fragment is attached to a bead, and copied such that each bead is covered with copies of the same fragment. One bead is deposited into each one of many wells on the chip, which is then flooded with one of the four DNA bases. When a base is incorporated into DNA, a hydrogen ion is released, leading to a chemical change in the well. The instrument detects the change, converts the signal to voltage, which registers that the base was incorporated and adds it to the growing sequence of the fragment. Another base floods the wells and the process repeats.

After testing the PGM, Moroz decided that it should be the sequencer for Ship-Seq. He was not sure where to install it along with the other necessary lab equipment. It was the Copasetic’s captain Ian van der Watt who suggested housing the lab in a shipping container. A construction manager at Florida Biodiversity Institute helped to organize one such container and design the mobile lab with Moroz. A few weeks later it was ready to be placed on the boat’s deck.

The mobile lab contained is transferred to the Copasetic’s deck.

The mobile lab is placed on the boat’s deck….{credit}L.L. Moroz{/credit}

The lab is mobile

…and is ready to travel anywhere. {credit}L.L. Moroz{/credit}

The advantage of a container, says Moroz, is that it offers a completely controlled environment. He and his lab collected the supplies and instruments they needed such as benches, anti-vibration tables, PCR machine, and enrichment systems to measure RNA and DNA and run quality controls.

They needed a high-quality water purification system for the sequencing. It is, he says “somewhat ironic” that the team needed to produce ‘clean pure water’ even though they were in the middle of the ocean. Thermo Fisher engineers got the sequencer ship-shape for a seafaring environment. “Basically we made a full-scale molecular lab” for genomics and imaging, says Moroz.

He still had concerns about variables such as temperature and vibration. They set up the lab and tested all the instruments. While at the dock, he asked the captain to power the motor forwards and backwards, simulating high waves. The lab aced the test.

ShipSeq set sail on its first voyage and the lab was humming from the moment they left, Moroz says. Sablotsky came along, too. Every day they did two sequencing runs and sent the data via a satellite link to HiPerGator, which is a high performance computer with 24,000 core processing units installed at the University of Florida.

mobile lab inside for web

Ship-Seq’s core lab. {credit}L.L. Moroz{/credit}

Moroz had set up an analysis pipeline with computational tools and scripts to assemble and annotate the incoming sequence information. After automated analysis, data was beamed back to the boat. The sailor-scientists had considered taking a Thermo Fisher engineer along but that did not pan out “so we were on our own,” says Moroz. The good news was “everything worked.”

The second trip, to the Gulf Stream and Florida Keys was windy and through rough seas. Seasickness immobilized half of the lab staff for part of the trip, says Moroz, including his wife. “People could not cope with the field conditions but the PGM machine could,” he says of the sequencer on board. Actually, he says, the Ship-Seq’s sequencing runs were higher quality than in the lab on land. He speculates that the waves enhanced the mixing of chemicals.

“The versatility of our bench top sequencers is only limited by the imagination of today’s scientists,” says Mark Stevenson, executive vice president of Thermo Fisher Scientific in an e-mail to Nature Methods. “Clearly, Dr. Moroz has taken an ingenious idea to a new level and demonstrated that great data can be attained and analyzed in real time – even on a ship that’s rocking on the high seas.”

Seasick but happy
On both trips and despite the seasickness on the second venture, the lab’s team was especially motivated, says Moroz. “It is easy to work a 16-18 hour day when you have the beautiful sea, beautiful creatures around.” People have been important for the overall success of the venture, he says.

Moroz wants to do more trips and expand Ship-Seq’s scientific scope. Using a prototype of the PII chip (which is not yet on the market), he performed single neuron RNA-sequencing in the lab. He projects it might cost around $3 per individual neuronal transcriptome, if one wanted to do a census of neuronal cell types in the brain of a marine organism such as Bolinopsis or others ctenophores, plankton and other, as he calls them, ‘aliens of the sea.’

setting sail for web

It took a while before Ship-Seq could set sail. {credit}L.L. Moroz{/credit}

Ship-Seq and its ‘lab-in-a-container’ offers many opportunities, he says. “The beauty is that it is mobile.” The container could be put on a ship in Florida or it could be sent to Palau or Antarctica and placed on a boat there for not much greater cost. “You can get anywhere,” he says, maybe even set up a “sequencing fleet.”

The planning for the next Ship-Seq trips is underway—but the geographic and scientific directions are not yet finalized. And the finances, too, need to be organized. The trip might focus on more complex marine organisms. For example, cephalopods have complex brains, lending them their nickname ‘primates of the sea.’ Moroz hopes to one day study their neurobiology, integrating field biology, behavior, and genomics. He also wants to be part of the ongoing ‘race to save species,’ to not only study but also “preserve our planet.”

Moroz has encountered plenty of detractors and skeptics. Whenever he is criticized and told he should stick to the traditional way of doing science, his path of taking the lab to the sea feels right. He says it reinforces his sense: “I must do it.” To him, doing science on Ship-Seq feels like “the investigation of a new planet.”

Ship-Seq Protocol
1 x 141-foot boat
1 x generous entrepreneur
1 x ship’s crew
1 x mobile molecular biology lab equipped with lab benches, a sequencer, reagents
1 x manufacturer of a high-throughput sequencer willing to donate an instrument
1 x satellite link to a supercomputer
1 x lab staff and scientist/wife willing to be scientist-sailors
1 x diving equipment
1 x funding National Institutes of Health (NIH), National Science Foundation (NSF), National Aeronautics and Space Administration (NASA)
3 x support from non-profit organizations: Florida Biodiversity Institute, Florida Museum of Natural History, the International Seakeepers Society
1,000 international units of patience
Several remedies for seasickness

The Method of the Year for 2013 is… single-cell sequencing

Single-cell sequencing edged out other contenders as our choice of Method of the Year in 2013. These techniques really came into their own in 2013 and are fast providing new insights into the workings of single cells that ensemble methods are incapable of.

Method of the Year 2013Back in 2008 we chose next-generation sequencing as our Method of the Year not only because of how the new techniques would improve performance in conventional sequencing applications, but also because they opened up whole new applications, unthinkable with traditional Sanger sequencing. Our choice of Method of the Year in 2013 bears this out, as none of these single-cell sequencing applications would be possible without next-generation sequencing. And in some applications the sequencing is used almost exclusively for identifying and counting tagged molecules.

Our choice likely comes as a surprise to all those who were certain that we would pick CRISPR/Cas9 technology for targeted genome modification. This is certainly an exciting technology, and not only for genome engineering, but also for epigenome editing as described in a Method to Watch. But genome editing with engineered nucleases was our pick for the 2011 Method of the Year and although CRISPR/Cas9 provides a huge practical improvement by largely dispensing with the need to engineer the nuclease and relying instead on a programmable guide RNA, the advance over 2011 is mostly one of ease-of-use.

Methods to investigate biology at the level of single cells have been of keen interest to Nature Methods since the journal started. Our first research article from Robert Singer described a paraffin-embedded tissue FISH (peT-FISH) method to simultaneously detect expression of several genes in situ in single cells while maintaining tissue morphology (Capodieci, P. 2005). This was followed by many other imaging-based methods for such things as measuring cell growth (Groisman, A. 2006), quantifying mRNA (Raj, A. 2008) and protein (Gordon, A. 2006) levels, profiling intracellular signaling (Krutzik, P.O. & Nolan, G.P. 2006)(Loo, L.-H. 2007) and DNA insertion-site analysis (Schmidt, M. 2008) in single cells.

The number of original research articles published in Nature journals exploded in 2013

The number of original research articles published in Nature journals exploded in 2013. These numbers may not be complete.

The publication of M. Azim Surani’s article on mRNA-Seq whole-transcriptome analysis of a single cell (Tang, F. 2009) in 2009 helped signal the rise of sequencing-based methods for single-cell analysis. But even two years later the Reviews and Perspectives in our supplement on single-cell analysis were more focused on imaging-based than sequencing-based aproaches to single-cell analysis.

It was only in 2013 that we finally saw an explosion of original research articles using or reporting single-cell sequencing methods in Nature-family journals. Numerous studies reported new biological results that relied on sequencing of whole or partial genomes or transcriptomes from single cells.

Our Method of the Year special feature has three Commentaries by researchers in the field, including some of the earliest developers and users of methods for single-cell analysis. An Editorial, News Feature and Primer describe our choice and provide helpful background information. We hope you enjoy the selection of articles in our special feature.

A star is born: the updated Human Reference Genome

The release of the 38th build of the human reference genome gets a well-deserved rock-star greeting by the scientific community.

The new GRCh38 is already a rock-star

The new GRCh38 is already a rock-star{credit}Wikimedia Commons/Flickr:Starman/K.Spencer{/credit}

Fans know it is worth the effort to camp out for tickets to a concert by a beloved rock, pop or country star. GRCh38, the newest build of the human reference genome, is that kind of star. Delayed by a few snags and also held up by the US government shut-down, the sequence has just traveled to GenBank for use by the scientific community.

Not only has Genome Reference Consortium build 38 (GRCh38) eliminated some pesky previous gaps, it will be the first human reference assembly to have sequence information for centromeres. Up until now, centromeres, which are specialized structural components of chromosomes, have been represented in the reference by gaps of 3 million base pairs. The news about centromere sequence will be of interest to cell biologists and genomics researchers alike.

“This will be a major boon to evolutionary studies of human populations and to the many groups doing mechanistic work on human centromeres and kinetochores,” says Stanford University researcher Aaron Straight, whose work focuses on cell division and chromosome segregation. “Finally, now we can stop saying ‘mind the gap’.”

The reference genome finishers are the members of the Genome Reference Consortium (GRC) at the European Bioinformatics Institute, the US National Center for Biotechnology Information, The Wellcome Trust Sanger Institute and The Genome Institute at Washington University.

Scientists may not have physically camped like concert-goers in front of the buildings where genome finishers scurry to get the sequence out the door. But the throngs have been virtually present. The GRC, which works on human, mouse and zebrafish reference genomes, is “having to field a lot of questions from folks who want to know the minute they can have the assembly,” says Deanna Church, a genomicist formerly at the US National Center for Biotechnology Information and who has, since this interview, moved to Personalis, a genetic testing and analysis company.

The din has faded from the 2001 celebration marking the end of the Human Genome Project. But the sequence was not complete nor is it complete now. As colleagues at Nature Methods have pointed out here and here, the sequence originally had around 150,000 gaps.

The most recent reference genome, Genome Reference Consortium build 37 (GRCh37), has 357 gaps. And is missing sequence around the centromeres. No longer.

Come here, centromere
The structure and repetitive nature of centromeric regions has made them largely inaccessible to methods used to create the reference assembly, says Church. The concept and the methods to produce the centromere sequences for this reference build were developed by a research team at University of California at Santa Cruz (UCSC). They constructed sequences using the Sanger technique and the data helped the team behind GRCh38 to fill in these important gaps.

The centromere community will be happy to no longer say this.

The centromere community will be happy to no longer say this.{credit}Wikimedia Commons/Clicsouris{/credit}

In a paper, the UCSC team, led by Karen Miga and Jim Kent, a member of GRC’s scientific advisory board, noted that centromeric regions are replete with near-identical tandem repeats—satellite DNA. Difficult assembly of these regions have led them frequently to be excluded from genomic studies. In the new reference genome, the scientists used reads generated during the Venter genome assembly and created models for the centromeres, says Church.

“These models don’t exactly represent the centromere sequences in the Venter assembly, but they are a good approximation of the ‘average’ centromere in this genome,” she says. And these sequence models are not exact representations of any one centromere, either. But including these sequences in the reference assembly “will likely improve genome analysis using current methods, and allow for some further study of population variation in centromere sequences,” says Church.

Continue reading

Stephen Quake responds to Lior Pachter

Stephen Quake responds to a blog post by Lior Pachter that analyzes data from his recent analysis of single-cell RNA sequencing methods published in Nature Methods.

In October, we published an Analysis by Quake and colleagues that evaluated a number of single-cell RNA-seq approaches on the basis of their sensitivity, accuracy and reproducibility. In a subsequent blog post, Pachter challenged their data reporting. At issue is whether the failure rate among 96 samples sequenced using the Fluidigm C1 microfluidic instrument should have been presented differently.

We encourage animated discussion of published research and hope that this can serve as a useful forum. In this guest post, Quake responds to Pachter’s blog entry. The views expressed below are solely his and do not necessarily represent those of Nature Methods.

Stephen Quake Methagora blog postIn a recent blog post, Lior Pachter appears to question my scientific integrity and suggest that I unfairly manipulated data in a recent publication on single cell RNAseq.

Pachter has not contacted me directly with his questions nor did he give any warning before publishing his blog post. While I am happy that he is carefully scrutinizing publications and independently re-analyzing primary data, his rather sensationalistic approach to reporting his results in the absence of discussion or peer-review risks doing a disservice to science and adds more heat than light.

Pachter tries to have it both ways – based on our published data he accuses me of 1) wasting effort by sequencing lower quality samples and 2) selectively publishing data from only the better samples. It is hard to see how these accusations can simultaneously both be true. As described in the methods section of our paper, the C1 capture rate is not perfectly efficient and therefore we manually inspected all the chambers. We found 93 chambers had single cells, 1 chamber had two cells, and 2 chambers had no cells. Of the 93 chambers with single cells, 91 of the cells appeared to be alive as measured by a live/dead stain and 2 did not. Our single cell RNAseq experiments included all 91 of the “live” single cells and 1 of the “dead” single cells; the data from the latter was indistinguishable from the former and thus it was included in all further analyses. There was absolutely no selection or manipulation of the data. All of the raw data as well as our R scripts were made available for Pachter and others to download and analyze upon publication of our paper.

The sequencing library prep and workflow that we use is geared around 96 parallel samples and we decided it would be valuable to process control samples in exactly the same batch as the single cell samples. We therefore included four control samples with the single cells: amplification products from a chamber on the chip that did not have a cell (C09, which was unfortunately not given a distinguishing filename during the file upload), a single cell tube amplification, a no template control (NTC, C70) tube experiment that did not have a single cell, and a bulk control sample. Pachter correctly points out that C70 is dominated by the ERCC spike in controls and has essentially no human transcripts as expected; similarly, the other negative control C09 performs very poorly next to the actual single cell data. It is not clear to me why Pachter thinks I should be embarrassed for performing negative control experiments; indeed biochemical amplifiers are known to be so sensitive that there are many stories of contamination that occurs through aerosol dispersal from nearby benches, etc. In our own analyses C09 and the other controls were excluded from the single cell data.

Pachter also noticed that ~ 3 of the single cell RNAseq experiments have significantly lower quality than the other 89, as measured by fraction of spike in sequenced or by log-correlation coefficient. If taken at face value, this corresponds to a failure rate of 3/92, or 3%. The experiments therefore had a 97% success rate by this metric and it is hard to see where his complaint lies. We conservatively included ALL of the single cell data in our analyses and thus if one follows Pachter’s prescription to only analyze the experiments that he deems “successful”, then the results will be even better than we reported.

Finally, Pachter makes a misleading argument concerning the statistical methods used to generate figure 4a. This figure is concerned with the questions of whether an ensemble of single-cell RNAseq experiments produces similar gene expression values as a bulk experiment. The reason for sub-sampling to equal depth is worry of introducing artifacts by comparing two RNAseq experiments of dramatically differing sequencing depth (see e.g. Cai, Guoshuai, et al. “Accuracy of RNA-Seq and its dependence on sequencing depth.” BMC bioinformatics 13.Suppl 13 (2012) and Tarazona, Sonia, et al. “Differential expression in RNAseq: a matter of depth.”Genome research 21.12 (2011): 2213-2223.). This figure has little to do with estimating the quality of the individual RNAseq experiments.

Whose genes are they?

The first gene patent was granted in 1982 for the sequence encoding insulin. Today ~ 20% of the human genome is patented. And the controversy is as alive as ever as we discuss in this month’s editorial.

Gene patents have supporters who see them as essential for the development of diagnostic and therapeutic products and detractors who see them as hindering research directly and indirectly.

We think that Myriad Genetics’ patents on BRCA 1 and 2, and how these patents have been upheld in court challenges in the US, illustrate how gene patents can elicit cases of preemptive obedience which is problematic for patients and researchers alike.

While it is not clear that Myriad’s patents on BRCA 1 and 2 are infringed by technology that does not rely on isolated cDNA, such as next generation sequencing (NGS), no company in the US seems to be willing to take that risk. AmbryGenetics, for example, a company offering an NGS-based test for mutations in 14 breast cancer related genes, specifically excludes BRCA 1 and 2.

For patients in the US with a family history of breast cancer this is bad news, since it leaves them with Myriad as the only provider for mutation testing in BRCA. And there is no way to obtain a second opinion, particularly for negative results.

Not so in the UK. There Myriad holds fewer and more restricted patents and NewGene, a company owned by the Newcastle Hospitals NHS Foundation Trust and Newcastle University, offers full sequencing of the BRCA1 and 2 coding regions and some introns. We are not saying that NewGene’s test is superior to Myriad’s, though it is certainly cheaper. But it shows that the latest technology can quite rapidly be translated into the clinic, if there are no legal hurdles.

To understand the impact of the many different mutations in the BRCA genes researchers need to have access to a large patient population. For this purpose the NIH founded the Breast Cancer Information core (BIC) mutation database to gather such patient data.

Myriad as the sole provider of BRCA mutation tests acquires large amounts of patient data and until about 2006 the company contributed variants to BIC. Then they decided to keep them proprietary.  Researchers were not happy about losing access to such valuable information. In response Robert Nussbaum from UCSF started an initiative to gather mutations directly from physicians. He is asking them to black out a patients identity and forward the mutation report Myriad provides so it can be incorporated into BIC. This information, together with many international contributions, has led to over 14 000 variants in the database.

With personalized medicine on the rise and many people interested in having their genome sequenced restricting who can look at a particular gene makes little sense and the much coveted $1000 genome will hopefully not be rendered ineffective by patent laws.

As always we are keen to hear your views.

Data overload

How do you handle terabytes of data? That is a question that more and more investigators must face, on a weekly basis.

Are you one of them? Light-sheet fluorescence imaging, for example, generates so much data in each experimental run that handling and storing the raw data is a challenge. Next-generation sequencing is another, much more ubiquitous, case.

Read the July issue editorial “Byte-ing off more than you can chew” and let us know about your own experience, problems and practical (or impractical) solutions.

Next-next DNA sequencing

While the technology feature, “DNA sequencing: generation next-next”, was at press, Pacific Biosciences of Menlo Park, California stunned the community with their announcement of a single molecule sequencing technology they claim will provide a complete human genome in 15 minutes by the year 2013. Although Pacific Biosciences was founded in 2004, the company had been very ‘hush hush’ about their technology development. But that veil of secrecy was lifted during the Advances in Genome Biology and Technology meeting held February 6th to 9th at Marco Island, Florida where Stephen Turner, chief technology officer, presented the first preliminary data on the system.

Continue reading