Analyzing high throughput sequencing data

Nature Methods has published popular analysis tools to make sense of the ever-increasing amount of high throughput (HTP) sequencing data. Some tools in this field have a short half life, due to pressure to always improve and innovate, others have staying power. Let’s look back over some of the highlights in our pages.

Mapping and assembling genomic reads

One of the first steps in any sequence analysis pipeline is base-calling and in 2008 Yaniv Erlich with Gregory Hannon improved the calling errors in Illumina data with their Alta-Cyclic that uses machine learning to reduce noise.

Once bases are called they most often need to be aligned to a reference, and high speed, sensitivity and accuracy are key requirements for mapping tools. In 2009 Paul Flicek and Ewan Birney discussed the basic principles behind methods for read alignment and assembly, and since then many more read mappers have been written. mrsFAST is a cache-oblivious, seed and extend,  short read mapper presented in 2010 by Cenk Sahinalp and colleagues. Bowtie2 by Ben Langmead and Steven Salzberg, a gapped read aligner,  promises exceptional speed and accuracy.  The GEM mapper by Paolo Ribeca and colleagues combines speed with an exhaustive search that returns all existing matches.

If no reference genome is available de novo assembly is the way to go. Many tools for genome assembly have been published but in 2010 Evan Eichler and colleagues demonstrated some of the limitations of popular assemblers used for the human genome. The ongoing high citation level of this paper and other work pointing out limits in current assembly programs highlight that de novo read assembly continues to be a challenge.

Finding structural variants

In 2009 Paul Medvedev and Michael Brudno looked at tools to discover structural variants  and later the same year they presented MoDIL, an insertion-deletion (indel) finder that focuses on a size range of 20-50 base pairs. Ken Chen et al.  published the aptly named BreakDancer, a tool to predict a wide variety of structural variation ranging in size from 10 base pairs to 1 megabase.  In 2011 Evan Eichler and colleagues added Splitread to find indels, de novo structural variants and copy number polymorphisms with high specificity and sensitivity. More recently in 2013, DeNovoGear from Donald Conrad and colleagues showed high validation rates in finding de novo indels in somatic tissue.  This year Scalpel, written by Michael Schatz and colleagues, came on the scene; a combination of mapping and de novo assembly allows it to detect transmitted as well as new indels in exome data.

Handling RNA-seq data

In 2008 Mortazavi et al.  and Cloonan et al. published one of the first RNA-seq papers in our pages and in 2009 Wold and Mortazavi presented and overview of tools for RNA-seq data analysis and the principles behind them. And since then the number of RNA-seq analysis tools has grown steadily throughout the literature.

To assess differential expression in RNA-seq data Malachi Griffith et al. wrote ALEXA-seq in 2010. The same year Chris Burge and colleagues published the MISO model to estimate expression of alternatively spliced exons and isoforms. Inanc Birol and colleagues presented Trans-ABySS for de novo transcriptome assembly. And a year later Manuel Garber and colleagues discussed the challenges in transsncriptome mapping, reconstruction and expression quantification.

Last year Paul Bertone and colleagues from the RGASP consortium compared popular tools for spliced alignment.  And they looked at the performance of software to reconstruct transcripts.

David Haussler and colleagues showed in 2010 with FragSeq that RNA-seq data can also be used to probe the structure of a transcript.  And by combining SHAPE with HTP sequencing in their SHAPE-MaP approach Kevin Weeks and colleagues showed this year that RNA functional motifs can be discovered in their structure.

Despite the many computational tools we have published it is still not always easy to predict a priori which one will be taken up by the community. We’d love to hear from you what you think makes a top notch analysis tool.

 

 

Synthetic Biology at Nature Methods

Since its launch, Nature Methods has seen many papers that have influenced the Synthetic Biology community. As a supplement to our May Focus on Synthetic Biology we take a nostalgic trip through the highlights of our papers in this area for different aspects of synthetic biology.

Cloning
In 2007 Stephen Elledge and Mamie Li developed SLIC (sequence and ligation-independent cloning) a strategy that uses homologous recombination to assembly many DNA fragments in vitro in a single reaction. Later the same year Mitsuhiro Itaya and colleagues also used homologous recombination in their bottom up assembly to unite larger DNA pieces to genomes of ~ 140kb size.

In 2009 Daniel Gibson and colleagues presented their one-pot enzymatic reaction that successfully assembled genomes 100s of kilobases and has since been dubbed ‘Gibson Assembly’.  The method reached fame on Youtube when the Cambridge iGEM team for 2010 created a music video showing how Gibson Assembly saves frustrated scientists:

Gene and genome synthesis
In 2007, to improve error-free DNA synthesis, Duhee Bang and George Church developed circular assembly amplification that eliminated error-containing oligonucleotides from the assembly. A few years later  Jay Shendure and colleagues introduced their dial-out PCR to retrieve desired DNA molecules from a library  for gene assembly.

For an in depth review on the topic of DNA synthesis, error correction and gene assembly visit Sriram Kosuri and George Church’s review in our Focus issue.

In 2010, on the heels of their breakthrough with Mycoplasma mycoides JCVI-syn1.0 – the first chemically synthesized bacterial genome (Gibson D, et al Science 329, 2010) ­- Gibson et al. published the chemical synthesis of the mouse mitochondrial genome in our pages. They adapted Gibson Assembly to begin at the oligonucleotide level to rapidly make larger fragments that were then combined into the desired genome, exclusively in vitro.  Once synthesized a bacterial genome might need to be further modified, but to do so in an organism other than E. coli proved challenging. In 2013 Bogumil Karas et al, showed that whole genomes, as large as 1.8 megabases can be directly transferred from bacteria to yeast where genetic manipulation is routine.

In our current Focus issue Gibson reviews the state of the art in genome assembly techniques , compares strategies and discusses what the future may hold.

Genome modification
To quickly generate large libraries of promoters in targeted regions of a bacterial chromosome  George Church and colleagues presented coselection MAGE (multiplex automated genome engineering) in 2012.  The increasingly popular CRISPR system can also rapidly edit genomes with few off-target effects when Cas9 is used as a nickase as William Skarnes and colleagues showed earlier this year.

Gene activation can be tuned by targeting transcription factors via the CRISPR-Cas9 system as Charles Gersbach demonstrated in 2013.

Circuit design
To ease construction of complex circuits Adam Arkin and colleagues adapted known translational regulators to control transcriptional elongation in 2012.  A bit later the same year Jim Collins and colleagues showed that an iterative plug-and-play method makes use of a large repository of genetic components when designing circuits.  This year Jeff Tabor and colleagues showed how gene circuit dynamics can be controlled with light. On April 28 Douglas Densmore and his team introduced Raven , software that calculates assembly plans for complex circuits.

Parts characterization
To be successful in any of the above applications one needs reliable and well characterized parts. Last year Drew Endy, Adam Arkin and colleagues presented a method to quantify the performance of genetic elements and in a companion paper they introduced a library of standardized transcription and translation initiation elements available through biofab.

Towards the end of 2013 Christopher Voigt and colleagues expanded the designer’s toolbox with over 500 well characterized transcriptional terminators. Robert Landick discussed how these ‘better stop signs’ as he termed them provide insight into the mechanism of termination.

UPDATE: There is now a joint special on Synthetic Biology at nature.com/synbio with articles from Nature, Nature Reviews Microbiology and Nature Methods.

Enjoy reading. The papers mentioned above are listed below in chronological order.

Continue reading

How to write a rebuttal letter

A well written rebuttal letter is critical in any resubmission. 

Once the initial reaction, be that joy, anger or frustration,  to receiving feedback from editors and reviewers about one’s work has subsided, it’s time for our authors to make one of two decisions:  continue to go after a Nature Method paper  or take their work to another journal.

A realistic look at how the reviewers’ requests can be met will go a long way in helping to determine whether a revision is likely successful and to avoid a futile resubmission.

If authors want to resubmit in cases where the editorial decision was negative, and referees were critical and asked for a lot of additional information, the first step, before embarking on any revision, should be an appeal (see the post on “How to write an appeal letter” for more details) and rebuttal letter to the editor to discuss whether a proposed list of additional information is likely to address the referees concerns.

Authors who receive a positive editorial decision and who are confident that they can address the reviewers’ points nevertheless have to submit a rebuttal letter with their revision.

The rebuttal letter is an author’s chance to directly reply to the reviewers, announce plans to improve the work, clear up misunderstandings or defend aspects of the work. How it is written can make a big difference in whether or not an appeal is granted and how the reviewers judge the revision.

The DOs:

  • Do acknowledge that the reviewers spent a substantial amount of time looking over the paper – rebuttal letters that thank the referees for their time and comments set a positive tone and ensure that the exchange takes place on a productive footing.
  • Do acknowledge that a misunderstanding may be due to poor presentation on your part, not lack of expertise on the reviewers’,  and phrase your reply accordingly,  taking the opportunity to clarify.
  • Do copy the full text of each reviewer’s comments in your rebuttal and reply to every concern raised by each reviewer immediately after each point in a concise manner that clearly states how you plan to address it (experimentally or editorially) or point to data that already addresses it which the reviewer appears to have missed.
  • If you cannot address a point at all, explain why not.
  • Do number the comments or at least break them into paragraphs, and use different fonts or text colors to distinguish the reviewer comments and your reply, rather than write a single reply to an entire review in summary form.
  • Do include relevant citations with full references or dois so they can be easily looked up, rather than just cite by First Author et al.
  • Do include pertinent new data as embedded figures, tables, or attachments,   indicate where in the manuscript you added the information; give page numbers, figure panels, Supplementary material etc., so editors and reviewers don’t have to go on a search for the new data. If any of this information will not be included in the revised paper explain why not.
  • Do be succinct and to the point and avoid epic discourses.  In the case where more than one referee has raised the same concern, it’s best to cite “see response to point 2 from Reviewer #1”, for example.
  • Do remember that each reviewer sees all comments and your replies so be equally respectful to all.

The DON’Ts:

  • Don’t vent or accuse the reviewers of bias or incompetence. We have read countless times that “ ref 2 is lacking expertise and completely misses the point” etc. and one wonders what the goal of such blanket statements is. They serve no productive purpose and instead potentially bias all referees, even the positive ones, against the work.
  • Don’t plead that for personal or monetary reasons critically important experiments can’t be performed. While we hear the plight of underfunded labs we don’t make exceptions for these reasons.
  • Don’t ignore specific requests by referees without comment and selectively only answer a few queries.
  • Don’t rephrase a referees’ point to give it a slightly different meaning that you can more easily address.

Don’t miss parts 1 and 3 of this series of posts covering cover letters and appeal letters. We encourage questions, comments and feedback below. The editors will do their best to answer any questions you have.

Importance of data sharing

No more (trade) secrets

Withholding information on the clinical significance of genetic variants from the scientific community impedes the progress of research and medicine.

Imagine you are a physician or researcher and seek to get more confirmation on the clinical impact of particular genetic variants. If your search of public databases comes up empty this does not necessarily mean that nothing is known about the mutations in question. Rather, the information may be locked away as a trade secret in a genetic testing company’s proprietary database.

Physicians and their patients are not able to independently verify the medical significance of a testing company’s finding, instead the results have to be taken on blind faith.  Researchers are limited in their knowledge of the vast mutational landscape in genes associated with diseases such as cancer which in turn may limit their understanding of the molecular underpinning of the disease.

Robert Nussbaum, at the University of California, San Francisco, recently pointed out that in other fields of medicine such an approach would be unthinkable. In a Technology Review he said, “Imagine if radiological images or histopathology slides of cancers were examined by a single monopoly holder without the medical community being able to assess and learn from what these images and tissue specimens teach us.” He launched  the Sharing Clinical Reports Project, an initiative to collect de-identified information on genetic testing data on the BRCA1 and 2 genes (as discussed in our August editorial).

With more genetic testing companies likely to enter the market, after the US Supreme Court invalidated some gene patents, the problems caused by proprietary data may increase. Clinicians may now have more options to obtain a genetic test, but, if they go with the less established testing company, they are then left with a suboptimal interpretation with possibly grave implications for the patient.

A resolution  from the American Medical Association passed in June 2013 supports public access to genetic data. The resolution calls for companies, laboratories, researchers and providers to publicly share data on genetic variants in a manner consistent with privacy and HIPAA protections.

Whether such calls will be heeded is another question. In a New York Times OdEd piece aptly named “Our genes, their secrets” the author wonders if the recent Supreme Court decision will prompt genetic testing companies to rely more on this strategy of treating information on the clinical impact of mutations as trade secrets and thereby try to deter competition and ensure revenue.

How can this be prevented? Cook-Deegan et al.  – in a recent article in the European Journal of Human genetics – call for joint action by national health systems,  insurers, regulators, researchers, providers and patients to ensure broad access to information about the clinical significance of variants. Some of their suggestions, besides the promotion of voluntary sharing, include sharing as a condition of payment or regulatory approval of the testing laboratories.

The battle about who may offer certain genetic tests is certainly heating up. Ambry Genetics and Gene by gene, two of the companies now offering BRCA1 and 2 testing, have been sued by Myriad Genetics for patent infringement.  A few days later, on July 12, US senator Patrick Leahy, a democrat from Vermont, wrote to Francis Collins, the director of the NIH, urging him to force Myriad to license the patent on reasonable terms to other parties to ensure affordable life-saving diagnostic tests.  As the federal  agency that provided the funding for the research behind Myriad’s patent  the NIH has the authority to do so, based on a provision in the Bayh-Dole Act that enabled universities to own inventions based on federal funding. Whether it will exercise this authority is unclear. Collin’s reply is still outstanding.

Ambry Genetics disputes that it infringes any of Myriad’s patents and a company spokesperson told Nature Methods that Ambry plans to share their testing data.

If enough companies follow suit, the desirable equilibrium of compensating a company fairly for the cost of its test and at the same time letting the public benefit from the results of these tests should be within reach.

What you always wanted to know about histones

Nature Methods and Nature Biotechnology will host a live discussion on why histone modifications matter in health and disease.

Some call it a code, some call it a language. The fact is that core histone proteins that make up the nucloeosme can be modified by a range of post translational modifications (current tally is 16) and that these PTMs, individually or collectively, send a message to the transcription machinery, either attracting or repelling it.

If you have wondered about the nature of the histone code, if you have questions about the importance of its writers, readers and erasers, or wonder how these are changed in some diseases and what can be done about it, an upcoming webcast will give a chance to raise these questions.

On February 26 we will discuss the importance of histone modifications from two aspects. First:  What is the biology behind it? Which enzymes write the code and how important is crosstalk between different modifications?  Second: How can one efficiently target these enzymes to fight disease?

Ali Shilatifard and James Bradner

Our speakers, Ali Shilatifard and James Bradner, will present their views and then they will engage in a live discussion fueled by questions from the audience.

Sign up for the webcast, and post your questions here before February 26, or during the webcast on the event website.  Either way, we will try our best to get them answered.

Note: The live webcast has now concluded. Anyone who wants to see it may still register at the link above and view a recording of the webcast at their leisure.

Whose genes are they?

The first gene patent was granted in 1982 for the sequence encoding insulin. Today ~ 20% of the human genome is patented. And the controversy is as alive as ever as we discuss in this month’s editorial.

Gene patents have supporters who see them as essential for the development of diagnostic and therapeutic products and detractors who see them as hindering research directly and indirectly.

We think that Myriad Genetics’ patents on BRCA 1 and 2, and how these patents have been upheld in court challenges in the US, illustrate how gene patents can elicit cases of preemptive obedience which is problematic for patients and researchers alike.

While it is not clear that Myriad’s patents on BRCA 1 and 2 are infringed by technology that does not rely on isolated cDNA, such as next generation sequencing (NGS), no company in the US seems to be willing to take that risk. AmbryGenetics, for example, a company offering an NGS-based test for mutations in 14 breast cancer related genes, specifically excludes BRCA 1 and 2.

For patients in the US with a family history of breast cancer this is bad news, since it leaves them with Myriad as the only provider for mutation testing in BRCA. And there is no way to obtain a second opinion, particularly for negative results.

Not so in the UK. There Myriad holds fewer and more restricted patents and NewGene, a company owned by the Newcastle Hospitals NHS Foundation Trust and Newcastle University, offers full sequencing of the BRCA1 and 2 coding regions and some introns. We are not saying that NewGene’s test is superior to Myriad’s, though it is certainly cheaper. But it shows that the latest technology can quite rapidly be translated into the clinic, if there are no legal hurdles.

To understand the impact of the many different mutations in the BRCA genes researchers need to have access to a large patient population. For this purpose the NIH founded the Breast Cancer Information core (BIC) mutation database to gather such patient data.

Myriad as the sole provider of BRCA mutation tests acquires large amounts of patient data and until about 2006 the company contributed variants to BIC. Then they decided to keep them proprietary.  Researchers were not happy about losing access to such valuable information. In response Robert Nussbaum from UCSF started an initiative to gather mutations directly from physicians. He is asking them to black out a patients identity and forward the mutation report Myriad provides so it can be incorporated into BIC. This information, together with many international contributions, has led to over 14 000 variants in the database.

With personalized medicine on the rise and many people interested in having their genome sequenced restricting who can look at a particular gene makes little sense and the much coveted $1000 genome will hopefully not be rendered ineffective by patent laws.

As always we are keen to hear your views.

To share or not to share

Many in the mass spectrometry community agree that MS data should be made publicly available for everybody’s benefit. All data, including the raw files generated by the mass spectrometers.

In the May editorial we support this request and introduce a new raw data repository run by the EBI that offers to replace the declining TRANCHE, up to very recently the only repository for such data.

Several good reasons can be made for making raw data available – one of them is the re-analysis of published data to validate claims. For example, the controversy arising in the wake of the analysis of fossilized Tyrannosaurus rex bones by Asara and colleagues  which led them to suggest that T. rex is more closely related to birds than to reptiles (Asara et al., Science 2007).  Their findings were finally corroborated in 2009 (Bern et al.J. Proteome Res.) but could have been examined much quicker,  if access to raw data had been given at the time of publication.

Re-analysis aside, raw data present a treasure trove of information that can be examined from different angles and, over time, with new tools that bring aspects to light that the original experimenters did not think of.  To create such new analysis tools, software developers rely on raw data to benchmark against established techniques.

Having access to raw files does not mean that they are easy to use – we realize that the diversity in file formats and the difficulty in converting one file type to another makes their analysis not as straight forward as it could be with a single community supported format.  And we also realize that these files are large and uploading them to the new EBI, or any other repository, will take time and some effort, particularly if important meta data about the experiment are included.

Still, we think the effort is worth it to ensure the field can move forward.  We’d love to hear your views, particularly if you disagree.

Transparency in NIH funding

Obtaining research funding, particularly from the NIH, is an increasingly daunting task that requires a lot of background information in addition to presenting a stellar proposal.

Researchers need to decide which of the 25 institutes at the NIH is most likely to fund a particular line of work. A new database, linked to RePORTER, NIH’s research portfolio reporting tool, will make this decision a bit easier.

The editorial in the June issue of Nature Methods highlights the importance of this database and a Correspondence in the same issue describes its functionality.

The developers of this database welcome community feedback and we encourage you to try it out.

Blogging at meetings

Social media are rapidly becoming a part of scientific meetings. It is no longer unusual to tweet from meetings and summary reports of talks can often be found on blogs.

Many meeting organizers support bloggers and microbloggers. To give only a few examples: the Federation of American Societies for Experimental Biology’s (FASEB) upcoming meeting on Experimental Biology is supportive of scientist bloggers discussing the meeting content online. For the past few years the International conference on intelligent systems for molecular biology (ISMB) has linked FriendFeed discussions about every talk to the meeting’s homepage, assuring that these exchanges are archived and easily accessible.

Organizers at the recent Keystone meeting on Stem Cells, Cancer and Metastasis provided a Twitter hashtag to initiate dialog between meeting participants and discuss questions raised at the meeting. Similarly, at the Workshop on visualizing biological data in March, tweeting was encouraged and eagerly embraced by attendees.

At this year’s Advances in Genome Biology and Technology meeting one speaker underscored his support of social media by wearing a T-shirt displaying “Tweet me” in large print. While we do not suggest such a dress code be made mandatory, we do, in principle, support the spirit behind the openness, as long as reasonable and clearly communicated restrictions by presenters are honored, as discussed in the editorial in our April issue.

Given that social media are still rapidly evolving the scientific community needs to keep up a dialog as to how to best use them. We are keen to hear about our readers’ experiences with meeting blogs and tweets.