Computable sugars: some computational resources in glycoscience

Glycoscience is sweet science

Glycoscience is sweet science{credit}PhotoDisc/ Getty Images{/credit}

As glycoscience advances, labs will increasingly want to ask questions about glycosylation sites on a protein or the structure of a sugar, says Raja Mazumder, a bioinformatician at George Washington University. They might ask for example: are there glycosyltransferases that are expressed in liver but not in the heart, or, which ones are overexpressed by a factor of three in more than two cancers. Such questions require infrastructure building, he says, because right now there is no mechanism to allow such queries. But he and others are building such capabilities. Mazumder along with William York at the University of Georgia are starting to build a glycoscience informatics portal.

Mazumder wants to leverage existing ontologies in the developer community in order to build systems that can be queried on a large-scale. For example, Mazumder is working with Cathy Wu at Georgetown University, who is developing the Protein Ontology. Such ontologies are collected, for example, by the non-profit OBO Foundry. To allow flexible querying, the computational resources will draw on different ontologies; ones that relate to glycans, genes, proteins, tissues, diseases and more.

Ontologies are part the team’s effort to build application program interfaces (APIs) that expose the data in a given database to incoming queries. Given how complex sugars are, the informatics framework has to be well-organized for both human and machine-based querying, says Mazumder.

When using the resource, a researcher will receive results that also document the search process itself such as the version of the queried database. “You need to be able to tell where you got that information from,” says Mazumder. Tracking data provenance matters especially in an age when databases continuously integrate information emerging in the literature.

For the Food and Drug Administration, Mazumder is developing computational standards for high-throughput sequencing, which he wants to also apply to glycoscience. His ‘biocompute object’ captures the given computational workflow a lab might have used to generate results: the software used, the databases queried and their version, and identifiers of data inputs and outputs. These biocompute objects are intended to help regulatory scientists interpret submitted work. It can also help scientists generally see if, for example, the version of software they used worked as it should, says Mazumder.

Too often labs use computational tools without benchmarking them, says Mazumder. “It would be unthinkable for a wet-lab scientist to not have a positive and negative control,” he says.  In informatics, developers benchmark their software but users often do not have these habits. “They don’t even know: if I don’t find anything, is it because my software did not run well or not?”

As labs move to big data analysis in genomics and also, eventually, in glycoscience, this aspect is ever more important, says Mazumder. In his view, biocompute objects will help glycobiology researchers communicate with one another about their results, such as where on a protein they found a sugar with a given structure. More generally, it will help glycoscientists to have a better way to connect the available sugar resources as they pursue their questions of interest.


Here are some resources that glycoscientists can tap into:                             

 Category Resource Description
General resources and funding information
Transforming Glycoscience: A Roadmap for the Future Report by the National Research Council of the National Academies of Science
NIH Common Fund program in glycoscience  Funding opportunities from the NIH Common Fund program in glycoscience
A roadmap for Glycoscience In Europe by BBSRC, EGSF, European Science Foundation   Glycoscience roadmap for Europe
GlycoNet Resources related to glycoscience research in Canada, based at the University of Alberta where the Alberta Glycomics Centre is located
National Center for Functional Glycomics A Glycomics-related Biomedical Technology Resource Center based at Beth Israel Deaconess Medical Center, Harvard Medical School with resources on, for example, microarrays and microarray services, protocols, training and databases
Databases and  portals 
CAZy Carbohydrate-Active Enzymes, a database of enzyme families that degrade, modify or create glycosidic bonds
Consortium for Functional Glycomics Resources and glycoscience data. Part of the National Center for Functional Glycomics.
ExPASy Software tools and databases to simulate, predict and visualize glycans, glycoproteins and glycan-binding proteins
Glycan Library  A list of lipid-linked sequence-defined glycan probes
Glyco3D A portal for structural glycoscience
GlycoBase 3.2 A database of N– and O-linked glycan structures with HPLC, UPLC, exoglycosidase sequencing and mass spectrometry data
GlycoPattern Portal for glycan array experimental results from the Consortium for Functional Glycomics
Glycosciences.de Collection of databases and tools in glycoscience
GlyToucan Repository for glycan structures based in Japan
MatrixDB A database of experimental data of interactions by proteoglycans, polysaccharides and extracellular matrix proteins
Repository of Glyco-enzyme expression constructs University of Georgia Complex Carbohydrate Research Center repository for glyco-enzyme constructs
SugarBind A database of carbohydrate sequences to which bacteria, toxins and viruses adhere
UniCarbKB A resource curated by scientists in in five countries. It includes GlycoSuiteDB, a database of glycan structures; EUROCarbDB, an experimental and structural database and UniCarb-DB, a mass spec database of glycan structures
Software tools
CASPER Web-based tool to calculate NMR chemical shifts of oligo- and polysaccharides
Glycan Builder An online tool at ExPASy for predicting possible oligosaccharide structures on proteins
GlycoMiner/GlycoPattern Software tools to automatically identify mass spec spectra of N-glycopeptides
GlyMAP An online resource for mapping glyco-active enzymes
NetOGlyc Software tool for predicting O--glycosylation sites on proteins
SweetUnityMol Molecular visualization software

Sources: NIH, R. Mazumder, George Washington University; New England Biolabs, Thermo Fisher Scientific, Nature Research

Mass spectrometry-based proteomics at Nature Methods

A look back at highlights in proteomics technology developments published in Nature Methods.

The last decade has seen amazing advances in mass spectrometry-based proteomics technology as well as ever-expanding use of the technology for varied biological applications. Here we take a look back at some proteomics technology development highlights published in Nature Methods over the last 10 years. (A second entry covering biological applications of mass spectrometry-based proteomics is planned for the near future; stay tuned.)

Sample preparation

The first step in a successful proteomics experiment is sample preparation. In 2009 Matthias Mann’s lab published a filter-aided sample preparation (FASP) method that is widely used by the proteomics community. In 2014 the same lab published an optimized approach that performs all sample processing tasks in a single enclosed tube.

Proteins are digested into peptides for ‘shotgun’ proteomics analysis. While trypsin is most widely used, it also comes with known limitations. Albert Heck and colleagues and Neil Kelleher and colleagues described useful alternatives to trypsin.

Proteomics researchers are always striving for higher sensitivity. John Yates’s lab’s DigDeAPr method and Bernhard Kuster’s lab’s use of DMSO to enhance electrospray response allow researchers to do deeper proteomic analysis.

Quantitative methods

Proteomics researchers want to quantify, as well as identify, peptides and proteins. Stable isotope labeling, either through metabolic incorporation or chemical labeling during sample preparation, enables researchers to quantitatively compare multiple samples. Spiking in labeled concatenated signature peptides into samples enables absolute quantification, as shown by Robert Beynon and colleagues.

The SILAC metabolic method has proved to be extremely popular, and we have published applications of SILAC for quantifying proteins and phosphorylation sites in human tissues, and in nematodes (Larance et al. and Fredens et al.).

A limitation with SILAC is that it cannot be used to compare more than three samples at one time. Joshua Coon and colleagues provided a clever way around this with their NeuCode SILAC approach, which in theory could enable up to 39-plex experiments.

Chemical labeling approaches (such as iTRAQ and TMT) currently offer higher multiplexing capability than SILAC, but can suffer from problems of quantitative accuracy. Coon’s lab and Steven Gygi’s lab each provided methods to obtain accurate quantitative data in multiplexed experiments.

Shotgun data analysis

In a typical ‘shotgun’ proteomics (discovery-based) experiment, MS/MS fragmentation spectra are generated for all peptides that can be detected by the mass spectrometer. The proteins are identified by matching these experimental spectra to theoretical or actual MS/MS peptide spectra found in databases. Well-performing tools to do this and methods to control for false discoveries are therefore crucial.

To generate good proteomics data, one must tune the mass spectrometer to the best of its ability. The HCD method from Stevan Horning and Matthias Mann and colleagues and a decision tree algorithm from the Coon lab enable researchers to obtain improved MS/MS data for protein identification.

We have published tools for peptide identification – PercolatorSpectraST, and MS-Cluster – and quantitative data analysis (Census). Lennart Martens’ group showed that combining various data processing workflows leads to greater proteome coverage. Proteogenomics-type approaches using custom databases generated using genomic data are becoming popular as they allow novel peptides not found in standard protein databases to be identified (see Evans et al. and Branca et al.).

Researchers must be careful to not overinterpret their proteomics data. Gygi’s lab wrote a useful Perspective on the target-decoy approach for determining false discovery rate, a metric that has become broadly adopted by the field.

In order to keep tools sharp and highlight areas for development, it is important to systematically put them to the test. In 2005, Gygi’s lab performed a comparison of three platforms. In 2009, a large group of researchers tested their ability to identify proteins in a small test sample. This analysis highlighted common problems that occur especially during data analysis in proteomics investigations.

Targeted proteomics

Targeted proteomics, which we chose as our Method of the Year in 2012, offers a fundamentally different way of analyzing data compared to discovery-based proteomics. Targeted approaches, most commonly selected reaction monitoring (SRM), utilize mass spectrometry assays to identify and quantify peptides selected to represent proteins of interest, akin to Western blotting, but in a multiplexed fashion.

These SRM assays can be laborious to generate, however. Methods for high-throughput SRM assay generation are therefore important (see Picotti et al.Stergachis et al. and Kennedy et al.). In 2008 Ruedi Aebersold’s group set up a database of assays for the yeast proteome, called SRMAtlas, which has since grown to include assays for M. tuberculosis and human. Amanda Paulovich and colleagues just this year presented the CPTAC Assay Portal, a new repository of analytically validated targeted proteomics assays.

As in discovery-based proteomics, statistical validation in targeted proteomics is equally important. Aebersold’s lab developed the mProphet tool and also provide a useful guide to SRM in their 2012 Review.

Biological applications of targeted proteomics are growing. Bart Deplancke and colleagues showed that transcription factors could be followed during cellular differentiation using SRM. Olga Vitek’s group showed that targeted proteins could be quantified using sparse reference labeling. In this current 10th Anniversary issue, Claus Jørgensen’s group reports a quantitative method for monitoring human kinases, and Paola Picotti’s lab describes a panel of assays to quantify ‘sentinel’ proteins reporting on 188 different yeast processes.

Data-independent analysis

Our very first issue in October 2004 featured an interesting paper from Yates and colleagues describing a data-independent mass spectrometry scanning approach for acquiring MS/MS spectra. In contrast to the common data-dependent approach, where the most prominent peptide ions are selected for MS/MS, the data-independent approach can enable more reproducible results as it overcomes issues of peptide ion sampling stochasticity. It took nearly a decade for this clever idea to really catch on, but within the last year or so, we have published practical data-independent analysis implementations from Michael MacCoss’s and Stefan Tenzer’s labs.

Anne-Claude Gingras and Stephen Tate and colleagues, along with Aebersold and colleagues, showed how a quantitative targeted data-independent analysis method called SWATH provides advantages for analyzing protein interactomes by affinity purification-mass spectrometry.

We look forward to many more strong advances in mass spectrometry-based proteomics in the decade to come!

Sunset on the PSI

As discussed in this month’s Editorial, the Protein Structure Initiative (PSI), a 15-year, nearly $1 billion structural genomics project funded by the National Institute of General Medical Sciences (NIGMS), will be coming to an end in 2015. The impact of ending this project should be minimized to avoid the loss of valuable resources and expertise.

The PSI was begun in 2000 when the US NIH budget was in the midst of substantial growth, spurring the creation of many new “big science” projects. At that time, protein structure determination was painfully slow. The PSI’s initial goals were to develop tools and methods to improve the speed and ability to solve protein structures, as well as to generate a large resource of novel and unique protein structures to facilitate homology modeling and to promote follow-up functional studies.

Protein Structures

{credit}Image Credit: Erin Boyle{/credit}

The PSI drew criticisms from many in the structural biology community from its inception, however. Critics of the first two phases of the PSI pointed out that it focused mainly on solving small bacterial protein structures that were not biologically very interesting, simply because they were relatively easy to express, purify and crystallize. This led the PSI to substantially change course in the third and current phase, PSI-Biology, which began in 2010. The emphasis on throughput was diminished and shifted to solving important, difficult structures like human membrane proteins and drug targets. This change of course has led to several successful structures for highly interesting yet difficult proteins such as GPCRs.

In 2013, a scientific advisory panel produced a mid-point evaluation report of PSI-Biology for NIGMS (PDF), assessing its strengths and weaknesses. The panel commended the impressive number of high-quality structures and methodological advances of the PSI centers, but noted that outreach to the broader biological community was inadequate. The panel found that the PSI’s community resources – the Structural Biology Knowledgebase, a portal for research, news and resources produced by the PSI (in collaboration with Nature Publishing Group), and the Materials Repository, which provides over 80,000 plasmids and 106 empty vectors to the community – were not widely taken advantage of by researchers outside the PSI. The panel also noted that NIGMS should start planning to transition the PSI from its current set-aside funding structure to a different funding model that maintains its unique resources and capabilities – but that it should also extend PSI-Biology for another 3 to 5 years past 2015 to allow it to reach its full potential.

However, budget cuts and a reassessment of NIGMS’s large-scale research initiatives by its new director, Jon Lorsch, has led the institute to prematurely cut the program after the current phase, PSI-Biology, ends in 2015.

The loss of this large-scale program will certainly shake up structural biology in the US, particularly for all those involved in the PSI but also for the many researchers – even outside of traditional structural biology – who benefit from PSI resources. As we discuss in this month’s Editorial, NIGMS now has the opportunity to set an example for other funding agencies in how to wind down a big science project with minimal negative impact. Internal and external transition planning committees have been created by NIGMS to determine which resources and capabilities developed by the PSI should be preserved and how this can be done. NIGMS has also put out a Request for Information seeking community input on the utility of resources developed by the PSI; the response date (May 23, 2014) has only just passed.

NIGMS is expected to make a decision before the end of 2014 as to what will be done with the substantial high-throughput expression, purification and crystallization facilities developed during the PSI’s tenure. In the Editorial we argue that this infrastructure – and the large-scale raw data and metadata generated that is not in any database, but is valuable for mining and algorithm development – should be preserved as much as possible. We argue that a project to systematically sample protein folds should be continued on a smaller scale. We also argue that NIGMS should continue to facilitate team research – as the internal transition planning committee chair Douglas Sheeley has said it will – to tackle particularly challenging structural biology research problems that require hybrid methods to solve.

We will be watching closely to see whether the negative fallout from the end of the PSI can indeed be minimized.

People, publishing, and policy: Q&A with Janet Thornton, director of the European Bioinformatics Institute.

Janet Thornton has been named Dame Commander of the Order of the British Empire. She feels it is an important recognition of bioinformatics.

Janet Thornton has been named Dame Commander of the Order of the British Empire. She feels it is an important recognition of bioinformatics.{credit}EMBL-EBI{/credit}

The scientist profiled in the February issue of Nature Methods (the Author File) is Janet Thornton, the director of the European Bioinformatics Institute.

Here, she shares some additional insight about publishing, science policy, and mentoring. What follows is an edited excerpt of her conversation with Nature Methods. Read more here.

VM: In an era of not-so-plentiful funds, ELIXIR (interviewer looks up acronym…)—the European life-sciences Infrastructure for biological Information—and other initiatives takes you deep into policy-making. Which tends to not resemble a picnic on a sunny Nottinghamshire day. What motivates you?

JT: ELIXIR was launched Dec 18 and now has its own director. It does feel a bit that it’s my child. But it’s a child that has grown up and is really on its way to becoming independent and moving forward to being an independent adult. It’s still got a long way to go. It’s a bit like a teenager, actually. (laughs)

I honestly believe that these initiatives are the best way forward because, despite the setbacks, everyone broadly agrees. So it is a case of getting through the politics and making the science happen. As we know, science has no borders—and all scientists agree with this—so in the end, common sense will win and we can go forward.

VM: You have published around 400 papers. What does a paper mean to you?

JT: Probably for me the most important part of the process of science is publishing a paper. Because it’s the time when you really sort out what matters, why you did it, what you discovered and then you try and make it understandable for other people. And I have to say I get really upset when my papers are rejected.

VM: What types of papers do you enjoy reading?

JT: I love reading good solid papers, which are logical and explain how the results are obtained and why they are important. I used to spend hours in the library, like a detective tracking down information and knowledge.

VM: Rumor has it, you still present posters.

I don’t often present posters but there was one particular occasion when the University of Cambridge organized an event and they asked all the senior staff throughout the university to present posters. That was the last sort of official poster presentation. Of course, my students and post-docs have posters all the time. And I do man those posters as appropriate. It’s fun. You talk about your work.

VM: What is the best way for a scientist to select members most suited to his or her lab?

JT: Five things I look for: a) Bright/clever, b) Committed and interested in a project or area of research, c) Relevant expertise – though this is not the most important thing, d) What does the lab think? e) Would I like to have a meeting at 9am on a Monday morning with this person?

VM: Computational resources in the life sciences are not always appreciated. What do you recommend to scientists keen on being and staying tool-builders and resource-providers?

JT: Find a good place to go to follow your dream; find someone you want to work with and prepare yourself for the future. Not all scientists can be principal investigators (PIs), nor indeed want to be, so the key is to find your own niche.

VM: You studied physics at the University of Nottingham, then shifted to biophysics for your PhD at the National Institute for Medical Research. What do you advise when students of any stripe wonder: ‘Shall I choose physics? Computer science? Biology?’

JT: I am afraid I am biased—go with biology—it is amazing, beautiful, complex, but still an open book with lots to discover. And even if this were not enough, it has so many really important applications —many of the so-called grand challenges that will literally affect the future of this planet and everyone on it.

An all-encompassing term to describe protein complexity

Neil Kelleher and Lloyd Smith propose that the scientific community adopt the term ‘proteoform’ to refer to all the different forms that a protein can take. Will the community adopt it?

The field of top-down proteomics, in which intact proteins are analyzed by a mass spectrometer, provides rich information about the genetic variations, alternative splicing and post-translational modifications that can be lost in a bottom-up proteomics approach (where proteins are digested into peptides prior to analysis). An unsolved problem in the top-down field, however, has been what exactly to call these various protein forms. Besides ‘protein forms’, a handful of other terms have been batted around in the literature, including ‘protein variants’, ‘protein isoforms’ and ‘protein species.’

In a Correspondence in the March issue of Nature Methods, Neil Kelleher and Lloyd Smith lay out the reasons why none of these terms are satisfactory. What is needed, they argue, is a novel, unique, intuitive, single-word term with a precise definition that is all-encompassing in describing protein complexity, and is also compatible with a gene-centric approach to protein naming. They believe that they have the perfect term: proteoform.

“It’s not just a term, it’s a movement,” says Kelleher. Kelleher has been one of the key drivers of top down methodology development, and argues that using a controlled vocabulary to describe proteins will serve a catalytic role in moving the field forward. “The implicit thing about this term is that it puts a focal point on the fact that [the proteoforms] are the functional players, insofar as protein primary structure is concerned,” he says. Especially in clinical research, he notes, different proteoforms are tied strongly to function and phenotype.

Kelleher and Smith have been gathering support for their term over the last several months by introducing it at conferences and inviting researchers to comment on a LinkedIn forum. The term also has the full support of the Consortium for Top Down Proteomics. At their latest conference in Florida, about a month ago, Kelleher says that “everyone” was using “proteoform” in their talks. “It just catches on…it fills a void the rolls right off the tongue at conferences and sits well in the gut while digesting text,” he says. The consortium website maintains a repository of proteoforms, which they hope will grow. Kelleher also notes that the term is being embraced by key protein informatics players at UniProt and the Protein Information Resource, both of which have adopted a gene-centric approach to protein naming.

What do you think about the term “proteoform”? Will you adopt it? We’d love to hear from you!

A different kind of Method of the Year for 2012

Our choice of Method of the Year in prior years has tended to be methods that generally didn’t even exist only a few years earlier but which had quickly bounded onto the scientific stage and attracted the attention of a large portion of the scientific community. Targeted proteomics, our choice for 2012, on the other hand has existed for years in scaled-down forms using methods based on antibodies. Western blotting, immunofluorescence, antibody arrays, etc. can all be used to detect and measure targeted subsets the proteins expressed in cells and tissues.

During this time the workhorse of proteomics, the mass spectrometer, has been used mostly for shotgun proteomics experiments in which the goal was to analyze all the proteins in a sample. But the means to use these machines for targeted detection of defined subsets of proteins and obtain more reproducible measurements than shotgun experiments can typically provide have been around for decades.

Shotgun methods have been mostly confined to specialist laboratories as many biologists have been intimidated by the complexity of implementing and analyzing these experiments properly. Targeted proteomics on the other hand offers a tantalizing opportunity to bring a sampling of the power of mass spectrometry to the wider community of biologists. The assays are simpler, easier to run and well suited to the hypothesis-driven experiments that are the mainstay of biological research.

The ubiquitous Western blot has long filled a central role or functioned as a crucial control in many research studies. Unfortunately performing a high-quality Western blot can feel a bit like roulette. Sometimes you get a fantastic looking blot with an accurate antibody but other times either the blot is blank, the bands may look like they ran through some carnival ride or it might suffer from any number of other problems. This might prompt people to either look for a goat to appease the Western blot gods or take unscientific liberties with the presentation of the data in order to make it look like they are believe it should. It also lessens the likelihood that important replicates are performed or reported.

Targeted mass spectrometry offers the possibility for thousands of labs to move away from, or supplement, Western blots; and improve the quality and quantity of their protein measurements. This is not as sexy as next-generation sequencing, super-resolution imaging or optogenetics, some of our prior choices of Method of the Year, but the potential for revolutionizing an arguably mundane but indispensable technique was compelling enough that it played no small role in our decision. Only time will tell what impact the method has and we eagerly look forward to the answer.

To share or not to share

Many in the mass spectrometry community agree that MS data should be made publicly available for everybody’s benefit. All data, including the raw files generated by the mass spectrometers.

In the May editorial we support this request and introduce a new raw data repository run by the EBI that offers to replace the declining TRANCHE, up to very recently the only repository for such data.

Several good reasons can be made for making raw data available – one of them is the re-analysis of published data to validate claims. For example, the controversy arising in the wake of the analysis of fossilized Tyrannosaurus rex bones by Asara and colleagues  which led them to suggest that T. rex is more closely related to birds than to reptiles (Asara et al., Science 2007).  Their findings were finally corroborated in 2009 (Bern et al.J. Proteome Res.) but could have been examined much quicker,  if access to raw data had been given at the time of publication.

Re-analysis aside, raw data present a treasure trove of information that can be examined from different angles and, over time, with new tools that bring aspects to light that the original experimenters did not think of.  To create such new analysis tools, software developers rely on raw data to benchmark against established techniques.

Having access to raw files does not mean that they are easy to use – we realize that the diversity in file formats and the difficulty in converting one file type to another makes their analysis not as straight forward as it could be with a single community supported format.  And we also realize that these files are large and uploading them to the new EBI, or any other repository, will take time and some effort, particularly if important meta data about the experiment are included.

Still, we think the effort is worth it to ensure the field can move forward.  We’d love to hear your views, particularly if you disagree.

Matthias Mann awarded Louis-Jeantet Prize for medicine

The Louis-Jeantet Foundation awarded its prestigious 2012 Louis-Jeantet Prize for medicine to Matthias Mann last Tuesday, Jan 24th, for his contributions to mass spectrometry and the field of proteomics.  Matthias Mann, Director of the Department of Proteomics and Signal Transduction at the Max-Planck Institute of Biochemistry in Martinsried, and his co-workers have developed several of the key technologies that have made modern proteomics possible, including mass spectrometry-based identification of proteins from electrophoretic gels and the SILAC method that underlies many recent quantitative proteomics studies. The foundation highlighted, in particular, his quantitative analyses of cancer cell proteomes, and the promise this work may hold for the future diagnosis and treatment of cancer (e.g. Geiger et al, 2010; Lundberg et al, 2010; Nagaraj et al,  2011).

The 2012 Louis-Jeantet Prize for medicine was also awarded to Fiona Powrie for her work on immunity and host-pathogen interactions within the mammalian gut.  Mann and Powrie will each be awarded CHF 700,000, with the majority of these funds going to help continue these scientists’ research programs.

EMBO Molecular Medicine recently published a related editorial, and contributions from both Mann and Powrie that provide some personal insight into the research paths that led to these important discoveries.


Geiger T, Cox J, Mann M (2010) Proteomic changes resulting from gene copy number variations in cancer cells. PLoS Genet 6: e1001090

Lundberg E, Fagerberg L, Klevebring D, Matic I, Geiger T, Cox J, Algenäs C, Lundeberg J, Mann M, Uhlen M (2010) Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol 6: 450

Nagaraj N, Wisniewski JR, Geiger T, Cox J, Kircher M, Kelso J, Pääbo S, Mann M (2011) Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol 7: 548

Spotlight on the human proteome

Ambitious plans are underway for an internationally coordinated Human Proteome Project, as discussed in this month’s Editorial.

The proteomics field has certainly had its share of ups and downs. Mass spectrometry, the key technology used for proteome analysis, has been long criticized as being an irreproducible technique. In a Commentary this month, six leaders in the field argue that the technology has greatly matured over the past decade and when it is properly applied, it is highly reproducible. The technology is also getting more and more sensitive; mass spectrometry has been used to detect large portions of the proteome of several cell types.

However, the human proteome is enormously complex, when one considers all the diverse tissues, fluids and cell types present, and all the possible protein post-translational modifications and alternative splicings. Whether it is realistic to carry out a Human Proteome Project at all and what the scope of the project should look like are questions that the proteomics field does not agree on. The anticipated cost and scale of such a project would be on the order of the Human Genome Project, so it is important for the field to come to a consensus on these issues.

What do you think about the proposed Human Proteome Project?

PRIDE Converter – A new tool for proteomics

The administrators of the EBI proteomics repository PRIDE have just announced, in the July issue of Nature Biotechnology, the availability of a software tool to facilitate data deposition.

According to Lennart Martens and co-authors, the new tool, PRIDE Converter, “makes it straightforward to submit proteomics data to PRIDE from most common data formats.” This comes handy.

Depositing proteomics data into a structured public repository is a very worthwhile effort—one that Nature Biotechnology, Nature Methods, and other Nature journals strongly encourage. To date, however, the problem has been that in some cases, depositing data remained an effort.

Things could get particularly challenging if you happened to have large datasets in instrument-specific formats. Converting these into the XML-based format needed for data submission required time and informatics skills. Apparently, now, PRIDE Converter does it for you.

Proteomics researchers need effective ways of sharing data. Submission of data to public repositories upon publication should become as automatic in proteomics as it is in some other fields. But realistically, this can only happen if a good infrastructure of databases is in place—it is—and if the submission process is not an undue burden on the researchers.

Kudos to PRIDE, thus, for taking this step in the right direction and for demonstrating a willingness to work with researchers to facilitate submission. It is up to proteomics researchers now, to make use of the tool, and work with its authors to continuously improve it as the field, and their needs, evolve.