Guest post by Ruedi Aebersold, Professor of Systems Biology with a joint appointment at ETH Zurich and the University of Zurich, & George Rosenberger, PhD student in the Aebersold group at the Institute of Molecular Systems Biology, ETH Zurich.
Mass spectrometry-based proteomics is a data-intense research discipline that primarily aims at identifying and quantifying the proteins that constitute the proteome1. This is achieved by generating large numbers (104 to 106) of fragment ion spectra that represent peptides generated by proteolysis of the respective proteome. Mass spectrometers can operate in different data acquisition modes, referred to as data-dependent acquisition (DDA), targeted acquisition exemplified by selected reaction monitoring (SRM) or data-independent acquisition (DIA)2 exemplified by SWATH-MS3,4. Specific software tools then generate from these raw data processed mass spectra – from which sets of identified peptides, proteins and their abundance are inferred and annotated with metadata. Both, the generation and the processing of such raw data sets are resource and time intensive. Further, if unique, irreplaceable samples are being analyzed, as is often the case with clinical cohorts the data cannot be re-generated. Therefore, the proteomics community has started to embrace data sharing by the means of different specialized public repositories, for example GPMDB5, PRIDE6, PeptideAtlas7 or ProteomicsDB8. For the last few years, the ProteomeXchange9 consortium has provided centralized deposition of raw data and their meta-annotation.
Publicly accessible and well annotated proteomic datasets serve minimally three important objectives. First, their integration and consistent processing defines the present state of mass spectrometry-based proteome discovery, i.e. a catalogue or index of entire proteomes. Collaborative efforts like the chromosome-centric human proteome project (C-HPP)10 or the recent draft maps of the human proteome8,11 are examples of such efforts supported by the voluntarily data deposition of the proteomics community. Second, the availability of large and diverse datasets is invaluable for the further development of software and statistical tools and the assessment of their performance. This increases the reproducibility and transparency of reported results; and third, the generation of highly validated libraries of fragment ion spectra and the peptide sequence they represent have proven highly beneficial for the analysis of data sets generated by DDA by spectral matching12. Spectral libraries are also an essential requirement for targeted and DIA strategies. They are used as a priori information providing sets of validated assays for the detection and quantification of specific proteins.
In 2014, we published our data descriptor “A repository of assays to quantify 10,000 human proteins by SWATH-MS”13. This project started out of the observation that DIA methods exemplified by SWATH-MS3 benefit from comprehensive spectral or assay libraries for the data analysis14. Because the generation of high quality spectral libraries is experimentally and computationally complex15 we built a comprehensive human assay library optimized for SWATH-MS from DDA measurements on a wide range of samples. Early on we realized that this resource might also be useful for others and could enable more reproducible comparison of human proteomic data sets acquired in SWATH-MS mode from different samples and in different studies and research groups, analogously to the standardization efforts for protein sequence databases16.
The launch of Scientific Data with its focus on data sharing and annotation encouraged us to prepare and submit our data with an accompanying manuscript. The journal provides a flexible format to describe the samples and methods in a visible section instead of hiding the details in a supplementary materials section. As authors, this flexibility enabled us to guide the reader’s attention to useful information on how to best make use of the data. For our Data Descriptor, this critical component was the method for statistical correction when large-scale assay libraries are used for targeted data extraction, an issue that requires intense attention to avoid the inflation of false positive protein identifications in (large-scale) proteomic experiments17.
The second added value of the Data Descriptor format is the machine-readable ISA-Tab format for metadata tracking. Supported by the Scientific Data editorial team, we generated detailed annotations for all our samples and dataset. In combination with our deposited data on ProteomeXchange and SWATHAtlas, we hope that this annotation will promote reuse of the data for future applications.
Since the original publication, our combined assay library has gained wide adoption for a range of different applications. To date most users tailor the assay libraries for targeted data extraction to their specific sample type, e.g. by substituting or complementing the public assay libraries with their own data. Others though, exemplified by Wang et al.,18 used the intermediary spectral library to demonstrate the application of their new algorithm to human samples and compare the performance to existing approaches.
We always considered our combined human spectral assay library as work-in-progress that will ideally be implemented on top of the community-wide large-scale efforts like C-HPP and PeptideAtlas. We believe that our initial efforts have significantly contributed to establishing and promoting the tools and methods to make best use of large-scale and unified assay libraries.
We strongly encourage the proteomics community to continue sharing their data in the established public repositories and to participate in the discussions to improve the annotation of metadata and reusability. It will have a significant effect on the community resources, the transparency and quality of the reported findings, benefits that will ultimately pay dividends for the whole field of mass spectrometry-based proteomics.
References
- Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
- Chapman, J. D., Goodlett, D. R. & Masselon, C. D. Multiplexed and data‐independent tandem mass spectrometry for global proteome profiling. Mass Spectrometry Reviews 33, 452–470 (2014).
- Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics 11, O111.016717–O111.016717 (2012).
- Röst, H. L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32, 219–223 (2014).
- Craig, R., Cortens, J. P. & Beavis, R. C. Open Source System for Analyzing, Validating, and Storing Protein Identification Data. J. Proteome Res. 3, 1234–1242 (2004).
- Martens, L. et al. PRIDE: The proteomics identifications database. PROTEOMICS 5, 3537–3545 (2005).
- Desiere, F. et al. The PeptideAtlas project. Nucleic Acids Res. 34, D655–8 (2006).
- Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
- Vizcaíno, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).
- Deutsch, E. W. et al. State of the Human Proteome in 2014/2015 As Viewed through PeptideAtlas: Enhancing Accuracy and Coverage through the AtlasProphet. J. Proteome Res. 150724142438005 (2015). doi:10.1021/acs.jproteome.5b00500
- Kim, M.-S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
- Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. PROTEOMICS 7, 655–667 (2007).
- Rosenberger, G. et al. A repository of assays to quantify 10,000 human proteins by SWATH-MS. Scientific Data, Published online: 16 September 2014; | doi:10.1038/sdata.2014.31 1, 140031 (2014).
- Gillet, L. C., Leitner, A. & Aebersold, R. Mass Spectrometry Applied to Bottom-Up Proteomics: Entering the High-Throughput Era for Hypothesis Testing. Annual Review of Analytical Chemistry 9, annurev–anchem–071015–041535 (2015).
- Schubert, O. T. et al. Building high-quality assay libraries for targeted analysis of SWATH MS data. Nat. Protoc. 10, 426–441 (2015).
- Lane, L. et al. neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 40, D76–83 (2012).
- Nesvizhskii, A. I., Vitek, O. & Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 4, 787–797 (2007).
- Wang, J. et al. MSPLIT-DIA: sensitive peptide identification for data-independent acquisition. Nat. Methods (2015). doi:10.1038/nmeth.3655