A peptide-retrieval strategy enables significant improvement of quantitative performance without compromising confidence of identification.
Reliable quantification of low-abundance proteins in complex proteomes is challenging largely owing to the limited number of spectra/peptides identified. In this study we developed a straightforward method to improve the quantitative accuracy and precision of proteins by strategically retrieving the less confident peptides that were previously filtered out using the standard target-decoy search strategy. The filtered-out MS/MS spectra matched to confidently-identified proteins were recovered, and the peptide-spectrum-match FDR were re-calculated and controlled at a confident level of FDR≤1%, while protein FDR maintained at ~1%. We evaluated the performance of this strategy in both spectral count- and ion current-based methods. >60% increase of total quantified spectra/peptides was respectively achieved for analyzing a spike-in sample set and a public dataset from CPTAC. Incorporating the peptide retrieval strategy significantly improved the quantitative accuracy and precision, especially for low-abundance proteins (e.g. one-hit proteins). Moreover, the capacity of confidently discovering significantly-altered proteins was also enhanced substantially, as demonstrated with two spike-in datasets. In summary, improved quantitative performance was achieved by this peptide recovery strategy without compromising confidence of protein identification, which can be readily implemented in a broad range of quantitative proteomics techniques including label-free or labeling approaches.
We hypothesize that more quantifiable spectra and peptides in a protein, even including less confident peptides, could help reduce variations and improve protein quantification. Hence the peptide retrieval strategy was developed and evaluated in two spike-in sample sets with different LC-MS/MS variations using both MS1- and MS2-based quantitative approach. The list of confidently identified proteins using the standard target-decoy search strategy was fixed and more spectra/peptides with less confidence matched to confident proteins were retrieved. However, the total peptide-spectrum-match false discovery rate (PSM FDR) after retrieval analysis was still controlled at a confident level of FDR≤1%. As expected, the penalty for occasionally incorporating incorrect peptide identifications is negligible by comparison with the improvements in quantitative performance. More quantifiable peptides, lower missing value rate, better quantitative accuracy and precision were significantly achieved for the same protein identifications by this simple strategy. This strategy is theoretically applicable for any quantitative approaches in proteomics and thereby provides more quantitative information, especially on low-abundance proteins.
Tu C
,Shen S
,Sheng Q
,Shyr Y
,Qu J
... -
《-》
In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.
In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended.
Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.
Audain E
,Uszkoreit J
,Sachsenberg T
,Pfeuffer J
,Liang X
,Hermjakob H
,Sanchez A
,Eisenacher M
,Reinert K
,Tabb DL
,Kohlbacher O
,Perez-Riverol Y
... -
《-》