Quality control of single amino acid variations detected by tandem mass spectrometry.
Study of single amino acid variations (SAVs) of proteins, resulting from single nucleotide polymorphisms, is of great importance for understanding the relationships between genotype and phenotype. In mass spectrometry based shotgun proteomics, identification of peptides with SAVs often suffers from high error rates on the variant sites detected. These site errors are due to multiple reasons and can be confirmed by manual inspection or genomic sequencing. Here, we present a software tool, named SAVControl, for site-level quality control of variant peptide identifications. It mainly includes strict false discovery rate control of variant peptide identifications and variant site verification by unrestrictive mass shift relocalization. SAVControl was validated on three colorectal adenocarcinoma cell line datasets with genomic sequencing evidences and tested on a colorectal cancer dataset from The Cancer Genome Atlas. The results show that SAVControl can effectively remove false detections of SAVs.
Protein sequence variations caused by single nucleotide polymorphisms (SNPs) are single amino acid variations (SAVs). The investigation of SAVs may provide a chance for understanding the relationships between genotype and phenotype. Mass spectrometry (MS) based proteomics provides a large-scale way to detect SAVs. However, using the current analysis strategy to detect SAVs may lead to high rate of false positives. The SAVControl we present here is a computational workflow and software tool for site-level quality control of SAVs detected by MS. It accesses the confidence of detected variant sites by relocating the mass shift responsible for an SAV to search for alternative interpretations. In addition, it uses a strict false discovery rate control method for variant peptide identifications. The advantages of SAVControl were demonstrated on three colorectal adenocarcinoma cell line datasets and a colorectal cancer dataset. We believe that SAVControl will be a powerful tool for computational proteomics and proteogenomics.
Yi X
,Wang B
,An Z
,Gong F
,Li J
,Fu Y
... -
《-》
In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.
In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended.
Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.
Audain E
,Uszkoreit J
,Sachsenberg T
,Pfeuffer J
,Liang X
,Hermjakob H
,Sanchez A
,Eisenacher M
,Reinert K
,Tabb DL
,Kohlbacher O
,Perez-Riverol Y
... -
《-》
Exome-based proteogenomics of HEK-293 human cell line: Coding genomic variants identified at the level of shotgun proteome.
Genomic and proteomic data were integrated into the proteogenomic workflow to identify coding genomic variants of Human Embryonic Kidney 293 (HEK-293) cell line at the proteome level. Shotgun proteome data published by Geiger et al. (2012), Chick et al. (2015), and obtained in this work for HEK-293 were searched against the customized genomic database generated using exome data published by Lin et al. (2014). Overall, 112 unique variants were identified at the proteome level out of ∼1200 coding variants annotated in the exome. Seven identified variants were shared between all the three considered proteomic datasets, and 27 variants were found in any two datasets. Some of the found variants belonged to widely known genomic polymorphisms originated from the germline, while the others were more likely resulting from somatic mutations. At least, eight of the proteins bearing amino acid variants were annotated as cancer-related ones, including p53 tumor suppressor. In all the considered shotgun datasets, the variant peptides were at the ratio of 1:2.5 less likely being identified than the wild-type ones compared with the corresponding theoretical peptides. This can be explained by the presence of the so-called "passenger" mutations in the genes, which were never expressed in HEK-293 cells. All MS data have been deposited in the ProteomeXchange with the dataset identifier PXD002613 (http://proteomecentral.proteomexchange.org/dataset/PXD002613).
Lobas AA
,Karpov DS
,Kopylov AT
,Solovyeva EM
,Ivanov MV
,Ilina IY
,Lazarev VN
,Kuznetsova KG
,Ilgisonis EV
,Zgoda VG
,Gorshkov MV
,Moshkovskii SA
... -
《-》