In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.-Z研学术

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

来自 PUBMED

作者：

Audain E ， Uszkoreit J ， Sachsenberg T ， Pfeuffer J ， Liang X ， Hermjakob H ， Sanchez A ， Eisenacher M ， Reinert K ， Tabb DL ， Kohlbacher O ， Perez-Riverol Y

展开 

摘要：

In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.

收起

展开 

DOI：

10.1016/j.jprot.2016.08.002

被引量：

年份：

1970

全部来源

SCI-Hub (全网免费下载)

发表链接

ResearchGate (全网免费下载)

钛学术 (全网免费下载)

通过文献互助平台发起求助，成功后即可免费获取论文全文。

查看求助

求助方法1：

知识发现用户

每天可免费求助50篇

求助

求助方法1：

关注微信公众号

每天可免费求助2篇

求助方法2：

求助需要支付5个财富值

您现在财富值不足

您可以通过应助全文获取财富值

求助方法2：

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证，认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献，1~5分钟即可下载全文，部分资源由于网络原因可能需要更长时间，请您耐心等待哦~

身份认证全文购买

相似文献(1876)

参考文献(0)

引证文献(20)

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.

Audain E ，Uszkoreit J ，Sachsenberg T ，Pfeuffer J ，Liang X ，Hermjakob H ，Sanchez A ，Eisenacher M ，Reinert K ，Tabb DL ，Kohlbacher O ，Perez-Riverol Y ... - 《-》

被引量: 20 发表:1970年
Protein Inference Using PIA Workflows and PSI Standard File Formats.

Uszkoreit J ，Perez-Riverol Y ，Eggers B ，Marcus K ，Eisenacher M ... - 《-》

被引量: 27 发表:1970年
Algorithms for database-dependent search of MS/MS data.

Matthiesen R 《-》

被引量: 4 发表:2013年
Using the entrapment sequence method as a standard to evaluate key steps of proteomics data analysis process.

Feng XD ，Li LW ，Zhang JH ，Zhu YP ，Chang C ，Shu KX ，Ma J ... - 《BMC GENOMICS》

被引量: 10 发表:1970年
Comparative database search engine analysis on massive tandem mass spectra of pork-based food products for halal proteomics.

Mass spectrometry-based proteomics relies on dedicated software for peptide and protein identification. These software include open-source or commercial-based search engines; wherein, they employ different algorithms to establish their scoring and identified proteins. Although previous comparative studies have differentiated the proteomics results from different software, there are still yet studies specifically been conducted to compare and evaluate the search engine in the field of halal analysis. This is important because the halal analysis is often using commercial meat samples that have been subjected to various processing, further complicating its analysis. Thus, this study aimed to assess three open-source search engines (Comet, X! Tandem, and ProteinProspector) and a commercial-based search engine (ProteinPilot™) against 135 raw tandem mass spectrometry data files from 15 types of pork-based food products for halal analysis. Each database search engine contained high false-discovery rate (FDR); however, a post-searching algorithm called PeptideProphet managed to reduce the FDR, except for ProteinProspector and ProteinPilot™. From this study, the combined database search engine (executed by iProphet) reveals a thorough protein list for pork-based food products; wherein the most abundant proteins are myofibrillar proteins. Thus, this proteomics study will aid the identification of potential peptide and protein biomarkers for future precision halal analysis. SIGNIFICANCE: A critical challenge of halal proteomics is the availability of a database to confirm the inferential peptides as well as proteins. Currently, the established database such as UniProtKB is related to animal proteome; however, the halal proteomics is related to the highly processed meat-based food products. This study highlights the use of different database search engines (Comet, X! Tandem, ProteinProspector, and ProteinPilot™) and their respective algorithms to analyse 135 raw tandem mass spectrometry data files from 15 types of pork-based food products. This is the first attempt that has compared different database search engines in the context of halal proteomics to ensure the effectiveness of controlling the FDR. Previous studies were just focused on the advantages of a certain algorithm over another. Moreover, other previous studies also have mainly reported the use of mass spectrometry-based shotgun proteomics for meat authentication (the most similar field to halal analysis), but none of the studies were reported on halal aspects that used samples originated from highly processed food products. Hence, a systematic comparative study is duly needed for a more comprehensive and thorough proteomics analysis for such samples. In this study, our combinatorial approach for halal proteomics results from the different search engines used (Comet, X! Tandem, and ProteinProspector) has successfully generated a comprehensive spectral library for the pork-based meat products. This combined spectral library is freely available at https://data.mendeley.com/datasets/6dmm8659rm/3. Thus far, this is the first and new attempt at establishing a spectral library for halal proteomics. We also believe this study is a pioneer for halal proteomics that aimed at non-conventional and non-model organism proteomics, protein analytics, protein bioinformatics, and potential biomarker discovery.

Amir SH ，Yuswan MH ，Aizat WM ，Mansor MK ，Desa MNM ，Yusof YA ，Song LK ，Mustafa S ... - 《-》

被引量: 3 发表:1970年

加载更多

来源期刊

影响因子：暂无数据

JCR分区：暂无

中科院分区：暂无