Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.-Z研学术

Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.

来自 PUBMED

作者：

Hall MB ， Wick RR ， Judd LM ， Nguyen AN ， Steinig EJ ， Xie O ， Davies M ， Seemann T ， Stinear TP ， Coin L

展开 

摘要：

Variant calling is fundamental in bacterial genomics, underpinning the identification of disease transmission clusters, the construction of phylogenetic trees, and antimicrobial resistance detection. This study presents a comprehensive benchmarking of variant calling accuracy in bacterial genomes using Oxford Nanopore Technologies (ONT) sequencing data. We evaluated three ONT basecalling models and both simplex (single-strand) and duplex (dual-strand) read types across 14 diverse bacterial species. Our findings reveal that deep learning-based variant callers, particularly Clair3 and DeepVariant, significantly outperform traditional methods and even exceed the accuracy of Illumina sequencing, especially when applied to ONT's super-high accuracy model. ONT's superior performance is attributed to its ability to overcome Illumina's errors, which often arise from difficulties in aligning reads in repetitive and variant-dense genomic regions. Moreover, the use of high-performing variant callers with ONT's super-high accuracy data mitigates ONT's traditional errors in homopolymers. We also investigated the impact of read depth on variant calling, demonstrating that 10× depth of ONT super-accuracy data can achieve precision and recall comparable to, or better than, full-depth Illumina sequencing. These results underscore the potential of ONT sequencing, combined with advanced variant calling algorithms, to replace traditional short-read sequencing methods in bacterial genomics, particularly in resource-limited settings.

收起

展开 

关键词：

C. jejuni ， E. coli ， K. pneumoniae ， L. monocytogenes ， M. tuberculosis ， S. aureus ， S. enterica ， S. pyogenes ， bacteria ， benchmark ， bioinformatics ， computational biology ， deep learning ， infectious disease ， microbiology ， nanopore ， systems biology ， variant calling

DOI：

10.7554/eLife.98300

被引量：

年份：

1970

全部来源

SCI-Hub (全网免费下载)

发表链接

ResearchGate (全网免费下载)

钛学术 (全网免费下载)

通过文献互助平台发起求助，成功后即可免费获取论文全文。

查看求助

求助方法1：

知识发现用户

每天可免费求助50篇

求助

求助方法1：

关注微信公众号

每天可免费求助2篇

求助方法2：

求助需要支付5个财富值

您现在财富值不足

您可以通过应助全文获取财富值

求助方法2：

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证，认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献，1~5分钟即可下载全文，部分资源由于网络原因可能需要更长时间，请您耐心等待哦~

身份认证全文购买

相似文献(153)

参考文献(49)

引证文献(0)

Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.

Variant calling is fundamental in bacterial genomics, underpinning the identification of disease transmission clusters, the construction of phylogenetic trees, and antimicrobial resistance detection. This study presents a comprehensive benchmarking of variant calling accuracy in bacterial genomes using Oxford Nanopore Technologies (ONT) sequencing data. We evaluated three ONT basecalling models and both simplex (single-strand) and duplex (dual-strand) read types across 14 diverse bacterial species. Our findings reveal that deep learning-based variant callers, particularly Clair3 and DeepVariant, significantly outperform traditional methods and even exceed the accuracy of Illumina sequencing, especially when applied to ONT's super-high accuracy model. ONT's superior performance is attributed to its ability to overcome Illumina's errors, which often arise from difficulties in aligning reads in repetitive and variant-dense genomic regions. Moreover, the use of high-performing variant callers with ONT's super-high accuracy data mitigates ONT's traditional errors in homopolymers. We also investigated the impact of read depth on variant calling, demonstrating that 10× depth of ONT super-accuracy data can achieve precision and recall comparable to, or better than, full-depth Illumina sequencing. These results underscore the potential of ONT sequencing, combined with advanced variant calling algorithms, to replace traditional short-read sequencing methods in bacterial genomics, particularly in resource-limited settings.

Hall MB ，Wick RR ，Judd LM ，Nguyen AN ，Steinig EJ ，Xie O ，Davies M ，Seemann T ，Stinear TP ，Coin L ... - 《eLife》

被引量: - 发表:1970年
A Comparison of Structural Variant Calling from Short-Read and Nanopore-Based Whole-Genome Sequencing Using Optical Genome Mapping as a Benchmark.

The identification of structural variants (SVs) in genomic data represents an ongoing challenge because of difficulties in reliable SV calling leading to reduced sensitivity and specificity. We prepared high-quality DNA from 9 parent-child trios, who had previously undergone short-read whole-genome sequencing (Illumina platform) as part of the Genomics England 100,000 Genomes Project. We reanalysed the genomes using both Bionano optical genome mapping (OGM; 8 probands and one trio) and Nanopore long-read sequencing (Oxford Nanopore Technologies [ONT] platform; all samples). To establish a "truth" dataset, we asked whether rare proband SV calls ( = 234) made by the Bionano Access (version 1.6.1)/Solve software (version 3.6.1_11162020) could be verified by individual visualisation using the Integrative Genomics Viewer with either or both of the Illumina and ONT raw sequence. Of these, 222 calls were verified, indicating that Bionano OGM calls have high precision (positive predictive value 95%). We then asked what proportion of the 222 true Bionano SVs had been identified by SV callers in the other two datasets. In the Illumina dataset, sensitivity varied according to variant type, being high for deletions (115/134; 86%) but poor for insertions (13/58; 22%). In the ONT dataset, sensitivity was generally poor using the original Sniffles variant caller (48% overall) but improved substantially with use of Sniffles2 (36/40; 90% and 17/23; 74% for deletions and insertions, respectively). In summary, we show that the precision of OGM is very high. In addition, when applying the Sniffles2 caller, the sensitivity of SV calling using ONT long-read sequence data outperforms Illumina sequencing for most SV types.

Pei Y ，Tanguy M ，Giess A ，Dixit A ，Wilson LC ，Gibbons RJ ，Twigg SRF ，Elgar G ，Wilkie AOM ... - 《-》

被引量: - 发表:1970年
Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction.

Complete, accurate, cost-effective, and high-throughput reconstruction of bacterial genomes for large-scale genomic epidemiological studies is currently only possible with hybrid assembly, combining long- (typically using nanopore sequencing) and short-read (Illumina) datasets. Being able to use nanopore-only data would be a significant advance. Oxford Nanopore Technologies (ONT) have recently released a new flowcell (R10.4) and chemistry (Kit12), which reportedly generate per-read accuracies rivalling those of Illumina data. To evaluate this, we sequenced DNA extracts from four commonly studied bacterial pathogens, namely , , and , using Illumina and ONT's R9.4.1/Kit10, R10.3/Kit12, R10.4/Kit12 flowcells/chemistries. We compared raw read accuracy and assembly accuracy for each modality, considering the impact of different nanopore basecalling models, commonly used assemblers, sequencing depth, and the use of duplex versus simplex reads. 'Super accuracy' (sup) basecalled R10.4 reads - in particular duplex reads - have high per-read accuracies and could be used to robustly reconstruct bacterial genomes without the use of Illumina data. However, the per-run yield of duplex reads generated in our hands with standard sequencing protocols was low (typically <10 %), with substantial implications for cost and throughput if relying on nanopore data only to enable bacterial genome reconstruction. In addition, recovery of small plasmids with the best-performing long-read assembler (Flye) was inconsistent. R10.4/Kit12 combined with sup basecalling holds promise as a singular sequencing technology in the reconstruction of commonly studied bacterial genomes, but hybrid assembly (Illumina+R9.4.1 hac) currently remains the highest throughput, most robust, and cost-effective approach to fully reconstruct these bacterial genomes.

Sanderson ND ，Kapel N ，Rodger G ，Webster H ，Lipworth S ，Street TL ，Peto T ，Crook D ，Stoesser N ... - 《Microbial Genomics》

被引量: - 发表:2023年
Nanopore-only assemblies for genomic surveillance of the global priority drug-resistant pathogen, Klebsiella pneumoniae.

Oxford Nanopore Technologies (ONT) sequencing has rich potential for genomic epidemiology and public health investigations of bacterial pathogens, particularly in low-resource settings and at the point of care, due to its portability and affordability. However, low base-call accuracy has limited the reliability of ONT data for critical tasks such as antimicrobial resistance (AMR) and virulence gene detection and typing, serotype prediction, and cluster identification. Thus, Illumina sequencing remains the standard for genomic surveillance despite higher capital and running costs. We tested the accuracy of ONT-only assemblies for common applied bacterial genomics tasks (genotyping and cluster detection, implemented via Kleborate, Kaptive and Pathogenwatch), using data from 54 unique isolates. ONT reads generated via MinION with R9.4.1 flowcells were basecalled using three alternative models [Fast, High-accuracy (HAC) and Super-accuracy (SUP), available within ONT's Guppy software], assembled with Flye and polished using Medaka. Accuracy of typing using ONT-only assemblies was compared with that of Illumina-only and hybrid ONT+Illumina assemblies, constructed from the same isolates as reference standards. The most resource-intensive ONT-assembly approach (SUP basecalling, with or without Medaka polishing) performed best, yielding reliable capsule (K) type calls for all strains (100 % exact or best matching locus), reliable multi-locus sequence type (MLST) assignment (98.3 % exact match or single-locus variants), and good detection of acquired AMR genes and mutations (88-100 % correct identification across the various drug classes). Distance-based trees generated from SUP+Medaka assemblies accurately reflected overall genetic relationships between isolates. The definition of outbreak clusters from ONT-only assemblies was problematic due to inflation of SNP counts by high base-call errors. However, ONT data could be reliably used to 'rule out' isolates of distinct lineages from suspected transmission clusters. HAC basecalling + Medaka polishing performed similarly to SUP basecalling without polishing. Therefore, we recommend investing compute resources into basecalling (SUP model), wherever compute resources and time allow, and note that polishing is also worthwhile for improved performance. Overall, our results show that MLST, K type and AMR determinants can be reliably identified with ONT-only R9.4.1 flowcell data. However, cluster detection remains challenging with this technology.

Foster-Nyarko E ，Cottingham H ，Wick RR ，Judd LM ，Lam MMC ，Wyres KL ，Stanton TD ，Tsang KK ，David S ，Aanensen DM ，Brisse S ，Holt KE ... - 《-》

被引量: - 发表:2023年
Closing the gap: Oxford Nanopore Technologies R10 sequencing allows comparable results to Illumina sequencing for SNP-based outbreak investigation of bacterial pathogens.

Whole-genome sequencing has become the method of choice for bacterial outbreak investigation, with most clinical and public health laboratories currently routinely using short-read Illumina sequencing. Recently, long-read Oxford Nanopore Technologies (ONT) sequencing has gained prominence and may offer advantages over short-read sequencing, particularly with the recent introduction of the R10 chemistry, which promises much lower error rates than the R9 chemistry. However, limited information is available on its performance for bacterial single-nucleotide polymorphism (SNP)-based outbreak investigation. We present an open-source workflow, Prokaryotic Awesome variant Calling Utility (PACU) (https://github.com/BioinformaticsPlatformWIV-ISP/PACU), for constructing SNP phylogenies using Illumina and/or ONT R9/R10 sequencing data. The workflow was evaluated using outbreak data sets of Shiga toxin-producing and by comparing ONT R9 and R10 with Illumina data. The performance of each sequencing technology was evaluated not only separately but also by integrating samples sequenced by different technologies/chemistries into the same phylogenomic analysis. Additionally, the minimum sequencing time required to obtain accurate phylogenetic results using nanopore sequencing was evaluated. PACU allowed accurate identification of outbreak clusters for both species using all technologies/chemistries, but ONT R9 results deviated slightly more from the Illumina results. ONT R10 results showed trends very similar to Illumina, and we found that integrating data sets sequenced by either Illumina or ONT R10 for different isolates into the same analysis produced stable and highly accurate phylogenomic results. The resulting phylogenies for these two outbreaks stabilized after ~20 hours of sequencing for ONT R9 and ~8 hours for ONT R10. This study provides a proof of concept for using ONT R10, either in isolation or in combination with Illumina, for rapid and accurate bacterial SNP-based outbreak investigation.

Bogaerts B ，Van den Bossche A ，Verhaegen B ，Delbrassinne L ，Mattheus W ，Nouws S ，Godfroid M ，Hoffman S ，Roosens NHC ，De Keersmaecker SCJ ，Vanneste K ... - 《-》

被引量: - 发表:1970年

加载更多

来源期刊

eLife

影响因子：8.704

JCR分区：暂无

中科院分区：暂无