Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.-Z研学术

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

来自 PUBMED

作者：

Ren J ， Song K ， Deng M ， Reinert G ， Cannon CH ， Sun F

展开 

摘要：

Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online.

收起

展开 

DOI：

10.1093/bioinformatics/btv395

被引量：

年份：

1970

全部来源

SCI-Hub (全网免费下载)

发表链接

ResearchGate (全网免费下载)

钛学术 (全网免费下载)

通过文献互助平台发起求助，成功后即可免费获取论文全文。

查看求助

求助方法1：

知识发现用户

每天可免费求助50篇

求助

求助方法1：

关注微信公众号

每天可免费求助2篇

求助方法2：

求助需要支付5个财富值

您现在财富值不足

您可以通过应助全文获取财富值

求助方法2：

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证，认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献，1~5分钟即可下载全文，部分资源由于网络原因可能需要更长时间，请您耐心等待哦~

身份认证全文购买

相似文献(123)

参考文献(32)

引证文献(21)

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online.

Ren J ，Song K ，Deng M ，Reinert G ，Cannon CH ，Sun F ... - 《-》

被引量: 21 发表:1970年
Normal and compound poisson approximations for pattern occurrences in NGS reads.

Zhai Z ，Reinert G ，Song K ，Waterman MS ，Luan Y ，Sun F ... - 《-》

被引量: 3 发表:2012年
New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

Song K ，Ren J ，Reinert G ，Deng M ，Waterman MS ，Sun F ... - 《-》

被引量: - 发表:1970年
Optimal choice of word length when comparing two Markov sequences using a χ (2)-statistic.

Bai X ，Tang K ，Ren J ，Waterman M ，Sun F ... - 《BMC GENOMICS》

被引量: - 发表:1970年
Hidden Markov Models in Bioinformatics: SNV Inference from Next Generation Sequence.

Bian J ，Zhou X 《-》

被引量: - 发表:2017年

加载更多

来源期刊

影响因子：暂无数据

JCR分区：暂无

中科院分区：暂无