Evaluating de Bruijn graph assemblers on 454 transcriptomic data.
Next generation sequencing (NGS) technologies have greatly changed the landscape of transcriptomic studies of non-model organisms. Since there is no reference genome available, de novo assembly methods play key roles in the analysis of these data sets. Because of the huge amount of data generated by NGS technologies for each run, many assemblers, e.g., ABySS, Velvet and Trinity, are developed based on a de Bruijn graph due to its time- and space-efficiency. However, most of these assemblers were developed initially for the Illumina/Solexa platform. The performance of these assemblers on 454 transcriptomic data is unknown. In this study, we evaluated and compared the relative performance of these de Bruijn graph based assemblers on both simulated and real 454 transcriptomic data. The results suggest that Trinity, the Illumina/Solexa-specialized transcriptomic assembler, performs the best among the multiple de Bruijn graph assemblers, comparable to or even outperforming the standard 454 assembler Newbler which is based on the overlap-layout-consensus algorithm. Our evaluation is expected to provide helpful guidance for researchers to choose assemblers when analyzing 454 transcriptomic data.
Ren X
,Liu T
,Dong J
,Sun L
,Yang J
,Zhu Y
,Jin Q
... -
《PLoS One》
Clover: a clustering-oriented de novo assembler for Illumina sequences.
Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches.
In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover's performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT.
The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html .
Hsieh MF
,Lu CL
,Tang CY
《BMC BIOINFORMATICS》
Benchmarking and Assessment of Eight De Novo Genome Assemblers on Viral Next-Generation Sequencing Data, Including the SARS-CoV-2.
Viral genomics has become crucial in clinical diagnostics and ecology, not to mention to stem the COVID-19 pandemic. Whole-genome sequencing (WGS) is pivotal in gaining an improved understanding of viral evolution, genomic epidemiology, infectious outbreaks, pathobiology, clinical management, and vaccine development. Genome assembly is one of the crucial steps in WGS data analyses. A series of different assemblers has been developed with the advent of high-throughput next-generation sequencing (NGS). Various studies have reported the evaluation of these assembly tools on distinct datasets; however, these lack data from viral origin. In this study, we performed a comparative evaluation and benchmarking of eight assemblers: SOAPdenovo, Velvet, assembly by short sequences (ABySS), iterative graph assembler (IDBA), SPAdes, Edena, iterative virus assembler, and VICUNA on the viral NGS data from distinct Illumina (GAIIx, Hiseq, Miseq, and Nextseq) platforms. WGS data of diverse viruses, that is, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), dengue virus 3, human immunodeficiency virus 1, hepatitis B virus, human herpesvirus 8, human papillomavirus 16, rhinovirus A, and West Nile virus, were utilized to assess these assemblers. Performance metrics such as genome fraction recovery, assembly lengths, NG50, N50, contig length, contig numbers, mismatches, and misassemblies were analyzed. Overall, three assemblers, that is, SPAdes, IDBA, and ABySS, performed consistently well, including for genome assembly of SARS-CoV-2. These assembly methods should be considered and recommended for future studies of viruses. The study also suggests that implementing two or more assembly approaches should be considered in viral NGS studies, especially in clinical settings. Taken together, the benchmarking of eight genome assemblers reported in this study can inform future public health and ecology research concerning the viruses, the COVID-19 pandemic, and viral outbreaks.
Gupta AK
,Kumar M
《-》
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.
Improved DNA sequencing methods have transformed the field of genomics over the last decade. This has become possible due to the development of inexpensive short read sequencing technologies which have now resulted in three generations of sequencing platforms. More recently, a new fourth generation of Nanopore based single molecule sequencing technology, was developed based on MinION(®) sequencer which is portable, inexpensive and fast. It is capable of generating reads of length greater than 100 kb. Though it has many specific advantages, the two major limitations of the MinION reads are high error rates and the need for the development of downstream pipelines. The algorithms for error correction have already emerged, while development of pipelines is still at nascent stage.
In this study, we benchmarked available assembler algorithms to find an appropriate framework that can efficiently assemble Nanopore sequenced reads. To address this, we employed genome-scale Nanopore sequenced datasets available for E. coli and yeast genomes respectively. In order to comprehensively evaluate multiple algorithmic frameworks, we included assemblers based on de Bruijn graphs (Velvet and ABySS), Overlap Layout Consensus (OLC) (Celera) and Greedy extension (SSAKE) approaches. We analyzed the quality, accuracy of the assemblies as well as the computational performance of each of the assemblers included in our benchmark. Our analysis unveiled that OLC-based algorithm, Celera, could generate a high quality assembly with ten times higher N50 & mean contig values as well as one-fifth the number of total number of contigs compared to other tools. Celera was also found to exhibit an average genome coverage of 12 % in E. coli dataset and 70 % in Yeast dataset as well as relatively lesser run times. In contrast, de Bruijn graph based assemblers Velvet and ABySS generated the assemblies of moderate quality, in less time when there is no limitation on the memory allocation, while greedy extension based algorithm SSAKE generated an assembly of very poor quality but with genome coverage of 90 % on yeast dataset.
OLC can be considered as a favorable algorithmic framework for the development of assembler tools for Nanopore-based data, followed by de Bruijn based algorithms as they consume relatively less or similar run times as OLC-based algorithms for generating assembly, irrespective of the memory allocated for the task. However, few improvements must be made to the existing de Bruijn implementations in order to generate an assembly with reasonable quality. Our findings should help in stimulating the development of novel assemblers for handling Nanopore sequence data.
Cherukuri Y
,Janga SC
《BMC GENOMICS》
Comparative analysis of de novo transcriptome assembly.
The fast development of next-generation sequencing technology presents a major computational challenge for data processing and analysis. A fast algorithm, de Bruijn graph has been successfully used for genome DNA de novo assembly; nevertheless, its performance for transcriptome assembly is unclear. In this study, we used both simulated and real RNA-Seq data, from either artificial RNA templates or human transcripts, to evaluate five de novo assemblers, ABySS, Mira, Trinity, Velvet and Oases. Of these assemblers, ABySS, Trinity, Velvet and Oases are all based on de Bruijn graph, and Mira uses an overlap graph algorithm. Various numbers of RNA short reads were selected from the External RNA Control Consortium (ERCC) data and human chromosome 22. A number of statistics were then calculated for the resulting contigs from each assembler. Each experiment was repeated multiple times to obtain the mean statistics and standard error estimate. Trinity had relative good performance for both ERCC and human data, but it may not consistently generate full length transcripts. ABySS was the fastest method but its assembly quality was low. Mira gave a good rate for mapping its contigs onto human chromosome 22, but its computational speed is not satisfactory. Our results suggest that transcript assembly remains a challenge problem for bioinformatics society. Therefore, a novel assembler is in need for assembling transcriptome data generated by next generation sequencing technique.
Clarke K
,Yang Y
,Marsh R
,Xie L
,Zhang KK
... -
《-》