Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index.
Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods.
In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.
Guguchkin E
,Kasianov A
,Belenikin M
,Zobkova G
,Kosova E
,Makeev V
,Karpulevich E
... -
《BMC BIOINFORMATICS》
Calling known variants and identifying new variants while rapidly aligning sequence data.
Whole-genome sequencing studies can identify causative mutations for subsequent use in genomic evaluations. Speed and accuracy of sequence alignment can be improved by accounting for known variant locations during alignment instead of calling the variants after alignment as in previous programs. The new programs Findmap and Findvar were compared with alignment using Burrows-Wheeler alignment (BWA) or SNAP and variant identification using Genome Analysis ToolKit (GATK) or SAMtools. Findmap stores the reference map and any known variant locations while aligning reads and counting reference and alternate alleles for each DNA source. Findmap also outputs potential new single nucleotide variant, insertion, and deletion alleles. Findvar separates likely true variants from read errors and outputs genotype probabilities. Strategies were tested using cattle, human, and a completely random reference map and simulated or actual data. Most tests simulated 10 bulls, each with 10× simulated sequence reads containing 39 million variants from the 1000 Bull Genomes Project. With 10 processors, clock times for processing 100× data were 105 h for BWA, 25 h for GATK, and 11 h for SAMtools but only about 4 h for SNAP, 3 h for Findmap, and 1 h for Findvar. Alignment programs required about the same total memory; BWA used 46 GB (4.6 GB/processor), whereas >10 processors can share the same memory in SNAP and Findmap, which used 40 and 46 GB, respectively. Findmap correctly mapped 92.9% of reads (compared with 92.6% from SNAP and 90.5% from BWA) and had high accuracy of calling alleles for known variants. For new variants, Findvar found 99.8% of single nucleotide variants, 79% of insertions, and 67% of deletions; GATK found 99.4, 95, and 90%, respectively; and SAMtools found 99.8, 12, and 16%, respectively. False positives (as percentages of true variants) were 11% of single nucleotide variants, 0.4% of insertions, and 0.3% of deletions from Findvar; 12, 8.4, and 2.9%, respectively, from GATK; and 37, 1.3, and 0.4%, respectively, from SAMtools. Advantages of Findmap and Findvar are fast processing, precise alignment, more useful data summaries, more compact output, and fewer steps. Calling known variants during alignment allows more efficient and accurate sequence-based genotyping.
VanRaden PM
,Bickhart DM
,O'Connell JR
《-》
Fast and SNP-aware short read alignment with SALT.
DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment.
The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy.
Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT .
Quan W
,Liu B
,Wang Y
《BMC BIOINFORMATICS》
Fast and memory efficient approach for mapping NGS reads to a reference genome.
New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome re-sequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows-Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html .
Kumar S
,Agarwal S
,Ranvijay
《-》
Fast read alignment with incorporation of known genomic variants.
Many genetic variants have been reported from sequencing projects due to decreasing experimental costs. Compared to the current typical paradigm, read mapping incorporating existing variants can improve the performance of subsequent analysis. This method is supposed to map sequencing reads efficiently to a graphical index with a reference genome and known variation to increase alignment quality and variant calling accuracy. However, storing and indexing various types of variation require costly RAM space.
Aligning reads to a graph model-based index including the whole set of variants is ultimately an NP-hard problem in theory. Here, we propose a variation-aware read alignment algorithm (VARA), which generates the alignment between read and multiple genomic sequences simultaneously utilizing the schema of the Landau-Vishkin algorithm. VARA dynamically extracts regional variants to construct a pseudo tree-based structure on-the-fly for seed extension without loading the whole genome variation into memory space.
We developed the novel high-throughput sequencing read aligner deBGA-VARA by integrating VARA into deBGA. The deBGA-VARA is benchmarked both on simulated reads and the NA12878 sequencing dataset. The experimental results demonstrate that read alignment incorporating genetic variation knowledge can achieve high sensitivity and accuracy.
Due to its efficiency, VARA provides a promising solution for further improvement of variant calling while maintaining small memory footprints. The deBGA-VARA is available at: https://github.com/hitbc/deBGA-VARA.
Guo H
,Liu B
,Guan D
,Fu Y
,Wang Y
... -
《BMC Medical Informatics and Decision Making》