Reference Mapping Considering Swaps of Adjacent Bases

: Since the time of the HGP, research into next-generation sequencing, which can reduce the cost and time of sequence analysis using computer algorithms, has been actively conducted. Mapping is a next-generation sequencing method that identiﬁes sequences by aligning short reads with a reference genome for which sequence information is known. Mapping can be applied to tasks such as SNP calling, motif searches, and gene identiﬁcation. Research on mapping that utilizes BWT and GPU has been undertaken in order to obtain faster mapping. In this paper, we propose a new mapping algorithm with additional consideration for base swaps. The experimental results demonstrate that when the penalty score for swaps was − 1, − 2, and − 3 in paired-end alignment, for the human whole genome, SOAP3-swap aligned 4667, 2318, and 972 more read pairs, respectively, than SOAP3-dp, and for the drosophila genome, SOAP3-swap aligned 1253, 454, and 129 more read pairs, respectively, than SOAP3-dp. SOAP3-swap has the same functionality as that of SOAP3-dp and also improves the alignment ratio by taking biologically signiﬁcant swaps into account for the ﬁrst time.


Introduction
Since the time of the Human Genome Project (HGP) [1,2], next-generation sequencing, which can reduce the cost and time of sequence analysis using computer algorithms, has been a very active area of research [3,4]. Mapping is a next-generation sequencing method that identifies sequences by aligning short reads with a reference genome for which sequence information is known. It can be applied to various types of analyses of genetic mutations and genetic polymorphisms, such as single-nucleotide polymorphism (SNP) calling, motif searches, and gene identification [5,6]. Most mapping algorithms consist of an indexing step, in which a data structure for a reference or reads is created, and an alignment step, in which fast mapping is performed using the data structures generated.
In order to enhance mapping speed, studies utilizing various data structures have been conducted. Burrows-Wheeler transform (BWT)-based mapping tools have been developed [7][8][9][10]. Among them, Bowtie [7] performs mapping based on the Ferragina and Manzini's algorithm [11] and takes mismatches into account. The Burrows-Wheeler aligner (BWA) [8] first generates a prefix trie for the reference sequence and then performs mapping, taking mismatches and gaps into account with the use of a top-down traversal method. SOAP2 [9] considers mismatches and gaps using a bi-directional BWT search (or 2way-BWT search) [12] and the Smith-Waterman algorithm [13], and it reduces space usage through a sampled suffix array. BWA-MEM [10] improved the performance of BWA using the seed alignments with maximal exact matches (MEM) and the seed extensions with the affine-gap Smith-Waterman algorithm. Meanwhile, tophat2 [14], considering very large deletions, inversions on the same chromosome and and translocations involving different chromosomes were introduced.
As the performance of graphics processing units (GPUs) has improved, several mapping studies that utilize this technology have been conducted [15][16][17][18][19]. SOAP3 [15] was developed based on SOAP2, in which the memory access method of the auxiliary data structure was improved with the consideration of the characteristics of Compute Unified Device Architecture (CUDA). Hard patterns referring to the read that causes some of the GPU processors to idle as a result of too many branches in alignment are processed, and a coalesced memory access strategy is employed. Subsequently, SOAP3-dp [16], which is based on SOAP3, improved the alignment ratio using the Smith-Waterman algorithm. BarraCUDA [17] was developed based on BWA, utilizing the texture memory of CUDA and a depth-first search algorithm. CUSHAW [18] uses 64 threads per thread block and a shared memory to improve performance, and it uses a depth-first search algorithm that considers the number of mismatches and the quality score of the mismatched base positions. CUSHAW2-GPU [19] improved upon the CUSHAW2 [20] algorithm by utilizing CUDA, a collaborative calculation of CPU and GPU, seed-based alignment, and a tile-based Smith-Waterman algorithm.
Early mapping tools performed alignment only with exact sequence matches, but later alignment algorithms allowed a limited number of mismatches to search for SNPs. In recent years, mapping tools that consider not only mismatches but also gaps have been developed to facilitate the analysis of various genetic mutations. Approximate string matching algorithms that compare sequences in the presence of errors are used [5][6][7][8][9]12,[15][16][17][18][19][20]. In approximate string matching algorithms, distance functions are used for measuring errors. Typical distance functions include the Hamming distance, edit distances, and weighted edit distances [21]. Studies have also been conducted on the extended edit distance, which takes into consideration swaps that change the positions of two adjacent bases [22][23][24][25] ( Figure 1). Swaps occur during genetic mutation and replication and are known to be associated with spinal muscular atrophy [26,27]. In this paper, we propose the SOAP3-swap, which is based on SOAP3-dp and performs reference mapping with additional consideration for swaps. SOAP3-swap can produce alignments in the presence of mismatches, small gaps, and swaps.

Previous Work
SOAP3-dp is a mapping tool that is an extension of SOAP3 [15], the first GPU-based mapping tool. SOAP3 and SOAP3-dp use an improved memory access method and 2way-BWT and allocate reads to different groups according to the amount of calculation required when considering the characteristics of CUDA. SOAP3-dp has a similar execution time to that of SOAP3 but performs an additional alignment of the reads not aligned during the initial attempt using dynamic programming. The addition of this step leads to an enhanced alignment ratio compared to that of SOAP3. While SOAP3 performs alignments that only consider mismatches, SOAP3-dp performs alignments that consider mismatches and also small gaps using the Smith-Waterman algorithm [13], a well-known sequence comparison algorithm.
When a paired-end alignment is carried out using SOAP3-dp, the alignment is first performed through a 2way-BWT search considering only mismatches. For some read pairs that do not align during this process, a search is made for the position with the highest alignment score, which satisfies the interval condition between the pair, taking mismatches and gaps into account. At this point, the region of the reference sequence in which the corresponding read pair may exist is called the candidate region. When a read R (|R| = n) likely to be aligned to a candidate region T (|T| = m) is provided, the sub-region of T with the highest alignment score between the T's sub-region and R is searched. Let T[i..j] represent the subsequence from i to j in T. Among the subsequences of T[1..i], the alignment score of the subsequence with the highest alignment score for R[1.
.j] is expressed as M(i, j). I(i, j) refers to the alignment score that inserts R[j] next to T[1..i], and D(i, j) refers to the alignment score that deletes T[i] from T[1..i]. Thus, the highest score at which R is aligned to T is max 1≤i≤m M(i, n). The starting position at which R is aligned to T can be calculated by tracing back the errors used to align from max 1≤i≤m M(i, n). The match score was expressed as S MA , and the penalty scores for mismatches, gap openings, and gap extensions were expressed as S MI , S GO , and S GE , respectively. The initial values of each alignment score are as follows: . Algorithm 1 shows the Smith-Waterman algorithm for SOAP3-dp [16]. Here, δ(a, b) denotes the match or mismatch score between two sequence elements a and b. If for j ← 1 to n do 5: end for 9: end for

Materials and Methods
Consider two sequences, T and R.
, we can determine that a swap of length k took place at T[i] and R[j]. A swap of length k means that the order of two adjacent subsequences with a length of k is changed. Let S SW be the penalty score for swaps. The alignment score of the swap is calculated by adding S SW to the score before the swap occurred. In Algorithm 2, which computes the tables for SOAP3-swap, the alignment score M(i, j) is defined as the largest score among the alignment scores that considers matches, mismatches, gaps, and swaps (line 9). In SOAP3-swap, swaps of length 1 to 3 can be considered, and the time complexity is the same as that of SOAP3-dp, as shown in Algorithm 2.
The Sequence Alignment/Map (SAM) format is a text format for storing read alignments against reference sequences, and it supports short and long reads (up to 128 Mbp) produced by different sequencing platforms [28,29]. The Binary Alignment/Map (BAM) format is a binary representation of the SAM and keeps exactly the same information as the SAM. To achieve fast random access in a zlib-compatible compressed file [30][31][32][33], the BAM can be compressed using the BGZF library, a generic library developed by Handsaker and modified by Li for remote file access and in-memory caching [28]. The SAM/BAM format includes the Compact Idiosyncratic Gapped Alignment Report (CIGAR) string format to describe how a read aligns to a reference. The CIGAR operations in SAM include the following: 'M' for match/mismatch, 'I' for insertion compared with the reference, 'D' for deletion, 'N' for skipped bases on the reference, 'S' for soft clipping where the clipped subsequence is shown in the read sequence in the SAM format, 'H' for hard clipping where the clipped subsequence is not shown in the read sequence in the SAM format, 'P' for padding, '=' for match, and 'X' for mismatch [28,29].
Algorithm 2 Compute I, D, and M considering swaps. 1: Assign tables I, D, and M of size (m + 1) × (n + 1) 2: Initialize(I, D, M) 3: for i ← 1 to m do 4: for j ← 1 to n do 5: Since a mapping tool that considers swaps was not available, no alignment information regarding swaps was available in the CIGAR string. Therefore, the symbols representing the swap were added to the CIGAR string of SOAP3-swap. When outputting the alignment results in SAM format, the reads where a swap occurred are marked as 'T', 'W', and 'B' in the CIGAR string when the lengths of the two exchanged base sequences are 1, 2, and 3, respectively.

Results
The data used in these experiments were the human genome sequence and the drosophila genome sequence. The human genome sequence is GRCh38 (3.3 GB), and 100 bp-paired-end reads (SRR211279, 8.7 GB each) containing 25,467,888 reads in total. The reads were generated by Illumina GAIIx from the Washington University Genome Sequencing Center. The drosophila genome sequence is DhydRS2 (148.5 MB, Drosophila hydei), and 100 bp-paired-end reads (SRR6326389, 12.7 GB each) containing 38,873,031 reads in total. The reads were generated by Illumina HiSeq 2500 from Texas A&M University.
In order to examine the performance and effectiveness of SOAP3-swap, the algorithm was compared to SOAP3-dp, SOAP3, CUSHAW2-GPU, BarraCUDA, BWA, Bowtie2, and CUSHAW2. The experimental conditions were as follows: SOAP3-swap, SOAP3-dp, CUSHAW2-GPU, and BarraCUDA were tested using 32 CPU threads and one GPU device. SOAP3 was tested using six CPU threads, which were maximal and one GPU device. BWA, Bowtie2, and CUSHAW2 were tested using 32 CPU threads. In SOAP3-swap and SOAP3-dp, S MA was set to 1, and S MI , S GO , and S GE were set to −2, −3, and −1, respectively, as in [16]. In SOAP3-swap, S SW was tested at −1, −2, and −3. Users can adjust the penalty scores to determine which operation is more likely to occur. The average running time of 20 experiments was taken to represent the running time of each tool. Tables 1 and 2 show the running times and alignment ratios of the paired-end alignments produced by the various tools for the human genome and for the drosophila genome, respectively. Table 3 shows the number of read pairs including at least one swap according to the swap cost and the swapped block size for the human genome and the drosophila genome.

Discussion
The two main strengths of SOAP3-swap are its speed and its alignment ratio. As shown in Tables 1 and 2, except for SOAP3-dp, SOAP3-swap was significantly faster than the other tools in our experiments. With respect to alignment ratio, SOAP3-swap aligned more read pairs than any of the other reference mapping tools. For example, when S SW was set to −1, −2, and −3, for the human genome, SOAP3-swap aligned 4667 more read pairs, 2318 more read pairs, and 972 more read pairs, respectively, than SOAP3-dp, and for the drosophila genome, SOAP3-swap aligned 1253 more read pairs, 454 more read pairs, and 129 more read pairs, respectively, than SOAP3-dp. Since the alignment results can be the basis for reconstructing genomes and identifying mutations of genomes, the higher alignment ratio can be more helpful to analyze genomes.
When S SW was set to −1, −2, and −3, for the human genome, SOAP3-swap aligned 273, 379, 202, 998, and 145, 117 read pairs, respectively, and for the drosophila genome, SOAP3-swap aligned 318, 343, 213, 178, and 159, 104 read pairs, respectively, as shown in Table 3. Since we performed the paired-end alignment in our experiment, we considered it as a swap occurrence when a swap occurred in at least one read. This result shows that some of the read pairs aligned by SOAP3-dp may actually have been aligned by swaps rather than insertion/deletion. Therefore SOAP3-swap can provide a different alignment result compared to SOAP3-dp because it performs alignment by considering all of the mismatches, gaps, and swaps.

Conclusions
The SOAP3-swap proposed in this paper has the same functionality as that of SOAP3dp and also improves the alignment ratio by taking biologically significant swaps into account for the first time. The reason why we considered the swaps with lengths of up to three is because swaps with lengths longer than three hardly occur in our experiment.
Further research is necessary to develop more efficient alignment methods that can be applied to situations including exchanges involving non-adjacent bases.
Author Contributions: Y.K. and M.K. are the main developers and analyzed SOAP3-dp and implemented SOAP3-swap. J.-H.J. and D.W.K. analyzed SOAP2 and SOAP3 and designed the parallel technique together. S.J.P. provided biological support and designed the experiments. J.S.S. provided algorithmic and biological support and is the manager in charge of the project. All authors read and approved the final manuscript.