A Faster Quick Search Algorithm

: We present the FQS (faster quick search) algorithm, an improved variation of the quick search algorithm. The quick search (QS) exact pattern matching algorithm and its variants are among the fastest practical matching algorithms today. The FQS algorithm computes a statistically expected shift value, which allows maximal shifts and a smaller number of comparisons between the pattern and the text. Compared to the state-of-the-art QS variants of exact pattern matching algorithms, the proposed FQS algorithm is the fastest when | Σ | ≤ 128 , where | Σ | is the alphabet size. FQS also has a competitive running time when | Σ | > 128 . Running on three practical text ﬁles, E. coli ( | Σ | = 4 ), Bible ( | Σ | = 63 ) and World192 ( | Σ | = 94 ), FQS resulted in the best performance in practice. Our FQS algorithm will have important applications in the domain of genomic database searching, involving DNA or RNA sequence databases with four symbols Σ = { A, C, G, T ( /U ) } or protein databases with | Σ | = 20 .


Introduction
Given a text T = T [0, ..., n − 1] of length n and a pattern P = P [0, ..., m − 1] of length m over an alphabet, Σ, the exact string matching problem is to find all occurrences of pattern P in the text, T .In general, n ≥ m.The exact string matching problem is an important and well-studied subject [1,2].
Three popular exact matching algorithms with linear time complexity are the Knuth-Morris-Pratt (KMP) algorithm [3], the Karp-Rabin (KR) algorithm [4] and the Boyer-Moore (BM) algorithm [5].Like KMP, the BM algorithm matches the pattern and the text by skipping characters that are not likely to result in exact matching with the pattern.Unlike the other methods, it compares the strings from right to left of the pattern.These algorithms need an O(|Σ| + m) time for preprocessing the pattern, and search in O(n) or sometimes even sublinearly in practice.The total time will be O(|Σ| + n + m).A different approach to pattern matching based on bitwise operations was introduced by R. Baeza-Yates and G. Gonnet [6].Here, the pattern is represented by a binary mask.Bitwise SHIFT and AND operations that are considered as constant time are used to find the patterns.Under this framework, SHIFT and AND correspond to the pattern movement and matching, respectively.The algorithm is effective for small patterns, when the pattern length is less than a computer word (typically 32 or 64 characters), which is usual for the text searching problem.Bitwise parallelism is the basis of some recent improved algorithms for exact pattern matching.See [7,8] for examples.
Since Boyer and Moore published their famous BM algorithm in 1977 [5], many variants of the BM algorithm have been proposed [2,9].Among these variants, Sunday's quick search [10] is widely used because of its simplicity and efficiency.
Algorithms based on character comparisons can be classified into these three categories by the way they scan the text [9]: forward orientation, backward orientation and no specific direction.Forward orientation is comparing the text to the pattern from left to right.The KMP algorithm is in this category.See, also, Apostolico et al. [11] and Crochemore et al. [12].Under backward orientation, we compare the text to the pattern from right to left; the BM is in this category.For the third category, some algorithms used both forward and backward comparisons at the same time, for example, quick search by Sunday, and its variants, the Franek-Jennings-Smyth (FJS) algorithm [13] and the Horspool algorithm [14].See, also, the book by Charras and Lecroq for other similar algorithms [1].The other strategy is to determine the preprocessing shift array according to the probability of symbol occurrences in the pattern [1,9].
The QS algorithm and its variants remain among the fastest practical exact pattern matching algorithms to date [9].In this paper, we introduce faster quick search (FQS), an improved algorithm based on the QS exact pattern matching algorithm.The FQS algorithm computes a statistically expected shift length, which allows for maximal shifts and a smaller number of comparisons between the pattern and the text.FQS also utilizes the QS algorithm's bad-character shift table (array) in preprocessing the pattern.Compared to the state-of-the-art algorithms of the QS variety, FQS is the fastest algorithm when |Σ| ≤ 128 and has a competitive running time when |Σ| > 128.Our FQS algorithm will have important applications in the domain of genome database searching, where the DNA (RNA) sequence databases consists of four symbols {A, C, G, T (/U )} and for protein databases with |Σ| = 20.
In this work, we have focused on the QS variants of the Boyer-Moore string matching algorithm.More general discussions on exact string matching can be found in the textbooks, [2,12,15,16].Two recent reviews related to the topic are [9,17].
This paper is organized as follows.First, we introduce the BM algorithm, the QS algorithm and its variants in Section 2. Next, we present the proposed FQS algorithm in Section 3. In Section 4, we present experimental results, including a comparison with three variants of the QS algorithm.Section 5 concludes the paper.

Boyer-Moore Algorithm and Its Variants
The Boyer-Moore (BM) algorithm is an efficient string searching algorithm introduced by Boyer and Moore in 1977 [5].The BM algorithm has been the standard benchmark algorithm in the exact string matching literature since it was introduced [5].The BM algorithm preprocesses the pattern, P , and utilizes the information gathered during the preprocessing step to skip blocks of text (rather than character by character comparisons) during matching, resulting in a faster running time than many other string algorithms.In general, the BM algorithm runs faster as the pattern length increases.
First, the BM preprocesses pattern P to construct a bad character shift array (abbreviated as bad_shif t) of length |Σ|, which is determined using Equation (1).Then, the BM uses the bad character rule.The bad character rule stipulates that once a mismatch occurs, the algorithm jumps to the next position, which is determined by the bad_shif t array without performing brute-force comparisons.
The BM also uses the good suffix rule.The BM starts the comparison between text T and pattern P from right to left.When a mismatch occurs in P [i] = T [j + i] with 0 < i < m and 0 < j < n, the suffix of pattern P [i+1, ..., m−1] matches text T [i+j +1, ..., j +m−1]; the suffix of pattern P [i+1, ..., m−1] is called the good suffix.The algorithm calculates a good-shift array of length m + 1 that determines the next jumping position using the maximum possible shift distance from the structure of the pattern.The overall shift value is then determined by choosing the longer distance between both the bad-shift and good-shift arrays.The classic quick search algorithm and our improved variant do not use the good suffix rule; hence, the corresponding good shift array equation is not presented here.Interested readers, please refer to the original paper by Boyer and Moore [5].The original BM algorithm has a worst-case running time of O(mn) and a best-case time in O( n m ).It has very good performance in general, and there are simple modifications to achieve an overall worst-case time in O(n + m + |Σ|) time [18,19].

Quick Search Algorithm
The quick search(QS) algorithm introduced by Sunday [10] is a simplification of the Boyer-Moore algorithm without the good suffix rule.QS preprocesses pattern P using a modified bad_shift array (called qbad_shif t) of length |Σ| in a time complexity of Θ(m + |Σ|).The modified quick search bad shift array is defined as follows: The preprocessing steps of the quick search algorithm are shown in Algorithm 1.In Algorithm 1, array qsBc is the quick search bad character shift array, which is initialized to value m from Line 1 to Line 3. Lines 4-6 implement Equation (2).For example, in the case of pattern P = "GCAGTCAG" with m = 8 and Σ = {A, C, G, T }.Each element in bad_shift array qsBc[A, C, G, T ] is initialized to eight.After executing the f orloop from Line 4 to Line 6, we have the bad_shift array qsBc[A, C, G, T ] = [2,3,1,4].
Algorithm 2 shows the quick search algorithm.First, it calls the preprocessing procedure, preQS, to compute the bad shift array.Lines 3-9 use a whileloop to compare the text, T , and the pattern, P .Line 4 compares P [0, ..., m − 1] and T [j, ..., j + m − 1], where 0 ≤ j ≤ n − m.When a mismatch occurs, the QS algorithm shifts to a new position as determined by the bad character in T , that is, using the corresponding shift value for the symbol, T [j + m].

Algorithm 2
The quick search algorithm.QS(P, m, T, n, |Σ|) 1 shift ← preQS(P,m) 2 j ← 0 3 while (j ≤ n − m) 4 Compare P [0, ..., m − 1] and T [j, ..., j + m − 1] 5 if all matched then do 6 output j 7 end if The searching phase of the QS has a worst case time complexity of O(mn).In the case of each time, a shift distance is maintained as on,e and the bad character is found in the last comparison of P [0] to the corresponding text (QS starts the comparison from right to left).For example, if T = A n and P = BA m−1 , in this case, the shift distance qsBc[A] = 1.That is, when each bad character occurs, the shift distance is one.Additionally, the bad character is found at the last comparison of P [0] to the corresponding text place, because the QS comparison is from right to left.However, this extreme worst case is rare.Just like the BM, the QS has a very good practical performance in general [10].

Variants of the QS Algorithm
The QS algorithm was motivated by another simplification of the BM algorithm proposed earlier by Horspool in 1980 [14].It has a better performance than the BM in the case of smaller alphabet sizes.The average number of comparisons for one character is between 1  |Σ| and 2 |Σ|+1 [14].It has the same preprocessing time of Θ(m+|Σ|) and the worst-case searching time of O(mn), as with the QS algorithm.
Another QS variant is the FJS algorithm, introduced by Franek, Jennings and Smyth [13] in 2007.FJS is a hybrid exact string matching algorithm that uses both the QS (i.e., Boyer-Moore) and Knuth-Moris-Pratt(KMP) algorithms.It has a Θ(n + m) preprocessing time, similar to the QS.FJS uses the KMP algorithm to ensure that, in the worst-case, its searching phase is O(n), which is better in theory than the O(mn) of BM, QS and other QS variants.As shown in [13], when the pattern length is small (less than 10 characters), FJS's performance is slightly better than the other algorithms.
Another variation of the QS algorithm was proposed by Sheik et al. [20] and Thathoo et al. [21], by combing the QS algorithm with an initial pre-testing stage, as earlier proposed by Raita [22].That is, after pre-computing the shift tables based on the QS algorithm, at the search phase, they introduce a pre-testing step, before full pattern matching can commence.Within a pattern matching window on the text, the last and first symbols in the pattern are first compared with their respective counterparts in the window on the text.If both tests succeed, pattern matching on the remaining symbols will then proceed as usual from right to left, using the QS algorithm.The idea is to establish some level of similarity between the pattern and the text window, before pattern matching will continue.A similar idea was used by Thathoo et al. [21], where they improved the basic approach and required a smaller number of comparisons and larger shifts on average.Experimental results in the recent comprehensive survey by Faro and Lecroq [9] showed that, indeed, the method of Thathoo et al. [21] was slightly better than the approach of Sheik et al. [20], in general.However, the FJS algorithm produced an overall better result when compared with the two methods.Thus, in our comparative analysis, we focused on FJS, HOR and QS.

The FQS Algorithm
Faster quick search (FQS) is an improved version of the quick search algorithm.QS calculates a shift table (array) using Equation (2).In addition to the same shift table in the QS algorithm, FQS calculates two more elements: one is the maximal expected shift position (called pos); the other is a new shift table for the prefix P [0, ..., pos − 1] using the QS algorithm.The expected shift (ES) is the sum of shifts when a mismatch occurs in the pattern current position.In our algorithm, the shift is calculated by the bad character rule, which shifts to right when matching the symbol of the text.In the uniform distribution of symbols, the maximal expected shift position is the left most position of the pattern that has the maximal expected shift value in all positions of the pattern.When the mismatching occurs in this position, it will have the largest shift value in the average case.Equation (3) calculates the expected shift distance for each position in pattern P .The maximal expected position, pos, is calculated in pattern P by using Equation (4).Finally, the algorithm identifies a maximal location, pos, which has the maximal expected shift position.
Before we introduce Equation (3), we first need to consider the array, prepos j (c).Given the current position, j, in pattern P and a symbol, c ∈ Σ, prepos j (c) records the most recent occurrence position of symbol c.For example, given a pattern P = "GCAG", let us examine the prepos j (c) array.First, the size of array prepos j (c) is the same as the alphabet size.Array prepos j (c) is calculated by scanning pattern P from left to right.The initial value of prepos j (c) is set to "−1".After scanning j = 0, prepos 0 (G) is changed to zero, and all of the other symbols of prepos j (c) are still "−1" (the initial value).After scanning j = 1, prepos 1 (C) is updated to one, because the second character is C; all of the other corresponding prepos 1 (c) remain the same as prepos 0 (c).After j = 2, prepos 2 (A) is updated to two; the other elements are unchanged.After j = 3, prepos 3 (G) is updated to three, the other elements remain unchanged: Now, consider Equation ( 3).ES j is the sum of the shifts in the current position, j, of pattern P if the bad character rule is applied.For each position, j in P , where m − 1 ≥ j ≥ 0, ES j is calculated by using Equation ( 3), which indicates the sum of shift values for each symbol, c ∈ Σ.
The maximal expected shift position (pos) for pattern P is computed using Equation ( 4).pos is defined as the first position in pattern P where the maximal ES j occur.

Preprocessing Phase
In the preprocessing phase, FQS needs to determine three elements: (1) The maximal expected shift position (pos) for pattern P using Equation ( 4); (2) a shift table for pattern P using the QS algorithm; and (3) a shift table for P [0, ..., pos − 1], the prefix of P , using the QS algorithm.The maximal expected shift position (pos, from Equation ( 4)) is the maximal expected shift distance using the bad character rule.pos is calculated from pattern P in the preprocessing phase.

Computing the ES array
In the naive computation, ES j = c (j − prepos(c)) for each symbol c, c ∈ Σ .The total time complexity for computing all ES j , where 0 ≤ j ≤ m − 1, is O(m|Σ|).Needless to say, it can be improved.The ES j can be calculated from ES j−1 , when j > 0. That is, the expected skip value at the current position can be calculated by utilizing the known expected skip value at the previous position.The difference between ES j and ES j−1 is: For each c ∈ Σ and c = P [j], prepos j−1 (c) = prepos j (c).That is, all symbols in pattern P have the property that prepos j−1 (c) = prepos j (c), except for the symbol at the current position, j.Put another way, except the current symbol in pattern P , for all of the other symbols in Σ, their current prepos j (c) is equal to prepos j−1 (c).The difference between ES j and ES j−1 can be further analyzed: Finally, we get:

Preprocessing algorithm
The preprocessing procedure is shown in Algorithm 3. We use an array, P reP os, of length |Σ|, to keep the previous position for each symbol, c, where c ∈ Σ.Following the above analysis, we can get ES j from ES j = ES j−1 + |Σ| − (j − prepos j−1 (P [j])) (Equation ( 7)), where m − 1 ≥ j ≥ 1.The computation can be done in constant time for each given j.Algorithm 3 shows the detailed preprocessing steps to compute the maximal expected shift position (pos) for pattern P [0, ..., m − 1] using Equation (4).In Algorithm 3, variable ES is the expected skip value, which is initialized to zero.In the first step of the loop in Lines 5-12, ES 0 will be set to |Σ| − 1. Variable maxES is the maximal expected shift value.Additionally, pos, a position in pattern P , is the location where the maximal expected shift value resides in the pattern, P .
Lines 2-4 initialize the value at each symbol to "−1" for the recent occurrence position array, P reP os (denoted as prepos in Equation ( 3)).Lines 5-12 are a f or loop, which calculates each position's expected shift value, ES, and determines the maximal expected value.Line 6 calculates the expected shift value, ES, using the incremental method, as discussed above (Equation ( 7)).Lines 9-12 search for the maximal expected shift value, maxES.The algorithm finally returns the maximal expected shift position, pos, in Line 13.Note that this preprocessing is only performed once for the pattern, P , using Θ(m + |Σ|) time.
Recall that FQS calculates three elements in its preprocessing phase, namely: (1) the maximal expected shift position (pos) for pattern P ; (2) a shift table for pattern P using the QS algorithm; and (3) a shift table for P [0, ..., pos − 1], the pos-length prefix of P , again using the QS algorithm.From the above calculation of the maximal expected shift position (pos), we know that the time complexity is Θ(m + |Σ|).For element ( 2) and ( 3), the computations are based on the QS algorithm, requiring time in Θ(pos + |Σ|) and Θ(m + |Σ|), respectively.Together, the overall preprocessing time complexity for FQS is Θ(m + |Σ|), since pos < m.

Search Phase
In the search phase, FQS starts to compare the position in the pattern, P , which has the maximal expected shift value, rather than the rightest-most position in P , as in the QS (and the other BM variants).Algorithm 4 shows the detailed steps.In Algorithm 4, the first three lines are the preprocessing steps.Line 1 calls Algorithm 3 to get the location, pos, with the maximal expected shift.Lines 2 and 3 calculate two shift tables (called next and shif t) for the prefix P [0, ..., pos − 1] and the entire pattern, P , respectively, using the same procedure as the classic QS preprocessing algorithm.Compared to the QS algorithm as shown in Algorithm 2, in the preprocessing phase, FQS adds two more lines: Lines 1 and 2. The total time complexity of the three steps is still O(m + |Σ|).
FQS determines the maximal expected shift position.This maximal expected shift position has the statistical maximum shift distance.Once a mismatch is found, the algorithm jumps to a new position, which has the expected maximal shift distance.This mechanism significantly speeds up the FQS algorithm (see the section on the results).
After the preprocessing step, the searching strategy of FQS is as follows: • In Algorithm 4, Lines 5-17 capture the searching phase.Compared with the QS algorithm, FQS adds Lines 6-11 in the search phase.In this phase, initially, text T is aligned with pattern P , at positions T [j] and P [0], respectively, where 0 ≤ j ≤ n − m.FQS first starts to compare the position of the maximal expected shift, pos in P , to the corresponding position, j + pos in T .If a mismatch occurs, the pattern is shifted to a position that is determined by the value, next[T [j + pos]].These steps are performed in Lines 6-11.Otherwise, the FQS algorithm does the same thing as the QS algorithm by starting to compare pattern P [0, ..., m − 1] and T [j, ..., j + m − 1] from right to left.

Correctness Analysis
The correctness of the FQS algorithm essentially follows from the correctness of the QS algorithm.In the search phase, the FQS algorithm uses two bad character shift arrays in two steps.When comparing pattern P to text T , FQS first checks the position, pos in P , the expected maximal shift position, comparing it to the position, j + pos in T .If there is a mismatch, it uses the shift array, next, to shift the pattern to the next right position.The shift value is at most pos + 1.It will not miss any potential matching position.After the first symbol comparison (P [pos] vs. T [j + pos]), the remaining steps are the same as in the QS algorithm.

Complexity Analysis
Section 3.1 provided details on computing the expected maximal shift position and showed the time complexity of Algorithm 3 to be in O(m + |Σ|).The other two preprocessing steps compute the shift arrays using the bad character rule, hence the time required for these two steps are also in O(m + |Σ|) according to the QS algorithm.
In the searching phase, the FQS algorithm integrates a pre-testing stage with the QS algorithm.The time complexity of one pre-test is constant, and the total time complexity is in Θ(n).The worst-case time complexity for searching phase in the FQS is O(mn), and the average time complexity is O(n).The extra space required by the FQS is in O(|Σ|).The FQS algorithm has the same worst-case and average-case time and space requirements as the QS algorithm.As with the general BM algorithm, the worst case complexity can be improved to O(n + m + |Σ|) using the good suffix heuristic with memorization [2,19].

Computing the Maximal Expected Shift Position (pos) for Pattern P
For calculating the maximal expected shift value of the pattern P = "GCAGTCAG", in Line 3 of Algorithm 3, the recent occurrence position (P reP os array in Algorithm 3) for these four symbols are initialized to "−1".Then, Algorithm 3 calculates pos for the pattern, P , by scanning from left to right.
When the first character P [0] = G is read (i = 0), P reP os[G] is the initial value of −1.Line 6 sets ES = 0 + 4 − (0 − (−1)1) = 3.This ES = 3 is the expected shift distance for Position 0 in pattern P .In Line 7, P reP os[G] is set to the current Position 0; this indicates that character G has appeared at least once at this time.In Lines 8-10, the maximal expected shift distance is set to maxES = 3.
When the fourth character P [3] = G is read (i = 3), the P reP os[G] value has been changed to its previous appearing position; in this case, P reP os[G] = 0. Line 6 will set ES = 6 + 4 − (3 − (0)) = 7.In Line 7, P reP os[G] is set to its current position, 3.In Line 9, the maximal expected shift distance is set to seven.
The remaining characters in pattern P are processed in a similar manner.The final expected shift distances for each position in pattern P [0, ..., m − 1] are 3, 5, 6, 7, 6, 6, 6, 6.The maximal expected shift position is in P [3] = G, which has a value of seven.Hence, we have the maximal expected shift distance in Position 3 of pattern P , that is, pos = 3 (see Table 1).

Attempt 1 x next[A,C,G,T] = {1, 2, 3, 4} G C A T C G C A G T C A G T A T A C A G T A C G C A G T C A G
Attempt 2: The second attempt is shown in Figure 2. The algorithm still starts to compare P

Experimental Results
We conducted a number of experiments to compare the FQS algorithm with other state-of-the-art QS algorithms, which are known to be among the fastest in practice: FJS [13], Horspool [14] and the QS itself [10].The implementation of FJS is provided by their authors in the paper [13].The implementation of the other two competitive algorithms are downloaded from the website developed by Christian Charras and Thierry Lecroq (http://www-igm.univ-mlv.fr/lecroq/string/).Their website provides the C code for a large number of exact string pattern matching algorithms, which they reviewed in [1,9].Our implementation of the FQS algorithm is also based on the codes for the QS algorithm provided at the site.
The experiments were conducted on two sets of data: one is a set of randomized text files, the other contains three practical text files.These three practical text files, E. coli, Bible and World192, were downloaded from the Large Canterbury Corpus (http://corpus.canterbury.ac.nz/).The computing environment was a personal computer with an Intel Core2 CPU with 1.66 GHz and 8 GB of RAM working in the Ubuntu 12.04 operating system.
The experimental results are shown separately for two cases: (1) the pattern length is less than or equal to 100 (m ≤ 100); and (2) the pattern length is greater or equal to 100 (m ≥ 100).The results show the following: • When m ≤ 100 and |Σ| = 2, 4, 8, 16, FQS is much faster than the others.Figure 6 shows the performance of the algorithms in these cases.When |Σ| = 2, 4, 8, 16, the trends are similar: FQS is the fastest algorithm among the four.The QS is the second best, which is slightly better than Horspool (denoted as HOR in the figures); • When m ≤ 100 and |Σ| = 32, 64, 128, 256, FQS and FJS demonstrate a competitive performance, which is better than the QS and HOR. Figure 7 shows the performances of the four algorithms in these situations.With the increasing of the alphabet size, the performance of the four algorithms tends to be similar.Although FQS is still among the best, the performance advantage over the others is less obvious.From Figure 7, we can observe that, when the pattern length is small (e.g., with m < 20), FJS provided the best performance among the four algorithms; • When m ≥ 100 and |Σ| = 2, 4, 8, 16, 32, 64, FQS provides the best results among the four algorithms.Figure 8 shows the comparative results.When the alphabet size is two, four and eight, respectively, QS is the second best.When the alphabet size is 32, and 64, FJS is ranked as the second best, only inferior to FQS; • When m ≥ 100 and |Σ| = 128, 256, QS is the best algorithm; FQS is similar to FJS, ranked as the second.Figure 9 shows the experimental results.When the length of the pattern is longer than 800, QS, FJS and FQS all have a very similar performance.When the alphabet size is small or medium (|Σ| = 2, 4, 8, 16, 32 and 64, respectively), the performance of FQS is significantly better than others.When the alphabet size is large (|Σ| = 128 or 256), FQS still has a competitive running performance.FQS is suitable to be used with a small or medium alphabet size (not more than 128).The longer the pattern is, the better FQS performs.We took a closer look at the impact of alphabet sizes on the performance.Figure 10 shows the average execution time plotted against alphabet size when the pattern length is fixed, for the cases with m = 10, m = 50, m = 100 and m = 800, respectively.From the figure, we can observe the overall trend for all of the algorithms: with increasing alphabet size |Σ|, the execution time decreases.FQS has better performance when |Σ| is small, especially for cases of long patterns (see m = 800 in Figure 10, for example).This suggests that FQS will have important potential applications in the analysis of a genomic database, since the alphabet size is usually very small, typically four (for DNA or RNA sequences) or 20 (for protein sequences).
We summarize our observations on random texts as follows.

Practical Text Files
The algorithms were also compared using the following three practical text files downloaded from the Large Canterbury Corpus: (1) E. coli: the sequence of the Escherichia coli genome consisting of 4,638,690 base pairs with |Σ| = 4; (2) The Bible: The King James version of the Bible consisting of 4,047,392 characters with |Σ| = 63; (3) World192: A CIA World Fact Book consisting of 2,473,400 characters with |Σ| = 94.
The experiments were carried out the same way as in the case of randomized text files.
Figure 11 shows the execution time versus pattern length from 10 to 100 in the three practical text files.Figure 12 shows the results for pattern length from 100 to 1,000.In all of these cases, FQS outperforms the others.
For E. coli, FQS is much better than the other algorithms.The QS and HOR are in the second group rank (Figures 11a and 12a).For the Bible, the four algorithms have a similar performance.FQS is slightly better than the others (Figures 11b and 12b).For World192, QS, FQS and HOR have a similar performance, with FQS showing a slightly better performance (Figures 11c and 12c).
For these practical files, FQS is the overall best algorithm among the four.Each of the three practical files has a symbol alphabet with size |σ| ≤ 128.This suggests that FQS might be the algorithm of choice for practical use, especially for searching genomic databases with typically smaller alphabets.

Number of Symbol Comparisons
To put the practical running times presented above in context, we also investigated the number of comparisons required by the algorithms and the number of pattern shifts performed during the match.These two parameters are the basic determinants of the running time of the algorithms.Below, we report on the performance of the two best algorithms, QS and FQS.
Table 2 shows the number of comparisons, their corresponding standard deviation (STD) and statistical significance (p-value) from algorithms QS and FQS, for pattern lengths m = 10, 100, 500 and 1,000, respectively.From the table, we can observe that, in all cases, the number of comparisons used by FQS is less than that of QS.The Student's t-test compares whether there is a statistical difference between these two algorithms by using p-value = 0.05 as the threshold.The p-value is shown in bold where there is a significant difference.For Bible and E. coli, there are significant differences in all cases.For World192, there is a statistically significant difference when pattern length m = 1,000; the other three cases (m = 10, 100, 500) do not show any statistically significant difference.Table 3 shows the corresponding results for the number of pattern shifts, the corresponding standard deviation (STD) and statistical difference (p-value) from algorithm QS and FQS, for the pattern lengths m = 10, 100, 500, 1,000.Again, the results show that in all cases, the number of pattern shifts performed by FQS is less than the number for QS.From a statistical point of view, in seven out of 12 cases, there are statistically significant difference in the performance of FQS over QS.Taken together, these two tables provide an explanation for the superior performance of FQS on the practical files when compared with the other QS variants.More importantly, the results show the effectiveness of the innovative use of an intelligent pre-testing stage before embarking on the more time-consuming pattern matching.In our FQS algorithm, this pre-testing is performed using pos, the location with the maximal expected shift in our FQS algorithm.

Statistical Analysis
We applied the Student's t-test to compare FQS vs. FJS, FQS vs. HOR and FQS vs. QS in 19 varied pattern lengths using randomized files and three practical files.The testing results are shown in Figure 13.The x-axis represents the 19 varied pattern lengths (m); the y-axis is arranged accordingly with the alphabet size |Σ| (11 total cases).From the figure, we can find that the statistical testing results match with the previous performances well.In most cases, FQS performed better than FJS, HOR and QS.When comparing FQS vs. HOR (Figure 13b), in a total of 209 cases, both had statically the same performance in 15 cases, and FQS outperformed HOR in the other 194 cases.Comparing FQS vs. QS (Figure 13c), FQS was statistically worse than QS in eight cases; both had similar performance in 34 cases, and FQS showed a statistically better performance in the remaining 167 cases.

Conclusions
The FQS algorithm improves on the quick search (QS) algorithm, by applying the bad character rule, aided with a statistically maximal expected shift value introduced in this work and a pre-testing stage before full pattern matching.Unlike previous approaches that blindly tested the first and last symbols in the pattern [20,21], our pre-testing stage is performed by computing the statistical maximal expected shift position.We have compared FQS against three other competitive QS variants: the QS itself, FJS and the Horspool algorithm.A range of text files were searched, including randomly generated text files with different alphabet sizes (2 ≤ |Σ| ≤ 256), and practical benchmark text files, namely E. coli, Bible and World192, from the Canterbury Corpus.The pattern lengths were varied from 10 to 1,000 with 19 varieties.We find that, statistically, FQS has the overall best performance (practical running time, number of symbol comparisons and number of pattern shifts) over all of the other three algorithms, mostly especially for text files with alphabet sizes less than 128.The results suggest that FQS could have important applications in practice, especially for genomic data sets, such as DNA or RNA sequences with four symbols or protein sequences with 20 symbols.

Step 1 :• Step 4 :• Step 5 :
Check the symbols at maximal expected shift position pos, that is, compare symbols P [pos] and T [j + pos]; • Step 2: If there is a mismatch, shift pattern P based on the distance determined by next[T [j+pos]].Go to Step 1 to continue checking position pos; • Step 3: If otherwise, compare P [0, ..., m − 1] to T [j, ..., j + m − 1], the same way as in the QS algorithm.If all matched, a matching pattern is found at position j in T ; Whether all matched or not, shift the pattern to the right based on the value of shif t[T [j + m]] using the classic quick search algorithm; Repeat the above Steps 1-4 in a loop until text T is exhausted (j > n − m).

3. 4 . 3 .
Searching Pattern P in T After preprocessing steps, the search phase begins.Attempt 1: The first attempt compares the pattern, P , to the text, T , from the beginning, as shown in Figure 1.Because the maximum of expected shift (pos) is three (pos = 3), the comparison starts at P [3] = G against the corresponding position in text T [j + pos] = T [0 + 3] = T [3].This will be the symbol, T , thus leading to a mismatch.The algorithm shifts the pattern, P , to the next position with the shift distance determined by next[T [3]] = next[T ] = 4. Additionally, the value of j is updated to j = j + next[T [3]] = 4.

Figure 2 .
Figure 2. The second attempt.Attempt 2 x next[A,C,G,T] = {1, 2, 3, 4} G C A T C G C A G T C A G T A T A C A G T A C G C A G T C A G

Attempt 3 :
The third attempt is shown in Figure 3.The algorithm compares P [3] = G to text T [j + pos] = T [5 + 3] = T [8] = G.The characters match.Then, the algorithm proceeds as the classic QS algorithm.After a one by one comparison, the algorithm finds an exact match here.It reports the occurrence position and determines the shift distance, j.This shift distance is determined by the classic QS algorithm j = j + shif t[T [j + m]] = 5 + shif t[T [5 + 8]] = 5 + shif t[T [13]] = 5 + shif t[T ] = 5 + 4 = 9.

Figure 3 .
Figure 3.The third attempt.Attempt 3 → shift[A,C,G,T] = {2, 3, 1, 4} G C A T C G C A G T C A G T A T A C A G T A C G C A G T C A G

Figure 4 .
Figure 4.The forth attempt.Attempt4 shift[A,C,G,T] = {2, 3, 1, 4} x → G C A T C G C A G T C A G T A T A C A G T A C G C A G T C A G

Attempt 5 :
The fifth attempt is shown in Figure 5.The algorithm compares P [3] = G to the corresponding position in text T [j + pos] = T [12 + 3] = T [15] = T .It is a mismatch.The shift distance is determined by FQS shift value next[T [15]] = 4, and the value of j is updated to j = j + next[T [15]] = 12 + 4 = 16.For n = 23, m = 8, when j = 16 > n − m = 15, text T is exhausted.The search phase stops.

Figure 5 .
Figure 5.The fifth attempt.Attempt 5 next[A,C,G,T] = {1, 2, 3, 4} x G C A T C G C A G T C A G T A T A C A G T A C G C A G T C A G

Figure 13 .
Figure13.The Student t-test for: (a) FQS vs. FJS (b) FQS vs. HOR and (c) FQS vs. QS in the condition of execution time versus pattern length, m. "B" denotes cases when FQS has a statistically significant better performance over the comparison algorithm; "S" denotes cases where FQS and the comparison algorithm have statistically the same performance; "W" denotes cases where FQS has a statistically significant worse performance than the comparison algorithm.

Table 1 .
ES, next and shif t arrays for an example pattern.

Table 2 .
The number of symbol comparisons used by QS and FQS.

Table 3 .
The number of pattern shifts used by QS and FQS.