On String Matching with Mismatches

In this paper, we consider several variants of the pattern matching with mismatches problem. In particular, given a text T = t1t2 · · · tn and a pattern P = p1p2 · · · pm, we investigate the following problems: (1) pattern matching with mismatches: for every i, 1 ≤ i ≤ n −m + 1 output, the distance between P and titi+1 · · · ti+m−1; and (2) pattern matching with k mismatches: output those positions i where the distance between P and titi+1 · · · ti+m−1 is less than a given threshold k. The distance metric used is the Hamming distance. We present some novel algorithms and techniques for solving these problems. We offer deterministic, randomized and approximation algorithms. We consider variants of these problems where there could be wild cards in either the text or the pattern or both. We also present an experimental evaluation of these algorithms. The source code is available at http://www.engr.uconn.edu/∼man09004/kmis.zip.


Introduction
The problem of string matching has been studied extensively due to its wide range of applications from Internet searches to computational biology.String matching can be defined as follows.Given a text T = t 1 t 2 • • • t n and a pattern P = p 1 p 2 • • • p m , with letters from an alphabet Σ, find all of the occurrences of the pattern in the text.This problem can be solved in O(n + m) time by using well-known algorithms (e.g., KMP [1]).A variation of this problem is to search for multiple patterns at the same time.An algorithm for this version is given in [2].
A more general formulation allows "don't care" or "wild card" characters in the text and the pattern.A wild card matches any character.An algorithm for pattern matching with wild cards is given in [3] and has a runtime of O(n log |Σ| log m).The algorithm maps each character in Σ to a binary code of length log |Σ|.Then, a constant number of convolution operations is used to check for mismatches between the pattern and any position in the text.For the same problem, a randomized algorithm that runs in O(n log n) time with high probability is given in [4].A slightly faster randomized O(n log m) algorithm is given in [5].A simple deterministic O(n log m) time algorithm based on convolutions is given in [6].
A more challenging formulation of the problem is pattern matching with mismatches.This formulation appears in two versions: (1) for every alignment of the pattern in the text, find the distance between the pattern and the alignment; or (2) identify only those alignments where the distance between the pattern and the text is less than a given threshold.The distance metric can be the Hamming distance, edit distance, L 1 metric, and so on.The problem has been generalized to use trees instead of sequences or to use sets of characters instead of single characters (see [7]).
A survey of string matching with mismatches is given in [8].A description of practical on-line string searching algorithms can be found in [9].
The Hamming distance between two strings A and B, of equal length, is defined as the number of positions where the two strings differ and is denoted by Hd(A, B).
In this paper, we are interested in the following two problems, with and without wild cards.
2. Pattern matching with k mismatches (or the k mismatches problem): Take the same input as above, plus an integer k.Output all i, 1 ≤ i ≤ n − m + 1, for which Hd(P, t i t i+1 , . . .t i+m−1 ) ≤ k.

Pattern Matching with Mismatches
For pattern matching with mismatches, a naive algorithm computes the Hamming distance for every alignment of the pattern in the text, in time O(nm).A faster algorithm, in the absence of wild cards, is Abrahamson's algorithm [10] that runs in O(n √ m log m) time.Abrahamson's algorithm can be extended to solve pattern matching with mismatches and wild cards, as we prove in Section 2.2.1.The new algorithm runs in O(n √ g log m) time, where g is the number of non-wild card positions in the pattern.This gives a simpler and faster alternative to an algorithm proposed in [11].
In the literature, we also find algorithms that approximate the number of mismatches for every alignment.For example, an approximate algorithm for pattern matching with mismatches, in the absence of wild cards, that runs in O(rn log m) time, where r is the number of iterations of the algorithm, is given in [12].Every distance reported has a variance bounded by (m − c i )/r 2 where c i is the exact number of matches for alignment i.
Furthermore, a randomized algorithm that approximates the Hamming distance for every alignment within an factor and runs in O(n log c m/ 2 ) time, in the absence of wild cards, is given in [13].Here, c is a small constant.We extend this algorithm to pattern matching with mismatches and wild cards, in Section 2.3.The new algorithm approximates the Hamming distance for every alignment within an factor in time O(n log 2 m/ 2 ) with high probability.
Recent work has also addressed the online version of pattern matching, where the text is received in a streaming model, one character at a time, and it cannot be stored in its entirety (see, e.g., [14][15][16]).Another version of this problem matches the pattern against multiple input streams (see, e.g., [17]).Another interesting problem is to sample a representative set of mismatches for every alignment (see, e.g., [18]).

Pattern Matching with K Mismatches
For the k mismatches problem, without wild cards, two algorithms that run in O(nk) time are presented in [19,20].A faster algorithm, that runs in O(n √ k log k) time, is given in [11].This algorithm combines the two main techniques known in the literature for pattern matching with mismatches: filtering and convolutions.We give a significantly simpler algorithm in Section 2.2.3, having the same worst case run time.The new algorithm will never perform more operations than the one in [11] during marking and convolution.
An intermediate problem is to check if the Hamming distance is less or equal to k for a subset of the aligned positions.This problem can be solved with the Kangaroo method proposed in [11] at a cost of O(k) time per alignment, using O(n + m) additional memory.We show how to achieve the same run time per alignment using only O(m) additional memory, in Section 2.2.2.
Further, we look at the version of k mismatches where wild cards are allowed in the text and the pattern.For this problem, two randomized algorithms are presented in [17].The first one runs in O(nk log n log m) time, and the second one in O (n log m(k + log n log log n)) time.Both are Monte Carlo algorithms, i.e., they output the correct answer with high probability.The same paper also gives a deterministic algorithm with a run time of O(nk 2 log 3 m).Furthermore, a deterministic O(nk log 2 m(log 2 k +log log m)) time algorithm is given in [21].We present a Las Vegas algorithm (that always outputs the correct answer), in Section 2.4.3, which runs in time O(nk log 2 m + n log 2 m log n + n log m log n log log n) with high probability.
An algorithm for k mismatches with wild cards in either the text or the pattern (but not both) is given in [22].This algorithm runs in O(nm 1/3 k 1/3 log 2/3 m) time.

Our Results
The contributions of this paper can be summarized as follows.For pattern matching with mismatches: • An algorithm for pattern matching with mismatches and wild cards that runs in O(n √ g log m) time, where g is the number of non-wild card positions in the pattern; see Section 2.2.1.
• A randomized algorithm that approximates the Hamming distance for every alignment, when wild cards are present, within an factor in time O(n log 2 m/ 2 ) with high probability; see Section 2.3.
For pattern matching with k mismatches: • An algorithm for pattern matching with k mismatches, without wild cards, that runs in O(n √ k log k) time; this algorithm is simpler and has a better expected run time than the one in [11]; see Section 2.2.3.The rest of the paper is organized as follows.First, we introduce some notations and definitions.Then, we describe the exact, deterministic algorithms for pattern matching with mismatches and for k mismatches.Then, we present the randomized and approximate algorithms: first the algorithm for approximate counting of mismatches in the presence of wild cards, then the Las Vegas algorithm for k mismatches with wild cards.Finally, we present an empirical run time comparison of the deterministic algorithms and conclusions.

Some Definitions
Given two strings T = t 1 t 2 . . .t n and P = p 1 p 2 . . .p m (with m ≤ n), the convolution of T and P is a sequence C = c 1 , c 2 , . . ., c n−m+1 where c i = m j=1 t i+j−1 p j , for 1 ≤ i ≤ (n − m + 1).This convolution can be computed in O(n log m) time using the fast Fourier transform.If the convolutions are applied on binary inputs, as is often the case in pattern matching applications, some speedup techniques are presented in [23].
In the context of randomized algorithms, by high probability, we mean a probability greater or equal to (1 − n − ) where n is the input size and is a probability parameter usually assumed to be a constant greater than 0. The run time of a Las Vegas algorithm is said to be O(f (n)) if the run time is no more than c f (n) with probability greater or equal to (1 − n − ) for all n ≥ n 0 , where c and n 0 are some constants and for any constant ≥ 1.
In the analysis of our algorithms, we will employ the following Chernoff bounds.
Chernoff bounds [24]: These bounds can be used to closely approximate the tail ends of a binomial distribution.
A Bernoulli trial has two outcomes, namely success and failure, the probability of success being p.A binomial distribution with parameters n and p, denoted as B(n, p), is the number of successes in n independent Bernoulli trials.
Let X be a binomial random variable whose distribution is B(n, p).If m is any integer > np, then the following are true: for any 0 < δ < 1.

Deterministic Algorithms
In this section, we present deterministic algorithms for pattern matching with mismatches.We start with a summary of two well-known techniques for counting matches: convolution and marking (see, e.g., [11]).In terms of notation, T i..j is the substring of T between i and j and T i stands for T i..i+m−1 .Furthermore, the value at position i in array X is denoted by X[i].
Convolution: Given a string S and a character α, define string S α , such that S α [i] = 1 if S[i] = α, and 0 otherwise.Let C α = convolution(T α , P α ).Then, C α [i] gives the number of matches between P and T i where the matching character is α.Then, α∈Σ C α [i] is the total number of matches between P and T i .
Marking: Given a character α, let P os[α] be the set of positions where character α is found in P (i.e., P os[α] = {i|1 ≤ i ≤ m, p i = α}).Note that, if t i = α, then the alignment between P and T i−j+1 will match t i = p j = α, for all j ∈ P os[α].This gives the marking algorithm: for every position i in the text, increment the number of matches for alignment i − j + 1, for all j ∈ P os[t i ].In practice, we are interested in doing the marking only for certain characters, meaning we will do the incrementing only for the positions t i = α where α ∈ Γ ⊆ Σ.The algorithm then takes O(n max α∈Γ |P os α |) time.The pseudocode is given in Algorithm 1.

Pattern Matching with Mismatches
For pattern matching with mismatches, without wild cards, Abrahamson [10] gave the following O(n √ m log m) time algorithm.Let A be a set of the most frequent characters in the pattern: (1) using convolutions, count how many matches each character in A contributes to every alignment; (2) using marking, count how many matches each character in Σ − A contributes to every alignment; and (3) add the two numbers to find for every alignment, the number of matches between the pattern and the text.The convolutions take O(|A|n log m) time.A character in Σ − A cannot appear more than m/|A| times in the pattern; otherwise, each character in A has a frequency greater than m/|A|, which is not possible.Thus, the run time for marking is O(nm/|A|).If we equate the two run times, we find the optimal |A| = m/ log m, which gives a total run time of O(n √ m log m).
In Step 1, convolution is used to count the number of matches contributed by each character in A. We obtain an array M 1 [1 : 12], such that M 1 [i] is the number of matches contributed by characters in A to the alignment of P with T i , for 1 ≤ i ≤ 12.In this example, M 1 = [0, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1].In Step 2, we compute, using marking, the number of matches contributed by the characters 3 and 4 to each alignment between T and P .We get another array M 2 [1 : 12], such that M 2 [i] is the number of matches contributed by 3 and 4 to the alignment between T i and P , for 1 ≤ i ≤ 12. Specific to this example, M 2 = [0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1].In Step 3, we add M 1 and M 2 to get the number of matches between T i and P , for 1 ≤ i ≤ 12.In this example, this sum yields: For pattern matching with mismatches and wild cards, a fairly complex algorithm is given in [11].The run time of this algorithm is O(n √ g log m) where g is the number of non-wild card positions in the pattern.The problem can also be solved through a simple modification of Abrahamson's algorithm, in time O(n √ m log m), as pointed out in [17].We now prove the following result: Theorem 1. Pattern matching with mismatches and wild cards can be solved in O(n √ g log m) time, where g is the number of non-wild card positions in the pattern.
Proof.Ignoring the wild cards for now, let A be the set of the most frequent characters in the pattern.As above, count matches contributed by characters in A and Σ − A using convolution and marking, respectively.By a similar reasoning as above, the characters used in the marking phase will not appear more than g/|A| times in the pattern.If we equate the run times for the two phases, we obtain O(n √ g log m) time.We are now left to count how many matches are contributed by the wild cards.For a string S and a character α, define S ¬α as S ¬α [i] = 1 − S α [i].Let w be the wild card character.Compute C = convolution(T ¬w , P ¬w ).Then, for every alignment i, the number of positions that have a wild card either in the text, or the pattern, or both, is m − C to the previously-computed counts and output.The total run time is O(n √ g log m).

Pattern Matching with K Mismatches
For the k mismatches problem, without wild cards, an O(k(m log m+n)) time algorithm that requires O(k(m + n)) additional space is presented in [19].Another algorithm, that takes O(m log m + kn) time and uses only O(m) additional space, is presented in [20].We define the following problem, which is of interest in the discussion.Thus, it can be solved by the O(nk) algorithms mentioned above.However, if |S| << n, then the O(nk) algorithms are too costly.A better alternative is to use the Kangaroo method proposed in [11].
The Kangaroo method can verify if Hd(P, T i ) ≤ k in O(k) time for any i.The method works as follows.Build a suffix tree of T #P and enhance it to support O(1) lowest common ancestor (LCA) queries.For a given i, perform an LCA query to find the position of the first mismatch between P and T i .Let this position be j.Then, perform another LCA to find the first mismatch between P j+1..m and T i+j+1..i+m−1 , which is the second mismatch of alignment i. Continue to "jump" from one mismatch to the next, until the end of the pattern is reached or we have found more than k mismatches.The Kangaroo method can process |S| positions in O(n + m + |S|k) time, and it uses O(n + m) additional memory for the LCA enhanced suffix tree.We now prove the following result: Proof.The algorithm is the following.Build an LCA-enhanced suffix tree of the pattern.Scan the text from left to right: (1) find the longest unscanned region of the text that can be found somewhere in the pattern, say starting at position i of the pattern; call this region of the text R; therefore, R is identical to P i..i+|R|−1 .; and (2) for every alignment in S that overlaps R, count the number of mismatches between R and the alignment, within the overlap region.To do this, consider an alignment in S that overlaps R, such that the beginning of R aligns with the j-th character in the pattern.We want to count the number of mismatches between R and P j..j+|R|−1 .However, since R is identical to P i..i+|R|−1 , we can simply compare P i..i+|R|−1 and P j..j+|R|−1 .This comparison can be done efficiently by jumping from one mismatch to the next, like in the Kangaroo method.Repeat from Step 1 until the entire text has been scanned.Every time we process an alignment, in Step 2, we either discover at least one additional mismatch or we reach the end of the alignment.This is true, because, otherwise, the alignment would match the text for more than |R| characters, which is not possible, from the way we defined R. Every alignment for which we have found more than k mismatches is excluded from further consideration to ensure O(k) time per alignment.It takes O(m) time to build the LCA enhanced suffix tree of the pattern and O(n) additional time to scan the text from left to right.Thus, the total run time is O(n + m + |S|k) with O(m) additional memory.The pseudocode is given in Algorithm 2.

An
For the k mismatches problem, without wild cards, a fairly complex O(n √ k log k) time algorithm is given in [11].The algorithm classifies the inputs into several cases.For each case, it applies a combination of marking followed by a filtering step, the Kangaroo method, or convolutions.The goal is to not exceed O(n √ k log k) time in any of the cases.We now present an algorithm with only two cases that has the same worst case run time.The new algorithm can be thought of as a generalization of the algorithm in [11], as we will discuss later.This generalization not only greatly simplifies the algorithm, but it also reduces the expected run time.This happens because we use information about the frequency of the characters in the text and try to minimize the work done by convolutions and marking.This means that T i..i+l−1 = P j..j+l−1 for some j; for a ∈ S where a return c We will now give the intuition for this algorithm.For any character α ∈ Σ, let f α be its frequency in the pattern and F α be its frequency in the text.Note that in the marking algorithm, a specific character α will contribute to the runtime a cost of F α × f α .On the other hand, in the case of convolution, a character α costs us one convolution, regardless of how frequent α is in the text or the pattern.Therefore, we want to use infrequent characters for marking and frequent characters for convolution.The balancing of the two will give us the desired runtime.
A position j in the pattern where p j = α is called an instance of α.Consider every instance of character α as an object of size 1 and cost F α .We want to fill a knapsack of size 2k at a minimum cost and without exceeding a given budget B. The 2k instances will allow us to filter some of the alignments with more than k mismatches, as will become clear later.This problem can be optimally solved by a greedy approach, where we include in the knapsack all of the instances of the least expensive character, then all of the instances of the second least expensive character, and so on, until we have 2k items or we have exceeded B. The last character considered may have only a subset of its instances included, but for the ease of explanation, assume that there are no such characters.
Note: Even though the above is described as a knapsack problem, the particular formulation can be optimally solved in linear time.This formulation should not be confused with other formulations of the knapsack problem that are NP-complete.
Case (1): Assume we can fill the knapsack at a cost C ≤ B. We apply the marking algorithm for the characters whose instances are included in the knapsack.It is easy to see that the marking takes time O(C) and creates C marks.For alignment i, if the pattern and the text match for all of the 2k positions in the knapsack, we will obtain exactly 2k marks at position i.Conversely, any position that has less than k marks must have more than k mismatches, so we can filter it out.Therefore, there will be at most C/k positions with k marks or more.For such positions, we run subset k mismatches to confirm which of them have less than k mismatches.The total runtime of the algorithm in this case is O(C).
Case ( 2): If we cannot fill the knapsack within the given budget B, we do the following: for the characters we could fit in the knapsack, we use the marking algorithm to count the number of matches that they contribute to each alignment.For characters not in the knapsack, we use convolutions to count the number of matches that they contribute to each alignment.We add the two counts and get the exact number of matches for every alignment.
Note that at least one of the instances in the knapsack has a cost larger than B/(2k) (if all of the instances in the knapsack had a cost less or equal to B/(2k), then we would have at least 2k instances in the knapsack).Furthermore, note that all of the instances not in the knapsack have a cost at least as high as any instance in the knapsack, because we greedily fill the knapsack starting with the least costly instances.This means that every character not in the knapsack appears in the text at least B/(2k) times.This means that the number of characters not in the knapsack does not exceed n/(B/(2k)).Therefore, the total cost of convolutions is O(nk/B log m).Since the cost of marking was O(B), we can see that the best value of B is the one that equalizes the two costs.This gives B = O(n √ k log m).Therefore, the algorithm takes O(n √ k log m) time.If k < m 1/3 , we can employ a different algorithm that solves the problem in linear time, as in [11].For larger k, O(log m) = O(log k), so the run time becomes O(n √ k log k).We call this algorithm knapsack k mismatches.The pseudocode is given in Algorithm 3. The following theorem results.
We can think of the algorithm in [11] as a special case of our algorithm where, instead of trying to minimize the cost of the 2k items in the knapsack, we just try to find 2k items for which the cost is less than O(n √ k log m).As a result, it is easy to verify the following: Theorem 4. Knapsack k mismatches spends at most as much time as the algorithm in [11] to do convolutions and marking.
Proof.Observation: In all of the cases presented below, knapsack k mismatches can have a run time as low as O(n), for example if there exists one character α with f α = O(k) and F α = O(n/k).
Case 1: |Σ| ≥ 2k.The algorithm in [11] chooses 2k instances of distinct characters to perform marking.Therefore, for every position of the text, at most one mark is created.If the number of marks is M , then the cost of the marking phase is O(n + M ).The number of remaining positions after filtering is no more than M/k, and thus, the algorithm takes O(n + M ) time.Our algorithm puts in the knapsack 2k instances, of not necessarily different characters, such that the number of marks B is minimized!Therefore, B ≤ M , and the total runtime is O(n + B).
Case 2: |Σ| < 2 √ k.The algorithm in [11] performs one convolution per character to count the total number of matches for every alignment, for a run time of Ω(|Σ|n log m).In the worst case, knapsack k mismatches cannot fill the knapsack at a cost B < |Σ|n log m, so it defaults to the same run time.However, in the best case, the knapsack can be filled at a cost B as low as O(n) depending on the frequency of the characters in the pattern and the text.In this case, the runtime will be O(n).
Case 3: 2 A symbol that appears in the pattern at least 2 √ k times is called frequent.Case 3.1: There are at least √ k frequent symbols.The algorithm in [11] chooses 2 √ k instances of √ k frequent symbols to do marking and filtering at a cost M ≤ 2n √ k.Since knapsack k mismatches will minimize the marking time B, we have B ≤ M , so the run time is the same as for [11] only in the worst case.
Case 3.2: There are A < √ k frequent symbols.The algorithm in [11] first performs one convolution for each frequent character for a run time of O(An log m).Two cases remain: Case 3.2.1:All of the instances of the non-frequent symbols number less than 2k positions.The algorithm in [11] replaces all instances of frequent characters with wild cards and applies a O(n √ g log m) algorithm to count mismatches, where g is the number of non-wild card positions.Since Case 3.2.2:All of the instances of the non-frequent symbols number at least 2k positions.The algorithm in [11] chooses 2k instances of infrequent characters to do marking.Since each character has a frequency less than 2 √ k, the time for marking is M < 2n √ k, and there are no more than M/k positions left after filtering.Knapsack k mismatches chooses characters in order to minimize the time B for marking, so again B ≤ M .

Approximate Counting of Mismatches
The algorithm of [13] takes as input a text T = t 1 t 2 . . .t n and a pattern P = p 1 p 2 . . .p m and approximately counts the Hamming distance between T i and P for every 1 ≤ i ≤ (n − m + 1).In particular, if the Hamming distance between T i and P is H i for some i, then the algorithm outputs h i where H i ≤ h i ≤ (1 + )H i for any > 0 with high probability (i.e., a probability of ≥ (1 − m −α )).The run time of the algorithm is O(n log 2 m/ 2 ).In this section, we show how to extend this algorithm to the case where there could be wild cards in the text and/or the pattern.
Let Σ be the alphabet under concern, and let σ = |Σ|.The algorithm runs in phases, and in each phase, we randomly map the elements of Σ to {1, 2}.A wild card is mapped to a zero.Under this mapping, we transform T and P to T and P , respectively.We then compute a vector C where This can be done using O(1) convolution operations (as in Section 2.4.1;see also [17]).A series of r such phases (for some relevant value of r) is done, at the end of which, we produce estimates on the Hamming distances.The intuition is that if a character x in T is aligned with a character y in P , then across all of the r phases, the expected contribution to C from these characters is r if x = y (assuming that x and y are non-wild cards).If x = y or if one or both of x and y are a wild card, the contribution to C is zero.
In particular, each element of Σ is mapped to 1 or 2 randomly with equal probability.Each wild card is mapped to a zero.Obtain two strings T and P where Here, h i is an estimate on the Hamming distance H i between T i and P .
Analysis: Let x be a character in T , and let y be a character in P .Clearly, if x = y or if one or both of x and y are a wild card, the contribution of x and y to any C [i] is zero.If x and y are non-wild cards and if x = y, then the expected contribution of these to any C [i] is 1.Across all of the r phases, the expected contribution of x and y to any C [i] is r.For a given x and y, we can think of each phase as a Bernoulli trial with equal probabilities for success and failure.A success refers to the possibility of Q(x) = Q(y).The expected number of successes in r phases is r 2 .Using Chernoff bounds (Equation 2), this contribution is no more than (1 + )r with probability ≥ 1 − exp(− 2 r/6).The probability that this statement holds for every pair (x, y) is ≥ 1 − m 2 exp(− 2 r/6).This probability will be . Similarly, we can show that for any pair of non-wild card characters, the contribution of them to any Put together, for any pair (x, y) of non-wild cards, the contribution of x and y to any . Let H i be the Hamming distance between T i and P for some i (1 ≤ i ≤ (n − m + 1)).Then, the estimate h i on H i will be in the interval (1 ± )H i with probability ≥ (1 − m −α ).As a result, we get the following Theorem.
Theorem 5. Given a text T and a pattern P , we can estimate the Hamming distance between T i and P , for every i, 1 ≤ i ≤ (n − m + 1), in O(n log 2 m/ 2 ) time.If H i is the Hamming distance between T i and P , then the above algorithm outputs an estimate that is in the interval (1 ± )H i with high probability.
Observation 1.In the above algorithm, we can ensure that h i ≥ H i and h i ≤ (1 + )H i with high probability by changing the estimate computed in Step 3 of Algorithm 4 to C [i]  (1− )r .
Observation 2. As in [13], with O m 2 log m 2 pre-processing, we can ensure that Algorithm 4 never errs (i.e., the error bounds on the estimates will always hold).

The 1 Mismatch Problem
Problem definition: For this problem, also, the inputs are two strings T and P with |T | = n, |P | = m, m ≤ n and possible wild cards in T and P .Let T i stand for the substring t i t i+1 . . .t i+m−1 , for any i, with 1 ≤ i ≤ (n − m + 1).The problem is to check if the Hamming distance between T i and P is exactly 1, for 1 ≤ i ≤ (n − m + 1).The following Lemma is shown in [17].
Lemma 1.The 1 mismatch problem can be solved in O(n log m) time using a constant number of convolution operations.
The algorithm: Assume that each wild card in the pattern as well as the text is replaced with a zero.Furthermore, assume that the characters in the text, as well as the pattern are integers in the range [1 : |Σ|] where Σ is the alphabet of concern.Let e i,j stand for the "error term" introduced by the character t i+j−1 in T i and the character p j in P , and its value is (t i+j−1 − p j ) 2 t i+j−1 p j .Furthermore, let E i = m j=1 e i,j .There are four steps in the algorithm: Note that E i will be zero if T i and P match (assuming that a wild card can be matched with any character).
Thus, this step can be completed with three convolution operations.

Compute E
Step 1, this step can also be completed with three convolution operations.

Let
).Note that if the Hamming distance between T i and P is exactly one, then B i will give the position in the text where this mismatch occurs.

If for any
conclude that the Hamming distance between T i and P is exactly one.
Note: If the Hamming distance between T i and P is exactly 1 (for any i), then the above algorithm will not only detect it, but also will identify the position where there is a mismatch.Specifically, it will identify the integer j, such that t i+j−1 = p j .
In Step 1, we compute E i , for 1≤ i ≤ 17.For example, E 1 = (5 − 2) Note that since t 5 is a wild card, it matches with any character in the pattern.Furthermore, E 9 = 182.Run Algorithm 5 or Algorithm 6; for ← 1 to w do for r ← 1 to ck (c being a constant) do Uniformly randomly choose m 2 +1 k positions of the pattern; Generate a string P , such that |P | = |P | and P has the same characters as P in these randomly chosen positions and zero everywhere else; Run the 1 mismatch algorithm on T and P ; As a result, if there is a single mismatch between T i and P , then add the position of mismatch to L(i) and reduce the value of E i by the right amount, for Proof.As shown in [17], the run time of Algorithm 5 is O(nk log m log n), and that of Algorithm 6 is O(n log m(k + log n log log n)).The analysis will be done with respect to an arbitrary T i .In particular, we will show that after the specified amount of time, with high probability, we will either know q i or realize that q i > k.It will then follow that the same statement holds for every T i (for Consider phase of Step 2 (for an arbitrary 1 ≤ ≤ w).Let 2 k < q i ≤ 2 +1 k for some i.Using the fact that a b ≈ ae b b , the probability of isolating one of the mismatches in one run of the sampling step is: As a result, using Chernoff bounds (Equation (3) with δ = 1/2, for example), it follows that if 13ke sampling steps are made in phase , then at least 6k of these steps will result in the isolation of single mismatches (not all of them need be distinct) with high probability (assuming that k = Ω(log n)).Moreover, we can see that at least 1.1k of these mismatches will be distinct.This is because the probability that ≤ 1.1k of these are distinct is ≤ q i 1.1k / 1.1k q i 6k ≤ 2 −2.64k using the fact that q i ≥ 2k.This probability will be very low when k = Ω(log n).
In the above analysis, we have assumed that k = Ω(log n).If this is not the case, in any phase of Step 2, we can do cα log n sampling steps, for some suitable constant c.In this case, also we can perform an analysis similar to that of the above case using Chernoff bounds.Specifically, we can show that with high probability, we will be able to identify all of the mismatches between T i and P .As a result, each phase of Step 2 takes O(n log m(k + log n)) time.We have O(log m) phases.Thus, the run time of Step 2 is O(n log 2 m(k + log n)).Furthermore, the probability that the condition in Step 3 holds is very high.
Therefore, the run time of the entire algorithm is O(nk

Results
The above algorithms are based on symbol comparison, arithmetic operations or a combination of both.Therefore, it is interesting to see how these algorithms compare in practice.
In this section, we compare deterministic algorithms for pattern matching.Some of these algorithms solve the pattern matching with mismatches problem, and others solve the k mismatches problem.For the sake of comparison, we treated all of them as algorithms for the k mismatches problem, which is a special case of the pattern matching with mismatches problem.
We implemented the following algorithms: the naive O(nm) time algorithm, Abrahamson's algorithm [10], subset k mismatches (Section 2.2.2) and knapsack k mismatches (Section 2.2.3).For subset k mismatches, we simulate the suffix tree and LCA extensions by a suffix array with an LCP (longest common prefix; [25]) table and data structures to perform RMQ queries (range minimum queries; [26]) on it.This adds a O(log n) factor to preprocessing.For searching in the suffix array, we use a simple forward traversal with a cost of O(log n) per character.The traversal uses binary search to find the interval of suffixes that start with the first character of the pattern.Then, another binary search is performed to find the suffixes that start with the first two characters of the pattern, and so on.However, more efficient implementations are possible (e.g., [27]).For subset k mismatches, we also tried a simple O(m 2 ) time pre-processing using dynamic programming to precompute LCPs and hashing to quickly determine whether a portion of the text is present in the pattern.This method takes more preprocessing time, but it does not have the O(log n) factor when searching.Knapsack k mismatches uses subset k mismatches as a subroutine, so we have two versions of it, as well.
We tested the algorithms on protein, DNA and English inputs generated randomly.We randomly selected a substring of length m from the text and used it as the pattern.The algorithms were tested on an Intel Core i7 machine with 8 GB of RAM, Linux Mint 17.1 Operating System and gcc 4.8.2.All convolutions were performed using the fftw [28] library Version 3.3.3.We used the suffix array algorithm RadixSAof [29].
Figure 1 shows run times for varying the length of the text n.All algorithms scale linearly with the length of the text.Figure 2 shows run times for varying the length of the pattern m.Abrahamson's algorithm is expensive, because, for alphabet sizes smaller than m/ log m, it computes one convolution for every character in the alphabet.The convolutions proved to be expensive in practice, so Abrahamson's algorithm was competitive only for DNA data, where the alphabet is small.Figure 3 shows runtimes for varying the maximum number of mismatches k allowed.The naive algorithm and Abrahamson's algorithm do not depend on k; therefore, their runtime is constant.Subset k mismatch, with its O(nk) runtime, is competitive for relatively small k.Knapsack k mismatch, on the other hand, scaled very well with k. Figure 4 shows runtimes for varying the alphabet from four (DNA) to 20 (protein) to 26 (English).As expected, Abrahamson's algorithm is the most sensitive to the alphabet size.Overall, the naive algorithm performed well in practice, most likely due to its simplicity and cache locality.Abrahamson's algorithm was competitive only for small alphabet size or for large k.Subset k mismatches performed well for relatively small k.In most cases, the suffix array version was slower than the hashing-based one with O(m 2 ) time pre-processing because of the added O(log n) factor when searching in the suffix array.It would be interesting to investigate how the algorithms compare with a more efficient implementation of the suffix array.Knapsack k mismatches was the fastest among the algorithms compared, because in most cases, the knapsack could be filled with less than the given "budget", and thus, the algorithm did not have to perform any convolution operations.Our algorithms are subset k mismatch with hashing-based preprocessing (Section 2.2.2), subset k mismatch with suffix array preprocessing (Section 2.2.2), knapsack k mismatch with hashing-based preprocessing (Section 2.2.3) and knapsack k mismatch with suffix array preprocessing (Section 2.2.3).

Conclusions
We have introduced several deterministic and randomized, exact and approximate algorithms for pattern matching with mismatches and the k mismatches problems, with or without wild cards.These algorithms improve the run time, simplify or extend previous algorithms wild cards.We have also implemented the deterministic algorithms.An empirical comparison of these algorithms showed that the algorithms based on character comparison outperform those based on convolutions.

Problem 1 .
Subset k mismatches: Given a text T of length n, a pattern P of length m, a set of positions S = {i|1 ≤ i ≤ n − m + 1} and an integer k, output the positions i ∈ S for which Hd(P, T i ) ≤ k.The subset k mismatches problem becomes the regular k mismatches problem if |S| = n − m + 1.

Theorem 2 .
Subset k mismatches can be solved in O(n + m + |S|k) time using only O(m) additional memory.

Algorithm 2 :
Subset k mismatches(S, T, P, k) input : S -set of positions in the text; T 1..n -text; P 1..m -pattern; k -max number of mismatches; output: M -the positions in S for which the pattern matches the text with at most k mismatches; begin Assume we have a suffix tree/array of the pattern; for a ∈ S do M [a] = 0; i = 1 ; while i ≤ n do Find the largest l, such that T i..i+l−1 describes a path in the suffix tree; number of mismatches; s 1 , s 2 -starting positions of two suffixes of the pattern; l -a maximum length; output: compare the two suffixes on their first l positions and add to c the number of mismatches found; if c exceeds k, return k + 1, otherwise return the updated c ; begin while l > 0 and c ≤ k do d = lcp(s 1 , s 2 ); // longest common prefix if d ≥ l then return c; c = c + 1; d = d + 1;

g < 2k ,
the run time for this stage is O(n √ k log m), and the total run time is O(An log m + n √ k log m).Knapsack k mismatches can always include in the knapsack all of the instances of non-frequent symbols, since their total cost is no more than O(n √ k), and in the worst case, do convolutions for the remaining characters.The total run time is O(An log m + n √ k).Of course, depending on the frequency of the characters in the pattern and text, knapsack k mismatch may not have to do any convolutions.

Figure 2 .
Figure2.Run times for pattern matching on DNA, protein and English alphabet data, when the length of the pattern (m) varies.The length of the text is n = 10 millions.The maximum number of mismatches allowed is k = 10% of the pattern length.Our algorithms are subset k mismatch with hashing-based preprocessing (Section 2.2.2), subset k mismatch with suffix array preprocessing (Section 2.2.2), knapsack k mismatch with hashing-based preprocessing (Section 2.2.3) and knapsack k mismatch with suffix array preprocessing (Section 2.2.3).

Figure 3 .
Figure 3. Run times for pattern matching on DNA, protein and English alphabet data, when the maximum number of mismatches allowed (k) varies.The length of the text is n = 10 millions.The length of the pattern is m = 2000.Our algorithms are subset k mismatch with hashing-based preprocessing (Section 2.2.2), subset k mismatch with suffix array preprocessing (Section 2.2.2), knapsack k mismatch with hashing-based preprocessing (Section 2.2.3) and knapsack k mismatch with suffix array preprocessing (Section 2.2.3).

Figure 4 .
Figure 4. Run times for pattern matching when the size of the alphabet varies from four (DNA) to 20 (protein) to 26 (English).The length of the text is n = 10 millions.The length of the pattern is m = 200 in the first graph and m = 1000 in the second.The maximum number of mismatches allowed is k = 20 in the first graph and k = 100 in the second.Our algorithms are subset k mismatch with hashing-based preprocessing (Section 2.2.2), subset k mismatch with suffix array preprocessing (Section 2.2.2), knapsack k mismatch with hashing-based preprocessing (Section 2.2.3) and knapsack k mismatch with suffix array preprocessing (Section 2.2.3).

•
An algorithm that tests if the Hamming distance is less than k for a subset of the alignments, without wild cards, at a cost of O(k) time per alignment, using only O(m) additional memory; see Section 2.2.2.• A Las Vegas algorithm for the k mismatches problem with wild cards that runs in time O(nk log 2 m + n log 2 m log n + n log m log n log log n) with high probability; see Section 2.4.3.
Theorem 3. Knapsack k mismatches has worst case run time O(n √ k log k).
input : T 1..n -text; P 1..m -pattern; k -max number of mismatches; output: S -set of positions in the text where the pattern matches with at most k mismatches; begin Compute F i and f i for every i ∈ Σ ; Sort Σ with respect to F Run times for pattern matching on DNA, protein and English alphabet data, when the length of the text (n) varies.The length of the pattern is m = 1000.The maximum number of mismatches allowed is k = 100.Our algorithms are subset k mismatch with hashing-based preprocessing (Section 2.2.2), subset k mismatch with suffix array preprocessing (Section 2.2.2), knapsack k mismatch with hashing-based preprocessing (Section 2.2.3) and knapsack k mismatch with suffix array preprocessing (Section 2.2.3).