2.2.1. Pattern Matching with Mismatches
For pattern matching with mismatches, without wild cards, Abrahamson [10
] gave the following
time algorithm. Let A
be a set of the most frequent characters in the pattern: (1) using convolutions, count how many matches each character in A
contributes to every alignment; (2) using marking, count how many matches each character in
contributes to every alignment; and (3) add the two numbers to find for every alignment, the number of matches between the pattern and the text. The convolutions take
time. A character in
cannot appear more than
times in the pattern; otherwise, each character in A
has a frequency greater than
, which is not possible. Thus, the run time for marking is
. If we equate the two run times, we find the optimal
, which gives a total run time of
An Example: Consider the case of and . Since each character in the pattern occurs an equal number of times, we can pick A arbitrarily. Let . In Step 1, convolution is used to count the number of matches contributed by each character in A. We obtain an array , such that is the number of matches contributed by characters in A to the alignment of P with , for . In this example, . In Step 2, we compute, using marking, the number of matches contributed by the characters 3 and 4 to each alignment between T and P. We get another array , such that is the number of matches contributed by 3 and 4 to the alignment between and P, for . Specific to this example, . In Step 3, we add and to get the number of matches between and P, for . In this example, this sum yields: .
For pattern matching with mismatches and wild cards, a fairly complex algorithm is given in [11
]. The run time of this algorithm is
is the number of non-wild card positions in the pattern. The problem can also be solved through a simple modification of Abrahamson's algorithm, in time
, as pointed out in [17
]. We now prove the following result:
Pattern matching with mismatches and wild cards can be solved in time, where g is the number of non-wild card positions in the pattern.
Ignoring the wild cards for now, let A be the set of the most frequent characters in the pattern. As above, count matches contributed by characters in A and using convolution and marking, respectively. By a similar reasoning as above, the characters used in the marking phase will not appear more than times in the pattern. If we equate the run times for the two phases, we obtain time. We are now left to count how many matches are contributed by the wild cards. For a string S and a character α, define as . Let w be the wild card character. Compute . Then, for every alignment i, the number of positions that have a wild card either in the text, or the pattern, or both, is . Add to the previously-computed counts and output. The total run time is .
2.2.2. Pattern Matching with K Mismatches
For the k
mismatches problem, without wild cards, an
time algorithm that requires
additional space is presented in [19
]. Another algorithm, that takes
time and uses only
additional space, is presented in [20
]. We define the following problem, which is of interest in the discussion.
Subset k mismatches: Given a text T of length n, a pattern P of length m, a set of positions and an integer k, output the positions for which .
The subset k
mismatches problem becomes the regular k
mismatches problem if
. Thus, it can be solved by the
algorithms mentioned above. However, if
, then the
algorithms are too costly. A better alternative is to use the Kangaroo method proposed in [11
]. The Kangaroo method can verify if
time for any i
. The method works as follows. Build a suffix tree of
and enhance it to support
lowest common ancestor (LCA) queries. For a given i
, perform an LCA query to find the position of the first mismatch between P
. Let this position be j
. Then, perform another LCA to find the first mismatch between
, which is the second mismatch of alignment i
. Continue to “jump” from one mismatch to the next, until the end of the pattern is reached or we have found more than k
mismatches. The Kangaroo method can process
time, and it uses
additional memory for the LCA enhanced suffix tree. We now prove the following result:
Subset k mismatches can be solved in time using only additional memory.
The algorithm is the following. Build an LCA-enhanced suffix tree of the pattern. Scan the text from left to right: (1) find the longest unscanned region of the text that can be found somewhere in the pattern, say starting at position i of the pattern; call this region of the text R; therefore, R is identical to .; and (2) for every alignment in S that overlaps R, count the number of mismatches between R and the alignment, within the overlap region. To do this, consider an alignment in S that overlaps R, such that the beginning of R aligns with the j-th character in the pattern. We want to count the number of mismatches between R and . However, since R is identical to , we can simply compare and . This comparison can be done efficiently by jumping from one mismatch to the next, like in the Kangaroo method. Repeat from Step 1 until the entire text has been scanned. Every time we process an alignment, in Step 2, we either discover at least one additional mismatch or we reach the end of the alignment. This is true, because, otherwise, the alignment would match the text for more than characters, which is not possible, from the way we defined R. Every alignment for which we have found more than k mismatches is excluded from further consideration to ensure time per alignment. It takes time to build the LCA enhanced suffix tree of the pattern and additional time to scan the text from left to right. Thus, the total run time is with additional memory. The pseudocode is given in Algorithm 2.
2.2.3. An Time Algorithm for K Mismatches
For the k
mismatches problem, without wild cards, a fairly complex
time algorithm is given in [11
]. The algorithm classifies the inputs into several cases. For each case, it applies a combination of marking followed by a filtering step, the Kangaroo method, or convolutions. The goal is to not exceed
time in any of the cases. We now present an algorithm with only two cases that has the same worst case run time. The new algorithm can be thought of as a generalization of the algorithm in [11
], as we will discuss later. This generalization not only greatly simplifies the algorithm, but it also reduces the expected run time. This happens because we use information about the frequency of the characters in the text and try to minimize the work done by convolutions and marking.
|Algorithm 2: Subset k mismatches()|
We will now give the intuition for this algorithm. For any character , let be its frequency in the pattern and be its frequency in the text. Note that in the marking algorithm, a specific character α will contribute to the runtime a cost of . On the other hand, in the case of convolution, a character α costs us one convolution, regardless of how frequent α is in the text or the pattern. Therefore, we want to use infrequent characters for marking and frequent characters for convolution. The balancing of the two will give us the desired runtime.
A position j in the pattern where is called an instance of α. Consider every instance of character α as an object of size 1 and cost . We want to fill a knapsack of size at a minimum cost and without exceeding a given budget B. The instances will allow us to filter some of the alignments with more than k mismatches, as will become clear later. This problem can be optimally solved by a greedy approach, where we include in the knapsack all of the instances of the least expensive character, then all of the instances of the second least expensive character, and so on, until we have items or we have exceeded B. The last character considered may have only a subset of its instances included, but for the ease of explanation, assume that there are no such characters.
Note: Even though the above is described as a knapsack problem, the particular formulation can be optimally solved in linear time. This formulation should not be confused with other formulations of the knapsack problem that are NP-complete.
Case (1): Assume we can fill the knapsack at a cost . We apply the marking algorithm for the characters whose instances are included in the knapsack. It is easy to see that the marking takes time and creates C marks. For alignment i, if the pattern and the text match for all of the positions in the knapsack, we will obtain exactly marks at position i. Conversely, any position that has less than k marks must have more than k mismatches, so we can filter it out. Therefore, there will be at most positions with k marks or more. For such positions, we run subset k mismatches to confirm which of them have less than k mismatches. The total runtime of the algorithm in this case is .
Case (2): If we cannot fill the knapsack within the given budget B, we do the following: for the characters we could fit in the knapsack, we use the marking algorithm to count the number of matches that they contribute to each alignment. For characters not in the knapsack, we use convolutions to count the number of matches that they contribute to each alignment. We add the two counts and get the exact number of matches for every alignment.
Note that at least one of the instances in the knapsack has a cost larger than
(if all of the instances in the knapsack had a cost less or equal to
, then we would have at least
instances in the knapsack). Furthermore, note that all of the instances not in the knapsack have a cost at least as high as any instance in the knapsack, because we greedily fill the knapsack starting with the least costly instances. This means that every character not in the knapsack appears in the text at least
times. This means that the number of characters not in the knapsack does not exceed
. Therefore, the total cost of convolutions is
. Since the cost of marking was
, we can see that the best value of B
is the one that equalizes the two costs. This gives
. Therefore, the algorithm takes
, we can employ a different algorithm that solves the problem in linear time, as in [11
]. For larger k
, so the run time becomes
. We call this algorithm knapsack k
mismatches. The pseudocode is given in Algorithm 3. The following theorem results.
Knapsack k mismatches has worst case run time .
|Algorithm 3: Knapsack k mismatches()|
We can think of the algorithm in [11
] as a special case of our algorithm where, instead of trying to minimize the cost of the
items in the knapsack, we just try to find
items for which the cost is less than
. As a result, it is easy to verify the following:
Theorem 4. Knapsack k mismatches spends at most as much time as the algorithm in  to do convolutions and marking
Observation: In all of the cases presented below, knapsack k mismatches can have a run time as low as , for example if there exists one character α with and .
. The algorithm in [11
instances of distinct characters to perform marking. Therefore, for every position of the text, at most one mark is created. If the number of marks is M
, then the cost of the marking phase is
. The number of remaining positions after filtering is no more than
, and thus, the algorithm takes
time. Our algorithm puts in the knapsack 2k instances, of not necessarily different characters, such that the number of marks B
, and the total runtime is
. The algorithm in [11
] performs one convolution per character to count the total number of matches for every alignment, for a run time of
. In the worst case, knapsack k
mismatches cannot fill the knapsack at a cost
, so it defaults to the same run time. However, in the best case, the knapsack can be filled at a cost B
as low as
depending on the frequency of the characters in the pattern and the text. In this case, the runtime will be
Case 3: . A symbol that appears in the pattern at least times is called frequent.
Case 3.1: There are at least
frequent symbols. The algorithm in [11
frequent symbols to do marking and filtering at a cost
. Since knapsack k
mismatches will minimize the marking time B
, we have
, so the run time is the same as for [11
] only in the worst case.
Case 3.2: There are
frequent symbols. The algorithm in [11
] first performs one convolution for each frequent character for a run time of
. Two cases remain:
Case 3.2.1: All of the instances of the non-frequent symbols number less than
positions. The algorithm in [11
] replaces all instances of frequent characters with wild cards and applies a
algorithm to count mismatches, where g
is the number of non-wild card positions. Since
, the run time for this stage is
, and the total run time is
. Knapsack k
mismatches can always include in the knapsack all of the instances of non-frequent symbols, since their total cost is no more than
, and in the worst case, do convolutions for the remaining characters. The total run time is
. Of course, depending on the frequency of the characters in the pattern and text, knapsack k
mismatch may not have to do any convolutions.
Case 3.2.2: All of the instances of the non-frequent symbols number at least
positions. The algorithm in [11
instances of infrequent characters to do marking. Since each character has a frequency less than
, the time for marking is
, and there are no more than
positions left after filtering. Knapsack k
mismatches chooses characters in order to minimize the time B
for marking, so again