^{1}

^{*}

^{2}

^{3}

^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

We propose a framework for the exact probabilistic analysis of window-based pattern matching algorithms, such as Boyer–Moore, Horspool, Backward DAWG Matching, Backward Oracle Matching, and more. In particular, we develop an algorithm that efficiently computes the distribution of a pattern matching algorithm's running time cost (such as the number of text character accesses) for any given pattern in a random text model. Text models range from simple uniform models to higher-order Markov models or hidden Markov models (HMMs). Furthermore, we provide an algorithm to compute the exact distribution of

The basic pattern matching problem is to find all occurrences of a

Let

A question that has apparently so far not been investigated is about the exact probability distribution of the number of required character accesses

In contrast to these results that are special to the Horspool algorithm, we use a general framework called

Here, we show that a similar idea can be applied to the analysis of pattern matching algorithms by constructing an automaton that encodes the behavior of such an algorithm and then combining it with a text model. The PAA framework allows doing this in a natural way, which further highlights its utility The random text model can be quite general, from simple i.i.d. uniform models to high-order Markov models or HMMs. The approach is applied to the following pattern matching algorithms in the non-asymptotic regime (short patterns, medium-length texts): Horspool, B(N)DM, BOM. We do not treat BDM and BNDM separately as, in terms of text character accesses, they are indistinguishable (see Section 2.2).

This paper is organized as follows. In the next section, we give a brief review of the Horspool, B(N)DM and BOM algorithms. In Section 3, we define

An extended abstract of this work has been presented at LATA'10 [

Throughout this paper, Σ denotes a finite alphabet,

In the following, we summarize the Horspool, B(N)DM and BOM algorithms; algorithmic details can be found in [

We do not discuss the Knuth–Morris–Pratt algorithm because its number of text character accesses is constant: Each character of the text is looked at exactly once. Therefore,

We also do not discuss the Boyer–Moore algorithm, since it is never the best one in practice because of its complicated code to achieve optimal asymptotic running time. In contrast to our earlier paper [

The Horspool, B(N)DM and BOM algorithms have the following properties in common: They maintain a search window ^{m}^{m}

its cost ^{p}

its shift ^{p}

First, the rightmost characters of window and pattern are compared; that means,

In any case, the rightmost window character a is used to determine how far the window can be shifted for the next iteration. The shift function ensures that no match can be missed by moving the window such that a becomes aligned to the rightmost

For concreteness, we state Horspool's algorithm and how we count text character accesses as pseudocode in Algorithm 1. Note that after a shift, even when we know that

The main advantage of the Horspool algorithm is its simplicity. Especially, a window's shift value depends only on its last character, and its cost is easily computed from the number of consecutive matching characters at its right end. The Horspool algorithm does not require any advanced data structure and can be implemented in a few lines of code.

^{m}

The main idea of the BDM algorithm is to build a deterministic finite automaton (in this case, a suffix automaton, which is a directed acyclic word graph or DAWG) that recognizes all substrings of the reversed pattern, accepts all suffixes of the reversed pattern (including the empty suffix), and enters a FAIL state if a string has been read that is not a substring of the reversed pattern.

The suffix automaton processes the window right-to-left. As long as the FAIL state has not been reached, we have read a substring of the reversed pattern. If we are in an accepting state, we have even found a suffix of the reversed pattern (

So, ^{p}^{p}^{p}

Note that ^{p}^{p}

The advantage of BDM is that it makes long shifts, but its main disadvantage is the necessary construction of the suffix automaton, which is possible in

Constructing a nondeterministic finite automaton (NFA) instead of the deterministic suffix automaton is much simpler. However, processing a text character then does not take constant, but

From the “text character accesses” analysis point of view, BDM and BNDM are equivalent, as they have the same shift and cost functions.

BOM is similar to B(N)DM, but the suffix automaton of the reversed pattern is replaced by a simpler deterministic automaton, the factor oracle [

If

The only string of length

It has the minimal number of states (

It has between

It may recognize more strings than the substrings of

The cost function ^{p}^{p}^{p}

By construction, BOM never gives longer shifts than B(N)DM. The main advantage of BOM over BDM is reduced space usage and preprocessing time; the factor oracle only has

In this section, we introduce deterministic arithmetic automata (DAAs). They extend ordinary deterministic finite automata (DFAs) by performing a computation while one moves from state to state. Even though DAAs can be shown to be formally equivalent to families of DFAs on an appropriately defined larger state space, they are a useful concept before introducing probabilistic arithmetic automata (PAAs) and allow us to construct PAAs for the analysis of pattern matching algorithms in a simpler way. By using the PAA framework, we emphasize the connection between the problems discussed in the present article and those solved before using the same formalism: Other applications in biological sequence analysis include the exact computation of clump size distributions and

A _{0} ∈
_{0} ∈
_{q}_{q}

Informally, a DAA starts with the state-value pair (_{0}, _{0}) and reads a sequence of symbols from Σ. Being in state _{q′}_{q′}

Further, we define the associated joint transition function

As usual, we extend the definition of

When _{0}, _{0}), _{
}(

For each state _{q}

As a simple example for a DAA, take a standard DFA (
_{0}, Σ, _{q}_{q}_{0} ≔ 0, and let the operation in each state be the usual addition: _{q}_{
}(

For a given algorithm and pattern ^{m}^{p}^{m}^{p}^{p}^{m}^{p}^{p}

While different constructions are possible (see also [

Given a window-based pattern matching algorithm ^{m}, and the associated shift and cost functions, ^{p}^{m} → {1, …, ^{p}^{m} → ℕ, the

^{m} × {0, …,

_{0} ≔ (

_{0}= (

The remaining components are defined as

_{0} ≔ 0,

ℰ ≔ {1, …,

_{q}

^{p}

Let
_{
}(^{p}

The total cost ^{p}^{p}_{i∈Is} ^{p}_{s}_{s}

We have to prove that the DAA computes this value for

Let (_{i}, x_{i}_{i}_{i}, x_{i}_{i}, x_{i}^{p}_{i}, x_{i}

It remains to show that
_{i}_{+1} = _{i}_{i}_{+1} > 0 and _{i}_{+1} = ^{p}_{i}_{i}_{+1} = 0. As _{0} = (_{i}_{i}_{j}^{p}_{j}_{′} > 0 for ^{p}

The size of the constructed DAA's state space is (^{m}^{2}) different factors, it is reasonable that |Σ|^{m}^{2}), for a total state space of size ^{3}). Therefore, for each algorithm, a specialized construction may exist that directly constructs the minimal state space whose size may only grow polynomially with ^{2}), as the construction of Tsai [

Hopcroft's algorithm [

The algorithm can straightforwardly be adapted to minimize DAAs by choosing the initial state set partition appropriately. In our case, each DAA state is associated with the same operation. The only differences in state's behavior thus stem from different emissions. Therefore, Hopcroft's algorithm can be initialized by the partition induced by the emissions and then continued as usual.

As we exemplify in Section 7, this leads to a considerable reduction of the number of states.

This section introduces finite-memory random text models and explains how to construct a

Given an alphabet Σ, a random text is a stochastic process (_{t}_{t∈ℕ0}, where each _{t}_{0} … _{|s|−1} =

A finite-memory text model is a tuple (
_{0}, Σ, _{0} ∈
_{σ∈Σ,c′∈
} _{t}_{0} :≡ _{0}. A probability measure is now induced by stipulating
_{0}, ^{n}_{1}, …, _{n}^{n}

The idea is that the model given by (
_{0}, Σ,

Note that the probability ℙ(_{0} … _{|s|−1} =

Let (
_{0}, Σ, _{0}, ^{n}, σ

We have

Renaming _{n}

Similar text models are used in [_{0}, Σ,

For an i.i.d. model, we set
_{σ}_{σ}

Probabilistic arithmetic automata (PAAs), as introduced in [

A _{0}, _{0}, ℰ, _{q}_{q∈
}, _{q}_{q∈
}), where
_{0},
_{0}, ℰ and _{q}_{q, q′∈
} is a stochastic matrix.

A PAA induces three stochastic processes: (1) the state process (_{t}_{t∈ℕ} with values in
_{t}_{t∈ℕ} with values in ℰ, and (3) the value process (_{t}_{t∈ℕ} with values in
_{0} :≡ _{0} and
_{t} ≔ _{Qt}_{t−1}, _{t}

We now restate the PAA recurrences from [_{t}_{t}_{t}_{t}_{q∈
} _{t}

The state-value distribution is given by _{0}(_{0} and _{0}, and _{0}(_{q}

The recurrence in Lemma 2 resembles the Forward recurrences known from HMMs.

Note that the range of _{t}_{t}_{t}_{t}_{n}_{0≤t≤n} |
_{t}_{n}^{n}_{t}_{n}

We now formally state how to combine a DAA and a text model into a PAA that allows us to compute the distribution of values produced by the DAA when processing a random text.

Let a text model _{0}, Σ,

a state space
^{
} ×

a start state

transition probabilities

(deterministic) emission probability vectors

operations

Note that states having zero probability of being reached from _{0} may be omitted from

Let a text model _{0}, Σ, _{0}, _{0}, ℰ, _{q}_{q∈
}, _{q}_{q∈
}) be the PAA given by Definition 5. Then,

ℒ(_{t}_{
}(_{0} … _{t−1})) for all _{0}, where

the value distribution ℒ(_{n}^{
}| · |
^{2} · |Σ| · _{n}^{
}| · |
_{n}

if for all ^{
}| · |
_{n}

As in Section 5.2, we define _{t}_{t}_{t}^{
} ∈
^{
}, _{0}. For

In the above derivation, step (_{q}_{q}^{
} and (

To compute the table _{n}_{n}_{0} and perform _{t}_{+1}, we initialize the table with zeros and iterate over all _{t}_{t+1}(_{q′}_{q′}

As a direct consequence of the above lemma and of the DAA construction from Section 4, we arrive at our main theorem.

Let a finite-memory text model (
_{0}, Σ, ^{A,p}^{A,p}^{2} · ^{
}| · |
^{2} · |Σ|) time and ^{
}| · |

Using optimal algorithm-dependent DAA constructions schemes (e.g., the ^{2}) construction for the Horspool algorithm by Tsai [^{
}| by a polynomial in ^{m}

Computing the cost distribution for two algorithms allows us to compare their performance characteristics. One natural question, however, cannot be answered by comparing these two (one-dimensional) distributions: What is the probability that algorithm

We start by giving a general construction of a DAA that computes the difference of the sum of emission of two given DAAs.

Let a finite alphabet Σ and two DAAs
^{1} ^{2} is defined as

^{1} ×
^{2} and

_{0} ≔ 0,

ℰ ≔ ℰ^{1} × ℰ^{2} and

^{1}, ^{2}), ^{1}(^{1}, ^{2}(^{2},

_{q}^{1}, ^{2})) ↦ ^{1} − ^{2}.

Let
^{1} and
^{2} be DAAs meeting the criteria given in Definition 6 and
^{1},
^{2}). Then, value_{
}(_{
1}(_{
2}(

Follows directly from Definition 6.

Lemma 4 can now be applied to the DAAs constructed for the analysis of two algorithms as described in Section 4. Since the above construction builds the product of both state spaces, it is advisable to minimize both DAAs before generating the product. Furthermore, in an implementation, only reachable states of the product automaton need to be constructed. Before being used to build a PAA (by applying Lemma 3), the product DAA should again be minimized.

As discussed in Section 5.2, at most _{n}

In Section 2, we considered three practically relevant algorithms, namely Horspool's algorithm, backward oracle matching (BOM), and backward (non)-deterministic DAWG matching (B(N)DM). Now, we compare the distributions of running time costs of these algorithms for several patterns over the DNA alphabet {

It is remarkable that for BOM we find zero probabilities with a fixed period. The period equals ^{p}^{p}

Let a window-based pattern matching algorithm ^{A,p}^{A,p}^{A,p}^{A,p}^{m}^{n}

Let _{i}_{k}^{n}^{A,p}_{i}_{i}

By using the assumption that ^{A,p}^{A,p}^{A,p}

The probability that one pattern matching algorithm is faster than another depends on the pattern. Using the technique introduced in Section 6, we can quantify the strength of this effect.

Worth noting and perhaps surprising is the fact that there is a non-zero probability of BOM being faster than B(N)DM although, ^{B(N)DM,p}(^{BOM,p}(

To assess the effect of DAA minimization before constructing PAAs, we constructed minimized DAAs for all 21840 patterns of lengths 2 to 7 over Σ = {

The algorithms were implemented in JAVA and are available as part of the MoSDi software package available at

Using PAAs, we have shown how the exact distribution of the number of character accesses for window-based pattern matching algorithms can be computed algorithmically. The framework admits general finite-memory text models, including i.i.d. models, Markov models of arbitrary order, and character-emitting hidden Markov models. The given construction results in an asymptotic runtime of ^{2} · ^{
}| · |
^{2} · |Σ|). The number of DAA states |
^{
}| can be as large as ^{m}^{3}) sizes for B(N)DM and BOM; this is consistent with the numbers in ^{2}) construction is known [

The behavior of BOM deserves further attention: first, periodic zero probabilities are found in its distribution of text character accesses; and second, it may (unexpectedly) need fewer text accesses than B(N)DM on some patterns, although BOM's shift values are never better than B(N)DM's.

We focused on algorithms for single patterns, but the presented techniques also apply to algorithms to search for multiple patterns like the Wu-Manber algorithm [

Other metrics than text character accesses might be of interest and could be easily substituted; for example, just counting the number of windows by defining ^{p}^{m}

The given constructions allow us to analyze an algorithm's performance for each pattern individually. While this is desirable for detailed analysis, the cost distribution resulting from randomly choosing text

The results of this paper were obtained while Tobias Marschall was a PhD student with Sven Rahmann and TU Dortmund. The thesis is available at

Factor Oracle for

Illustration of the DAA encoding the behavior of Horspool's algorithm when searching the text

Exact distributions of character access counts for patterns

Exact distributions of differences in character access counts for different patterns using a second order Markovian text model estimated from the human genome and random texts of lengths 100.

Example of a string for which BOM executes less character accesses than B(N)DM when searching for the pattern

Histogram on number of states of minimal DAAs over all patterns of length 6 over Σ = {

Comparison of DAA sizes for all patterns of length

| ||||
---|---|---|---|---|

^{m} |
||||

2 | 48 | 4 / 4.8 / 5 | 4 / 4.0 / 4 | 4 / 4.8 / 5 |

3 | 256 | 7 / 8.3 / 9 | 7 / 8.3 / 9 | 7 / 9.6 / 10 |

4 | 1280 | 11 / 14.3 / 15 | 11 / 15.6 / 18 | 11 / 17.0 / 19 |

5 | 6144 | 16 / 23.6 / 25 | 16 / 26.5 / 30 | 16 / 27.9 / 31 |

6 | 28672 | 22 / 37.0 / 39 | 22 / 41.8 / 47 | 22 / 42.8 / 48 |

7 | 131072 | 29 / 55.2 / 58 | 29 / 62.4 / 70 | 29 / 62.6 / 70 |

Most work was carried out while both authors were affiliated with TU Dortmund.