Fast Full Permuted Pattern Matching Algorithms on Multi-track Strings

A multi-track string is a tuple of strings of the same length. The full permuted pattern matching problem is, given two multi-track strings T = (t1, t2, . . . , tN ) and P = (p1, p2, . . . , pN ) such that |p1| = · · · = |pN | ≤ |t1| = · · · = |tN |, to find all positions i such that P = (tr1 [i : i+m− 1], . . . , trN [i : i+m− 1]) for some permutation (r1, . . . , rN ) of (1, . . . , N), where m = |p1| and t[i : j] denotes the substring of t from position i to j. We propose new algorithms that perform full permuted pattern matching practically fast. The first and second algorithms are based on the Boyer-Moore algorithm and the Horspool algorithm, respectively. The third algorithm is based on the Aho-Corasick algorithm where we use a multi-track character instead of a single character in the so-called goto function. The fourth algorithm is an improvement of the multi-track Knuth-Morris-Pratt algorithm that uses an automaton instead of the failure function of the original algorithm. Our experiment results demonstrate that those algorithms perform permuted pattern matching faster than existing algorithms.


Introduction
The pattern matching problem on strings is to find all occurrences of a pattern string in a text string. Pattern matching algorithms such as the Knuth-Morris-Pratt (KMP) algorithm [8], Boyer-Moore algorithm [2], and Horspool algorithm [5], perform pattern matching fast by preprocessing the pattern. On the other hand, pattern matching can be also performed by preprocessing the text into some data structure such as a suffix tree [12], a suffix array [10], and a position heap [4].
The permuted pattern matching problem, proposed by Katsura et al. [6,7], is a generalization of the pattern matching problem, where we compare tuples of strings. Tuples of strings can model various types of real data such as multiple-sensor data, polyphonic music data, and multiple genomes. We call a tuple of strings of the same length a multi-track string. The permuted pattern matching problem is, given two multi-track strings T = (t 1 , t 2 , . . . , t N ) and P = (p 1 , p 2 , . . . , p M ) where M ≤ N and |t 1 | = · · · = |t N | = n ≥ |p 1 | = · · · = |p M | = m, to find all positions i such that P is a permutation of a sub-tuple of (t 1 [i : i + m − 1], . . . , t N [i : i + m − 1]), where w[i : j] denotes the substring of w from position i to j. As an example, data from multiple sensors such as a three dimensional accelerometer can be considered as a multi-track string. This problem can be solved by constructing some data structure from the text such as a multi-track suffix tree [6] and a multi-track position heap [7], or by preprocessing the pattern like the Aho-Corasick (AC) automaton based algorithm [6] and KMP based algorithm [3] do.
In this paper, we focus on the full permuted matching problem, which is a special case of the permuted matching problem where we have M = N . We propose several Proceedings of the Prague Stringology Conference 2016

Algorithm
Preprocessing time Matching time Online AC-automaton based [6] O(mM log σ) O(nN log σ) yes Multi-track KMP* [3] O(mM ) O(nN ) no Filter-MTKMP [3] O new algorithms that perform full permuted pattern matching practically fast. The first algorithm, MT-BM, is based on the Boyer-Moore algorithm [2] and the second one, MT-H, is based on the Horspool algorithm [5], on which we made a significant improvement by using a data structure called track trie. The third algorithm, multitrack AC-automaton, is an algorithm for dictionary matching on multi-track strings based on the AC-algorithm [1], where we use a multi-track character instead of a single character in the so-called goto function. The fourth algorithm, multi-track permuted matching automaton, is an improvement of multi-track KMP algorithm [3] that uses an automaton instead of the failure function in the KMP algorithm. Moreover, we conduct experiments and show that our algorithms perform permuted pattern matching faster than existing algorithms. The worst case running time of proposed algorithms and existing algorithms are summarized in Table 1, where d is the total length of the patterns and σ is the size of the alphabet.

Preliminaries
Let w ∈ Σ n be a string of length n over an alphabet Σ and σ = |Σ| be the alphabet size. The length n of w is denoted by |w|. The empty string, denoted by ε, is a string of length 0.  [1]. For two strings x and y, we denote by x ≺ y that x is lexicographically smaller than y, and by x y that either x = y or x ≺ y.
A multi-track string (or multi-track for short) W = (w 1 , w 2 , ..., w N ) is an Ntuple of strings w i ∈ Σ n , and each w i is called the i-th track of W. A multi-track character C = (c 1 , c 2 , ..., c N ) is an N -tuple of characters c i ∈ Σ. The length n of strings in W is called the length of W and denoted by |W| len . Let r = (r 1 , r 2 , . . . , r N ) be a permutation of (1, 2, . . . , N ). For a multi-track W = (w 1 , w 2 , . . . , w N ), W r = W r 1 , r 2 , . . . , r N = (w r 1 , . . . , w r N ) is called a permuted multi-track of W. The sorted index SI(W) of a multi-track W is a permutation (r 1 , . . . , r N ) such that w r i w r i+1 for any 1 ≤ i < N , where we assume r i < r i+1 in the case w r i = w r i+1 . The sorted multi-track sort(W) is defined as W SI(W) . The reverse of a multi-track W = (w 1 , . . . , w N ) is W R = (w R 1 , . . . , w R N ). The sorted index of the reverse multi-track, denoted by RI(W), is a permutation (r 1 , . . . , r N ) such that w R r i w R r i+1 for any 1 ≤ i < N . Note that SI(W[i :]) and RI(W[: i]) for 1 ≤ i ≤ n can be computed in O(nN log σ) time offline by using a suffix tree [6,11] or a suffix array [9], and RI(W[: i]) for 1 ≤ i ≤ n can be computed in O(n(N + σ)) time online by using radix sort.
For two multi-tracks X = (x 1 , x 2 , . . . , x N ) and Y = (y 1 , y 2 , . . . , y N ), X permutedmatches Y, denoted by X ⊲⊳ = Y, if X = Y r for some permutation r. Throughout the paper, we assume that P is a pattern with |P| num = M and |P| len = m, and T is a text with |T| num = N = M and |T| len = n ≥ m. The pattern matching problem on multi-tracks is defined as follows.

Boyer-Moore and Horspool algorithms for multi-track strings
In this section, we propose two permuted pattern matching algorithms that are based on the Boyer-Moore algorithm and the Horspool algorithm, which we call MT-BM and MT-H, respectively.

Multi-track Boyer-Moore algorithm
The original Boyer-Moore algorithm uses two failure functions GS (good suffixes) and BC (bad characters) to determine how much the position of a substring to compare should be shifted when a mismatch is found between the input patten and the substring of the text. Those functions are defined as follows on multi-tracks.
Algorithm 1: MT-BM and MT-H preprocessing functions Definition 4 (Bad character). For multi-track P of length |P| len = m and a multitrack character C, BC (C) is the first occurrence position of sort(C) in P R [2 :]. The function BC (C) returns m if there is no occurrence of sort(C) in P R [2 :].

Algorithm 3: MT-H
In the implementation, suf and GS can be represented as arrays, while BC can be realized in a trie of the multi-track characters. We perform permuted-match instead of exact match when computing GS . Algorithm 1 shows how to construct GS and BC . The array GS is computed by ComputeGS, which uses array suf computed by ComputeSuf. Note that we compute RI at the beginning (Lines 2 and 24) of the algorithm and will not recompute them when we use the values later.  Proof. From Lemmas 5, 6, and 7, Algorithm 2 needs O(m(M log σ + σ)) time for preprocessing. Next, RI (T[: i]) can be computed in O(n(N + σ)) time by using radix sort. In the outer while loop starting at line 6, the value of j is increased by at least 1, Algorithm 4: Track-trie construction algorithm (constructTrackTrie(P)) Algorithm 5: Track-trie matching algorithm (matchTrackTrie(T, j)) Proof. Similar to the proof of Theorem 8, beside MT-H uses BC only. ⊓ ⊔

Boyer-Moore and Horspool matching algorithms with track-trie
The two algorithms presented in the previous subsections decide if two multi-tracks permuted-match by sorting them. In this subsection, we present another idea for this task using a data structure called a track trie. The track trie of a multi-track P stores all the reversed strings of the tracks of P, that is, {p R 1 , p R 2 , . . . , p R M }. Fig. 1(a) shows the track trie of a multi-track pattern P = (aaabb, abbba, bbaba).
Algorithm 4 is the construction algorithm for the track-trie of P. For a node s of the track trie and a character c ∈ Σ, the goto function goto(s, c) returns the child of s that has an edge labeled c. We naturally extend it to the domain Σ * by goto(s, ε) = s and goto(s, aw) = goto(goto(s, a), w) for any a ∈ Σ and w ∈ Σ * . We also associate a weight with each node to find mismatch on a text, as we will explain later. For a given multi-track text T and a position i, Algorithm 5 finds a mismatch position in two cases; (1) when a track cannot find its goto destination, and (2) when the number of tracks that have the same string w is more than the weight of the node that represents the string goto(root, w). Those mismatch conditions are illustrated in Fig. 1 (b) and (c), respectively. Fig. 1 (b) shows that the track trie cannot find a transition for the second character b of the third track. On the other hand, Fig. 1 (c) shows that T 2 [3 :] has two 'bba' on its track, however the P[3 :] has only one 'bba' on its track, i.e. the node that represents 'bba' has one on its weight. Although the worst case time complexity remains the same, by using track-trie, both MT-BM and MT-H can match the pattern to the text practically faster, because we do not need to compute the reverse sorted index of the text. First, we construct the track-trie of the pattern by using constructTrackTrie(P). Then, we replace line 8 (resp. line 7) of Algorithm 2 (resp. Algorithm 3) by matchTrackTrie(T, j) to find a mismatch position.

Multi-track AC-automaton
In this section, we will explain a data structure called a multi-track AC-automaton that can perform dictionary matching on multi-tracks. Given a set D = {P 1 , P 2 , . . . , P r } of multi-track patterns called a dictionary and a multi-track text T, by preprocessing the patterns, the multi-track AC-automaton can find all occurrence positions of each pattern in the text. Let d = Σ r i=1 m i be the total length of the patterns in D, where Unlike the original AC-automaton, the multi-track AC-automaton uses a multitrack character, instead of a single character to define goto. The states and goto in MTAC (D) construct a trie of sort(P i ) for all P i ∈ D. Each state in MTAC (D) represents a prefix of sort(P i ), thus each state can be denoted by S(W), where W is the string obtained by concatenating the labels of the edges from the root to the state. Therefore, we can define goto(S(P i [: j]), P i [j + 1]) = S(P i [: j + 1]) for 1 ≤ i ≤ r and 1 ≤ j < m i . For convenience, we denote goto(goto(s, For a state s and a multi-track character C, goto(s, C) can be implemented by using multi-track character trie of depth at most M nodes, thus goto(s, C) can be executed in O(M log σ) time. The function goto can be constructed by using Algorithm 6.
Next, the failure function of a state S(P i [: j]) is defined as flink (S(P i [: j])) = S(sort(P i [k : j])), where P i [k : j] is the longest proper suffix of P i [: j] such that P i [k : j] is a prefix of some sort(P ℓ ) with P ℓ ∈ D. Algorithm 7 shows a construction algorithm for the failure function of a multi-track AC-automaton.
Finally, the output function of the multi-track AC-automaton is similar to the original AC-Automaton. For a state S(P i [: j]), the output of the state output(S(P i [: j])) is the set of patterns P ℓ ∈ D such that P ℓ ⊲⊳ = P i [k : j] for some 1 ≤ k ≤ j. The initial output function is constructed by Algorithm 6, and then updated by Algorithm 7 to get the final output function. Fig. 2 shows an example of MTAC (D). In order to simplify the construction algorithm, we use a special state that reads any multi-track character to get to the root state.  Proof. The running time of Algorithm 8 can be evaluated by counting the number of executions of goto. First, for each i, goto is executed at least once on activeState transition. Next, goto is executed to check whether the transition is fail or not. In this case, the number of executions of goto is the same as that of failure. The latter is at most n, because whenever goto is executed, the depth of activeState is increased by one, and whenever failure is executed, the depth of activeState is decreased by at least one. Therefore, the number of executions of goto is O(n).

Multi-track permuted matching automaton
In this section, we will describe a data structure called a multi-track permuted matching automaton that can perform permuted pattern matching on a multi-track text T online, by preprocessing a multi-track pattern P. A multi-track permuted matching automaton is constructed by two functions, goto and failure. In addition, similarly to a track-trie, each state of the multi-track permuted matching automaton has a if k = m then set activeState as an accept state ; Figure 3. Multi-track permuted matching automaton of P = (aaabb, abaab, bbaaa). The asterisk '*' is a special character that matches with any characters in Σ.
weight in order to determine whether failure should be executed or not. Fig. 3 shows an example of a multi-track permuted matching automaton.
For a multi-track pattern P = (p 1 , p 2 , ..., p m ), the multi-track permuted matching automaton of the pattern is denoted by MTPMA(P). The goto function of the multitrack permuted matching automaton is similar to that of an AC-automaton, thus, each state in MTPMA(P) represents a prefix of p i , which is denoted by S(w), where w is the string obtained by concatenating the labels of the edges from the root to the state. Each state S(w) has a weight, which is a number of tracks containing w as a prefix. Moreover, a state S(w) is called an accept state if w = p i for some i. Algorithm 9 constructs the goto function of a multi-track permuted matching automaton. Algorithm 10 constructs the failure function of a multi-track permuted matching automaton. We use a state pointer for each track in the pattern. Similarly to a tracktrie, there are two conditions that are considered as failure in a multi-track permuted matching automaton. The first condition is when it cannot find the goto transition, and the second condition is when the number of state pointers in the state is more than the weight of the state.
Theorem 16. Algorithm 10 constructs the failure function of a multi-track permuted matching automaton in O(mM log σ) time.  We set the parameter values as follows, n = 100000, m = 10, N = M = 1000, and σ = 2, and changed one of the parameters in each experiment to see the running time of the algorithms with respect to the parameters. We used randomly generated texts and patterns, and inserted 50 occurrences of a pattern into each text to make sure that there are occurrences of the pattern in the text.
The result of the experiments are shown in Fig. 4 (a)-(d), where one of the parameters n, N , m, and σ is changed respectively. First, we can see that the running time of the algorithms increase linearly with respect to the length and track count of the text, and is not much affected by the pattern length or the alphabet size. The running times of MT-BM and MT-H are almost the same, and the running times of these algorithms are faster when a track-trie is used.
Multi-track AC-automaton is slower than the MTKMP algorithm on a single pattern matching, although it can support dictionary matching on multi-track strings. We can also see that multi-track permuted matching automaton runs faster than the MTKMP and Filter-MTKMP algorithms, as it is an improvement of MTKMP algorithm.

Concluding remarks
In this paper, we focused on full permuted pattern matching problems, where the track count N of a text equals to the track count M of a pattern. In general, the permuted pattern matching problem is more difficult if N > M . For example, when we construct GS , and we might miss the occurrences of the pattern if we use the same shift as in the case of full permuted pattern matching. This problem is also arises in multi-track AC-automaton and multi-track permuted matching automaton when we try to construct the failure function. We should find another condition to define the failure function for these algorithms.