Permuted Pattern Matching Algorithms on Multi-Track Strings

Diptarama Hendrian; Yohei Ueki; Kazuyuki Narisawa; Ryo Yoshinaka; Ayumi Shinohara

doi:10.3390/a12040073

Abstract

A multi-track string is a tuple of strings of the same length. Given the pattern and text of two multi-track strings, the permuted pattern matching problem is to find the occurrence positions of all permutations of the pattern in the text. In this paper, we propose several algorithms for permuted pattern matching. Our first algorithm, which is based on the Knuth–Morris–Pratt (KMP) algorithm, has a fast theoretical computing time with

O (m k)

as the preprocessing time and

O (n k log σ)

as the matching time, where n, m, k,

σ

, and

occ

denote the length of the text, the length of the pattern, the number of strings in the multi-track, the alphabet size, and the number of occurrences of the pattern, respectively. We then improve the KMP-based algorithm by using an automaton, which has a better experimental running time. The next proposed algorithms are based on the Boyer–Moore algorithm and the Horspool algorithm that try to perform pattern matching. These algorithms are the fastest experimental algorithms. Furthermore, we propose an extension of the AC-automaton algorithm that can solve dictionary matching on multi-tracks, which is a task to find multiple multi-track patterns in a multi-track text. Finally, we propose filtering algorithms that can perform permuted pattern matching quickly in practice.

Keywords:

multi-track string; permuted pattern matching; AC-automaton

1. Introduction

The pattern matching problem on strings is to find all occurrence positions of a pattern string in a text string. Pattern matching algorithms such as the Knuth–Morris–Pratt (KMP) algorithm [1], the Boyer–Moore algorithm [2], and the Horspool algorithm [3] perform pattern matching quickly by preprocessing the pattern. On the other hand, pattern matching can be accelerated by preprocessing the text into some indexing structure such as suffix trees [4], suffix arrays [5], or position heaps [6].

The permuted pattern matching problem, proposed by Katsura et al. [7,8], is a generalization of the pattern matching problem, where we compare tuples of strings. We call a tuple of strings of the same length a multi-track string. The permuted pattern matching problem is as follows: given two multi-track strings

T = (T_{1}, T_{2}, \dots, T_{k})

and

P = (P_{1}, P_{2}, \dots, P_{k})

where

| T_{1} | = \dots = | T_{k} | = n \geq | P_{1} | = \dots = | P_{k} | = m

, to find all positions i such that

(T_{1} [i : i + m - 1], \dots, T_{k} [i : i + m - 1])

is a permutation of

P

. The problem can be solved by constructing indexing structures from the text such as multi-track suffix trees [7], multi-track position heaps [8], and filtering multi-set trees [9], or by preprocessing the pattern such as the AC-automaton-based algorithm [7].

Multi-track strings can model various types of real data when sequences in the data have the same meaning. For example, data from a three-dimensional accelerometer in a smartphone can be considered a multi-track string. Consider two smartphones with the same position, and rotate one of them several times where each rotation is 90 degrees on the x, y, or z axis. If we perform the same motion on both smartphones, the generated data from three-dimensional accelerometer of one phone is a permutation of the other one. By using permuted pattern matching, we can find occurrences of a pattern on both smartphones without permuting the pattern.

In this paper, we propose several algorithms that solve permuted pattern matching quickly. These algorithms can be classified into three groups. The first group consists of algorithms that are based on the KMP algorithm [1]. This group includes the multi-track KMP, multi-track AC-automaton, and multi-track permuted matching automaton algorithms. The second group consists of algorithms that are based on the Boyer–Moore algorithm [2] and the Horspool algorithm [3]. This group includes the multi-track Boyer–Moore and multi-track Horspool algorithms. The third group consists of filtering algorithms. Moreover, we conduct experiments and show that our algorithms perform permuted pattern matching faster than existing algorithms. The worst case running times of the proposed algorithms and existing algorithms are summarized in Table 1, where d is the total length of the patterns and

σ

is the size of the alphabet.

Table 1. Comparison of the algorithms for permuted pattern matching. Multi-track AC-automaton can find occurrences of multiple patterns.

Preliminary versions of this work appeared in [10,11].

2. Notation and Definition on Multi-Track Strings

Let

Σ

be a totally-ordered alphabet and

σ = | Σ |

be the size of the alphabet. An element of

Σ^{*}

is called a string. For a string

W \in Σ^{*}

, the length of W is denoted by

| W |

. The empty string, denoted by

ε

, is a string of length zero. For a string

W \in Σ^{*}

of length n,

W [i]

denotes the

i^{th}

symbol of W, and

W [i : j] = W [i] W [i + 1] \dots W [j]

denotes a substring of W that begins at i and ends at j for

1 \leq i \leq j \leq n

. For convenience, we abbreviate

W [1 : i]

to

W [: i]

and

W [i : n]

to

W [i :]

; these terms are known as the prefix and the suffix of W, respectively. Moreover, let

W [i : j] = ε

if

i > j

. The reverse string of W is denoted by

W^{R} = W [n] W [n - 1] \dots W [2] W [1]

. For two strings, X and Y,

X ≺ Y

denotes that X is lexicographically smaller than Y, and

X ⪯ Y

denotes that either

X = Y

or

X ≺ Y

.

A multi-track symbol

C = (c_{1}, c_{2}, \dots, c_{k})

is a k-tuple of symbols

c_{i} \in Σ

. A multi-track string (or multi-track for short)

W = (W_{1}, W_{2}, \dots, W_{k})

is a k-tuple of strings

W_{i} \in Σ^{*}

where

| W_{1} | = | W_{2} | = \dots = | W_{k} |

, and each

W_{i}

is called the

i^{th}

track of

W

. The length n of strings in

W

is called the length of

W

and denoted by

{| W |}_{len}

, and the number k of tracks in

W

is called the track count of

W

and is denoted by

{| W |}_{num}

. For a multi-track symbol

C = (c_{1}, c_{2}, \dots, c_{k})

, we define

C [i] = c_{i}

. For a multi-track

W

of track count

k = {| W |}_{num}

,

W [i] = (W_{1} [i], W_{2} [i], \dots, W_{k} [i])

denotes the

i^{th}

multi-track symbol, and

W [i] [j]

denotes

W_{j} [i]

for

1 \leq i \leq {| W |}_{len}

and

1 \leq j \leq {| W |}_{num}

. Moreover,

W [i : j] = (W_{1} [i : j], W_{2} [i : j], \dots, W_{k} [i : j])

denotes a substring of

W

that begins at i and ends at j for

1 \leq i \leq j \leq {| W |}_{len}

. Similarly to the notation for strings,

W [: i]

and

W [i :]

mean

W [1 : i]

and

W [i : | W |_{len}]

, which are called the prefix and the suffix of

W

, respectively.

Let

r = (r_{1}, r_{2}, \dots, r_{k})

be a permutation of

(1, 2, \dots, k)

. For a multi-track

W = (W_{1}, W_{2}, \dots, W_{k})

,

W ⟨ r ⟩ = W ⟨ r_{1}, r_{2}, \dots, r_{k} ⟩ = (W_{r_{1}}, \dots, W_{r_{k}})

is called a permuted multi-track of

W

. The sorted index

SI (W)

of a multi-track

W

is a permutation

(r_{1}, \dots, r_{k})

such that

W_{r_{i}} ⪯ W_{r_{i + 1}}

for any

1 \leq i < k

, where we assume

r_{i} < r_{i + 1}

in the case

W_{r_{i}} = W_{r_{i + 1}}

. The sorted multi-track

sort (W)

is defined as

W ⟨ SI (W) ⟩

. The reverse of a multi-track is

W = (W_{1}, \dots, W_{k})

is

W^{R} = (W_{1}^{R}, \dots, W_{k}^{R})

. The sorted index of the reverse multi-track, denoted by

RI (W)

, is a permutation

(r_{1}, \dots, r_{k})

such that

w_{r_{i}}^{R} ⪯ w_{r_{i + 1}}^{R}

for any

1 \leq i < k

.

Lemma 1

([7]). Let

{SI}_{W}

be a two-dimensional array such that

{SI}_{W} [i] = SI (W [i :])

for

1 \leq i \leq n

.

{SI}_{W}

can be computed in

O (n k)

time ffline, where

n = {| P |}_{len}

and

k = {| P |}_{num}

.

Lemma 2.

Let

{RI}_{W}

be a two-dimensional array such that

{RI}_{W} [i] = RI (W [i :])

for

1 \leq i \leq n

.

{RI}_{W}

can be computed in

O (n (k + σ))

time nline, where

n = {| P |}_{len}

and

k = {| P |}_{num}

.

Proof.

Clearly,

{RI}_{W}

can be computed in

O (n (k + σ))

by using a radix sort algorithm. □

For two multi-tracks

X = (X_{1}, X_{2}, \dots, X_{k})

and

Y = (Y_{1}, Y_{2}, \dots, Y_{k})

of the same length, we say that

X

permuted-matches

Y

if

X = Y ⟨ r ⟩

for some permutation

r

, and it is denoted by

X ⋈ Y

. □

Lemma 3

([7]). For two multi-tracks

X

and

Y

,

X ⋈ Y

if and only if

sort (X) = sort (Y)

.

Throughout this paper, we assume that

P

is a pattern with

{| P |}_{num} = k

and

{| P |}_{len} = m

, and

T

is text with

{| T |}_{num} = k

and

{| T |}_{len} = n \geq m

. The pattern matching problem on multi-tracks is defined as follows.

Definition 1

(Permuted pattern matching ([7]). Given a multi-track text

T

and a multi-track pattern

P

, compute all positions i that satisfy

P ⋈ T [i : i + m - 1]

. We call such positions ccurrence position of

P

in

T

.

For example, given a text

T = (\begin{matrix} aabaaaaa & , \\ abaabbaa & , \\ baaababa \end{matrix})

and a pattern

P = (\begin{matrix} aba & , \\ baa & , \\ aaa \end{matrix})

, we can see that the pattern

P

matches

T

at two because

T [2 : 4] = P

. Moreover, the pattern

P

permuted-matches

T

at six, since

P ⟨ 3, 2, 1 ⟩ = (\begin{matrix} aaa & , \\ baa & , \\ aba \end{matrix}) = T [6 : 8]

. Therefore, we should output two and six in this case.

We remark that Katsura et al. [7] defined a more general problem called the sub-permuted pattern matching, where we have

{| T |}_{num} \geq {| P |}_{num}

, and our task is to find a partial permutation

(r_{1}, \dots, r_{{| P |}_{num}})

of

(1, \dots, | T |_{num})

and a position i for which

P ⋈ T ⟨ r_{1}, \dots, r_{{| P |}_{num}} ⟩ [i : i + m - 1]

holds. However, in this paper, we only consider the case

{| T |}_{num} = {| P |}_{num}

, which was called full-permuted pattern matching in [7].

3. KMP-Based Permuted Pattern Matching Algorithms

In this section, we propose three algorithms for permuted pattern matching based on the KMP algorithm. In Section 3.1, we introduce our first algorithm, the multi-track KMP algorithm (MTKMP algorithm) which uses the border array of a multi-track pattern. We remark that this algorithm is a variant of the generalized KMP algorithm proposed by Matsuoka et al. [12]. They showed that the KMP algorithm can be applied to string classes that satisfy some equality and border properties. In this paper, we show some properties of multi-track strings so that the KMP algorithm can be applied to multi-track strings and give the exact computation time of the algorithm.

In Section 3.2, we extend it to the multi-track AC-automaton algorithm, which can perform dictionary matching on multi-tracks. Last, in Section 3.3, we introduce another extension of the MTKMP algorithm, the permuted matching automaton algorithm, which can perform permuted pattern matching faster than the MTKMP algorithm.

3.1. Multi-Track KMP Algorithm

Similar to the original KMP algorithms, the multi-track KMP algorithm uses multi-track border arrays to compute the shift amount when the algorithm finds a mismatch. First, we define borders for multi track strings.

Definition 2

(Borders of multi-tracks). A orde of a multi-track

P

is any multi-track that permuted-matches with a prefix and a suffix of

P

. Moreover, a border

B

of

P

is called rope if

B

P

.

We can easily confirm the following lemma.

Lemma 4.

For any two borders

B_{1}

and

B_{2}

of

P

, if

| B_{1} |_{len} < {| B_{2} |}_{len}

, then

B_{1}

is a border of

B_{2}

.

Let

T [i : i + j] ⋈ P [1 : j + 1]

for some i and j. If

T [i + x : i + x + m - 1] ⋈ P [1 : m]

for some

x \leq j

, then

T [i : i + (j - x)] ⋈ P [1 : (j - x) + 1]

and

T [i + x : i + j] ⋈ P [1 : (j - x) + 1]

hold, i.e.,

P [1 : (j - x) + 1]

is a proper border of

P [1 : j + 1]

. Therefore, if

T [i : i + j] ⋈ P [1 : j + 1]

and

T [i : i + j]

P [1 : j + 1]

, we can safely shift

P

by x, where

(j - x + 1)

is the length of the longest proper border of

P [1 : j + 1]

. We store the length of the longest proper border of

P [1 : j + 1]

for

1 \leq j \leq m

as the border array of

P

.

Definition 3

(Multi-track border array). Given a multi-track

P

of

{| P |}_{len} = m

, he border arra

{Border}_{P}

of

P

is an array of length

m + 1

, where the

i^{th}

element of

{Border}_{P}

is the length of the longest proper border of

P [1 : i]

. Formally,

{Border}_{P} [i] = max {j ∣ P [1 : j] ⋈ P [i - j + 1 : i], 0 \leq j < i} .

(1)

For algorithmic purpose, we define

{Border}_{P} [0] = - 1

. Algorithm 1 constructs the multi-track border array of an input

P

.

Lemma 5.

Given a multi-track

P

of

{| P |}_{len} = m

and

{| P |}_{num} = k

, Algorithm 1 constructs the border array of

P

in

O (m k)

time.

Proof.

First, we show the correctness of the algorithm by induction. Assume we have computed the longest proper border of

P [: j]

and stored it in

Border [j]

for

1 \leq j \leq i

. Let

P [: b]

be the longest proper border of

P [: i + 1]

. Clearly,

P [: b - 1]

is a border of

P [: i]

. Using Lemma 4, we can find

b - 1

by finding

max {j ∣ P [: j] is a proper border of P [: i] and P [: j + 1] ⋈ P [i - j + 1 : i + 1]}

. Since

Border [j]

is the longest proper border of

P [: j]

, the algorithm can find

b - 1

by updating

j \leftarrow Border [j]

from

j = i

until

P [: Border [j] + 1] ⋈ P [i - Border [j] + 1 : i + 1]

holds. Therefore, Algorithm 1 computes the border array of

P

correctly.

Next, we show the computation time of the algorithm. Using Lemma 1,

SI

can be computed in

O (m k)

. Both while loops are called

O (m)

times at most, since

i - j \leq i \leq m + 1

, and the value of i always increases each time the outer loop is called, while the value of

i - j

always increases each time the inner loop is called. Each comparison of

P [j] ⟨ SI (P [1 :]) ⟩

and

P [i] ⟨ SI (P [i - j :]) ⟩

consumes

O (k)

time; hence, Algorithm 1 runs in

O (m k)

time. □

Algorithm 2 shows the pseudocode of the MTKMPalgorithm. The MTKMP algorithm performs permuted pattern matching from left to right of the pattern and the text. The sorted pattern

P_{sort}

and the array

SI

save the sorted indices of suffixes of text used to perform permuted-matching between the pattern and a substring of the text. If symbols on the text and the pattern are mismatched, we shift the matching position of the pattern using

Border

. If the pattern matches a substring of the text, then the MTKMP algorithm outputs the occurrence position at the text and shifts the pattern according to

Border

.

Algorithm 1: Multi-track border array construction algorithm.

Algorithm 2: The multi-track KMP algorithm.

Theorem 1.

Given a text

T

and a pattern

P

, the MTKMP algorithm computes all occurrence positions of

P

in

T

in

O (n k + m k)

time.

Proof.

According to the above arguments, we can shift

P

by x, where

(j - x + 1)

is the length of the longest proper border of

P [1 : j + 1]

if

T [i : i + j] ⋈ P [1 : j + 1]

and

T [i : i + j]

P [1 : j + 1]

. Since

Border [i]

is the longest proper border of

P [: i]

, the algorithm finds the occurrence position of

P

in

T

correctly.

Using Lemma 1,

SI

can be computed in

O (n k)

. From Lemma 5, constructMTBorderArray

P

runs in

O (m k)

. Both while loops are called

O (n)

times at most, since

i - j \leq i \leq n + 1

, and the value of i always increases each time the outer loop is called, while the value of

i - j

always increases each time the inner loop is called. Each comparison of

P [j] ⟨ SI (P [1 :]) ⟩

and

T [i] ⟨ SI (T [i - j :]) ⟩

consumes

O (k)

time. Hence, Algorithm 2 runs in

O (n k + m k)

time. □

3.2. Multi-Track AC-Automaton

In this subsection, we introduce a data structure called a multi-track AC-automaton that can perform dictionary matching on multi-tracks, namely permuted dictionary matching. Let

D = {P_{1}, P_{2}, \dots, P_{r}}

be a set of patterns of the same track count k, called a dictionary, and let

d = \sum_{i = 1}^{r} m_{i}

be the total length of the patterns in D, where

m_{i} = {| P_{i} |}_{len}

. The multi-track AC-automaton of D, denoted by

MTAC (D)

, consists of a set of states and three functions: goto, failure, and output functions. Figure 1 shows an example of

MTAC (D)

.

Figure 1. The multi-track AC-automaton

MTAC (D)

for

D = {P_{1}, P_{2}, P_{3}}

, where

P_{1} = (aaabb, abaab, bbaaa)

,

P_{2} = (abab, abba, bbab)

, and

P_{3} = (aabbab, bababb, baaaab)

. The asterisk “*” is a special symbol that matches any symbol in

Σ

.

For a dictionary D, the states of

MTAC (D)

are

S = {sort (P_{i}) [: j] ∣ P_{i} \in D, 0 \leq j \leq m_{i}}

. Each state corresponds to a prefix of

sort (P_{i})

for some

P_{i} \in D

. Unlike the original AC-automaton, the multi-track AC-automaton uses a multi-track symbol, instead of a single symbol, to define the goto function. The goto, failure, and output functions of

MTAC (D)

are defined as follows.

Definition 4

(Goto function). We define the goto function δ of

MTAC (D)

by

δ (sort (P_{i}) [: j]), C) = sort (P_{i}) [: j + 1]

iff

C = sort (P_{i}) [j + 1]

for a state

sort (P_{i}) [: j]

and a multi-track symbol

C

.

Definition 5

(Failure function). The failure link of a state

sort (P_{i}) [: j]

is defined as

flink (sort (P_{i}) [: j]) = sort (P_{i} [l : j])

, where

P_{i} [l : j]

is the longest proper suffix of

P_{i} [: j]

such that

P_{i} [l : j]

permuted-matches a prefix of some pattern in D.

Definition 6

(Output function).The output

output (P_{i} [: j])

of a state

sort (P_{i}) [: j]

is the set of patterns that permuted-matches

P_{i} [l : j]

for some

1 \leq l \leq j

.

Next, we explain how to construct

MTAC (D)

. We use multi-track symbol tries to implement the goto function.

Definition 7

(Multi-track symbol trie). Let

C = {C_{1}, . . ., C_{z}}

be a set of multi-track symbols. The multi-track symbol trie of

C

is the trie for the set of strings

{C_{1} [1] C_{1} [2] \dots C_{1} [| C_{1} |_{num}], \dots, C_{z} [1] C_{z} [2] \dots C_{z} [| C_{z} |_{num}]}

.

For a state

s \in S

, we use the multi-track symbol trie of a set

{C ∣ δ (s, C) \neq NULL}

and each leaf of multi-track symbol tries labeled by

δ (s, C)

. Since a child of a node in multi-track symbol tries can be searched in

O (log σ)

time,

δ (s, C)

can be computed in

O (k log σ)

time. The goto function

δ

can be constructed by using Algorithm 3.

Algorithm 3: Multi-track AC-automaton goto function construction algorithm.

Lemma 6.

Algorithm 3 constructs the goto function of the multi-track AC-automaton of a dictionary

D = {P_{1}, P_{2}, \dots, P_{r}}

in

O (d k log σ)

time.

Proof.

The goto function

δ (s, C)

is executed

O (d)

times, and

δ (s, C)

can be computed in

O (k log σ)

time. □

After constructing the goto function, we can construct the failure function of

MTAC (D)

. Algorithm 4 shows a construction algorithm for the failure function of a multi-track AC-automaton. In order to simplify the construction algorithm, we use a special state that reads any multi-track symbol to get to the root state.

Algorithm 4: Multi-track AC-automaton failure function construction algorithm.

Lemma 7.

Algorithm 4 constructs the failure function of the multi-track AC-automaton of a dictionary

D = {P_{1}, P_{2}, \dots, P_{r}}

in

O (d k log σ)

time.

Proof.

We can bound the running time of Algorithm 4 by counting the number of executions of

δ (s, C)

. For each pattern

P_{i}

, let

s_{i, j}

be a state such that

s_{i, j} = δ (root, P_{i} [: j])

for

1 \leq j \leq m_{i}

. Let

f_{i, j}

be the number of executions of

flink

when finding

flink (s_{i, j})

. The maximum value of

f_{i, j}

is bounded by

depth (flink (s_{i, j - 1})) + 1

. Because the depth of

flink (s_{i, j})

is, at most,

depth (flink (s_{i, j - 1})) - f_{i, j} + 1

, we get

f_{i, j} \leq depth (flink (s_{i, j - 2})) - f_{i, j - 1} + 2

recursively. By solving this formula, we get

\sum_{j = 1}^{m_{i}} f_{i, j} \leq 2 m_{i}

and

\sum_{i = 1}^{r} \sum_{j = 1}^{m_{i}} f_{i, j} \leq \sum_{i = 1}^{r} 2 m_{i} = 2 d

. Moreover, each

δ (s, C)

is executed in

O (k log σ)

time. □

Last, by using the goto and failure functions, Algorithm 4 constructs the output function of a multi-track AC-automaton.

Lemma 8.

Algorithm 5 constructs the output function of the multi-track AC-automaton of a dictionary

D = {P_{1}, P_{2}, \dots, P_{r}}

in

O (d k log σ)

time.

Proof.

Updating the output function can be performed in

O (1)

time by using lists to save the output function and updating it by concatenating the list (see [13]). The proof is similar to the proofs of Lemmas 3 and 4. □

From Lemmas 6–8, we get the following theorem.

Theorem 2.

The multi-track AC-automaton of a dictionary

D = {P_{1}, P_{2}, \dots, P_{r}}

can be constructed in

O (d k log σ)

time.

Algorithm 5: Multi-track AC-automaton output function construction algorithm.

By using the multi-track AC-automaton of D, we can perform permuted dictionary matching on a text

T

, as shown in Algorithm 6. Let

v

be the current state of the multi-track AC-automaton. For each position i on

T

, Algorithm 6 uses the sorted index of

T [i - depth (v) :]

to determine the permutation of

T [i]

that is used in the goto function.

Algorithm 6: Permuted dictionary matching algorithm by using multi-track AC-automaton.

Theorem 3.

Permuted dictionary matching on a multi-track text

T

can be performed in

O (n k log σ)

time by using the multi-track AC-automaton of D.

Proof.

The running time of Algorithm 6 can be evaluated by counting the number of executions of

δ (s, C)

. First, for each i,

δ (s, C)

is executed at least once on the

v

transition. Next,

δ (s, C)

is executed to check whether the transition is

NULL

or not. In this case, the number of executions of

δ (s, C)

is the same as that of

flink

. The latter is, at most, n, because whenever

δ (s, C)

is executed, the depth of

v

is increased by one, and whenever

flink

is executed, the depth of

v

is decreased by at least one. Therefore, the number of executions of

δ (s, C)

is

O (n)

. □

3.3. Multi-Track Permuted Matching Automaton

In this subsection, we describe a data structure called a multi-track permuted matching automaton that can perform permuted pattern matching on a multi-track text

T

online, by preprocessing a multi-track pattern

P

. For a multi-track pattern

P = (P_{1}, P_{2}, . . ., P_{k})

, the multi-track permuted matching automaton of

P

, denoted by

MTPMA (P)

, consists of a set of states and two functions, goto and failure functions. In addition, each state of the multi-track permuted matching automaton has a weight in order to determine whether

flink

should be executed or not. Figure 2 shows an example of a multi-track permuted matching automaton.

Figure 2. The multi-track permuted matching automaton

MTPMA (P)

for

P = (aaabb, abaab, bbaaa)

. The asterisk is a special symbol that matches with any symbols in

Σ

.

Let

S

be the set of states of

MTPMA (P)

. Each state of

MTPMA (P)

corresponds to a prefix of

P_{i}

, i.e.,

S = {P_{i} [: j] ∣ 1 \leq i \leq k, 0 \leq j \leq m}

. Each state s has a weight, which is the number of tracks containing s as a prefix. Moreover, a state s is called an accept state if

s = P_{i}

for some i. The goto and failure functions of

MTPMA (P)

are defined as follows.

Definition 8

(Goto function).We define the goto function δ as

δ (P_{i} [: j], c) = P_{i} [: j + 1]

iff

c = P_{i} [j + 1]

for each state

P_{i} [: j]

and symbol c.

Definition 9

(Failure function). Let

S_{j} = {P_{i} [: j] ∣ 1 \leq i \leq k}

be the set of states whose depth is j. The failure link of the state is

flink (P_{i} [: j]) = P_{x} [: y] \in S_{y}

such that

P_{x} [: y]

is a proper suffix of

P_{i} [: j]

and

P [: y]

is the longest prefix of

P

that permuted-matches with a proper suffix of

P [: j]

. The latter condition is similar to that of the multi-track KMP algorithm.

Next, we explain how to construct

MTPMA (P)

. Algorithm 7 constructs the goto function of a multi-track permuted matching automaton. We add a weight on each state each time we visit the state when reading each track.

Lemma 9.

Algorithm 7 constructs the goto function of

MTPMA (P)

in

O (m k log σ)

time.

Proof.

For each track, the number of executions of

δ (s, c)

is m, and there are k tracks in a pattern

P

. Moreover,

δ (s, c)

can be executed in

O (log σ)

time. □

After constructing the goto function, Algorithm 8 constructs the failure function of

MTPMA (P)

. For each state s, we use a multi-track border array to determine the depth of

flink (s)

.

Lemma 10.

Algorithm 8 constructs the failure function of

P

in

O (m k log σ)

time.

Proof.

Similarly to the proof of Theorem 7, the failure and goto functions are executed

O (m k)

times. Moreover, the execution time of the failure function is

O (1)

, and that of the goto function is

O (log σ)

. □

From Lemmas 9 and 10 we get the following theorem.

Theorem 4.

The multi-track permuted matching automaton of a pattern

P

can be constructed in

O (m k log σ)

time.

Finally, by using

MTPMA (P)

, Algorithm 9 can perform permuted pattern matching on a multi-track text

T

. Algorithm 9 uses k pointers

activeStates

to point to the current states. Note that all

activeStates

always have the same depth. Algorithm 9 uses two conditions to determine whether it should execute the failure function or not. The first condition is when it cannot find the goto transition, and the second condition is when the number of state pointers in the state is more than the weight of the state. If any of the

activeStates

fail, then all of the

activeStates

execute the failure function, otherwise

activeStates

executes the goto function.

Theorem 5.

By using

MTPMA (P)

, Algorithm 9 performs permuted pattern matching on a multi-track string

T

in

O (n k log σ)

time.

Proof.

Similarly to the proof of Theorem 3, the number of executions of the failure and goto function is

O (n k)

. Since the execution time of the failure function is

O (1)

and the goto function is

O (log σ)

, Algorithm 9 runs in

O (n k log σ)

time. □

Algorithm 7: Multi-track permuted matching automaton goto function construction algorithm.

Algorithm 8: Multi-track permuted matching automaton failure function construction algorithm.

Algorithm 9: Multi-track permuted matching automaton matching algorithm.

4. Multi-Track Boyer–Moore and Horspool Algorithms

In this section, we propose two permuted pattern matching algorithms that are based on the Boyer–Moore algorithm and the Horspool algorithm, which we call MT-BM and MT-H , respectively. The original Boyer–Moore algorithm uses two functions

GoodSuf

(good suffixes) and

BadSym

(bad symbols), and the Horspool algorithm only uses

BadSym

to determine how much the position of a substring to compare should be shifted when a mismatch is found between the input patten and the substring of the text. For multi-track strings, we use a permuted match to define

GoodSuf

instead of a complete match, and we use sorted multi-track symbols for

BadSym

instead of symbols. Formally, those functions are defined as follows on multi-tracks.

Definition 10

(Good suffixes function). For a multi-track

P

of length

{| P |}_{len} = m

, let

A [i] = {0 < s < i ∣ P [i - s + 1 : m - s] ⋈ P [i + 1 : m], P [i - s : m - s]

P [i : m]}

and

B [i] = {i \leq s < m ∣ P [1 : m - s] ⋈ P [s + 1 : m]}

. The good suffixes function is defined as

{GoodSuf}_{P} [m] = 1

and

{GoodSuf}_{P} [i] = min A [i] \cup B [i] \cup {m}

for

0 \leq i < m

.

Definition 11

(Bad symbols function). For a multi-track

P

of length

{| P |}_{len} = m

and a multi-track symbol

C

,

{BadSym}_{P} (C)

is the first occurrence position of

sort (C)

in

P^{R} [2 :]

. The function

{BadSym}_{P} (C)

returns m if there is no occurrence of

sort (C)

in

P^{R} [2 :]

.

In the implementation,

GoodSuf

is computed by using an array, while

BadSym

can be computed by using a trie of the multi-track symbols. We perform permuted-match instead of exact matching when computing

GoodSuf

. We also use another array

suf

to compute

GoodSuf

.

Definition 12

(Suffixes). For a multi-track

P

of length

{| P |}_{len} = m

,

{suf}_{P} [i]

is the maximum value of l such that

P [i - l + 1 : i] ⋈ P [m - l + 1 : m]

for

1 \leq i \leq m

.

Algorithm 10 shows how to construct

GoodSuf

and

BadSym

. The array

GoodSuf

is computed by

ComputeGoodSuf

, which uses array

suf

computed by

ComputeSuf

. Note that we compute

RI (P [: i])

at the beginning (Lines 2 and 26) of the algorithm and will not recompute them when we use the values later.

Algorithm 10: MT-BM and MT-H preprocessing functions.

Lemma 11.

The function

ComputeSuf

computes the array

suf

in

O (m (k + σ))

time.

Proof.

First,

RI

can be computed in

O (m (k + σ))

time by using radix sort. The for loop is executed

m - 1

times, and the while loop in Line 11 is executed m times at most through the whole run, because k is always reduced in each loop. A comparison of two multi-track symbols of the pattern that was executed in each loop can be computed in

O (k)

time. □

Lemma 12.

The function

ComputeGoodSuf

computes

GoodSuf

in

O (m)

time.

Proof.

All the for loops are executed m times at most. The while loop is executed m times at most through the whole execution of the algorithm, since j is always increased and does not exceed m. □

Lemma 13.

The function

ComputeBadSym

computes

BadSym

in

O (m (k log σ + σ))

time.

Proof.

RI

can be computed in

O (m (k + σ))

time by using radix sort. Each edge in the trie of

BadSym

can be accessed in

O (log σ)

time by using a binary search. Since the depth of the trie is, at most, k, each

BadSym (P [i])

for

1 \leq i \leq m

can be added and accessed in

O (k log σ)

time. □

By using both

{GoodSuf}_{P}

and

{BadSym}_{P}

, MT-BM outputs the positions of the text that are permuted-matched with the pattern. The matching algorithm of MT-BM is shown in Algorithm 11.

Algorithm 11: Multi-track Boyer–Moore algorithm.

Theorem 6.

Given a multi-track text

T

and a pattern

P

, MT-BM outputs all occurrence positions of

P

in

T

in

O (n k (m + log σ) + n (k + σ))

time with

O (m (k log σ + σ))

preprocessing time.

Proof.

From Lemmas 11–13, Algorithm 11 needs

O (m (k log σ + σ))

time for preprocessing. Next,

RI

can be computed in

O (n (k + σ))

time by using radix sort. In the outer while loop starting at Line 7, the value of j is increased by at least one, so the loop is executed

n - m + 2

times at most. In each execution of the outer loop, the inner while loop is executed m times at most, where multi-track symbols of the pattern and the text can be compared in

O (k)

time.

BadSym

can be accessed in

O (k log σ)

time, and

GoodSuf

can be executed in

O (1)

time. □

Similarly to the original Horspool algorithm, the multi-track Horspool algorithm (MT-H) uses

BadSym

to shift the pattern.

Theorem 7.

Given a multi-track text

T

and a pattern

P

, MT-H outputs all occurrence positions of

P

in

T

in

O (n k (m + log σ) + n (k + σ))

time with

O (m (k log σ + σ))

preprocessing time.

Proof.

Similar to the proof of Theorem 6. □

Boyer–Moore and Horspool Matching Algorithms with Track Trie

The two algorithms presented in the previous subsections determine if two multi-tracks permuted-match by sorting them. In this subsection, we present another idea for this task using a data structure called a track trie. The track trie

TrackTrie (P)

of a multi-track

P

stores all the reversed strings of the tracks of

P

, that is

{P_{1}^{R}, P_{2}^{R}, \dots, P_{k}^{R}}

. Figure 3a shows the track trie of a multi-track pattern

P = (bbaba, abbba, aaabb)

.

Figure 3. (a) Track trie of

P = (bbaba, abbba, aaabb)

; (b) example of mismatch when the track trie cannot find the transition; (c) example of mismatch when the number of tracks is more than the weight of the node.

Algorithm 12 is the construction algorithm of

TrackTrie (P)

. For a node s of

TrackTrie (P)

and a symbol

c \in Σ

, the goto function

δ (s, c)

returns the child of s that has an edge labeled c. We naturally extend it to the domain

Σ^{*}

by

δ (s, ε) = s

and

δ (s, a w) = δ (δ (s, a), w)

for any

a \in Σ

and

w \in Σ^{*}

. We also associate a weight with each node to find mismatch on a text, as we explain later.

Algorithm 12: Track trie construction algorithm.

Theorem 8.

Algorithm 12 constructs

TrackTrie (P)

in

O (m k log σ)

time.

Proof.

The function

goto

can be calculated in

O (log σ)

time by binary search. On each execution of the inner for loop (Line 6), Algorithm 12 executes

goto

to check the child nodes of

activeNode

. If there is no node with an edge labeled

P [j] [i]

, then a new node is constructed, which can be done in

O (1)

time. On the other hand, if there is a node with an edge labeled

P [j] [i]

, Algorithm 12 accesses the child node and then increases its weight by one. The total number of iterations of the inner loop is

m k

. □

For a given multi-track text

T

and a position i, Algorithm 13 finds a mismatch position in two cases: (1) when a track cannot find its

goto

destination, and (2) when the number of tracks that have the same string w is more than the weight of the node that represents the string

δ (r o o t, w)

. Those mismatch conditions are illustrated in Figure 3b,c, respectively. Figure 3b shows that the track trie cannot find a transition for the second symbol

b

of the third track. On the other hand, Figure 3c shows that

T_{2} [3 :]

has two “

bba

” on its track; however, the

P [3 :]

has only one “

bba

” on its track, i.e., the node that represents “

bba

” has one on its weight.

Algorithm 13: Track trie matching algorithm.

Theorem 9.

Given a multi-track text

T

and a position j, Algorithm 13 finds a mismatch position in the pattern in

O (m k log σ)

time.

Proof.

For each position

i + j

on the text, Algorithm 13 executes

goto

to check whether

activeNodes [l]

has a child node with an edge labeled

T [i + j] [l]

for

1 \leq l \leq k

. If there is no child node with an edge labeled

T [i + j] [l]

, then Algorithm 13 considers it a mismatch and returns the mismatch position. On the other hand, if there is such a child node, Algorithm 13 updates

activeNodes [l]

to the child node and then checks whether the number of tracks of

T [i + j :]

that contain

T [i + j : i + m] [l]

as a prefix is more than the weight of the child node. If the number of tracks exceeds the weight, then Algorithm 13 treats it as mismatch and returns the mismatch position. The total number of iterations of the inner loop is, at most,

m k

. □

Although the worst case time complexity remains the same, by using track trie, both MT-BM and MT-H can match the pattern to the text faster, because we do not need to compute the reverse-sorted index of the text. First, we construct the track trie of the pattern by using

constructTrackTrie (P)

. Then, we replace Line 10 of Algorithm 11 by

matchTrackTrie (T, j)

to find a mismatch position.

5. Filtering Algorithm on a Multi-Track String

In this section, we propose filtering algorithms for permuted pattern matching. The filtering algorithm uses a function to transform a multi-track string into another sequence. Then, the algorithm implements a pattern matching algorithm on the transformed pattern and text to get candidate positions where the pattern may permuted-match. Finally, every candidate position is checked for whether the pattern is permuted-matched in each position or not.

The filtering algorithm uses a function

ϕ

that inputs a multi-track

W

and outputs a sequence of length

{| W |}_{len}

. The function

ϕ

must have a false-positive property, that is, for two multi-tracks

X

and

Y

, if

X ⋈ Y

, then

ϕ (X) = ϕ (Y)

. We can use any hash function such as the Karp–Rabin fingerprint [14] and locality-sensitive hashing [15]. We describe the filtering algorithm by using a simple function

β

that is defined as follows.

Definition 13

(Bucket function β). For

W = (W_{1}, W_{2}, \dots, W_{k})

,

β (W) = (B_{1}, B_{2}, \dots, B_{σ})

such that

B_{j} [i]

is the number of the

j^{th}

symbol in

Z [i]

.

For example, for a multi-track

W = (abab, bbac, aabb, cabb, abba)

,

β (W)

is as follows,

(\begin{matrix} a & b & a & b \\ b & b & a & c \\ a & a & b & b \\ c & a & b & b \\ a & b & b & a \end{matrix}), β (Z) = (\begin{matrix} 3 & 2 & 2 & 1 \\ 1 & 3 & 3 & 3 \\ 1 & 0 & 0 & 1 \end{matrix}) .

(2)

β (W)

can be computed in

O (n (k + σ))

time by counting all symbols in

W

.

Algorithm 14 shows an example of filtering algorithm for permuted pattern matching. This algorithm uses

β

as a hash function, and the KMP algorithm as a patten matching algorithm. First, the algorithm computes

P^{'} = β (P)

and

T^{'} = β (T)

and then constructs the border array

{Border}_{P^{'}}

using the same procedure as the KMP algorithm. Next, the algorithm performs pattern matching on

P^{'}

and

T^{'}

. When a mismatch occurs, the pattern is shifted by using

{Border}_{P^{'}}

. If

P^{'}

matches

T^{'} [i : j]

, then the algorithm checks whether the pattern

P

permuted-matches

T [i : j]

or not, and it outputs the position if true.

Algorithm 14: Filtering algorithm for permuted pattern matching.

Theorem 10.

Given a multi-track text

T

and a pattern

P

, Algorithm 14 computes all occurrence positions of

P

in

T

in

O ((m + n) (σ + k) + c m k)

time, where c is the number of candidates.

Proof.

As described above,

P^{'} = β (P)

and

T^{'} = β (T)

can be computed in

O (m (σ + k))

and

O (n (σ + k))

, respectively. The border array

{Border}_{P^{'}}

can be constructed in

O (m σ)

time, and pattern matching can be performed in

O (n σ)

time, since the comparison

P^{'} [i] = P^{'} [i]

needs

O (σ)

time. In addition, the algorithm takes

O (c m k)

time to check the candidates. □

6. Experiments

We performed some sets of experiments to evaluate the practical performance of the proposed algorithms. We measured the running time of our algorithms and the AC-automaton-based algorithm [7]. All experiments were performed on a computer with an Intel Xeon CPU E5-2609 8 core, 2.40 GHz, 256 GB memory, and a Debian Wheezy operating system. We set the parameter values as

n =

100,000,

m = 10

,

k =

1000, and

σ = 2

, and we changed one of the parameters in each experiment to see the relation between the parameters and the running time of the algorithms. We used randomly-generated texts and patterns. The source code of our algorithm is available at https://github.com/ushitora/permutedpatternmatching.

The first set of experiments compared the running times of KMP-based algorithms, namely the multi-track KMP algorithm, the multi-track AC-automaton algorithm, and the multi-track permuted matching automaton algorithm. The purpose of this experiment was to see the changes in the running time of the multi-track KMP algorithm when extended to the multi-track AC-automaton algorithm and the multi-track permuted matching automaton algorithm. The results of the experiment are shown in Figure 4. From the results, we can see that the multi-track AC-automaton algorithm was slower than the multi-track KMP algorithm. The reason for this is that the multi-track AC-automaton algorithm uses state transitions instead of comparing multi-track symbols directly. This time difference is not significant considering that the multi-track AC-automaton algorithm focuses on dictionary matching instead of single pattern matching. Next, we can see that the multi-track permuted matching automaton algorithm is faster than the multi-track KMP algorithm. The multi-track permuted matching automaton algorithm can perform matching faster because the multi-track permuted matching automaton algorithm does not need to sort the text.

Figure 4. Running time of the multi-track KMP (MTKMP), multi-track AC-automaton (MTACA), multi-track permuted matching automaton (MTPMA) algorithms on permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

The next experiment compared the running time of the multi-track Boyer–Moore algorithm and the multi-track Horspool algorithm. We also checked the effectiveness of track tries in reducing the matching times of both algorithms. Figure 5 shows the result of the experiment. While the multi-track Horspool was slightly faster than the multi-track Boyer–Moore algorithm, there was no significant difference between these algorithms. We can also see that track tries significantly reduced the matching times of both algorithms, since we do not need to sort the text when using track tries.

Figure 5. Running time of the multi-track Buyer–Moore (MTBM) and multi-track Horspool (MTH) algorithms on permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

The third experiment set compared the running times of the filtering algorithms. We used three algorithms, the KMP algorithm, the Boyer–Moore algorithm, and the Horspool algorithm, as filtering algorithms. Figure 6 shows the running times of the filtering algorithms. There was no significant difference in running time between the filtering algorithms. We can see that the filtering algorithms were fast apart from when the pattern length was

m = 1

. The algorithms became very slow, because there were many occurrence position candidates when

m = 1

. The occurrence position candidates will be fewer if the pattern length is longer.

Figure 6. Running time of the filtering algorithms (Knuth–Morris–Pratt (Filter-KMP), Boyer–Moore (Filter-BM), and Horspool (Filter-H)) for permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

Lastly, we compared the proposed algorithms with the AC-automaton-based algorithm [7]. Figure 7 shows the running time of the AC-automaton-based algorithm [7] and the proposed algorithms. We chose the fastest algorithms among the groups. We can see that the proposed algorithms were faster than the AC-automaton-based algorithm overall. The multi-track Boyer–Moore algorithm with track trie was the fastest among the algorithms.

Figure 7. Comparison of the running time of the AC-automaton-based algorithm with the proposed algorithms for permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

7. Conclusions

We proposed some algorithms for permuted pattern matching on multi-track strings. We primarily focused on the algorithms that preprocess the pattern before performing permuted pattern matching, instead of constructing indexing structures from the text. We showed that our proposed algorithms are faster than the AC-automaton-based algorithm. Moreover, we proposed an algorithm for dictionary matching on multi-track strings, which, to the best of our knowledge, is the first algorithm for this problem. Because of its considerable difficulty owing to a larger number of permutations, only a few algorithms for sub-permuted pattern matching have been proposed. Therefore, developing an algorithm for the sub-permuted pattern matching problem must be addressed in future work.

Author Contributions

Conceptualization, D.H., K.N., R.Y. and A.S.; methodology, D.H., Y.U., R.Y. and A.S.; software, D.H.; validation, D.H., R.Y. and A.S.; formal analysis, D.H., K.N., R.Y. and A.S.; investigation, D.H.; data curation, D.H.; writing—original draft preparation, D.H.; writing—review and editing, D.H., R.Y. and A.S.; visualization, D.H.; funding acquisition, D.H. and A.S.

Funding

This work was supported by JSPS KAKENHI Grant Numbers JP15H05706, JP24106010, JP25240003, and JP19K20208. This work was also supported by ImPACT Program of the Council for Science, Technology and Innovation (Cabinet Office, Government of Japan). Diptarama Hendrian was supported by a research grant from Tohoku University Division for International Advanced Research and Education.

Conflicts of Interest

The authors declare no conflict of interest.

References

Knuth, D.E.; Morris, J.H., Jr.; Pratt, V.R. Fast Pattern Matching in Strings. SIAM J. Comput. 1977, 6, 323–350. [Google Scholar] [CrossRef]
Boyer, R.S.; Moore, J.S. A fast string searching algorithm. Commun. ACM 1977, 20, 762–772. [Google Scholar] [CrossRef]
Horspool, R.N. Practical fast searching in strings. Softw. Pract. Exp. 1980, 10, 501–506. [Google Scholar] [CrossRef]
Weiner, P. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), Iowa City, IA, USA, 15–17 October 1973; pp. 1–11. [Google Scholar] [CrossRef]
Manber, U.; Myers, G. Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
Ehrenfeucht, A.; McConnell, R.M.; Osheim, N.; Woo, S.W. Position heaps: A simple and dynamic text indexing data structure. J. Discret. Algorithms 2011, 9, 100–121. [Google Scholar] [CrossRef]
Katsura, T.; Narisawa, K.; Shinohara, A.; Bannai, H.; Inenaga, S. Permuted Pattern Matching on Multi-track Strings. In Proceedings of the SOFSEM 2013: 39th International Conference on Current Trends in Theory and Practice of Computer Science, Špindlerův Mlýn, Czech Republic, 26–31 January 2013; pp. 280–291. [Google Scholar]
Katsura, T.; Otomo, Y.; Narisawa, K.; Shinohara, A. Position Heaps for Permuted Pattern Matching on Multi-Track Strings. In Proceedings of the Student Research Forum Papers and Posters at SOFSEM 2015, Pec pod Snezkou, Czech Republic, 24–29 January 2015; pp. 41–53. [Google Scholar]
Narisawa, K.; Katsura, T.; Ota, H.; Shinohara, A. Filtering Multi-set Tree: Data Structure for Flexible Matching Using Multi-track Data. Interdiscip. Inf. Sci. 2015, 21, 37–47. [Google Scholar] [CrossRef][Green Version]
Diptarama, Y.U.; Narisawa, K.; Shinohara, A. KMP based pattern matching algorithms for multi-track strings. In Proceedings of the Student Research Forum Papers and Posters at SOFSEM 2016, Harrachov, Czech Republic, 23–28 January 2016; Volume 2016, pp. 100–107. [Google Scholar]
Diptarama, Y.U.; Yoshinaka, R.; Shinohara, A. Fast Full Permuted Pattern Matching Algorithms on Multi-track Strings. In Proceedings of the Prague Stringology Conference 2016, Prague, Czech Republic, 29–31 August 2016; pp. 7–21. [Google Scholar]
Matsuoka, Y.; Aoki, T.; Inenaga, S.; Bannai, H.; Takeda, M. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theor. Comput. Sci. 2016, 656, 225–233. [Google Scholar] [CrossRef]
Aho, A.V.; Corasick, M.J. Efficient string matching: An aid to bibliographic search. Commun. ACM 1975, 18, 333–340. [Google Scholar] [CrossRef]
Karp, R.M.; Rabin, M.O. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 1987, 31, 249–260. [Google Scholar] [CrossRef]
Indyk, P.; Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing—STOC ’98, Dallas, TX, USA, 24–26 May 1998; ACM Press: New York, NY, USA, 1998; pp. 604–613. [Google Scholar] [CrossRef]

Figure 1. The multi-track AC-automaton

MTAC (D)

for

D = {P_{1}, P_{2}, P_{3}}

, where

P_{1} = (aaabb, abaab, bbaaa)

,

P_{2} = (abab, abba, bbab)

, and

P_{3} = (aabbab, bababb, baaaab)

. The asterisk “*” is a special symbol that matches any symbol in

Σ

.

Figure 1. The multi-track AC-automaton

MTAC (D)

for

D = {P_{1}, P_{2}, P_{3}}

, where

P_{1} = (aaabb, abaab, bbaaa)

,

P_{2} = (abab, abba, bbab)

, and

P_{3} = (aabbab, bababb, baaaab)

. The asterisk “*” is a special symbol that matches any symbol in

Σ

.

Figure 2. The multi-track permuted matching automaton

MTPMA (P)

for

P = (aaabb, abaab, bbaaa)

. The asterisk is a special symbol that matches with any symbols in

Σ

.

Figure 2. The multi-track permuted matching automaton

MTPMA (P)

for

P = (aaabb, abaab, bbaaa)

. The asterisk is a special symbol that matches with any symbols in

Σ

.

Figure 3. (a) Track trie of

P = (bbaba, abbba, aaabb)

; (b) example of mismatch when the track trie cannot find the transition; (c) example of mismatch when the number of tracks is more than the weight of the node.

Figure 3. (a) Track trie of

P = (bbaba, abbba, aaabb)

; (b) example of mismatch when the track trie cannot find the transition; (c) example of mismatch when the number of tracks is more than the weight of the node.

Figure 4. Running time of the multi-track KMP (MTKMP), multi-track AC-automaton (MTACA), multi-track permuted matching automaton (MTPMA) algorithms on permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

Figure 5. Running time of the multi-track Buyer–Moore (MTBM) and multi-track Horspool (MTH) algorithms on permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

Figure 6. Running time of the filtering algorithms (Knuth–Morris–Pratt (Filter-KMP), Boyer–Moore (Filter-BM), and Horspool (Filter-H)) for permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

Figure 7. Comparison of the running time of the AC-automaton-based algorithm with the proposed algorithms for permuted pattern matching with respect to (a) the text length, (b) the track count, (c) the pattern length, and (d) the alphabet size.

Table 1. Comparison of the algorithms for permuted pattern matching. Multi-track AC-automaton can find occurrences of multiple patterns.

Algorithm	Preprocessing Time	Matching Time
AC-automaton-based [7]	$O (m k log σ)$	$O (n k log σ)$
Multi-track KMP	$O (m k)$	$O (n k)$
MTAC-automaton	$O (d k log σ)$	$O (n k log σ)$
MT permuted matching automaton	$O (m k log σ)$	$O (n k log σ)$
MT Boyer–Moore	$O (m (k log σ + σ))$	$O (n k (m + log σ) + n (k + σ))$
MT Horspool	$O (m (k log σ + σ))$	$O (n k (m + log σ) + n (k + σ))$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.