siEDM: an efficient string index and search algorithm for edit distance with moves

Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM's efficiency.


Introduction
The vast amounts of text data are created, replicated, and modified with the increasing use of the internet and advances of data-centric technology. Many of these data contain repetitions of long substrings with slight differences, so called highly repetitive texts, such as Wikipedia and software repositories like GitHub with a large number of revisions. Also recent biological databases store a large amount of human genomes while the genetic differences among individuals are less than 0.1 percent, which results in the collections of human genomes to be highly repetitive. Therefore, there is a strong need to develop powerful methods for processing highly repetitive text collections on a large scale.
Building indexes is the de facto standard method to search large databases of highly repetitive texts. Several methods have been presented for indexing and searching large-scale and highly repetitive text collections. Examples include the ESP-index [19], SLP-index [1] and LZ77-based index [5]. Recently, Gagie and Puglisi [6] presented a general framework called kernelization for indexing and searching highly repetitive texts. Although these methods enable fast query searches, their applicability is limited to exact match searches.
The edit distance between two strings is the minimum cost of edit operations (insertions, deletions, and replacements of characters) to transform one string to another. It has been proposed for detecting evolutionary changes in biological sequences [4], detecting typing errors in documents [3], and correcting errors on lossy communication channels [10]. To accelerate the quadratic time upper bound on computing the edit distance, Cormode and Muthukrishnan introduced a new technique called edit sensitive parsing (ESP) [2]. This technique allows us to compute a modified edit distance in near linear time by sacrificing accuracy with theoretical bounds. The modified distance is known as edit distance with moves (EDM) [2], which includes substring move operations in addition to insertions and deletions. While the exact computation of EDM is known to be intractable [18], the approximate computation of EDM using ESP achieves a good approximation ratio O(lg N lg * N), and runs in almost linear time O(N lg * N) for the string length N, where lg denotes the logarithm of base two.
ESP is extended to various applications for highly repetitive texts. Examples are data compressions called grammar compression [17,12,14,13], indexes for exact matches [11,19,21], an approximated frequent pattern discovery [15] and an online pattern matching for EDM [20]. Despite several attempts to efficiently compute EDM and various extensions of ESP, there is no method for indexing and searching texts with EDM. Such a method is required in bioinformatics where approximated text searches are used to analyze massive genome sequences. Thus, an open challenge is to develop an efficient string index and search algorithm for EDM.
We propose a novel method called siEDM that efficiently indexes massive text, and performs query searches for EDM. As far as we know, siEDM is the first string index for searching queries for EDM. A space-efficient index structure for a string is built by succinctly encoding a parse tree obtained from ESP, and query searches are performed on the encoded index structures. siEDM prunes useless portions of the search space based on the lower bound of EDM without missing any matching patterns, enabling fast query searches. As in existing methods, similarity searches of siEDM are approximate but have the same guarantee of the approximation ratio as in ESP.
Experiments were performed on indexing and searching repetitive texts for EDM on standard benchmark datasets. The performance comparison with an online pattern matching for EDM [20] demonstrates siEDM's practicality.

Basic notations
Let Σ be a finite alphabet, and σ be |Σ|. All elements in Σ are totally ordered. Let us denote by Σ * the set of all strings over Σ, and by Σ q the set of strings of length q over Σ, i.e., Σ q = {w ∈ Σ * : |w| = q} and an element in Σ q is called a q-gram. The length of a string S is denoted by |S|. The empty string ǫ is a string of length 0, namely |ǫ| = 0. For a string S = αβγ, α, β and γ are called the prefix, substring, and suffix of S, respectively. The i-th character of a string S is denoted by S denote the substring of S that begins at position i and ends at position j, and let S[i, j] be ǫ when i > j. For a string S and integer q ≥ 0, let pre(S, q) = S[1, q] and suf (S, q) = S[|S| − q + 1, |S|]. We assume a recursive enumerable set X of variables with Σ ∩ X = ∅. All elements in Σ ∪ X are totally ordered, where all elements in Σ must be smaller than those in X . In this paper, we call a sequence of symbols from Σ ∪ X a string. Let us define lg (1) u = lg u, and lg (i+1) u = lg (lg (i) u) for i ≥ 1. The iterated logarithm of u is denoted by lg * u, and defined as the number of times the logarithm function must be applied before the result is less than or equal to 1, i.e., lg * u = min{i : lg (i) u ≤ 1}.

Straight-line program (SLP)
A context-free grammar (CFG) in Chomsky normal form is a quadruple G = (Σ, V, D, X s ), where V is a finite subset of X , D is a finite subset of V × (V ∪ Σ) 2 , and X s ∈ V is the start symbol. An element in D is called a production rule. Denote X l(k) (resp. X r(k) ) as a left symbol (resp. right symbol) on the right hand side for a production rule with a variable X k on the left hand side, i.e., X k → X l(k) X r(k) . val (X i ) for variable X i ∈ V denotes the string derived from X i . A grammar compression of S is a CFG G that derives S and only S. The size of a CFG is the number of variables, i.e., |V | and let n = |V |.
The parse tree of G is a rooted ordered binary tree such that (i) internal nodes are labeled by variables in V and (ii) leaves are labeled by symbols in Σ, i.e., the label sequence in leaves is equal to the input string. In a parse tree, any internal node Z corresponds to a production rule Z → XY , and has the left child with label X and the right child with label Y .
Straight-line program (SLP) [9] is defined as a grammar compression over Σ∪V , and its production rules are in the form of X k → X i X j where X i , X j ∈ Σ ∪ V and 1 ≤ i, j < k ≤ n + σ.

Rank/select dictionaries
A rank/select dictionary for a bit string B [8] supports the following queries: rank c (B, i) returns the number of occurrences of c ∈ {0, 1} in B[0, i]; select c (B, i) returns the position of the i-th occurrence of c ∈ {0, 1} in B; access(B, i) returns the i-th bit in B. Data structures with only the |B| + o(|B|) bits storage to achieve O(1) time rank and select queries [16] have been presented. GMR [7] is a rank/select dictionary for large alphabets and supports rank/ select/access queries for strings in (Σ∪V ) * . GMR uses (n+ σ) lg (n + σ) + o((n+ σ) lg (n + σ)) bits while computing both rank and access queries in O(lg lg (n + σ)) times and also computing select queries in O(1) time.

Problem
We first review the notion of EDM. The distance d(S, Q) between two strings S and Q is the minimum number of edit operations to transform S into Q. The edit operations are defined as follows: Shapira and Storer [18] proved the NP-completeness of EDM and proposed a polynomial-time algorithm for a restricted EDM. Cormode and Muthukrishnan [2] presented an approximation algorithm named ESP for computing EDM. We present a string index and search algorithm by leveraging the idea behind ESP for solving Problem 1. Our method consists of two parts: (i) an efficient index structure for a given string S and (ii) a fast algorithm for searching query Q on the index structure of S with respect to EDM. Although our method is also an approximation algorithm, it guarantees upper and lower bounds for the exact EDM. We first review ESP in the next section and then discuss the two parts.

ESP revisit
We review the edit sensitive parsing algorithm for building SLPs [17]. This algorithm, referred to as ESP-comp, computes an SLP from an input sting S. The tasks of ESP-comp are to (i) partition S into s 1 s 2 · · · s ℓ such that 2 ≤ |s i | ≤ 3 for each 1 ≤ i ≤ ℓ, (ii) if |s i | = 2, generate the production rule X → s i and replace s i by X (this subtree is referred to as a 2-tree), and if |s i | = 3, generate the production rule Y → AX and X → BC for s i = ABC, and replace s i by Y (referred to as a 2-2-tree), (iii) iterate this process until S becomes a symbol. Finally, the ESP-comp builds an SLP representing the string S.
We focus on how to determine the partition S = s 1 s 2 · · · s ℓ . A string of the form a r with a ∈ Σ∪V and r ≥ 2 is called a repetition. First, S is uniquely partitioned into the form w 1 x 1 w 2 x 2 · · · w k x k w k+1 by its maximal repetitions, where each x i is a maximal repetition of a symbol in Σ ∪ V , and each w i ∈ (Σ ∪ V ) * contains no repetition. Then, each x i is called type1, each w i of length at least 2 lg * |S| is type2, and any remaining w i is type3. If |w i | = 1, this symbol is attached to x i−1 or x i with preference x i−1 when both cases are possible. Thus, if |S| > 2, each x i and w i is longer than or equal to two. One of the substrings is referred to as S i .
Next, ESP-comp parses each S i depending on the type. For type1 and type3 substrings, the algorithm performs the left aligned parsing as follows. If |S i | is even, the algorithm builds 2-tree from For type2 S i , the algorithm further partitions it into short substrings of length two or three by alphabet reduction [2].
The alphabet reduction transforms S into L * (S) such that any substring of L * (S) of length at least 2 lg * |S| contains at least one landmark because L * (S) is also type2. Using this characteristic, the algorithm ESP-comp determines the bigrams S[i, i+1] to be replaced for any landmark S[i], where any two landmarks are not adjacent, and then the replacement is deterministic. After replacing all landmarks, any remaining maximal substring s is replaced by the left aligned parsing, where if |s| =1, it is attached to its left or right block.
We give an example of the edit sensitive parsing of an input string in Fig depending on the types. The label string L is computed for the type2 string. Originally, L is iteratively computed until [L] ≤ lg * |S|. This case shows that a single iteration satisfies this condition. After the alphabet reduction, three landmarks S[i] are found, and then each S[i, i + 1] is parsed. Any other remaining substrings including type1 and type3 are parsed by the left aligned parsing shown in Figure 1-(ii). In this example, a dashed node denotes that it is an intermediate node in a 2-2-tree. Originally, an ESP tree is a ternary tree in which each node has at most three children. The intermediate node is introduced to represent ESP tree as a binary tree. As shown in [2], the alphabet reduction approximates the minimum CFG as follows. Let S be a type2 string containing a substring α at least twice. When α is sufficiently long (e.g., |α| ≥ 2 lg * |S|), there is a partition α = α 1 α 2 such that |α 1 | = O(lg * |S|) and each landmark of α 2 within α is decided by only α 1 . This means the long prefix α 2 of α is replaced by the same variables, independent of the occurrence of α.
ESP-comp generates a new shorter string S ′ of length from |S|/3 to |S|/2, and it parses S ′ iteratively. Given a string S, ESP builds the ESP-tree of height

Approximate computations of EDM from ESP-trees
ESP-trees enable us to approximately compute EDM for two strings. After constructing ESP-trees for two strings, their characteristic vectors are defined as follows. Let T (S) be the ESP-tree for string S. We define that an integer vector F (S) to be the characteristic vector if F (S)(X) represents the number of times the variable X appears in T (S) as the root of a 2-tree. For a string S, T (S) and its characteristic vector are illustrated in Figure 2. The EDM between two strings S and Q can be approximated by L 1 -distance between two characteristic vectors F (S) and F (Q) as follows: Cormode and Muthukrishnan showed the upper and lower bounds on the L 1distance between characteristic vectors for the exact EDM.
5 Index Structure for ESP-trees 5.1 Efficient encoding scheme siEDM encodes an ESP-tree built from a string for fast query searches. This encoding scheme sorts the production rules in an ESP-tree such that the left symbols on the right hand side of the production rules are in monotonically increasing order, which enables encoding of these production rules efficiently and supporting fast operations for ESP-trees. The encoding scheme is performed from the first and second levels to the top level (i.e., root) in an ESP-tree.
First, the set of production rules at the first and second levels in the ESPtree is sorted in increasing order of the left symbols on the right hand of the production rules, i.e., X l(i) in the form of X i → X l(i) X r(i) , which results in a sorted sequence of these production rules. The variables in the left hand side in the sorted production rules are renamed in the sorted order, generating a set of new production rules that is assigned to the corresponding nodes in the ESP-tree. The same scheme is applied to the next level of the ESP-tree, which iterates until it reaches the root node. Figure 3 shows an example of the encoding scheme for the ESP-tree built from an input string S = babababaaba. At the first and second levels in the ESP-tree, the set of production rules, {X 1 → ab, X 2 → bX 1 , X 3 → aa, X 4 → ba}, is sorted in the lexicographic order of the left symbols on right hand sides of production rules, which results in the sequence of production rules, (X 1 → ab, X 3 → aa, X 2 → bX 1 , X 4 → ba). The variables on the right hand side of the production rules are renamed in the sorted order, resulting in the new sequence (X 1 → ab, X 2 → aa, X 3 → bX 1 , X 4 → ba), whose production rules are assigned to the corresponding nodes in the ESP-tree. This scheme is repeated until it reaches level 4.
Using the above encoding scheme, we obtain a monotonically increasing sequence of left symbols on the right hand side of the production rules, i.e., X l(i) in the form of X i → X l(i) X r(i) . Let A l be the increasing sequence; A l can be efficiently encoded into a bit string by using the gap-encoding and the unary coding. For example, the gap-encoding represents the sequence (1, 1, 3, 5, 8) by (1, 0, 2, 2, 3), and it is further transformed to the bit string 0 1 10 0 10 2 10 2 10 3 1 = 0110010010001 by unary coding. Generally, for a sequence (x 1 , x 2 , . . . , x n ), its unary code U represents x i by rank 0 (U, select 1 (U, i)). Because the number of 0s and the number of 1s is n + σ and n, respectively, the size of U is 2n + σ bits. The bit string is indexed by the rank/select dictionary.
Let A r be the sequence consisting of the right symbols on the right hand side of the production rules, i.e., X r (i) in the form of X i → X l(i) X r(i) . A r is represented using (n + σ) lg (n + σ) bits. A r is indexed by GMR [7].
The space for storing A l and A r is (n+σ) lg (n + σ)+2n+σ+o((n+σ) lg (n + σ)) bits in total. A l and A r enable us to simulate fast queries on encoded ESP-trees, which is presented in the next subsection.

Query processing on tree
The encoded ESP-trees support the operations LeftChild, RightChild, LeftParents and RightParents, which are used in our search algorithm. LeftChild(X k ) returns the left child X l(k) of X k and can be implemented on bit string A l in O(1) time as m = select 1 (A l , X k ) and LeftChild(X k ) = m − X k . RightChild(X k ) returns the right child X r(k) of X k and can be implemented on array A r in O(lg lg (n + σ)) time as X j = access(A r , X k ).
LeftParents(X k ) and RightParents(X k ) return sets of parents of X k as left and right children, respectively, i.e., LeftParents( Because A l is a monotonic sequence, any X k appears consecutively in A l . Using the unary encoding of A l , LeftParents(X k ) is computed by {p + i : p = select 1 (A l , X k ), rank 0 (A l , p + i) = rank 0 (A l , p)} in O(|LeftParents(X k )|) time.
RightParents(X k ) can be computed by repeatedly applying select operations for X k on A r until no more X k appear, i.e., select X k (A r , p) for 1 ≤ p ≤ n. Thus, RightParents(X k ) for X k ∈ V can be computed in O(|RightParents(X k )|) time.

Other data structures
As a supplemental data structure, siEDM computes the node characteristic vector, denoted by F (X i ), for each variable X i : the characteristic vector consisting of the frequency of any variable derived from X i . The space for storing all node characteristic vectors of n variables is at most n 2 lg |S| bits. Figure 3-(V) shows an example of the node characteristic vectors for ESP-tree in Figure 3-(III). In addition, let V (X i ) be a set of X i and variables appearing in all the descendant nodes under X i , i.e., V (X i ) = {e ∈ (V ∪ Σ) : F (X i )(e) = 0}. Practically, F (X i ) is represented by a sequence of a pair of X j ∈ V (X i ) and F (X i )(X j ). Additionally, because F (X i ) = F (LeftChild(X i )) + F (RightChild(X i )) + (X i , 1) (+(X i , 1) represents adding 1 to dimension X i ), the characteristic vectors can be stored per level 2 of the ESP-tree. The data structure is represented by a bit array F B indexed by a rank/select dictionary and the characteristic vectors reduced per level 2 of ESP-tree. F B is set to 1 for i-th bit if F (X i ) is stored, otherwise it is 0. Then, F (X i ) can be computed by rank 1 Another data structure that siEDM uses is a non-negative integer vector named length vector, each dimension of which is the length of the substring derived from the corresponding variable (See Figure 3-(VI)). The space for storing length vectors of n variables is n lg |S| bits.
6 Search Algorithm

Baseline algorithm
Given an ESP tree T (S), the maximal subtree decomposition of S[i, j] is a sequence (X 1 , X 2 , . . . , X m ) of variable in T (S) defined recursively as follows. X 1 is the variable of the root of the maximal subtree satisfying that S[i] is its leftmost leaf and |val(X 1 )| ≤ j − i. If val(X 1 ) = S[i, j], then (X 1 ) is the maximal subtree decomposition of S[i, j]. Otherwise, let X 1 , X 2 , . . . , X m be already determined and |val(X 1 )val(X 2 ) · · · val(X m )| = k < j − i. Then, let X m+1 be the variable of the root of the maximal subtree satisfying that S[i + k + 1] is its leftmost leaf and |val(X m+1 )| ≤ j − i − k. Repeating this process until val(X 1 )val(X 2 ) · · · val(X m ) = S[i, j], the maximal subtree decomposition is determined.
Based on the maximal subtree decomposition, we explain the outline of the baseline algorithm, called online ESP [20], for computing an approximation of EDM between two strings. T (S) is constructed beforehand. Given a pattern Q, the online ESP computes T (Q), and for each substring S[i, j] of length |Q|, it computes the approximate EDM as follows. It computes the maximal subtree decomposition (X 1 , X 2 , . . . , X m ) of S[i, j]. Then, the distance F . This baseline algorithm is, however, required to compute the characteristic vector of S[i, j] at each position i. Next, we improve the time and space of the online ESP by finding those |Q|-grams for each variable X in V (S) instead of each position i.

Improvement
The siEDM approximately solves Problem 1 with the same guarantees presented in Theorem 1. Let X i ∈ V (S) such that |val (X i )| > |Q|. There are |Q|-grams formed by the string suf (val (X l(i) ), |Q| − k)pre(val (X r(i) ), k) with k = 1, 2, . . . , (|Q| − 1). Then, the variable X i is said to stab the |Q|-grams. The set of the |Q|-grams stabbed by X i is denoted by itv (X i ). Let itv (S) be the set of itv (X i ) for all X i appearing in T (S). An important fact is that itv (S) includes any |Q|-gram in S. Using this characteristic, we can reduce the search space .
If a |Q|-gram R is in itv (X i ), there exists a maximal subtree decomposition X i 1 , X i 2 , . . . , X im . Then, the L 1 -distance of F (Q) and m j=1 F (X i j ) guarantees the same upper bounds in the original ESP as follows.
Theorem 2 Let R ∈ itv (X i ) be a |Q|-gram on S and X i 1 , X i 2 , . . . , X im be its maximal subtree decomposition in the tree T (X i ). Then, it holds that R). On the other hand, for an occurrence of R in S, let T (X i ) be the smallest subtree in T (S) containing the occurrence of R, i.e., R ∈ itv (X i ). For T (R) and T (X i ), let s(R) and s(X i ) be the sequences of the level 2 symbols in T (R) and T (X i ), respectively. By the definition of the ESP, it holds that s(R) = αβγ and s(X i ) = α ′ βγ ′ for some strings satisfying |αα ′ γγ ′ | = O(lg * |S|)|, and this is true for the remaining string β iteratively. Thus, F (R) − F (X i ) 1 = O(lg |R| lg * |S|) since the trees are balanced. Hence, by the equation we obtain the approximation ratio. ✷ To further enhance the search efficiency, we present a lower bound of the L 1distance between characteristic vectors, which can be used for reducing the search space.

Theorem 3 (A lower bound µ) For any
proof. The L 1 distance between F (S) and F (Q) is divided into four classes of terms: (i) both members in F (S) and F (Q) are non-zero, (ii) both members in F (S) and F (Q) are zero, (iii) the members in F (S) and F (Q) are zero and nonzero, (iv) the members in F (S) and F (Q) are non-zero and zero, respectively. Terms consisting of class (iii) and (iv) can be written as e∈V (S) V (Q) F (S)(e), which is a lower bound of the L 1 -distance. Thus, we obtain the inequality proof. Every entry in F (X k ) is less than or equal to the corresponding entry in F (X i ). Thus, the inequality holds. ✷
Applying all variables to Algorithm 1 enables us to find the candidates covering all solutions. There are no possibilities for missing any |Q|-grams in itv (S) such that the L 1 -distances between their characteristic vectors and F (Q) are at most τ , i.e., false negatives. The set may include a false positive, i.e., the solution set encodes a |Q|-gram such that the L 1 -distance between its characteristic vector and F (Q) is more than τ . However, false positives are efficiently removed by computing the L 1 -distance F (Q) − m j=1 F (X i j ) 1 as a post-processing.
(I) ESP-tree built from a query string Q, a characteristic vector F(Q) and a distance threshold τ.
(iii) The computation for suf(val(X 3 ), 1) and pre(val(X 5 ), 4) in itv(X 7 ). proof. Because the height of the ESP-tree is O(lg |S|), for each variable X, the number of visited nodes is O(lg |Q| + lg |S|). The computation time of LeftChild (X) and RightChild(X) is O(lg lg (n + σ)), and the time of FindLeft and FindRight is O(|Q| lg lg (n + σ)(lg |S| + lg |Q|)). Thus, for n iterations of the functions, the total computation time is O(n|Q| lg lg (n + σ)(lg |S| + lg |Q|)). ✷ Algorithm 1 to output the candidate R ⊆ V (S) for X ∈ V (S), a query pattern Q and a distance threshold τ .

Computing positions
The algorithm also computes all the positions of val (X i ), denoted by P ( Starting from X i , the algorithm goes up to the root in the ESP-tree built from S. p is initialized to 0 at X i . If X k through the pass from X i to the root is the parent with the right child X r(k) on the pass, non-negative integer (|val(X k )| − |val(X r(k) )|) is added to p. Otherwise, nothing is added to p. When the algorithm reaches the root, p represents a start position of val (X i ) on S, i.e., val To compute the set P (X i ), the algorithm starts from X i and goes up to the root for each parent in RightParents(X i ) and LeftParents(X i ), which return sets of parents for X i . Algorithm 2 shows the pseudo-code. proof. Using the index structures of RightParents(X) and LeftParents(X), we can traverse the path from any node with label (X) to the root of T (S) counting the position. The length of the path is O(lg |S|). ✷
proof. The time for computing T (Q) and F (Q) is t 1 = O(|Q| lg * |S|). The time for finding candidates is t 2 = O(n|Q| lg lg (n + σ)(lg |S| + lg |Q|)) by Theorem 5. The time for computing positions is O(occ lg |S|) by Theorem 6. Thus, the total time is t 1 + t 2 + t 3 = O(n|Q| lg lg (n + σ)(lg |S| + lg |Q|) + occ lg |S|). The size of the data structure is derived by the results in Section 5. ✷ In Theorem 7, n and occ are incomparable because occ > n is possible for a highly repetitive string.

Experiments
We evaluated the performance of siEDM on one core of a quad-core Intel Xeon Processor E5540 (2.53GHz) machine with 144GB memory. We implemented siEDM using the rank/select dictionary and GMR in libcds 1 . We used two standard benchmark datasets of einstein and cere from repetitive text collections in the pizza & chili corpus 2 , which is detailed in Table 1. As a comparison method, we used the online pattern matching for EDM called online ESP (baseline) [20] that approximates EDM between a query Q and substrings of the length of |Q| of each position of an input text. We randomly selected S[i, j] as the query pattern Q for each |Q| = 50, 100, 500, 1000 and examined the performance. Table 2 shows the memory consumption in the search of the siEDM and baseline. The memory consumption of siEDM was larger than the baseline for both texts because the baseline does not have characteristic vectors of each node and length vector. Table 3 shows the size for each component of the index structure and the time for building the index structure on einstein and cere datasets. Most of the size of the index structure was consumed by the characteristic vector F . The index size of cere was much larger than that of einstein. The index sizes of cere   and einstein were approximately 16 megabytes and 256 megabytes, respectively, because the number of variables generated from cere was much larger than that generated from einstein. The number of variables generated from einstein was 305, 098 and the number of variables generated from cere was 4, 512, 406. The construction times of the index structures were 118 seconds for einstein and 472 seconds for cere. The results for constructing the index structures demonstrate the applicability of siEDM to moderately large, repetitive texts. Figure 5 shows the total search time (sec.) of siEDM and the baseline for einstein and cere in distance thresholds τ from 10 to 60. In addition, this result does not contain the case τ < 10 because siEDM found no candidate under the condition. The query length is one of {50, 100, 500, 1000}. Because the search time of baseline is linear in |S| + |Q|, we show only the fastest case: q = |Q| = 50. The search time of siEDM was faster than baseline in most cases. Figure 6 shows the detailed search time in second. CF is the time for finding candidates of Q in T (S), DIST is the time for computing approximated L 1 distance by characteristic vectors, and PC is the time for determining the positions of all |Q|-grams within the threshold τ . Figure 7 shows the number of nodes T (S) visited by the algorithm, #TN, the number of candidate |Q|-grams computed by F indCandidates, #CAND, the number of true positives among candidate |Q|-grams, #TP, and the number of occurrences, #OCC. The most time-consuming task is the candidate finding.
By the monotonicity of characteristic vectors, pruning the search space for small distance thresholds and long query length is more efficient. Thus, it is expected that siEDM is faster for smaller distance thresholds and longer query lengths and the experimental results support this. The search time on cere is much slower than that on einstein because the number of generated production rules from cere is much larger than that from einstein, and a large number of iterations of FindCandidates is executed. In addition, the comparison of #CAND and #TP validates the efficiency of siEDM for candidate finding with the proposed pruning method.  In Figure 7, the algorithm failed to find a candidate. Such a phenomenon often appears when the required threshold τ is too small, because the ESP-tree T (Q) is not necessarily identical to T (S[i, j]) even if Q = S[i, j]. Generally, the parsing of T (S[i, j]) is affected by a suffix of S[1, i − 1] and a prefix of S[j + 1, |S|] of length at most lg * |S|.
As shown in Table 3 and Figure 5, the search time of siEDM depends on the size of encoded ESP-tree for the input. Finally, we confirm this feature by an additional experiment for other repetitive texts. Table 4, 5 and 6 is the description of several datasets from the pizza & chili corpus. Figure 8 shows the search time of siEDM and baseline. This result supports our claim that siEDM is suitable for computing EDM of repetitive texts.

Conclusion
We have proposed siEDM, an efficient string index for computing approximate searching based on EDM. Experimental results demonstrated the applicability of siEDM to real-world repetitive text collections as well as a longer pattern search. Future work will make the search algorithm in siEDM faster, which would be beneficial for users performing query searches for EDM.
x v y y y y y y y y d