Open Access
This article is
 freely available
 reusable
Algorithms 2016, 9(2), 26; https://doi.org/10.3390/a9020026
Article
siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves
^{1}
Graduate School of Computer Science and Systems Engineering, Kyushu Institute of Technology, 6804 Kawazu, Iizukashi, Fukuoka 8208502, Japan
^{2}
Computer Centre, Gakushuin University, 151 Mejiro, Toshimaku, Tokyo 1718588, Japan
^{3}
PRESTO, Japan Science and Technology Agency, 418 Honcho, Kawaguchishi, Saitama 3320012, Japan
^{*}
Author to whom correspondence should be addressed.
Academic Editor:
Florin Manea
Received: 25 November 2015 / Accepted: 11 April 2016 / Published: 15 April 2016
Abstract
:Although several selfindexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a stringtostring distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM’s efficiency.
Keywords:
edit distance with moves; selfindex; grammarbased selfindex; string index for editdistance with moves1. Introduction
Vast amounts of text data are created, replicated, and modified with the increasing use of the internet and advances of datacentric technology. Many of these data contain repetitions of long substrings with slight differences, so called highly repetitive texts, such as Wikipedia and software repositories like GitHub with a large number of revisions. Also, recent biological databases store a large amount of human genomes while the genetic differences among individuals are less than $0.1$ percent, which results in the collections of human genomes being highly repetitive. Therefore, there is a strong need to develop powerful methods for processing highly repetitive text collections on a large scale.
Building indexes is the de facto standard method to search large databases of highly repetitive texts. Several methods have been presented for indexing and searching largescale and highly repetitive text collections. Examples include the ESPindex [1], SLPindex [2] and LZ77based index [3]. Recently, Gagie and Puglisi [4] presented a general framework called kernelization for indexing and searching highly repetitive texts. Although these methods enable fast query searches, their applicability is limited to exact match searches.
The edit distance between two strings is the minimum cost of edit operations (insertions, deletions, and replacements of characters) to transform one string to another. It has been proposed for detecting evolutionary changes in biological sequences [5], detecting typing errors in documents [6], and correcting errors on lossy communication channels [7]. To accelerate the quadratic time upper bound on computing the edit distance, Cormode and Muthukrishnan introduced a new technique called edit sensitive parsing (ESP) [8]. This technique allows us to compute a modified edit distance in near linear time by sacrificing accuracy with theoretical bounds. The modified distance is known as edit distance with moves (EDM) [8], which includes substring move operations in addition to insertions and deletions. While the exact computation of EDM is known to be intractable [9], the approximate computation of EDM using ESP achieves a good approximation ratio $O(lgN{lg}^{*}N)$, and runs in almost linear time $O\left(N{lg}^{*}N\right)$ for the string length N, where lg denotes the logarithm of base two.
ESP is extended to various applications for highly repetitive texts. Examples are data compressions called grammar compression [10,11,12,13], indexes for exact matches [1,14,15], an approximated frequent pattern discovery [16] and an online pattern matching for EDM [17]. Despite several attempts to efficiently compute EDM and various extensions of ESP, there is no method for indexing and searching texts with EDM. Such a method is required in bioinformatics where approximated text searches are used to analyze massive genome sequences. Thus, an open challenge is to develop an efficient string index and search algorithm for EDM.
We propose a novel method called siEDM that efficiently indexes massive text, and performs query searches for EDM. As far as we know, siEDM is the first string index for searching queries for EDM. A spaceefficient index structure for a string is built by succinctly encoding a parse tree obtained from ESP, and query searches are performed on the encoded index structures. siEDM prunes useless portions of the search space based on the lower bound of EDM without missing any matching patterns, enabling fast query searches. As in existing methods, similarity searches of siEDM are approximate but have the same guarantee of the approximation ratio as in ESP.
Experiments were performed on indexing and searching repetitive texts for EDM on standard benchmark datasets. The performance comparison with an online pattern matching for EDM [17] demonstrates siEDM’s practicality.
2. Preliminaries
2.1. Basic Notations
Let Σ be a finite alphabet, and σ be $\Sigma $. All elements in Σ are totally ordered. Let us denote by ${\Sigma}^{*}$ the set of all strings over Σ, and by ${\Sigma}^{q}$ the set of strings of length q over Σ, i.e., ${\Sigma}^{q}=\{w\in {\Sigma}^{*}:\leftw\right=q\}$ and an element in ${\Sigma}^{q}$ is called a qgram. The length of a string S is denoted by $\leftS\right$. The empty string ϵ is a string of length 0, namely $\left\u03f5\right=0$. For a string $S=\alpha \beta \gamma $, α, β and γ are called the prefix, substring, and suffix of S, respectively. The ith character of a string S is denoted by $S\left[i\right]$ for $i\in [1,S\left\right]$. For a string S and interval $[i,j]$ ($1\le i\le j\le \leftS\right$), let $S[i,j]$ denote the substring of S that begins at position i and ends at position j, and let $S[i,j]$ be ϵ when $i>j$. For a string S and integer $q\ge 0$, let $\mathit{pre}(S,q)=S[1,q]$ and $\mathit{suf}(S,q)=S\left[\rightSq+1,S\left\right]$. We assume a recursive enumerable set $\mathcal{X}$ of variables with $\Sigma \cap \mathcal{X}=\varnothing $. All elements in $\Sigma \cup \mathcal{X}$ are totally ordered, where all elements in Σ must be smaller than those in $\mathcal{X}$. In this paper, we call a sequence of symbols from $\Sigma \cup X$ a string. Let us define ${lg}^{\left(1\right)}u=lgu$, and ${lg}^{(i+1)}u=lg({lg}^{\left(i\right)}u)$ for $i\ge 1$. The iterated logarithm of u is denoted by ${lg}^{*}u$, and defined as the number of times the logarithm function must be applied before the result is less than or equal to 1, i.e., ${lg}^{*}u=min\{i:{lg}^{\left(i\right)}u\le 1\}$.
2.2. StraightLine Program (SLP)
A contextfree grammar (CFG) in Chomsky normal form is a quadruple $G=(\Sigma ,V,D,{X}_{s})$, where V is a finite subset of $\mathcal{X}$, D is a finite subset of $V\times {(V\cup \Sigma )}^{2}$, and ${X}_{s}\in V$ is the start symbol. An element in D is called a production rule. Denote ${X}_{l\left(k\right)}$ (resp. ${X}_{r\left(k\right)}$) as a left symbol (resp. right symbol) on the right hand side for a production rule with a variable ${X}_{k}$ on the left hand side, $i.e.$, ${X}_{k}\to {X}_{l\left(k\right)}{X}_{r\left(k\right)}$. $\mathit{val}\left({X}_{i}\right)$ for variable ${X}_{i}\in V$ denotes the string derived from ${X}_{i}$. A grammar compression of S is a CFG G that derives S and only S. The size of a CFG is the number of variables, i.e., $\leftV\right$ and let $n=\leftV\right$.
The parse tree of G is a rooted ordered binary tree such that (i) internal nodes are labeled by variables in V and (ii) leaves are labeled by symbols in Σ, i.e., the label sequence in leaves is equal to the input string. In a parse tree, any internal node Z corresponds to a production rule $Z\to XY$, and has the left child with label X and the right child with label Y.
Straightline program (SLP) [18] is defined as a grammar compression over $\Sigma \cup V$, and its production rules are in the form of ${X}_{k}\to {X}_{i}{X}_{j}$ where ${X}_{i},{X}_{j}\in \Sigma \cup V$ and $1\le i,j<k\le n+\sigma $.
2.3. Rank/Select Dictionaries
A rank/select dictionary for a bit string B [19] supports the following queries: ${\mathit{rank}}_{c}(B,i)$ returns the number of occurrences of $c\in \{0,1\}$ in $B[0,i]$; ${\mathit{select}}_{c}(B,i)$ returns the position of the ith occurrence of $c\in \{0,1\}$ in B; $\mathit{access}(B,i)$ returns the ith bit in B. Data structures with only the $\leftB\right+o\left(\rightB\left\right)$ bits storage to achieve $O\left(1\right)$ time rank and select queries [20] have been presented.
GMR [21] is a rank/select dictionary for large alphabets and supports rank/ select/access queries for strings in ${(\Sigma \cup V)}^{*}$. GMR uses $(n+\sigma )lg(n+\sigma )+o\left(\right(n+\sigma )lg(n+\sigma ))$ bits while computing both rank and access queries in $O(lglg(n+\sigma ))$ times and also computing select queries in $O\left(1\right)$ time.
3. Problem
We first review the notion of EDM. The distance $d(S,Q)$ between two strings S and Q is the minimum number of edit operations to transform S into Q. The edit operations are defined as follows:
 Insertion: A character a is inserted at position i in S, which generates $S[1,i1]aS[i,S\left\right]$,
 Deletion: A character is deleted at position i in S, which generates $S[1,i1]S[i+1,S\left\right]$,
 Replacement: A character is replaced by a at position i in S, which generates $S[1,i1]aS[i+1,S\left\right]$,
 Substring move: A substring $S[i,j]$ is deleted from the position i, and inserted at the position k in S, which generates $S[1,i1]S[j+1,k1]S[i,j]S[k,S\left\right]$ for $1\le i\le j\le k\le \leftS\right$, and $S[1,k1]S[i,j]S[k,i1]S[j+1,S\left\right]$ for $1\le k\le i\le j\le \leftS\right$.
Problem 1
(Query search for EDM). For a string $S\in {\Sigma}^{*}$, a query $Q\in {\Sigma}^{*}$ and a distance threshold $\tau \ge 0$, find all $i\in [1,S\left\right]$ satisfying $d\left(S\right[i,i+\leftQ\right1],Q)\le \tau $.
Shapira and Storer [9] proved the NPcompleteness of EDM and proposed a polynomialtime algorithm for a restricted EDM. Cormode and Muthukrishnan [8] presented an approximation algorithm named ESP for computing EDM. We present a string index and search algorithm by leveraging the idea behind ESP for solving Problem 1. Our method consists of two parts: (i) an efficient index structure for a given string S and (ii) a fast algorithm for searching query Q on the index structure of S with respect to EDM. Although our method is also an approximation algorithm, it guarantees upper and lower bounds for the exact EDM. We first review ESP in the next section and then discuss the two parts.
4. Edit Sensitive Parsing (ESP) for Building SLPs
4.1. ESP Revisit
We review the edit sensitive parsing algorithm for building SLPs [10]. This algorithm, referred to as ESPcomp, computes an SLP from an input sting S. The tasks of ESPcomp are to (i) partition S into ${s}_{1}{s}_{2}\cdots {s}_{\ell}$ such that $2\le {s}_{i}\le 3$ for each $1\le i\le \ell $, (ii) if ${s}_{i}=2$, generate the production rule $X\to {s}_{i}$ and replace ${s}_{i}$ by X (this subtree is referred to as a 2tree), and if ${s}_{i}=3$, generate the production rule $Y\to AX$ and $X\to BC$ for ${s}_{i}=ABC$, and replace ${s}_{i}$ by Y (referred to as a 22tree), (iii) iterate this process until S becomes a symbol. Finally, the ESPcomp builds an SLP representing the string S.
We focus on how to determine the partition $S={s}_{1}{s}_{2}\cdots {s}_{\ell}$. A string of the form ${a}^{r}$ with $a\in \Sigma \cup V$ and $r\ge 2$ is called a repetition. First, S is uniquely partitioned into the form ${w}_{1}{x}_{1}{w}_{2}{x}_{2}\cdots {w}_{k}{x}_{k}{w}_{k+1}$ by its maximal repetitions, where each ${x}_{i}$ is a maximal repetition of a symbol in $\Sigma \cup V$, and each ${w}_{i}\in {(\Sigma \cup V)}^{*}$ contains no repetition. Then, each ${x}_{i}$ is called type1, each ${w}_{i}$ of length at least $2{lg}^{*}\leftS\right$ is type2, and any remaining ${w}_{i}$ is type3. If ${w}_{i}=1$, this symbol is attached to ${x}_{i1}$ or ${x}_{i}$ with preference ${x}_{i1}$ when both cases are possible. Thus, if $\leftS\right>2$, each ${x}_{i}$ and ${w}_{i}$ is longer than or equal to two. One of the substrings is referred to as ${S}_{i}$.
Next, ESPcomp parses each ${S}_{i}$ depending on the type. For type1 and type3 substrings, the algorithm performs the left aligned parsing as follows. If ${S}_{i}$ is even, the algorithm builds 2tree from ${S}_{i}[2j1,2j]$ for each $j\in \{1,2,\dots ,{S}_{i}/2\}$; otherwise, the algorithm builds a 2tree from ${S}_{i}[2j1,2j]$ for each $j\in \{1,2,\dots ,\lfloor ({S}_{i}3)/2\rfloor \}$ and builds a 22tree from the last trigram ${S}_{i}\left[\right{S}_{i}2,{S}_{i}\left\right]$. For type2 ${S}_{i}$, the algorithm further partitions it into short substrings of length two or three by alphabet reduction [8].
Alphabet reduction: Given a type2 string S, consider $S\left[i\right]$ and $S[i1]$ as binary integers. Let p be the position of the least significant bit, in which $S\left[i\right]\ne S[i1]$, and let $bit(p,S[i\left]\right)$ be the bit of $S\left[i\right]$ at the pth position. Then, $L\left[i\right]=2p+bit(p,S[i\left]\right)$ is defined for any $i\ge 2$. Because S is repetitionfree (i.e., type2), the label string $L\left(S\right)=L\left[2\right]L\left[3\right]\cdots L\left[\rightS\left\right]$ is also type2. If the number of different symbols in S is n (denoted by $\left[S\right]$), then $\left[L\right(S\left)\right]=O(lgn)$. For the $L\left(S\right)$, the next label string is iteratively computed until the final ${L}^{*}\left(S\right)$ satisfying $\left[{L}^{*}\left(S\right)\right]\le {lg}^{*}\leftS\right$ is obtained. $S\left[i\right]$ is called the landmark if $L\left[i\right]>max\left\{L\right[i1],L[i+1\left]\right\}$.
The alphabet reduction transforms S into ${L}^{*}\left(S\right)$ such that any substring of ${L}^{*}\left(S\right)$ of length at least $2{lg}^{*}\leftS\right$ contains at least one landmark because ${L}^{*}\left(S\right)$ is also type2. Using this characteristic, the algorithm ESPcomp determines the bigrams $S[i,i+1]$ to be replaced for any landmark $S\left[i\right]$, where any two landmarks are not adjacent, and then the replacement is deterministic. After replacing all landmarks, any remaining maximal substring s is replaced by the left aligned parsing, where if $\lefts\right$ = 1, it is attached to its left or right block.
We give an example of the edit sensitive parsing of an input string in Figure 1(i) and (ii). The input string S is divided into three maximal substrings depending on the types. The label string L is computed for the type2 string. Originally, L is iteratively computed until $\left[L\right]\le {lg}^{*}\leftS\right$. This case shows that a single iteration satisfies this condition. After the alphabet reduction, three landmarks $S\left[i\right]$ are found, and then each $S[i,i+1]$ is parsed. Any other remaining substrings including type1 and type3 are parsed by the left aligned parsing shown in Figure 1(ii). In this example, a dashed node denotes that it is an intermediate node in a 22tree. Originally, an ESP tree is a ternary tree in which each node has at most three children. The intermediate node is introduced to represent ESP tree as a binary tree.
As shown in [8], the alphabet reduction approximates the minimum CFG as follows. Let S be a type2 string containing a substring α at least twice. When α is sufficiently long (e.g., $\left\alpha \right\ge 2{lg}^{*}\leftS\right$), there is a partition $\alpha ={\alpha}_{1}{\alpha}_{2}$ such that ${\alpha}_{1}=O({lg}^{*}\leftS\right)$ and each landmark of ${\alpha}_{2}$ within α is decided by only ${\alpha}_{1}$. This means the long prefix ${\alpha}_{2}$ of α is replaced by the same variables, independent of the occurrence of α.
ESPcomp generates a new shorter string ${S}^{\prime}$ of length from $\leftS\right/3$ to $\leftS\right/2$, and it parses ${S}^{\prime}$ iteratively. Given a string S, ESP builds the ESPtree of height $O(lgS\left\right)$ in $O\left(\rightS\left{lg}^{*}\rightS\left\right)$ time and in $O\left(\right\Sigma \cup Vlg\Sigma \cup V)$ space. The approximation ratio of the smallest grammar by ESP is $O({lg}^{*}\leftS\rightlg\leftS\right)$ [10].
4.2. Approximate Computations of EDM from ESPTrees
ESPtrees enable us to approximately compute EDM for two strings. After constructing ESPtrees for two strings, their characteristic vectors are defined as follows. Let $T\left(S\right)$ be the ESPtree for string S. We define that an integer vector $F\left(S\right)$ to be the characteristic vector if $F\left(S\right)\left(X\right)$ represents the number of times the variable X appears in $T\left(S\right)$ as the root of a 2tree. For a string S, $T\left(S\right)$ and its characteristic vector are illustrated in Figure 2. The EDM between two strings S and Q can be approximated by ${L}_{1}$distance between two characteristic vectors $F\left(S\right)$ and $F\left(Q\right)$ as follows:
$${\parallel F\left(S\right)F\left(Q\right)\parallel}_{1}=\sum _{e\in V\left(S\right)\cup V\left(Q\right)}F\left(S\right)\left(e\right)F\left(Q\right)\left(e\right)$$
Cormode and Muthukrishnan showed the upper and lower bounds on the ${L}_{1}$distance between characteristic vectors for the exact EDM.
Theorem 1
(Upper and lower bounds of the approximated EDM [8]) . For $N=max\left(\rightS,Q\left\right)$,
$$d(S,Q)\le {2\parallel F\left(S\right)F\left(Q\right)\parallel}_{1}=O(lgN{lg}^{*}N)d(S,Q)$$
5. Index Structure for ESPTrees
5.1. Efficient Encoding Scheme
siEDM encodes an ESPtree built from a string for fast query searches. This encoding scheme sorts the production rules in an ESPtree such that the left symbols on the right hand side of the production rules are in monotonically increasing order, which enables encoding of these production rules efficiently and supporting fast operations for ESPtrees. The encoding scheme is performed from the first and second levels to the top level (i.e., root) in an ESPtree.
First, the set of production rules at the first and second levels in the ESPtree is sorted in increasing order of the left symbols on the right hand of the production rules, i.e., ${X}_{l\left(i\right)}$ in the form of ${X}_{i}\to {X}_{l\left(i\right)}{X}_{r\left(i\right)}$, which results in a sorted sequence of these production rules. The variables in the left hand side in the sorted production rules are renamed in the sorted order, generating a set of new production rules that is assigned to the corresponding nodes in the ESPtree. The same scheme is applied to the next level of the ESPtree, which iterates until it reaches the root node.
Figure 3 shows an example of the encoding scheme for the ESPtree built from an input string $S=babababaaba$. At the first and second levels in the ESPtree, the set of production rules, $\{{X}_{1}\to ab,{X}_{2}\to b{X}_{1},{X}_{3}\to aa,{X}_{4}\to ba\}$, is sorted in the lexicographic order of the left symbols on right hand sides of production rules, which results in the sequence of production rules, $({X}_{1}\to ab,{X}_{3}\to aa,{X}_{2}\to b{X}_{1},{X}_{4}\to ba)$. The variables on the right hand side of the production rules are renamed in the sorted order, resulting in the new sequence $({X}_{1}\to ab,{X}_{2}\to aa,{X}_{3}\to b{X}_{1},{X}_{4}\to ba)$, whose production rules are assigned to the corresponding nodes in the ESPtree. This scheme is repeated until it reaches level 4.
Using the above encoding scheme, we obtain a monotonically increasing sequence of left symbols on the right hand side of the production rules, i.e., ${X}_{l\left(i\right)}$ in the form of ${X}_{i}\to {X}_{l\left(i\right)}{X}_{r\left(i\right)}$. Let ${A}_{l}$ be the increasing sequence; ${A}_{l}$ can be efficiently encoded into a bit string by using the gapencoding and the unary coding. For example, the gapencoding represents the sequence $(1,1,3,5,8)$ by $(1,0,2,2,3)$, and it is further transformed to the bit string ${0}^{1}{10}^{0}{10}^{2}{10}^{2}{10}^{3}1=0110010010001$ by unary coding. Generally, for a sequence $({x}_{1},{x}_{2},\dots ,{x}_{n})$, its unary code U represents ${x}_{i}$ by ${\mathit{rank}}_{0}(U,{\mathit{select}}_{1}(U,i))$. Because the number of 0s and the number of 1s is $n+\sigma $ and n, respectively, the size of U is $2n+\sigma $ bits. The bit string is indexed by the rank/select dictionary.
Let ${A}_{r}$ be the sequence consisting of the right symbols on the right hand side of the production rules, i.e., ${X}_{r}\left(i\right)$ in the form of ${X}_{i}\to {X}_{l\left(i\right)}{X}_{r\left(i\right)}$. ${A}_{r}$ is represented using $(n+\sigma )lg(n+\sigma )$ bits. ${A}_{r}$ is indexed by GMR [21].
The space for storing ${A}_{l}$ and ${A}_{r}$ is $(n+\sigma )lg(n+\sigma )+2n+\sigma +o\left(\right(n+\sigma )lg(n+\sigma ))$ bits in total. ${A}_{l}$ and ${A}_{r}$ enable us to simulate fast queries on encoded ESPtrees, which is presented in the next subsection.
5.2. Query Processing on Tree
The encoded ESPtrees support four tree operations, $\mathit{LeftChild}$, $\mathit{RightChild}$, $\mathit{LeftParents}$ and $\mathit{RightParents}$, which are used in our search algorithm. $\mathit{LeftChild}\left({X}_{k}\right)$ returns the left child ${X}_{l\left(k\right)}$ of ${X}_{k}$ and can be implemented on bit string ${A}_{l}$ in $O\left(1\right)$ time as $m={\mathit{select}}_{1}({A}_{l},{X}_{k})$ and $\mathit{LeftChild}\left({X}_{k}\right)=m{X}_{k}$. $\mathit{RightChild}\left({X}_{k}\right)$ returns the right child ${X}_{r\left(k\right)}$ of ${X}_{k}$ and can be implemented on array ${A}_{r}$ in $O(lglg(n+\sigma ))$ time as ${X}_{j}=\mathit{access}({A}_{r},{X}_{k})$.
$\mathit{LeftParents}\left({X}_{k}\right)$ and $\mathit{RightParents}\left({X}_{k}\right)$ return sets of parents of ${X}_{k}$ as left and right children, respectively, i.e., $\mathit{LeftParents}\left({X}_{k}\right)=\{{X}_{i}\in V:{X}_{i}\to {X}_{k}{X}_{j}{,}^{\forall}{X}_{j}\in (\Sigma \cup V)\}$ and $\mathit{RightParents}\left({X}_{k}\right)=\{{X}_{i}\in V:{X}_{i}\to {X}_{j}{X}_{k}{,}^{\forall}{X}_{j}\in (\Sigma \cup V)\}$.
Because ${A}_{l}$ is a monotonic sequence, any ${X}_{k}$ appears consecutively in ${A}_{l}$. Using the unary encoding of ${A}_{l}$, $\mathit{LeftParents}\left({X}_{k}\right)$ is computed by $\{p+i:p={\mathit{select}}_{1}({A}_{l},{X}_{k}),\phantom{\rule{0.222222em}{0ex}}{\mathit{rank}}_{0}({A}_{l},p+i)={\mathit{rank}}_{0}({A}_{l},p)\}$ in $O\left(\right\mathit{LeftParents}\left({X}_{k}\right)\left\right)$ time. $\mathit{RightParents}\left({X}_{k}\right)$ can be computed by repeatedly applying select operations for ${X}_{k}$ on ${A}_{r}$ until no more ${X}_{k}$ appear, i.e., ${\mathit{select}}_{{X}_{k}}({A}_{r},p)$ for $1\le p\le n$. Thus, $\mathit{RightParents}\left({X}_{k}\right)$ for ${X}_{k}\in V$ can be computed in $O\left(\right\mathit{RightParents}\left({X}_{k}\right)\left\right)$ time.
5.3. Other Data Structures
As a supplemental data structure, siEDM computes the node characteristic vector, denoted by $F\left({X}_{i}\right)$, for each variable ${X}_{i}$: the characteristic vector consisting of the frequency of any variable derived from ${X}_{i}$. The space for storing all node characteristic vectors of n variables is at most ${n}^{2}lg\leftS\right$ bits. Figure 3(V) shows an example of the node characteristic vectors for ESPtree in Figure 3(III). In addition, let $V\left({X}_{i}\right)$ be a set of ${X}_{i}$ and variables appearing in all the descendant nodes under ${X}_{i}$, i.e., $V\left({X}_{i}\right)=\{e\in (V\cup \Sigma ):F\left({X}_{i}\right)\left(e\right)\ne 0\}$. Practically, $F\left({X}_{i}\right)$ is represented by a sequence of a pair of ${X}_{j}\in V\left({X}_{i}\right)$ and $F\left({X}_{i}\right)\left({X}_{j}\right)$. Additionally, because $F\left({X}_{i}\right)=F\left(\mathit{LeftChild}\left({X}_{i}\right)\right)+F\left(\mathit{RightChild}\left({X}_{i}\right)\right)+({X}_{i},1)$ ($+({X}_{i},1)$ represents adding 1 to dimension ${X}_{i}$), the characteristic vectors can be stored per level 2 of the ESPtree. The data structure is represented by a bit array $FB$ indexed by a rank/select dictionary and the characteristic vectors reduced per level 2 of ESPtree. $FB$ is set to 1 for ith bit if $F\left({X}_{i}\right)$ is stored, otherwise it is 0. Then, $F\left({X}_{i}\right)$ can be computed by ${\mathit{rank}}_{1}(FB,i)$th characteristic vector if the ith bit of $FB$ is 1; otherwise, $F\left(\mathit{LeftChild}\left({X}_{i}\right)\right)+F\left(\mathit{RightChild}\left({X}_{i}\right)\right)+({X}_{i},1)$.
Another data structure that siEDM uses is a nonnegative integer vector named length vector, each dimension of which is the length of the substring derived from the corresponding variable (See Figure 3(VI)). The space for storing length vectors of n variables is $nlg\leftS\right$ bits.
From the above argument, the space of the siEDM’s index structure for n variables is $n(n+1)lg\leftS\right+(n+\sigma )lg(n+\sigma )+2n+\sigma +o\left(\right(n+\sigma )lg(n+\sigma ))$ bits in total.
6. Search Algorithm
6.1. Baseline Algorithm
Given a $T\left(S\right)$, the maximal subtree decomposition of $S[i,j]$ is a sequence $({X}_{1},{X}_{2},\dots ,{X}_{m})$ of variable in $T\left(S\right)$ defined recursively as follows. ${X}_{1}$ is the variable of the root of the maximal subtree satisfying that $S\left[i\right]$ is its leftmost leaf and $\leftval\right({X}_{1}\left)\right\le ji$. If $val\left({X}_{1}\right)=S[i,j]$, then $\left({X}_{1}\right)$ is the maximal subtree decomposition of $S[i,j]$. Otherwise, let ${X}_{1},{X}_{2},\dots ,{X}_{m}$ be already determined and $val\left({X}_{1}\right)val\left({X}_{2}\right)\cdots val\left({X}_{m}\right)=k<ji$. Then, let ${X}_{m+1}$ be the variable of the root of the maximal subtree satisfying that $S[i+k+1]$ is its leftmost leaf and $\leftval\right({X}_{m+1}\left)\right\le jik$. Repeating this process until $val\left({X}_{1}\right)val\left({X}_{2}\right)\cdots val\left({X}_{m}\right)=S[i,j]$, the maximal subtree decomposition is determined.
Based on the maximal subtree decomposition, we explain the outline of the baseline algorithm, called online ESP [17], for computing an approximation of EDM between two strings. $T\left(S\right)$ is constructed beforehand. Given a pattern Q, the online ESP computes $T\left(Q\right)$, and for each substring $S[i,j]$ of length $\leftQ\right$, it computes the approximate EDM as follows. It computes the maximal subtree decomposition $({X}_{1},{X}_{2},\dots ,{X}_{m})$ of $S[i,j]$. Then, the distance ${\parallel F\left(Q\right)F\left(S[i,j]\right)\parallel}_{1}$ is approximated by $\parallel F\left(Q\right){\sum}_{k=1}^{m}F\left({X}_{k}\right){\parallel}_{1}$ because ESPtree is balanced and then $\parallel F\left(S[i,j]\right){\sum}_{k=1}^{m}F\left({X}_{k}\right){\parallel}_{1}=O(lgm)$. This baseline algorithm is, however, required to compute the characteristic vector of $S[i,j]$ at each position i. Next, we improve the time and space of the online ESP by finding those $\leftQ\right$grams for each variable X in $V\left(S\right)$ instead of each position i.
6.2. Improvement
The siEDM approximately solves Problem 1 with the same guarantees presented in Theorem 1. Let ${X}_{i}\in V\left(S\right)$ such that $\left\mathit{val}\right({X}_{i}\left)\right>\leftQ\right$. There are $\leftQ\right$grams formed by the string $\mathit{suf}(\mathit{val}\left({X}_{l\left(i\right)}\right),\leftQ\rightk)\mathit{pre}(\mathit{val}\left({X}_{r\left(i\right)}\right),k)$ with $k=1,2,\dots ,\left(\rightQ1)$. Then, the variable ${X}_{i}$ is said to stab the $\leftQ\right$grams. The set of the $\leftQ\right$grams stabbed by ${X}_{i}$ is denoted by $\mathit{itv}\left({X}_{i}\right)$. Let $\mathit{itv}\left(S\right)$ be the set of $\mathit{itv}\left({X}_{i}\right)$ for all ${X}_{i}$ appearing in $T\left(S\right)$. An important fact is that $\mathit{itv}\left(S\right)$ includes any $\leftQ\right$gram in S. Using this characteristic, we can reduce the search space .
If a $\leftQ\right$gram R is in $\mathit{itv}\left({X}_{i}\right)$, there exists a maximal subtree decomposition ${X}_{{i}_{1}},{X}_{{i}_{2}},\dots ,{X}_{{i}_{m}}$. Then, the ${L}_{1}$distance of $F\left(Q\right)$ and ${\sum}_{j=1}^{m}F\left({X}_{{i}_{j}}\right)$ guarantees the same upper bounds in the original ESP as follows.
Theorem 2.
Let $R\in \mathit{itv}\left({X}_{i}\right)$ be a $\leftQ\right$gram on S and ${X}_{{i}_{1}},{X}_{{i}_{2}},\dots ,{X}_{{i}_{m}}$ be its maximal subtree decomposition in the tree $T\left({X}_{i}\right)$. Then, it holds that
$$\parallel F\left(Q\right)\sum _{j=1}^{m}F\left({X}_{{i}_{j}}\right){\parallel}_{1}=O(lg\leftQ\right{lg}^{*}\phantom{\rule{2.56073pt}{0ex}}\leftS\right)d(Q,R)$$
Proof.
By Theorem 1, ${\parallel F\left(Q\right)F\left(R\right)\parallel}_{1}=O(lg\leftQ\right{lg}^{*}\phantom{\rule{2.56073pt}{0ex}}\leftS\right)d(Q,R)$. On the other hand, for an occurrence of R in S, let $T\left({X}_{i}\right)$ be the smallest subtree in $T\left(S\right)$ containing the occurrence of R, i.e., $R\in \mathit{itv}\left({X}_{i}\right)$. For $T\left(R\right)$ and $T\left({X}_{i}\right)$, let $s\left(R\right)$ and $s\left({X}_{i}\right)$ be the sequences of the level 2 symbols in $T\left(R\right)$ and $T\left({X}_{i}\right)$, respectively. By the definition of the ESP, it holds that $s\left(R\right)=\alpha \beta \gamma $ and $s\left({X}_{i}\right)={\alpha}^{\prime}\beta {\gamma}^{\prime}$ for some strings satisfying $\alpha {\alpha}^{\prime}\gamma {\gamma}^{\prime}=O({lg}^{*}\leftS\right\left)\right$, and this is true for the remaining string β iteratively. Thus, $\parallel F\left(R\right)F\left({X}_{i}\right){\parallel}_{1}=O(lgR{lg}^{*}\leftS\right)$ since the trees are balanced. Hence, by the equation
we obtain the approximation ratio. ☐
$$\begin{array}{cc}\hfill \parallel F\left(Q\right)\sum _{j=1}^{m}F\left({X}_{{i}_{j}}\right){\parallel}_{1}& =O(lg\leftQ\right{lg}^{*}\phantom{\rule{2.56073pt}{0ex}}\leftS\right)d(Q,R)+O(lg\leftQ\right{lg}^{*}\phantom{\rule{2.56073pt}{0ex}}\leftS\right)\hfill \\ & =O(lg\leftQ\right{lg}^{*}\phantom{\rule{2.56073pt}{0ex}}\leftS\right)d(Q,R)\hfill \end{array}$$
To further enhance the search efficiency, we present a lower bound of the ${L}_{1}$distance between characteristic vectors, which can be used for reducing the search space.
Theorem 3
(A lower bound μ). For any ${X}_{i}\in V\left(S\right)\cup V\left(Q\right)$, the inequality ${\parallel F\left(S\right)F\left(Q\right)\parallel}_{1}\ge \mu \left({X}_{i}\right)$ holds where
$$\mu \left({X}_{i}\right)=\sum _{e\in V\left(S\right)\u2a01V\left(Q\right)}F\left({X}_{i}\right)\left(e\right)$$
Proof.
The ${L}_{1}$ distance between $F\left(S\right)$ and $F\left(Q\right)$ is divided into four classes of terms: (i) both members in $F\left(S\right)$ and $F\left(Q\right)$ are nonzero, (ii) both members in $F\left(S\right)$ and $F\left(Q\right)$ are zero, (iii) the members in $F\left(S\right)$ and $F\left(Q\right)$ are zero and nonzero, (iv) the members in $F\left(S\right)$ and $F\left(Q\right)$ are nonzero and zero, respectively. Terms consisting of class (iii) and (iv) can be written as ${\sum}_{e\in V\left(S\right)\u2a01V\left(Q\right)}F\left(S\right)\left(e\right)$, which is a lower bound of the ${L}_{1}$distance. Thus, ${\parallel F\left(S\right)F\left(Q\right)\parallel}_{1}\ge {\sum}_{e\in V\left(S\right)\u2a01V\left(Q\right)}F\left(S\right)\left(e\right)$. ☐
Theorem 4
(Monotonicity of μ). If a variable ${X}_{i}$ derives ${X}_{k}$, the inequality $\mu \left({X}_{i}\right)\ge \mu \left({X}_{k}\right)$ holds.
Proof.
Every entry in $F\left({X}_{k}\right)$ is less than or equal to the corresponding entry in $F\left({X}_{i}\right)$. Thus, the inequality holds. ☐
6.3. Candidate Finding
By Theorems 2, 3 and 4, the task of the algorithm is reduced to finding a maximal subtree decomposition $({X}_{{i}_{1}},{X}_{{i}_{2}},\dots ,{X}_{{i}_{m}})$ within ${X}_{i}$. Given a threshold $\tau \ge 0$, for each $\leftQ\right$gram in $\mathit{itv}\left(S\right)$, the algorithm finds the candidate: the maximal subtree decomposition $({X}_{{i}_{1}},{X}_{{i}_{2}},\dots ,{X}_{{i}_{m}})$ satisfying $\mu \left({X}_{{i}_{1}}\right)+\mu \left({X}_{{i}_{2}}\right)+\cdots +\mu \left({X}_{{i}_{m}}\right)\le \tau $.
For an ${X}_{i}$ and an occurrence of some $\leftQ\right$gram in $\mathit{itv}\left({X}_{i}\right)$, the $\leftQ\right$gram is formed by the expression $\mathit{suf}(\mathit{val}\left({X}_{l\left(i\right)}\right),\leftQ\rightk)\mathit{pre}(\mathit{val}\left({X}_{r\left(i\right)}\right),k)$ for a k $(1\le k\le Q1)$. The algorithm computes the maximal subtree decompositions $({x}_{1},{x}_{2},\dots ,{x}_{p})$ covering $\mathit{suf}(\mathit{val}\left({X}_{l\left(i\right)}\right),Qk)$ and $({y}_{1},{y}_{2},\dots ,{y}_{q})$ covering $\mathit{pre}(\mathit{val}\left({X}_{r\left(i\right)}\right),k)$, and outputs $({x}_{1},\dots ,{x}_{p},{y}_{1},\dots ,{y}_{q})$ covering the $\leftQ\right$gram when ${\sum}_{1\le i\le p}\mu \left({x}_{i}\right)+{\sum}_{1\le i\le q}\mu \left({y}_{i}\right)\le \tau $. We illustrate the computation of candidates satisfying $\mu \left({X}_{{i}_{1}}\right)+\mu \left({X}_{{i}_{2}}\right)+\cdots +\mu \left({X}_{{i}_{m}}\right)\le \tau $ in Figure 4 and show the pseudocode in Algorithm 1.
Applying all variables to Algorithm 1 enables us to find the candidates covering all solutions. There are no possibilities for missing any $\leftQ\right$grams in $\mathit{itv}\left(S\right)$ such that the ${L}_{1}$distances between their characteristic vectors and $F\left(Q\right)$ are at most τ, i.e., false negatives. The set may include a false positive, i.e., the solution set encodes a $\leftQ\right$gram such that the ${L}_{1}$distance between its characteristic vector and $F\left(Q\right)$ is more than τ. However, false positives are efficiently removed by computing the ${L}_{1}$distance $\parallel F\left(Q\right){\sum}_{j=1}^{m}F\left({X}_{{i}_{j}}\right){\parallel}_{1}$ as a postprocessing.
Theorem 5.
The computation time of FindCandidates is $O\left(n\rightQlglg(n+\sigma )(lg\leftS\right+lg\leftQ\right\left)\right)$.
Proof.
Because the height of the ESPtree is $O(lg\leftS\right)$, for each variable X, the number of visited nodes is $O(lg\leftQ\right+lgS\left\right)$. The computation time of $\mathit{LeftChild}\left(X\right)$ and $\mathit{RightChild}\left(X\right)$ is $O(lglg(n+\sigma ))$, and the time of FindLeft and FindRight is $O\left(\rightQlglg(n+\sigma )(lg\leftS\right+lg\leftQ\right\left)\right)$. Thus, for n iterations of the functions, the total computation time is $O\left(n\rightQlglg(n+\sigma )(lg\leftS\right+lg\leftQ\right\left)\right)$. ☐
Algorithm 1 to output the candidate $R\subseteq V\left(S\right)$ for $X\in V\left(S\right)$, a query pattern Q and a distance threshold τ. 

6.4. Computing Positions
The algorithm also computes all the positions of $\mathit{val}\left({X}_{i}\right)$, denoted by $P\left({X}_{i}\right)=\{p\in \{1,2,\dots ,\leftS\right\}:S[p,p+\mathit{val}\left({X}_{i}\right)1]=\left({X}_{i}\right)\}$. Starting from ${X}_{i}$, the algorithm goes up to the root in the ESPtree built from S. p is initialized to 0 at ${X}_{i}$. If ${X}_{k}$ through the pass from ${X}_{i}$ to the root is the parent with the right child ${X}_{r\left(k\right)}$ on the pass, nonnegative integer $\left(\right\mathit{val}\left({X}_{k}\right)\mathit{val}\left({X}_{r\left(k\right)}\right)\left\right)$ is added to p. Otherwise, nothing is added to p. When the algorithm reaches the root, p represents a start position of $\mathit{val}\left({X}_{i}\right)$ on S, i.e., $\mathit{val}\left({X}_{i}\right)=S[p,p+\mathit{val}\left({X}_{i}\right)1]$. To compute the set $P\left({X}_{i}\right)$, the algorithm starts from ${X}_{i}$ and goes up to the root for each parent in $\mathit{RightParents}\left({X}_{i}\right)$ and $\mathit{LeftParents}\left({X}_{i}\right)$, which return sets of parents for ${X}_{i}$. Algorithm 2 shows the pseudocode.
Algorithm 2 to compute the set P of all occurrence of $\mathit{val}\left(X\right)$ on S for $X\in V\left(S\right)$. 

Theorem 6.
The computation time of $P\left(X\right)$ is $O(\mathit{occ}lg\leftS\right)$, where $\mathit{occ}$ is the number of occurrences of X in $T\left(S\right)$.
Proof.
Using the index structures of $\mathit{RightParents}\left(X\right)$ and $\mathit{LeftParents}\left(X\right)$, we can traverse the path from any node with label $\left(X\right)$ to the root of $T\left(S\right)$ counting the position. The length of the path is $O(lgS\left\right)$. ☐
Theorem 7.
The search time is $O\left(n\rightQlglg(n+\sigma )(lg\leftS\right+lg\leftQ\right)+occlg\leftS\right)$ using the data structure of size $n(n+1)lg\leftS\right+(n+\sigma )lg(n+\sigma )+2n+\sigma +o\left(\right(n+\sigma )lg(n+\sigma ))$ bits.
Proof.
The time for computing $T\left(Q\right)$ and $F\left(Q\right)$ is ${t}_{1}=O\left(\rightQ{lg}^{*}\phantom{\rule{2.56073pt}{0ex}}\leftS\right)$. The time for finding candidates and computing $\parallel F\left(Q\right){\sum}_{j=1}^{m}{X}_{{i}_{j}}{\parallel}_{1}$ is ${t}_{2}=O\left(n\rightQlglg(n+\sigma )(lg\leftS\right+lg\leftQ\right))$ by Theorem 5. The time for computing positions is $O(occlg\leftS\right)$ by Theorem 6. Thus, the total time for a query search is ${t}_{1}+{t}_{2}+{t}_{3}=O\left(n\rightQlglg(n+\sigma )(lg\leftS\right+lg\leftQ\right)+occlg\leftS\right)$. The size of the data structure is derived by the results in Section 5. ☐
In Theorem 7, n and $occ$ are incomparable because $occ>n$ is possible for a highly repetitive string.
7. Experiments
We evaluated the performance of siEDM on one core of a quadcore Intel Xeon Processor E5540 (2.53GHz) machine with 144GB memory. We implemented siEDM using the rank/select dictionary and GMR in libcds (https://github.com/fclaude/libcds). We used two standard benchmark datasets of einstein and cere from repetitive text collections in the pizza and chili corpus (http://pizzachili.dcc.uchile.cl/repcorpus.html), which is detailed in Table 1. As a comparison method, we used the online pattern matching for EDM called online ESP (baseline) [17] that approximates EDM between a query Q and substrings of the length of $\leftQ\right$ of each position of an input text. We randomly selected $S[i,j]$ as the query pattern Q for each $\leftQ\right=50,100,500,1000$ and examined the performance.
Table 2 shows the memory consumption in the search of the siEDM and baseline. The memory consumption of siEDM was larger than the baseline for both texts because the baseline does not have characteristic vectors of each node and length vector.
Table 3 shows the size for each component of the index structure and the time for building the index structure on einstein and cere datasets. Most of the size of the index structure was consumed by the characteristic vector F. The index size of cere was much larger than that of einstein. The index sizes of cere and einstein were approximately 16 megabytes and 256 megabytes, respectively, because the number of variables generated from cere was much larger than that generated from einstein. The number of variables generated from einstein was $305,098$ and the number of variables generated from cere was $4,512,406$. The construction times of the index structures were 118 s for einstein and 472 s for cere. The results for constructing the index structures demonstrate the applicability of siEDM to moderately large, repetitive texts.
Figure 5 shows the total search time (sec.) of siEDM and the baseline for einstein and cere in distance thresholds τ from 10 to 60. In addition, this result does not contain the case $\tau <10$ because siEDM found no candidate under the condition. The query length is one of $\{50,100,500,1000\}$. Because the search time of baseline is linear in $\leftS\right+\leftQ\right$, we show only the fastest case: $q=\leftQ\right=50$. The search time of siEDM was faster than baseline in most cases.
Figure 6 shows the detailed search time in second. CF is the time for finding candidates of Q in $T\left(S\right)$, DIST is the time for computing approximated ${L}_{1}$ distance by characteristic vectors, and PC is the time for determining the positions of all $\leftQ\right$grams within the threshold τ.
Figure 7 shows the number of nodes $T\left(S\right)$ visited by the algorithm, #TN, the number of candidate $\leftQ\right$grams computed by $FindCandidates$, #CAND, the number of true positives among candidate $\leftQ\right$grams, #TP, and the number of occurrences, #OCC. The most timeconsuming task is the candidate finding.
By the monotonicity of characteristic vectors, pruning the search space for small distance thresholds and long query length is more efficient. Thus, it is expected that siEDM is faster for smaller distance thresholds and longer query lengths and the experimental results support this. The search time on cere is much slower than that on einstein because the number of generated production rules from cere is much larger than that from einstein, and a large number of iterations of FindCandidates is executed. In addition, the comparison of #CAND and #TP validates the efficiency of siEDM for candidate finding with the proposed pruning method.
In Figure 7, the algorithm failed to find a candidate. Such a phenomenon often appears when the required threshold τ is too small, because the ESPtree $T\left(Q\right)$ is not necessarily identical to $T\left(S\right[i,j\left]\right)$ even if $Q=S[i,j]$. Generally, the parsing of $T\left(S\right[i,j\left]\right)$ is affected by a suffix of $S[1,i1]$ and a prefix of $S[j+1,S\left\right]$ of length at most ${lg}^{*}\leftS\right$.
As shown in Table 3 and Figure 5, the search time of siEDM depends on the size of encoded ESPtree for the input. Finally, we confirm this feature by an additional experiment for other repetitive texts. Table 4, Table 5 and Table 6 are the description of several datasets from the pizza & chili corpus. Figure 8 shows the search time of siEDM and baseline. This result supports our claim that siEDM is suitable for computing EDM of repetitive texts.
8. Conclusions
We have proposed siEDM, an efficient string index for computing approximate searching based on EDM. Experimental results demonstrated the applicability of siEDM to realworld repetitive text collections as well as a longer pattern search. Future work will make the search algorithm in siEDM faster, which would be beneficial for users performing query searches for EDM.
Acknowledgments
This work was supported by GrantinAid for JSPS Fellows, JSPS KAKENHI (24700140,26280088) and the JST PRESTO program. The authors thank the anonymous reviewers for their valuable feedback. The authors also thank T. Ohnishi and S. Fukunaga for their assistance.
Author Contributions
All the authors contributed equally to this work.
Conflicts of Interest
The authors declare no conflict of interest.
References
 Takabatake, Y.; Tabei, Y.; Sakamoto, H. Improved ESPindex: A Practical SelfIndex for Highly Repetitive Texts. In Proceedings of the 13th International Symposium on Experimental Algorithms (SEA), Copenhargen, Denmark, 29 June–1 July 2014; pp. 338–350.
 Claude, F.; Navarro, G. Selfindexed grammarbased compression. Fundam. Inform. 2011, 111, 313–337. [Google Scholar]
 Gagie, T.; Gawrychowski, P.; Kärkkäinen, J.; Nekrich, Y.; Puglisi, S.J. LZ77Based SelfIndexing with Faster Pattern Matching. In Proceedings of the 11th Latin American Theretical Informatics Symposium (LATIN), Montevideo, Uruguay, 31 March–4 April 2014; pp. 731–742.
 Gagie, T.; Puglisi, S.J. Searching and Indexing Genomic Databases via Kernelization. Front. Bioeng. Biotechnol. 2015, 3, 12. [Google Scholar] [CrossRef] [PubMed]
 Durbin, R.; Eddy, S.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
 Crochemore, M.; Rytter, W. Text Algorithms; Oxford University Press: Oxford, UK, 1994. [Google Scholar]
 Levenshtein, V.I. Binary codes capable of correcting deletions, insertions and reversals. Sov. phys. dokl. 1996, 10, 707–710. [Google Scholar]
 Cormode, G.; Muthukrishnan, S. The String Edit Distance Matching Problem with Moves. ACM Trans. Algor. 2007, 3, 1–19. [Google Scholar] [CrossRef]
 Shapira, D.; Storer, J.A. Edit distance with move operations. J. Discret. Algorithms 2007, 5, 380–392. [Google Scholar] [CrossRef]
 Sakamoto, H.; Maruyama, S.; Kida, T.; Shimozono, S. A SpaceSaving Approximation Algorithm for GrammarBased Compression. IEICE Trans. Inf. Syst. 2009, 92D, 158–165. [Google Scholar] [CrossRef]
 Maruyama, S.; Sakamoto, H.; Takeda, M. An Online Algorithm for Lightweight GrammarBased Compression. Algorithms 2012, 5, 213–235. [Google Scholar] [CrossRef]
 Maruyama, S.; Tabei, Y.; Sakamoto, H.; Sadakane, K. Fullyonline grammar compression. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval Symposium (SPIRE), Jerusalem, Israel, 7–9 October 2013; pp. 218–229.
 Maruyama, S.; Tabei, Y. Fullyonline grammar compression in constant space. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 26–28 March 2014; pp. 218–229.
 Maruyama, S.; Nakahara, M.; Kishiue, N.; Sakamoto, H. ESPIndex: A Compressed Index Based on EditSensitive Parsing. J. Discrete Alogrithms 2013, 18, 100–112. [Google Scholar] [CrossRef]
 Takabatake, Y.; Tabei, Y.; Sakamoto, H. Online SelfIndexed Grammar Compression. In Proceedings of the 22nd International Symposium on String Processing and Information Retrieval (SPIRE), London, UK, 1–4 September 2015; pp. 258–269.
 Nakahara, M.; Maruyama, S.; Kuboyama, T.; Sakamoto, H. Scalable Detection of Frequent Substrings by GrammarBased Compression. IEICE Trans. Inf. Syst. 2013, 96D, 457–464. [Google Scholar] [CrossRef]
 Takabatake, Y.; Tabei, Y.; Sakamoto, H. Online Pattern Matching for String Edit Distance with Moves. In Proceedings of the 21st International Symposium on String Processing and Information Retrieva (SPIRE), Ouro Preto, Brazil, 20–22 October 2014; pp. 203–214.
 Karpinski, M.; Rytter, W.; Shinohara, A. An efficient patternmatching algorithm for strings with short descriptions. Nord. J. Comput. 1997, 4, 172–186. [Google Scholar]
 Jacobson, G. SpaceEfficient Static Trees and Graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), Research Triangle Park, NC, USA, 30 October–1 November 1989; pp. 549–554.
 Raman, R.; Raman, V.; Rao, S.S. Succinct indexable dictionaries with applications to encoding kary trees, prefix sums and multisets. ACM Trans. Algor. 2007, 3. [Google Scholar] [CrossRef]
 Golynski, A.; Munro, J.I.; Rao, S.S. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the 17th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), Miami, FL, USA, 22–26 January 2006; pp. 368–373.
Figure 1.
The edit sensitive parsing. In (i), an underlined $S\left[i\right]$ means a landmark, and $p\ge 0$. In (i) and (ii), a dashed node is corresponding to the intermediate node in a 22tree.
Figure 6.
Details of search time for different $\leftQ\right$ and τ: time for candidate findings, CF, time for ${L}_{1}$distance computations, DIST, and time for position computations, PC. (a) and (b) correspond to CF, (c) and (d) correspond to DIST, and (e) and (f) correspond to PC of einstein and cere, respectively.
Figure 7.
Statistical information of the query search: the number of traversed nodes, #TN, the number of candidate $\leftQ\right$grams, #CAND, the number of true positives, #TP, the number of occurrences, #OCC. (a) and (b) correspond to #TN, (c) and (d) correspond to #CAND, (e) and (f) correspond to #TP, and (g) and (h) correspond to #TP of einstein and cere, respectively.
Dataset  Length  $\mathbf{}\mathbf{\Sigma}\mathbf{}$  Size (MB) 

einstein  $467,626,544$  139  446 
cere  $461,286,644$  5  440 
Dataset  Einstein  Cere 

siEDM (MB)  $17.12$  $254.75$ 
baseline (MB)  $6.98$  $10.95$ 
Dataset  Einstein  Cere  

Encoded ESPtree (MB)  $1.18$  $19.92$  
Index Size  Characteristic vector F (MB)  $15.35$  $227.34$ 
Length vector L (MB)  $0.59$  $7.49$  
Construction time (sec)  $117.65$  $472.21$ 
Dataset  Length  $\mathbf{}\mathbf{\Sigma}\mathbf{}$  Size (MB) 

influenza  154808555  15  $147.64$ 
Escherichia_Coli  112689515  15  $107.47$ 
Dataset  Influenza  Escherichia_Coli 

siEDM (MB)  $164.87$  $262.01$ 
baseline (MB)  $53.01$  $100.81$ 
Dataset  Influenza  Escherichia_Coli  

Encoded ESPtree (MB)  $9.92$  $20.21$  
Index Size  Characteristic vector F (MB)  $150.87$  $234.91$ 
Length vector L (MB)  $4.08$  $6.88$  
Construction time (sec)  $290.33$  $420.43$ 
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CCBY) license (http://creativecommons.org/licenses/by/4.0/).