Abstract
Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM’s efficiency.
1. Introduction
Vast amounts of text data are created, replicated, and modified with the increasing use of the internet and advances of data-centric technology. Many of these data contain repetitions of long substrings with slight differences, so called highly repetitive texts, such as Wikipedia and software repositories like GitHub with a large number of revisions. Also, recent biological databases store a large amount of human genomes while the genetic differences among individuals are less than percent, which results in the collections of human genomes being highly repetitive. Therefore, there is a strong need to develop powerful methods for processing highly repetitive text collections on a large scale.
Building indexes is the de facto standard method to search large databases of highly repetitive texts. Several methods have been presented for indexing and searching large-scale and highly repetitive text collections. Examples include the ESP-index [1], SLP-index [2] and LZ77-based index [3]. Recently, Gagie and Puglisi [4] presented a general framework called kernelization for indexing and searching highly repetitive texts. Although these methods enable fast query searches, their applicability is limited to exact match searches.
The edit distance between two strings is the minimum cost of edit operations (insertions, deletions, and replacements of characters) to transform one string to another. It has been proposed for detecting evolutionary changes in biological sequences [5], detecting typing errors in documents [6], and correcting errors on lossy communication channels [7]. To accelerate the quadratic time upper bound on computing the edit distance, Cormode and Muthukrishnan introduced a new technique called edit sensitive parsing (ESP) [8]. This technique allows us to compute a modified edit distance in near linear time by sacrificing accuracy with theoretical bounds. The modified distance is known as edit distance with moves (EDM) [8], which includes substring move operations in addition to insertions and deletions. While the exact computation of EDM is known to be intractable [9], the approximate computation of EDM using ESP achieves a good approximation ratio , and runs in almost linear time for the string length N, where lg denotes the logarithm of base two.
ESP is extended to various applications for highly repetitive texts. Examples are data compressions called grammar compression [10,11,12,13], indexes for exact matches [1,14,15], an approximated frequent pattern discovery [16] and an online pattern matching for EDM [17]. Despite several attempts to efficiently compute EDM and various extensions of ESP, there is no method for indexing and searching texts with EDM. Such a method is required in bioinformatics where approximated text searches are used to analyze massive genome sequences. Thus, an open challenge is to develop an efficient string index and search algorithm for EDM.
We propose a novel method called siEDM that efficiently indexes massive text, and performs query searches for EDM. As far as we know, siEDM is the first string index for searching queries for EDM. A space-efficient index structure for a string is built by succinctly encoding a parse tree obtained from ESP, and query searches are performed on the encoded index structures. siEDM prunes useless portions of the search space based on the lower bound of EDM without missing any matching patterns, enabling fast query searches. As in existing methods, similarity searches of siEDM are approximate but have the same guarantee of the approximation ratio as in ESP.
Experiments were performed on indexing and searching repetitive texts for EDM on standard benchmark datasets. The performance comparison with an online pattern matching for EDM [17] demonstrates siEDM’s practicality.
2. Preliminaries
2.1. Basic Notations
Let Σ be a finite alphabet, and σ be . All elements in Σ are totally ordered. Let us denote by the set of all strings over Σ, and by the set of strings of length q over Σ, i.e., and an element in is called a q-gram. The length of a string S is denoted by . The empty string ϵ is a string of length 0, namely . For a string , α, β and γ are called the prefix, substring, and suffix of S, respectively. The i-th character of a string S is denoted by for . For a string S and interval (), let denote the substring of S that begins at position i and ends at position j, and let be ϵ when . For a string S and integer , let and . We assume a recursive enumerable set of variables with . All elements in are totally ordered, where all elements in Σ must be smaller than those in . In this paper, we call a sequence of symbols from a string. Let us define , and for . The iterated logarithm of u is denoted by , and defined as the number of times the logarithm function must be applied before the result is less than or equal to 1, i.e., .
2.2. Straight-Line Program (SLP)
A context-free grammar (CFG) in Chomsky normal form is a quadruple , where V is a finite subset of , D is a finite subset of , and is the start symbol. An element in D is called a production rule. Denote (resp. ) as a left symbol (resp. right symbol) on the right hand side for a production rule with a variable on the left hand side, , . for variable denotes the string derived from . A grammar compression of S is a CFG G that derives S and only S. The size of a CFG is the number of variables, i.e., and let .
The parse tree of G is a rooted ordered binary tree such that (i) internal nodes are labeled by variables in V and (ii) leaves are labeled by symbols in Σ, i.e., the label sequence in leaves is equal to the input string. In a parse tree, any internal node Z corresponds to a production rule , and has the left child with label X and the right child with label Y.
Straight-line program (SLP) [18] is defined as a grammar compression over , and its production rules are in the form of where and .
2.3. Rank/Select Dictionaries
A rank/select dictionary for a bit string B [19] supports the following queries: returns the number of occurrences of in ; returns the position of the i-th occurrence of in B; returns the i-th bit in B. Data structures with only the bits storage to achieve time rank and select queries [20] have been presented.
GMR [21] is a rank/select dictionary for large alphabets and supports rank/ select/access queries for strings in . GMR uses bits while computing both rank and access queries in times and also computing select queries in time.
3. Problem
We first review the notion of EDM. The distance between two strings S and Q is the minimum number of edit operations to transform S into Q. The edit operations are defined as follows:
- Insertion: A character a is inserted at position i in S, which generates ,
- Deletion: A character is deleted at position i in S, which generates ,
- Replacement: A character is replaced by a at position i in S, which generates ,
- Substring move: A substring is deleted from the position i, and inserted at the position k in S, which generates for , and for .
Problem 1
(Query search for EDM). For a string , a query and a distance threshold , find all satisfying .
Shapira and Storer [9] proved the NP-completeness of EDM and proposed a polynomial-time algorithm for a restricted EDM. Cormode and Muthukrishnan [8] presented an approximation algorithm named ESP for computing EDM. We present a string index and search algorithm by leveraging the idea behind ESP for solving Problem 1. Our method consists of two parts: (i) an efficient index structure for a given string S and (ii) a fast algorithm for searching query Q on the index structure of S with respect to EDM. Although our method is also an approximation algorithm, it guarantees upper and lower bounds for the exact EDM. We first review ESP in the next section and then discuss the two parts.
4. Edit Sensitive Parsing (ESP) for Building SLPs
4.1. ESP Revisit
We review the edit sensitive parsing algorithm for building SLPs [10]. This algorithm, referred to as ESP-comp, computes an SLP from an input sting S. The tasks of ESP-comp are to (i) partition S into such that for each , (ii) if , generate the production rule and replace by X (this subtree is referred to as a 2-tree), and if , generate the production rule and for , and replace by Y (referred to as a 2-2-tree), (iii) iterate this process until S becomes a symbol. Finally, the ESP-comp builds an SLP representing the string S.
We focus on how to determine the partition . A string of the form with and is called a repetition. First, S is uniquely partitioned into the form by its maximal repetitions, where each is a maximal repetition of a symbol in , and each contains no repetition. Then, each is called type1, each of length at least is type2, and any remaining is type3. If , this symbol is attached to or with preference when both cases are possible. Thus, if , each and is longer than or equal to two. One of the substrings is referred to as .
Next, ESP-comp parses each depending on the type. For type1 and type3 substrings, the algorithm performs the left aligned parsing as follows. If is even, the algorithm builds 2-tree from for each ; otherwise, the algorithm builds a 2-tree from for each and builds a 2-2-tree from the last trigram . For type2 , the algorithm further partitions it into short substrings of length two or three by alphabet reduction [8].
Alphabet reduction: Given a type2 string S, consider and as binary integers. Let p be the position of the least significant bit, in which , and let be the bit of at the p-th position. Then, is defined for any . Because S is repetition-free (i.e., type2), the label string is also type2. If the number of different symbols in S is n (denoted by ), then . For the , the next label string is iteratively computed until the final satisfying is obtained. is called the landmark if .
The alphabet reduction transforms S into such that any substring of of length at least contains at least one landmark because is also type2. Using this characteristic, the algorithm ESP-comp determines the bigrams to be replaced for any landmark , where any two landmarks are not adjacent, and then the replacement is deterministic. After replacing all landmarks, any remaining maximal substring s is replaced by the left aligned parsing, where if = 1, it is attached to its left or right block.
We give an example of the edit sensitive parsing of an input string in Figure 1-(i) and (ii). The input string S is divided into three maximal substrings depending on the types. The label string L is computed for the type2 string. Originally, L is iteratively computed until . This case shows that a single iteration satisfies this condition. After the alphabet reduction, three landmarks are found, and then each is parsed. Any other remaining substrings including type1 and type3 are parsed by the left aligned parsing shown in Figure 1-(ii). In this example, a dashed node denotes that it is an intermediate node in a 2-2-tree. Originally, an ESP tree is a ternary tree in which each node has at most three children. The intermediate node is introduced to represent ESP tree as a binary tree.
Figure 1.
The edit sensitive parsing. In (i), an underlined means a landmark, and . In (i) and (ii), a dashed node is corresponding to the intermediate node in a 2-2-tree.
As shown in [8], the alphabet reduction approximates the minimum CFG as follows. Let S be a type2 string containing a substring α at least twice. When α is sufficiently long (e.g., ), there is a partition such that and each landmark of within α is decided by only . This means the long prefix of α is replaced by the same variables, independent of the occurrence of α.
ESP-comp generates a new shorter string of length from to , and it parses iteratively. Given a string S, ESP builds the ESP-tree of height in time and in space. The approximation ratio of the smallest grammar by ESP is [10].
4.2. Approximate Computations of EDM from ESP-Trees
ESP-trees enable us to approximately compute EDM for two strings. After constructing ESP-trees for two strings, their characteristic vectors are defined as follows. Let be the ESP-tree for string S. We define that an integer vector to be the characteristic vector if represents the number of times the variable X appears in as the root of a 2-tree. For a string S, and its characteristic vector are illustrated in Figure 2. The EDM between two strings S and Q can be approximated by -distance between two characteristic vectors and as follows:
Figure 2.
Illustration of edit sensitive parsing (ESP)-tree and characteristic vector.
Cormode and Muthukrishnan showed the upper and lower bounds on the -distance between characteristic vectors for the exact EDM.
Theorem 1
(Upper and lower bounds of the approximated EDM [8]) . For ,
5. Index Structure for ESP-Trees
5.1. Efficient Encoding Scheme
siEDM encodes an ESP-tree built from a string for fast query searches. This encoding scheme sorts the production rules in an ESP-tree such that the left symbols on the right hand side of the production rules are in monotonically increasing order, which enables encoding of these production rules efficiently and supporting fast operations for ESP-trees. The encoding scheme is performed from the first and second levels to the top level (i.e., root) in an ESP-tree.
First, the set of production rules at the first and second levels in the ESP-tree is sorted in increasing order of the left symbols on the right hand of the production rules, i.e., in the form of , which results in a sorted sequence of these production rules. The variables in the left hand side in the sorted production rules are renamed in the sorted order, generating a set of new production rules that is assigned to the corresponding nodes in the ESP-tree. The same scheme is applied to the next level of the ESP-tree, which iterates until it reaches the root node.
Figure 3 shows an example of the encoding scheme for the ESP-tree built from an input string . At the first and second levels in the ESP-tree, the set of production rules, , is sorted in the lexicographic order of the left symbols on right hand sides of production rules, which results in the sequence of production rules, . The variables on the right hand side of the production rules are renamed in the sorted order, resulting in the new sequence , whose production rules are assigned to the corresponding nodes in the ESP-tree. This scheme is repeated until it reaches level 4.
Figure 3.
Illustration of encoding scheme.
Using the above encoding scheme, we obtain a monotonically increasing sequence of left symbols on the right hand side of the production rules, i.e., in the form of . Let be the increasing sequence; can be efficiently encoded into a bit string by using the gap-encoding and the unary coding. For example, the gap-encoding represents the sequence by , and it is further transformed to the bit string by unary coding. Generally, for a sequence , its unary code U represents by . Because the number of 0s and the number of 1s is and n, respectively, the size of U is bits. The bit string is indexed by the rank/select dictionary.
Let be the sequence consisting of the right symbols on the right hand side of the production rules, i.e., in the form of . is represented using bits. is indexed by GMR [21].
The space for storing and is bits in total. and enable us to simulate fast queries on encoded ESP-trees, which is presented in the next subsection.
5.2. Query Processing on Tree
The encoded ESP-trees support four tree operations, , , and , which are used in our search algorithm. returns the left child of and can be implemented on bit string in time as and . returns the right child of and can be implemented on array in time as .
and return sets of parents of as left and right children, respectively, i.e., and .
Because is a monotonic sequence, any appears consecutively in . Using the unary encoding of , is computed by in time. can be computed by repeatedly applying select operations for on until no more appear, i.e., for . Thus, for can be computed in time.
5.3. Other Data Structures
As a supplemental data structure, siEDM computes the node characteristic vector, denoted by , for each variable : the characteristic vector consisting of the frequency of any variable derived from . The space for storing all node characteristic vectors of n variables is at most bits. Figure 3-(V) shows an example of the node characteristic vectors for ESP-tree in Figure 3-(III). In addition, let be a set of and variables appearing in all the descendant nodes under , i.e., . Practically, is represented by a sequence of a pair of and . Additionally, because ( represents adding 1 to dimension ), the characteristic vectors can be stored per level 2 of the ESP-tree. The data structure is represented by a bit array indexed by a rank/select dictionary and the characteristic vectors reduced per level 2 of ESP-tree. is set to 1 for i-th bit if is stored, otherwise it is 0. Then, can be computed by -th characteristic vector if the i-th bit of is 1; otherwise, .
Another data structure that siEDM uses is a non-negative integer vector named length vector, each dimension of which is the length of the substring derived from the corresponding variable (See Figure 3-(VI)). The space for storing length vectors of n variables is bits.
From the above argument, the space of the siEDM’s index structure for n variables is bits in total.
6. Search Algorithm
6.1. Baseline Algorithm
Given a , the maximal subtree decomposition of is a sequence of variable in defined recursively as follows. is the variable of the root of the maximal subtree satisfying that is its leftmost leaf and . If , then is the maximal subtree decomposition of . Otherwise, let be already determined and . Then, let be the variable of the root of the maximal subtree satisfying that is its leftmost leaf and . Repeating this process until , the maximal subtree decomposition is determined.
Based on the maximal subtree decomposition, we explain the outline of the baseline algorithm, called online ESP [17], for computing an approximation of EDM between two strings. is constructed beforehand. Given a pattern Q, the online ESP computes , and for each substring of length , it computes the approximate EDM as follows. It computes the maximal subtree decomposition of . Then, the distance is approximated by because ESP-tree is balanced and then . This baseline algorithm is, however, required to compute the characteristic vector of at each position i. Next, we improve the time and space of the online ESP by finding those -grams for each variable X in instead of each position i.
6.2. Improvement
The siEDM approximately solves Problem 1 with the same guarantees presented in Theorem 1. Let such that . There are -grams formed by the string with . Then, the variable is said to stab the -grams. The set of the -grams stabbed by is denoted by . Let be the set of for all appearing in . An important fact is that includes any -gram in S. Using this characteristic, we can reduce the search space .
If a -gram R is in , there exists a maximal subtree decomposition . Then, the -distance of and guarantees the same upper bounds in the original ESP as follows.
Theorem 2.
Let be a -gram on S and be its maximal subtree decomposition in the tree . Then, it holds that
Proof.
By Theorem 1, . On the other hand, for an occurrence of R in S, let be the smallest subtree in containing the occurrence of R, i.e., . For and , let and be the sequences of the level 2 symbols in and , respectively. By the definition of the ESP, it holds that and for some strings satisfying , and this is true for the remaining string β iteratively. Thus, since the trees are balanced. Hence, by the equation
we obtain the approximation ratio. ☐
To further enhance the search efficiency, we present a lower bound of the -distance between characteristic vectors, which can be used for reducing the search space.
Theorem 3
(A lower bound μ). For any , the inequality holds where
Proof.
The distance between and is divided into four classes of terms: (i) both members in and are non-zero, (ii) both members in and are zero, (iii) the members in and are zero and non-zero, (iv) the members in and are non-zero and zero, respectively. Terms consisting of class (iii) and (iv) can be written as , which is a lower bound of the -distance. Thus, . ☐
Theorem 4
(Monotonicity of μ). If a variable derives , the inequality holds.
Proof.
Every entry in is less than or equal to the corresponding entry in . Thus, the inequality holds. ☐
6.3. Candidate Finding
By Theorems 2, 3 and 4, the task of the algorithm is reduced to finding a maximal subtree decomposition within . Given a threshold , for each -gram in , the algorithm finds the candidate: the maximal subtree decomposition satisfying .
For an and an occurrence of some -gram in , the -gram is formed by the expression for a k . The algorithm computes the maximal subtree decompositions covering and covering , and outputs covering the -gram when . We illustrate the computation of candidates satisfying in Figure 4 and show the pseudo-code in Algorithm 1.
Figure 4.
Illustration of candidate finding and -distance computation.
Applying all variables to Algorithm 1 enables us to find the candidates covering all solutions. There are no possibilities for missing any -grams in such that the -distances between their characteristic vectors and are at most τ, i.e., false negatives. The set may include a false positive, i.e., the solution set encodes a -gram such that the -distance between its characteristic vector and is more than τ. However, false positives are efficiently removed by computing the -distance as a post-processing.
Theorem 5.
The computation time of FindCandidates is .
Proof.
Because the height of the ESP-tree is , for each variable X, the number of visited nodes is . The computation time of and is , and the time of FindLeft and FindRight is . Thus, for n iterations of the functions, the total computation time is . ☐
| Algorithm 1 to output the candidate for , a query pattern Q and a distance threshold τ. |
|
6.4. Computing Positions
The algorithm also computes all the positions of , denoted by . Starting from , the algorithm goes up to the root in the ESP-tree built from S. p is initialized to 0 at . If through the pass from to the root is the parent with the right child on the pass, non-negative integer is added to p. Otherwise, nothing is added to p. When the algorithm reaches the root, p represents a start position of on S, i.e., . To compute the set , the algorithm starts from and goes up to the root for each parent in and , which return sets of parents for . Algorithm 2 shows the pseudo-code.
| Algorithm 2 to compute the set P of all occurrence of on S for . |
|
Theorem 6.
The computation time of is , where is the number of occurrences of X in .
Proof.
Using the index structures of and , we can traverse the path from any node with label to the root of counting the position. The length of the path is . ☐
Theorem 7.
The search time is using the data structure of size bits.
Proof.
The time for computing and is . The time for finding candidates and computing is by Theorem 5. The time for computing positions is by Theorem 6. Thus, the total time for a query search is . The size of the data structure is derived by the results in Section 5. ☐
In Theorem 7, n and are incomparable because is possible for a highly repetitive string.
7. Experiments
We evaluated the performance of siEDM on one core of a quad-core Intel Xeon Processor E5540 (2.53GHz) machine with 144GB memory. We implemented siEDM using the rank/select dictionary and GMR in libcds (https://github.com/fclaude/libcds). We used two standard benchmark datasets of einstein and cere from repetitive text collections in the pizza and chili corpus (http://pizzachili.dcc.uchile.cl/repcorpus.html), which is detailed in Table 1. As a comparison method, we used the online pattern matching for EDM called online ESP (baseline) [17] that approximates EDM between a query Q and substrings of the length of of each position of an input text. We randomly selected as the query pattern Q for each and examined the performance.
Table 1.
Summary of datasets.
Table 2 shows the memory consumption in the search of the siEDM and baseline. The memory consumption of siEDM was larger than the baseline for both texts because the baseline does not have characteristic vectors of each node and length vector.
Table 2.
Comparison of the memory consumption for the query search.
Table 3 shows the size for each component of the index structure and the time for building the index structure on einstein and cere datasets. Most of the size of the index structure was consumed by the characteristic vector F. The index size of cere was much larger than that of einstein. The index sizes of cere and einstein were approximately 16 megabytes and 256 megabytes, respectively, because the number of variables generated from cere was much larger than that generated from einstein. The number of variables generated from einstein was and the number of variables generated from cere was . The construction times of the index structures were 118 s for einstein and 472 s for cere. The results for constructing the index structures demonstrate the applicability of siEDM to moderately large, repetitive texts.
Table 3.
Comparison of the index size and construction time.
Figure 5 shows the total search time (sec.) of siEDM and the baseline for einstein and cere in distance thresholds τ from 10 to 60. In addition, this result does not contain the case because siEDM found no candidate under the condition. The query length is one of . Because the search time of baseline is linear in , we show only the fastest case: . The search time of siEDM was faster than baseline in most cases.
Figure 5.
Comparison of the search time for einstein (left) and cere (right).
Figure 6 shows the detailed search time in second. CF is the time for finding candidates of Q in , DIST is the time for computing approximated distance by characteristic vectors, and PC is the time for determining the positions of all -grams within the threshold τ.
Figure 6.
Details of search time for different and τ: time for candidate findings, CF, time for -distance computations, DIST, and time for position computations, PC. (a) and (b) correspond to CF, (c) and (d) correspond to DIST, and (e) and (f) correspond to PC of einstein and cere, respectively.
Figure 7 shows the number of nodes visited by the algorithm, #TN, the number of candidate -grams computed by , #CAND, the number of true positives among candidate -grams, #TP, and the number of occurrences, #OCC. The most time-consuming task is the candidate finding.
Figure 7.
Statistical information of the query search: the number of traversed nodes, #TN, the number of candidate -grams, #CAND, the number of true positives, #TP, the number of occurrences, #OCC. (a) and (b) correspond to #TN, (c) and (d) correspond to #CAND, (e) and (f) correspond to #TP, and (g) and (h) correspond to #TP of einstein and cere, respectively.
By the monotonicity of characteristic vectors, pruning the search space for small distance thresholds and long query length is more efficient. Thus, it is expected that siEDM is faster for smaller distance thresholds and longer query lengths and the experimental results support this. The search time on cere is much slower than that on einstein because the number of generated production rules from cere is much larger than that from einstein, and a large number of iterations of FindCandidates is executed. In addition, the comparison of #CAND and #TP validates the efficiency of siEDM for candidate finding with the proposed pruning method.
In Figure 7, the algorithm failed to find a candidate. Such a phenomenon often appears when the required threshold τ is too small, because the ESP-tree is not necessarily identical to even if . Generally, the parsing of is affected by a suffix of and a prefix of of length at most .
As shown in Table 3 and Figure 5, the search time of siEDM depends on the size of encoded ESP-tree for the input. Finally, we confirm this feature by an additional experiment for other repetitive texts. Table 4, Table 5 and Table 6 are the description of several datasets from the pizza & chili corpus. Figure 8 shows the search time of siEDM and baseline. This result supports our claim that siEDM is suitable for computing EDM of repetitive texts.
Table 4.
Summary of additional datasets.
Table 5.
Comparison of the memory consumption for the query search.
Table 6.
Comparison of the index size and construction time for additional datasets.
Figure 8.
Search time (sec.) for repetitive texts: E. coli (left) and influenza (right).
8. Conclusions
We have proposed siEDM, an efficient string index for computing approximate searching based on EDM. Experimental results demonstrated the applicability of siEDM to real-world repetitive text collections as well as a longer pattern search. Future work will make the search algorithm in siEDM faster, which would be beneficial for users performing query searches for EDM.
Acknowledgments
This work was supported by Grant-in-Aid for JSPS Fellows, JSPS KAKENHI (24700140,26280088) and the JST PRESTO program. The authors thank the anonymous reviewers for their valuable feedback. The authors also thank T. Ohnishi and S. Fukunaga for their assistance.
Author Contributions
All the authors contributed equally to this work.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Takabatake, Y.; Tabei, Y.; Sakamoto, H. Improved ESP-index: A Practical Self-Index for Highly Repetitive Texts. In Proceedings of the 13th International Symposium on Experimental Algorithms (SEA), Copenhargen, Denmark, 29 June–1 July 2014; pp. 338–350.
- Claude, F.; Navarro, G. Self-indexed grammar-based compression. Fundam. Inform. 2011, 111, 313–337. [Google Scholar]
- Gagie, T.; Gawrychowski, P.; Kärkkäinen, J.; Nekrich, Y.; Puglisi, S.J. LZ77-Based Self-Indexing with Faster Pattern Matching. In Proceedings of the 11th Latin American Theretical Informatics Symposium (LATIN), Montevideo, Uruguay, 31 March–4 April 2014; pp. 731–742.
- Gagie, T.; Puglisi, S.J. Searching and Indexing Genomic Databases via Kernelization. Front. Bioeng. Biotechnol. 2015, 3, 12. [Google Scholar] [CrossRef] [PubMed]
- Durbin, R.; Eddy, S.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- Crochemore, M.; Rytter, W. Text Algorithms; Oxford University Press: Oxford, UK, 1994. [Google Scholar]
- Levenshtein, V.I. Binary codes capable of correcting deletions, insertions and reversals. Sov. phys. dokl. 1996, 10, 707–710. [Google Scholar]
- Cormode, G.; Muthukrishnan, S. The String Edit Distance Matching Problem with Moves. ACM Trans. Algor. 2007, 3, 1–19. [Google Scholar] [CrossRef]
- Shapira, D.; Storer, J.A. Edit distance with move operations. J. Discret. Algorithms 2007, 5, 380–392. [Google Scholar] [CrossRef]
- Sakamoto, H.; Maruyama, S.; Kida, T.; Shimozono, S. A Space-Saving Approximation Algorithm for Grammar-Based Compression. IEICE Trans. Inf. Syst. 2009, 92-D, 158–165. [Google Scholar] [CrossRef]
- Maruyama, S.; Sakamoto, H.; Takeda, M. An Online Algorithm for Lightweight Grammar-Based Compression. Algorithms 2012, 5, 213–235. [Google Scholar] [CrossRef]
- Maruyama, S.; Tabei, Y.; Sakamoto, H.; Sadakane, K. Fully-online grammar compression. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval Symposium (SPIRE), Jerusalem, Israel, 7–9 October 2013; pp. 218–229.
- Maruyama, S.; Tabei, Y. Fully-online grammar compression in constant space. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 26–28 March 2014; pp. 218–229.
- Maruyama, S.; Nakahara, M.; Kishiue, N.; Sakamoto, H. ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. J. Discrete Alogrithms 2013, 18, 100–112. [Google Scholar] [CrossRef]
- Takabatake, Y.; Tabei, Y.; Sakamoto, H. Online Self-Indexed Grammar Compression. In Proceedings of the 22nd International Symposium on String Processing and Information Retrieval (SPIRE), London, UK, 1–4 September 2015; pp. 258–269.
- Nakahara, M.; Maruyama, S.; Kuboyama, T.; Sakamoto, H. Scalable Detection of Frequent Substrings by Grammar-Based Compression. IEICE Trans. Inf. Syst. 2013, 96-D, 457–464. [Google Scholar] [CrossRef]
- Takabatake, Y.; Tabei, Y.; Sakamoto, H. Online Pattern Matching for String Edit Distance with Moves. In Proceedings of the 21st International Symposium on String Processing and Information Retrieva (SPIRE), Ouro Preto, Brazil, 20–22 October 2014; pp. 203–214.
- Karpinski, M.; Rytter, W.; Shinohara, A. An efficient pattern-matching algorithm for strings with short descriptions. Nord. J. Comput. 1997, 4, 172–186. [Google Scholar]
- Jacobson, G. Space-Efficient Static Trees and Graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), Research Triangle Park, NC, USA, 30 October–1 November 1989; pp. 549–554.
- Raman, R.; Raman, V.; Rao, S.S. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor. 2007, 3. [Google Scholar] [CrossRef]
- Golynski, A.; Munro, J.I.; Rao, S.S. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Miami, FL, USA, 22–26 January 2006; pp. 368–373.
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).