# Approximate String Matching with Compressed Indexes

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction and Related Work

## 2. Basics and Contribution Overview

#### 2.1. Basic Concepts

**text**string, by P the

**pattern**string, and by O an occurrence of P in T; by Σ the strings’

**alphabet**, of size σ; by $T\left[i\right]$ the symbol (or character) at position $(i\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}u)$ of T; by $S.{S}^{\prime}$ the

**concatenation**of strings; by $S=S[..i-1].S[i..j].S[j+1..]$ respectively a

**prefix**, a

**substring**, and a

**suffix**of a string S; by $S\u2291{S}^{\prime}$ that S is a substring of ${S}^{\prime}$; and by ε the empty string.

**trie**is a character-labeled tree where no two children of a node have the same label. The concatenation of the labels from the root to a node v is called the

**path-label**of v, and also denoted v. A

**compact trie**is obtained by collapsing the unary paths in a trie into single edges, labeled with the concatenation of the collapsed symbols. The

**suffix tree**[10, 42] of T is a compact trie such that each suffix of $T\$$ is the path-label of a leaf and vice versa, where $\$\notin \Sigma $ is a terminator symbol. String labels in a suffix tree are substrings of T, thus they can be represented by a starting position in T plus a length. We will assume that u is the length of $T\$$. The

**suffix array**[11] $SA[0,u-1]$ stores the suffix indexes of the leaves in lexicographical order. The suffix tree nodes can be identified with suffix array intervals: each node v corresponds to the range of leaves that descend from v. For a detailed explanation see, e.g., Gusfield’s book [43]. Figure 1 shows an example of a suffix tree and a suffix array.

**Ziv-Lempel**compression [23, 44,45,46] is based on cutting a text T into

**phrases**of varying length, so that, essentially, each phrase already appears somewhere earlier in the text. Each phrase is encoded usually with a fixed number of bits, so more compressible texts generate longer phrases. There are many variants of this family, which basically differ in the rules to form the phrases from the previous text.

#### 2.2. Approximate String Matching

**Figure 1.**On top, suffix tree for $cbdbddcbababa\$$. Some suffix links are shown (dashed arrows). On the bottom, suffix array of T.

**Dynamic programming.**

**Figure 2.**D table computation for strings $abccba$ and $abbbab$. Left: schematic representation of the alignment. Middle: computation of D. The numbers in bold correspond to the alignment shown on the left, and to the shortest path. Right: computation with increasing error bound.

**edit graph**: Each matrix position $(i,j)$ is a node, which is the source of three arrows towards $(i,j+1)$ (of weight 1), $(i+1,j)$ (of weight 1), and $(i+1,j+1)$ (of weight ${\delta}_{A\left[i\right]=B\left[j\right]}$). Hence $ed(A,B)$ is the weight of the shortest path from $(0,0)$ to $(m,{m}^{\prime})$.

**Backtracking.**

**Filtration and sampling.**

**Lemma 1**

#### 2.3. Our Contributions

**Reducing to exact matching of pattern pieces.**

**Hybrid indexing on Ziv-Lempel indexes.**

**Hybrid indexing on compressed suffix trees.**

## 3. A Simple Self-Indexing Method

#### 3.1. Using the LZ-index

**Figure 3.**(left) LZ-trie for strings $\{a,b,ab,abc,d,db,dbc\}$. (right) Reverse tree of the LZ78 trie.

#### 3.2. Improving the Basic Solution

#### 3.3. Using the FM-index

## 4. An Improved q-samples Index

#### 4.1. Varying the Error Distribution

**Lemma 2**

**Proof**

**Figure 4.**A schematic edit distance graph between A and B. The dashed lines show the division of A and B into the ${A}_{i}$’s and delimit ${B}^{\prime}$. The shortest path in the graph is enclosed between two parallel lines.

**Figure 5.**Dynamic programming table D for strings, $B=P=axbcxdxexfgxhixjkxlxmnxo$ (vertical) and $A=O=abcdefghijklmno$ (horizontal) with $k=9$. The shortest path in the edit graph is shown in bold, we do not show inactive cells.

**Lemma 3**

**Proof**

#### 4.2. Partial q-sample Matching

#### 4.3. A Hybrid q-samples Index

## 5. Using a Lempel-Ziv Self-Index

#### 5.1. Handling Different Lengths

**Lemma 4**

**Proof**

#### 5.2. A Hybrid Lempel-Ziv Index

**Lemma 5**

- 1.
- there is a substring ${A}^{\prime}$ of ${A}_{i}$ with $|{A}^{\prime}|=q$ and $ed({A}^{\prime},{B}^{\prime})<s$, or
- 2.
- $ed({A}_{i},{B}^{\prime})<k\xb7{\left|{A}_{i}\right|}_{v}$ in which case for any prefix ${A}^{\prime}$ of ${A}_{i}$ there exists a substring ${B}^{\u2033}$ of ${B}^{\prime}$ such that $ed({A}^{\prime},{B}^{\u2033})<k\xb7|{A}_{i}{|}_{v}-s\lfloor \left(\right|{A}_{i}|-|{A}^{\prime}\left|\right)/q\rfloor $.

**Proof**

**Figure 6.**A schematic representation of the two possible cases mentioned in Lemma 5. The boxes represent the A string and the density of the filling represents the density of errors with B.

**(1)**We look for any q-gram contained in a phrase which matches within P with less than s errors. We backtrack in the trie of phrases for every $P[{y}_{1}..]$, descending in the trie and advancing ${y}_{2}$ in $P[{y}_{1},{y}_{2}]$ while computing the DP matrix between the current trie node and $P[{y}_{1},{y}_{2}]$. We look for all trie nodes at depth q that match some $P[{y}_{1},{y}_{2}]$ with less than s errors. Since every suffix of a phrase is a phrase in the ILZI, every q-gram within any phrase can be found starting from the root of the trie of phrases. All the phrases Z that descend from each q-gram trie node found must be verified (those are the phrases that start with that q-gram). We must also spot the phrases suffixed by each such Z. Hence we map each phrase Z to the trie of reverse phrases and also verify all the descent of the reverse trie nodes. This covers case 1 of Lemma 5.

**(2)**We look for any phrase ${A}_{i}$ matching a portion of P with less than $k\xb7|{A}_{i}{|}_{v}$ errors. This is done over the trie of phrases. Yet, as we go down in the trie (thus considering longer phrases), we can enforce that the number of errors found up to depth d must be less than $k\xb7|{A}_{i}{|}_{v}-s\lfloor \left(\right|{A}_{i}|-d)/q\rfloor $. This covers case 2 in Lemma 5, where the equations vary according to the roles described in Section 4.3. (i.e. depending on i):

**(2.1)**$1<i<j$, in which case we are considering a phrase contained inside O that is not a prefix nor a suffix. The formula $k\xb7|{A}_{i}{|}_{v}$ (both for the matching condition and the backtracking limit) can be bounded by $(1+\u03f5)\xb7k\xb7min\left(\right|{A}_{i}|/(m-k-2v),1)$, which depends on $|{A}_{i}|$. Since ${A}_{i}$ may correspond to any trie node that descends from the current one, we determine a priori which $|{A}_{i}|\le m-k$ maximizes the backtracking limit. We apply the backtracking for each $P[{y}_{1}..]$.

**(2.2)**$i=j$, in which case we are considering a phrase that starts by a suffix of O. Now $k\xb7|{A}_{i}{|}_{v}$ can be bounded by $(1+\u03f5)\xb7k\xb7min\left(\right(d-v)/(m-k-2v),1)$, yet still the limit depends on $|{A}_{i}|$ and must be maximized a priori. This time we are only interested in suffixes of P, that is, we can perform m searches with ${y}_{2}=m$ and different ${y}_{1}$. If a node verifies the condition we must consider also those that descend from it, to get all the phrases that start with the same suffix.

**(2.3)**$i=1$, in which case we are considering a phrase that ends in a prefix of O. This search is as the case $i=j$, with similar formulas. We are only interested in prefixes of P, that is ${y}_{1}=0$. As the phrases are suffix-closed, we can conduct a single search for $P[0..]$ from the trie root, finding all phrase suffixes that match each prefix of P. Each such suffix node must be mapped to the reverse trie and the descent there must be included. The case $i=j=1$ is different, as it includes the case where O is contained inside a phrase. In this case we do not require the matching trie nodes to be suffixes, but also prefixes of suffixes. That is, we include the descent of the trie nodes and map each node in that descent to the reverse trie, just as in case 1.

#### 5.3. Homogeneous Lempel-Ziv Phrases

**(2)**of Section 5.2..

**Lemma 6**

**Proof**

**Lemma 7**

**Proof**

**Lemma 8**

**Proof**

## 6. Hierarchical Approximate String Matching

#### 6.1. Bidirectional Compressed Indexes

#### 6.2. Indexed Hierarchical Verification

**Lemma 9**

**Proof**

## 7. Practical Issues and Testing

`http://pizzachili.dcc.uchile.cl`), with 50 MB of English and DNA and 64 MB of proteins. The machine was a Pentium 4, 3.2 GHz, 1 MB L2 cache, 1 GB RAM, running Fedora Core 3, and compiling with

`gcc-3.4 -O9`. The pattern strings were sampled randomly from the text and each character was distorted with $10\%$ of probability. All the patterns had length $m=30$. Every configuration was tested during at least 60 seconds using at least 5 repetitions. Hence the numbers of repetitions varied between 5 and 130,000. To parametrize the hybrid index we tested all the j values from 1 to $k+1$ and reported the best time. To parametrize we choose $q=\lfloor m/h\rfloor $ and $s=\lfloor k/h\rfloor +1$ for some convenient h, since we can prove that this is the best approach and it was corroborated by our experiments. To determine the value of h and v we also tested the viable configurations and reported the best results. In our examples choosing v and h such that $2v$ is slightly smaller than q yielded the best configuration. Figure 9 shows the sensitivity of the ILZI to this parameter. The LZI and DLZI are not parametrized.

**Figure 8.**Average user time for finding the occurrences of patterns of size 30 with k errors in DNA, English, and Proteins. The y axis units are in seconds.

**Figure 9.**Average user time, in seconds, that the ILZI takes to find occurrences of patterns of size 30 with k errors, using different v’s.

55 |

45 |

## 8. Conclusions and Future Work

## Acknowledgements

## References and Notes

- Navarro, G. A guided tour to approximate string matching. ACM Comput. Surv.
**2001**, 33, 31–88. [Google Scholar] [CrossRef] - Chang, W.; Marr, T. Approximate string matching and local similarity. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching (CPM), Asilomar, CA, USA, June 5–8, 1994; pp. 259–273.
- Fredriksson, K.; Navarro, G. Average-optimal single and multiple approximate string matching. ACM J. Exp. Algorithmics
**2004**, 9. No. 1.4. [Google Scholar] [CrossRef] - Navarro, G.; Baeza-Yates, R.; Sutinen, E.; Tarhio, J. Indexing methods for approximate string matching. IEEE Data Eng. Bull.
**2001**, 24, 19–27. [Google Scholar] - Sung, W.K. Indexed approximate string matching; Springer: Berlin, Germany, 2008; pp. 408–411. [Google Scholar]
- Cole, R.; Gottlieb, L.A.; Lewenstein, M. Dictionary matching and indexing with errors and don’t cares. In Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), Chicago, IL, USA, June 13–16, 2004; pp. 91–100.
- Maaß, M.; Nowak, J. Text indexing with errors. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), Jeju Island, Korea, June 19–22, 2005; pp. 21–32.
- Chan, H.L.; Lam, T.W.; Sung, W.K.; Tam, S.L.; Wong, S.S. A linear size index for approximate pattern matching. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM), Barcelona, Spain, July 5–7, 2006; pp. 49–59.
- Coelho, L.; Oliveira, A. Dotted suffix trees: a structure for approximate text indexing. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE), Glasgow, UK, October 11–13, 2006; pp. 329–336.
- Weiner, P. Linear pattern matching algorithms. In 14th IEEE Annual Symposium on Switching and Automata Theory, Iowa City, USA, October 15–17, 1973; pp. 1–11.
- Manber, U.; Myers, E. Suffix arrays: a new method for on-line string searches. SIAM J. Comput.
**1993**, 22, 935–948. [Google Scholar] [CrossRef] - Gonnet, G. A tutorial introduction to Computational Biochemistry using Darwin; Technical report; Informatik E.T.H.: Zuerich, Switzerland, 1992. [Google Scholar]
- Ukkonen, E. Approximate string matching over suffix trees. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching (CPM), Asilomar, CA, USA, June 5–8, 1994; pp. 228–242.
- Cobbs, A. Fast approximate matching using suffix trees. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching (CPM), Espoo, Finland, July 5–7, 1995; pp. 41–54.
- Sutinen, E.; Tarhio, J. Filtration with q-samples in approximate string matching. In In Proceeding of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM), Laguna Beach, CA, USA, June 10–12, 1996; pp. 50–63.
- Navarro, G.; Baeza-Yates, R. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J.
**1998**, 1. No. 2. [Google Scholar] - Myers, E.W. A sublinear algorithm for approximate keyword searching. Algorithmica
**1994**, 12, 345–374. [Google Scholar] [CrossRef] - Navarro, G.; Baeza-Yates, R. A hybrid indexing method for approximate string matching. J. Discrete Algorithms
**2000**, 1, 205–239. [Google Scholar] - Navarro, G.; Sutinen, E.; Tarhio, J. Indexing text with approximate q-grams. J. Discrete Algorithms
**2005**, 3, 157–175. [Google Scholar] [CrossRef] - Kurtz, S. Reducing the space requirement of suffix trees. Softw. Pract. Exper.
**1999**, 29, 1149–1171. [Google Scholar] [CrossRef] - Navarro, G.; Mäkinen, V. Compressed full-text indexes. ACM Comput. Surv.
**2007**, 39. No. 2. [Google Scholar] [CrossRef] - Manzini, G. An analysis of the Burrows-Wheeler transform. JACM
**2001**, 48, 407–430. [Google Scholar] [CrossRef] - Ziv, J.; Lempel, A. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory
**1978**, 24, 530–536. [Google Scholar] [CrossRef] - Ferragina, P.; Manzini, G. Indexing compressed text. JACM
**2005**, 52, 552–581. [Google Scholar] [CrossRef] - Navarro, G. Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms
**2004**, 2, 87–114. [Google Scholar] [CrossRef] - Kärkkäinen, J.; Ukkonen, E. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP), Recife, Brazil, August 08–09, 1996; pp. 141–155.
- Arroyuelo, D.; Navarro, G.; Sadakane, K. Reducing the space requirement of LZ-Index. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM), Barcelona, Spain, July 5–7, 2006; pp. 318–329.
- Russo, L.M.S.; Oliveira, A.L. A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr.
**2008**, 11, 359–388. [Google Scholar] [CrossRef] - Sadakane, K. New text indexing functionalities of the compressed suffix arrays. J. Algorithms
**2003**, 48, 294–313. [Google Scholar] [CrossRef] - Grossi, R.; Gupta, A.; Vitter, J. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Baltimore, MD, USA, January 12–14 2003; pp. 841–850.
- Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms
**2007**, 3. No. 20. [Google Scholar] [CrossRef] - Grossi, R.; Vitter, J. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput.
**2005**, 35, 378–407. [Google Scholar] [CrossRef] - Sadakane, K. Compressed suffix trees with full functionality. Theory Comput. Syst.
**2007**, 41, 589–607. [Google Scholar] [CrossRef] - Fischer, J.; Mäkinen, V.; Navarro, G. An(other) entropy-bounded compressed suffix tree. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM), Pisa, Italy, June 18–20, 2008; pp. 152–165.
- Russo, L.; Navarro, G.; Oliveira, A. Fully-compressed suffix trees. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN), Búzios, Brazil, April 7–11, 2008; pp. 362–373.
- Ferragina, P.; González, R.; Navarro, G.; Venturini, R. Compressed text indexes: From theory to practice. ACM J. Exp. Algorithmics (JEA)
**2009**, 13. No. 12. [Google Scholar] [CrossRef] - Huynh, T.; Hon, W.K.; Lam, T.W.; Sung, W.K. Approximate string matching using compressed suffix arrays. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), Jeju Island, Korea, June 19–22, 2005; pp. 434–444.
- Lam, T.W.; Sung, W.K.; Wong, S.S. Improved approximate string matching using compressed suffix data structures. In Proceedings of the 16th Annual International Symposium on Algorithms and Computation (ISAAC), Hainan, China, December 19–21, 2005; pp. 339–348.
- Navarro, G.; Baeza-Yates, R. Improving an algorithm for approximate pattern matching. Algorithmica
**2001**, 30, 473–502. [Google Scholar] [CrossRef] - Russo, L.M.S.; Navarro, G.; Oliveira, A.L. Approximate string matching with Lempel-Ziv compressed indexes. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE), Santiago, Chile, October 29–31, 2007; pp. 264–275.
- Russo, L.M.S.; Navarro, G.; Oliveira, A.L. Indexed hierarchical approximate string matching. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE), Melbourne, Australia, November 10–12, 2008; pp. 144–154.
- Apostolico, A. The myriad virtues of subword trees. In Combinatorial Algorithms on Words; Springer-Verlag: New York, NY, USA, 1985; pp. 85–96. [Google Scholar]
- Gusfield, D. Algorithms on Strings, Trees and Sequences; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
- Lempel, A.; Ziv, J. On the complexity of finite sequences. IEEE Trans. Inf. Theory
**1976**, 22, 75–81. [Google Scholar] [CrossRef] - Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory
**1977**, 23, 337–343. [Google Scholar] [CrossRef] - Welch, T. A technique for high performance data compression. IEEE Comput. Mag.
**1984**, 17, 8–19. [Google Scholar] [CrossRef] - Ukkonen, E. Finding approximate patterns in strings. J. Algorithms
**1985**, 6, 132–137. [Google Scholar] [CrossRef] - Landau, G.M.; Myers, E.W.; Schmidt, J.P. Incremental string comparison. SIAM J. Comput.
**1998**, 27, 557–582. [Google Scholar] [CrossRef] - Maaß, M. Linear bidirectional on-line construction of affix trees. Algorithmica
**2003**, 37, 43–74. [Google Scholar] [CrossRef] - Navarro, G. Implementing the lz-index: Theory versus practice. ACM J. Exp. Algorithmics
**2009**, 13. No. 2. [Google Scholar] [CrossRef] - Ukkonen, E. Finding approximate patterns in strings. J. Algorithms
**1985**, 6, 132–137. [Google Scholar] [CrossRef] - Navarro, G.; Baeza-Yates, R. A hybrid indexing method for approximate string matching. J. Discrete Algorithms
**2000**, 1, 205–239. [Google Scholar] - Lee, S.; Park, K. Dynamic rank-select structures with applications to run-length encoded texts. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM), Pisa, Italy, June 18–20, 2008; 2008; pp. 95–106. [Google Scholar]
- Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. JACM
**1999**, 46, 395–415. [Google Scholar] [CrossRef] - Navarro, G.; Baeza-Yates, R. Very fast and simple approximate string matching. Inf. Proc. Lett.
**1999**, 72, 65–70. [Google Scholar] [CrossRef] - Raman, R.; Raman, V.; Rao, S.S. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th annual ACM-SIAM symposium on Discrete algorithms, San Francisco, CA, USA, January 6–8, 2002; pp. 233–242.
- Mäkinen, V.; Navarro, G. Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms
**2008**, 4, 32:1–32:38. [Google Scholar] [CrossRef] - Wu, S.; Manber, U. Fast text searching allowing errors. Commun. ACM
**1992**, 35, 83–91. [Google Scholar] [CrossRef]

© 2009 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Russo, L.M.S.; Navarro, G.; Oliveira, A.L.; Morales, P. Approximate String Matching with Compressed Indexes. *Algorithms* **2009**, *2*, 1105-1136.
https://doi.org/10.3390/a2031105

**AMA Style**

Russo LMS, Navarro G, Oliveira AL, Morales P. Approximate String Matching with Compressed Indexes. *Algorithms*. 2009; 2(3):1105-1136.
https://doi.org/10.3390/a2031105

**Chicago/Turabian Style**

Russo, Luís M. S., Gonzalo Navarro, Arlindo L. Oliveira, and Pedro Morales. 2009. "Approximate String Matching with Compressed Indexes" *Algorithms* 2, no. 3: 1105-1136.
https://doi.org/10.3390/a2031105