Linear-Time Text Compression by Longest-First Substitution

: We consider grammar-based text compression with longest ﬁrst substitution ( LFS ), where non-overlapping occurrences of a longest repeating factor of the input text are replaced by a new non-terminal symbol. We present the ﬁrst linear-time algorithm for LFS. Our al-gorithm employs a new data structure called sparse lazy sufﬁx trees . We also deal with a more sophisticated version of LFS, called LFS2 , that allows better compression. The ﬁrst linear-time algorithm for LFS2 is also presented


Introduction
Data compression is a task of reducing data description length.Not only does it enable us to save space for data storage, but also it reduces time for data communication.This paper focuses on text compression where the data to be compressed are texts (strings).Recent research developments show that text compression has a wide range of applications, e.g., pattern matching [1,2,3], string similarity computation [4,5], detecting palindromic/repetitive structures [4,6], inferring hierarchal structure of natural language texts [7,8], and analyses of biological sequences [9].
Grammar-based compression [10] is a kind of text compression scheme in which a context-free grammar (CFG) that generates only an input text w is output as a compressed form of w.Since the problem of computing the smallest CFG which generates w is NP-hard [11], many attempts have been made to develop practical algorithms that compute a small CFG which generates w.Examples of grammar-based compression algorithms are LZ78 [12], LZW [13], Sequitur [7], and Bisection [14].Approximation algorithms for optimal grammar-based compression have also been proposed [15,16,17].The first compression algorithm based on a subclass of context-sensitive grammars was introduced in [18].
Grammar-based compression based on greedy substitutions has been extensively studied.Wolff [19] introduced a concept of most-frequent-first substitution (MFFS) such that a digram (a factor of length 2) which occurs most frequently in the text is recursively replaced by a new non-terminal symbol.He also presented an O(n 2 )-time algorithm for it, where n is the input text length.A linear-time algorithm for most-frequent-first substitution, called Re-pair, was later proposed by Larsson and Moffat [20].Apostolico and Lonardi [21] proposed a concept of largest-area-first substitution such that a factor of the largest "area" is recursively replaced by a new non-terminal symbol.Here the area of a factor refers to the product of the length of the factor by the number of its non-overlapping occurrences in the input text.It was reported in [22] that compression by largest-area-first substitution outperforms gzip (based on LZ77 [23]) and bzip2 (based on the Burrows-Wheeler Transform [24]) on DNA sequences.However, to the best of our knowledge, no linear-time algorithm for this compression scheme is known.
This paper focuses on another greedy text compression scheme called longest-first substitution (LFS), in which a longest repeating factor of an input text is recursively replaced by a new non-terminal symbol.For example, for input text w = abaaabbababb$, the following grammar S → BaaABA$; A → abb; B → ab, which generates only w is the output of LFS.
In this paper, we propose the first linear-time algorithm for text compression by LFS substitution.A key idea is the use of a new data structure called sparse lazy suffix trees.Moreover, this paper deals with a more sophisticated version of longest-first text compression (named LFS2), where we also consider repeating factors of the right-hand of the existing production rules.For the same input text w = abaaabbababb$ as above, we obtain the following grammar: This method allows better compression since the total grammar size becomes smaller.In this paper, we present the first linear-time algorithm for text compression based on LFS2.Preliminary versions of our paper appeared in [25] and [26].

Related Work
It is true that several algorithms for LFS or LFS2 were already proposed, however, in fact none of them runs in linear time in the worst case.Bentley and McIlroy [27] proposed an algorithm for LFS, but Nevill-Manning and Witten [8] pointed out that the algorithm does not run in linear time.Nevill-Manning and Witten also claimed that the algorithm can be improved so as to run in linear time, but they only noted a too short sketch for how, which is unlikely to give a shape to the idea of the whole algorithm.Lanctot et al. [28] proposed an algorithm for LFS2 and stated that it runs in linear time, but a careful analysis reveals that it actually takes O(n 2 ) time in the worst case for some input string of length n.See Appendix for our detailed analysis.

Notations
Let Σ be a finite alphabet of symbols.We assume that Σ is fixed and |Σ| is constant.An element of Σ * is called a string.Strings x, y, and z are said to be a prefix, factor, and suffix of string w = xyz, respectively.
The length of a string w is denoted by |w|.The empty string is denoted by ε, that is, |ε| = 0. Also, we assume that all strings end with a unique symbol $ ∈ Σ that does not occur anywhere else in the strings.Let Σ + = Σ * \{ε}.The i-th symbol of a string w is denoted by w[i] for 1 ≤ i ≤ |w|, and the factor of a string w that begins at position i and ends at position j is denoted by w[i : j] for 1 ≤ i ≤ j ≤ |w|.For convenience, let w[i : j] = ε for j < i, and w[i :] = w[i : |w|] for 1 ≤ i ≤ |w|.For any strings x, w, let BP w (x) denote the set of the beginning positions of all the occurrences of x in w.That is, We say that strings x, y overlap in w if there exist integers i, j such that x = w[i Let #occ w (x) denote the possible maximum number of non-overlapping occurrences of x in w.If #occ w (x) ≥ 2, then x is said to be repeating in w.We abbreviate a longest repeating factor of w to an LRF of w.Remark that there can exist more than one LRF for w.
Let Σ and Π be the set of terminal and non-terminal symbols, respectively, such that Σ ∩ Π = ∅.A context-free grammar G is a formal grammar in which every production rule is of the form A → u, where A ∈ Π and u ∈ (Σ ∪ Π) * .Let u = xBy and v = xβy with x, y, β ∈ (Σ ∪ Π) * and B ∈ Π.If there exists a production rule B → β in G, then v = xβy is said to be directly derived from u = xBy by G, and it is denoted by u ⇒ G v. If there exists a sequence w 0 , w 1 , . . ., w n such that w i ∈ (Σ ∪ Π) * and then we say that v is derived from u.The length of a non-terminal symbol A, denoted |A|, is the length of the string z ∈ Σ * that is derived from the production rule A → v.For convenience, we assume that any non-terminal symbol A in G has |A| positions.The size of the production rule is the number of terminal and non-terminal symbols v contains.

Data Structures
Our text compression algorithm uses a data structure based on suffix trees [29].The suffix tree of string w, denoted by STree(w), is defined as follows: Definition 1 (Suffix Trees) STree(w) is a tree structure such that: (1) every edge is labeled by a nonempty factor of w, (2) every internal node has at least two child nodes, (3) all out-going edge labels of every node begin with mutually distinct symbols, and (4) every suffix of w is spelled out in a path starting from the root node.
Assuming any string w terminates with the unique symbol $ not appearing elsewhere in w, there is a one-to-one correspondence between a suffix of w and a leaf node of STree(w).It is easy to see that the numbers of the nodes and edges of STree(w) are linear in |w|.Moreover, by encoding every edge label x of STree(w) with an ordered pair (i, j) of integers such that x = w[i : j], each edge only needs constant space.Therefore, STree(w) can be implemented with total of O(|w|) space.Also, it is well known that STree(w) can be constructed in O(|w|) time (e.g.see [29]).
STree(w) for string w = ababa$ is shown in Figure 1.For any node v of STree(w), str (v) denotes the string obtained by concatenating the labels of the edges in the path from the root node to node v.
The length of node v, denoted len(v), is defined to be |str (v)|.It is an easy application of the Ukkonen algorithm [29] to compute the lengths of all nodes while constructing STree(w).The leaf node such that str ( ) = w[i :] is denoted by leaf i , and i is said to be the id of the leaf.Every node v of STree(w) except for the root node has a suffix link, denoted by suf (v), such that suf (v) = v where str (v ) is a suffix of str (v) and len(v ) + 1 = len(v).Linear-time suffix tree construction algorithms (e.g., [29]) make extensive use of the suffix links.
A sparse suffix tree [30] of w ∈ Σ * is a kind of suffix tree which represents only a subset of the suffixes of w.The sparse suffix tree of w ∈ (Σ ∪ Π) * represents the subset {w[i :] | w[i] ∈ Σ} of suffixes of w which begin with a terminal symbol.Let be the length of the LRFs of w.A reference node of the sparse suffix tree of w ∈ (Σ ∪ Π) * is any node v such that len(v) ≥ + 1, and there is no node u such that str (u) is a proper prefix of str (v) and len(u) ≥ + 1.
Our algorithm uses the following data structure.
Definition 2 (Sparse Lazy Suffix Trees) A sparse lazy suffix tree (SLSTree) of string w ∈ (Σ ∪ Π) * , denoted by SLSTree(w), is a kind of sparse suffix tree such that: (1) All paths from the root node to the reference nodes coincide with those of the sparse suffix tree of w, and ( 2) Every reference node v stores an ordered triple SLSTree(w) is called "lazy" since its subtrees that are located below the reference nodes may not coincide with those of the corresponding sparse suffix tree of w.Our algorithms of Section 3. run in linear time by "neglecting" updating these subtrees below the reference nodes.
Proof.By a standard postorder traversal on STree(w), propagating the id of each leaf node.
Since STree(w) can be constructed in O(|w|) time [29], we can build SLSTree(w) in total of O(|w|) time.

Off-Line Compression by Longest-First Substitution
Given a text string w ∈ Σ * , we here consider a greedy approach to construct a context-free grammar which generates only w.The key is how to select a factor of w to be replaced by a non-terminal symbol from Π. Here, we consider the longest-first-substitution approach where we recursively replace as many LRFs as possible with non-terminal symbols.
Example.Let w = abaaabbababb$.At the beginning, the grammar is of the following simple form S → abaaabbababb$, where the right-hand of the production rule consists only of terminal symbols from Σ. Now we focus on the right-hand of S which has two LRFs aba and abb.Let us here choose abb to be replaced by non-terminal A ∈ Π.We obtain the following grammar: S → abaaAabA$; A → abb.The other LRF aba of length 3 is no longer present in the right-hand of S. Thus we focus on an LRF ab of length 2. Replacing ab by non-terminal B ∈ Π results in the following grammar: S → BaaABA$; A → abb; B → ab.Since the right-hand of S has no repeating factor longer than 1, we are done.
Let w 0 = w, and let w k denote the string obtained by replacing an LRF of w k−1 with a non-terminal symbol A k .LRF (w k−1 ) denotes the LRF of w k−1 that is replaced by A k , namely, we create a new production rule A k → LRF (w k−1 ).In the above example, w 0 = w = abaaabbababb$, LRF (w 0 ) = abb, A 1 = A, w 1 = abaaAabA$, LRF (w 1 ) = ab, A 2 = B, and w 2 = BaaABA$.
Due to the property of the longest first approach, we have the following observation.
Observation 1 Let A 1 , . . ., A k ∈ Π be the non-terminal symbols which replace LRF (w 0 ), . . ., LRF (w k−1 ), respectively.For any 1 ≤ i ≤ k, the right-hand of the production rule of In what follows, we will show our algorithm which outputs a context-free grammar which generates a given string.Our algorithm heavily uses the SLSTree structure.In this section, we show how to find an LRF of w k from SLSTree(w k ).The next lemmas characterize an LRF of w k that is not represented by a node of SLSTree(w k ).
then it contradicts the precondition that x is not represented by a node of SLSTree(w k ).Hence we have x is an LRF of w k , we have j ≥ i + |x|.However, if we assume j > i + |x|, this contradicts the The above lemma implies that an LRF x is not represented by a node of SLSTree(w k ) only if the first and the last occurrences of x form a square xx in w k .For example, see Figure 1 that illustrates SLSTree(w 0 ) for w = ababa$.One can see that ab is an LRF of w 0 but it is not represented by a node of SLSTree(w 0 ).
However, the following lemma guarantees that it is indeed sufficient to consider the strings represented by nodes of SLSTree(w k ) as candidates for LRF (w k ).
Lemma 2 Let x be an LRF of w k that is not represented by a node of SLSTree(w k ).Then, there exists another LRF y of w k that is represented by a node of SLSTree(w k ) such that |x| = |y|.Moreover, x is no longer present in w k+1 after a substitution for y (see also Figure 2).Proof.Let i = min BP w k (x) and j = max BP w k (x).It follows from Lemma 1 that j = i + |x|.Suppose that x is represented on an edge from some node s to some node t of STree(w).Let u = str (t).Then we have Hence y is represented by a node of SLSTree(w k ).Since x occurs only within the region w k [i : j + |u| − 1], x does not occur in w k+1 after a substitution for y.
In the running example of Figure 1, ba is an LRF of w 0 that is represented by a node of SLSTree(w 0 ).After its two occurrences are replaced by a non-terminal symbol A 1 , then ab, which is an LRF of w 0 not represented by a node of SLSTree(w 0 ), is no more present in w 1 = aA 1 A 1 $.
After constructing SLSTree(w 0 ) = SLSTree(w), we create a bin-sorted list of the internal nodes of SLSTree(w) in the decreasing order of their lengths.This can be done in linear time by a standard traversal on SLSTree(w).We remark that a new internal node v may appear in SLSTree(w k ) for some k ≥ 1, which did not exist in SLSTree(w k−1 ).However, we have that len(v) ≤ |LRF (w k−1 )|.Thus, we can maintain the bin-sorted list by inserting node v in constant time.
Given a node s in the bin-sorted list, we can determine whether str (s) is repeating or not by using SLSTree(w k ), as follows.
Remark that the values of min BP w k−1 (s i ) and max BP w k−1 (s i ) are stored in node s i and can be referred to in constant time.Since the above inequality is checked at most once for each node s, it takes amortized constant time.
Suppose we have found an LRF of w k as mentioned above.In the sequel, we show our greedy strategy to select occurrences of the LRF in w k to be replaced with a new non-terminal symbol.
The next lemma is essentially the same as Lemma 2 of Kida et al. [1].
Lemma 5 For any non-repeating factor x of w k , BP w k (x) forms a single arithmetic progression.
Therefore, for any non-repeating factor x of w k , BP w k (x) can be expressed by an ordered triple consisting of minimum element min BP w k (x), maximum element max BP w k (x), and cardinality BP w k (x) , which takes constant space.Lemma 6 Let s be any node of SLSTree(w k ) such that str (s) is an LRF of w k , and s be any child of s.Then, BP w k (str (s )) contains at most two positions corresponding to non-overlapping occurrences of str (s) in w k .
Proof.Assume for contrary that BP w k (str (s )) contains three non-overlapping occurrences of str (s), and let them be i 1 , i 2 , i 3 in the increasing order.Then we have which implies that w k [i 1 : i 1 + len(s)] and w k [i 3 : i 3 + len(s)] are non-overlapping.Moreover, since len(s ) > len(s), we have w k [i 1 : i 1 + len(s)] = w k [i 3 : i 3 + len(s)].However, this contradicts the precondition that str (s) is an LRF of w k .
From Lemma 6, each child s of node s such that str (s) is an LRF, corresponds to at most two nonoverlapping occurrences of str (s).Due to Lemma 3, we can greedily select occurrences of str (s) to be replaced by a new non-terminal symbol, by checking all children s 1 , . . ., s of node s.According to Lemma 5, it takes amortized constant time to select such occurrences for each node s.
Note that we have to select occurrences of str (s) so that no occurrences of str (s) remain in the text string, and at least two occurrences of str (s) are selected.We remark that we can greedily choose at least max{2, #occ(str (s))/2} occurrences.

How to Update SLSTree(w
Let L be the set of the greedily selected occurrences of LRF (w k ) in w k .For any 0 ≤ i ≤ |L|, let w i k denote the string obtained after replacing the first i occurrences of LRF (w k ) with non-terminal symbol A k+1 .Namely, w 0 k = w k and w |L| k = w k+1 .In this section we show how to update SLSTree(w i−1 k ) to SLSTree(w i k ).Let p be the beginning position of the i-th occurrence in L. Assume that we have SLSTree(w i−1 k ), and that we have replaced We now have w i k , and we have to update SLSTree(w i−1 k ) to SLSTree(w i k ).A naive way to obtain SLSTree(w i k ) is to remove all the suffixes of w i−1 k from SLSTree(w i−1 k ) and insert all the suffixes of w i k into it.However, since only the nodes not longer than LRF (w k ) are important for our longest-first strategy, only the suffixes ∈ Σ for any p − t ≤ r < p have to be removed from SLSTree(w i−1 k ), and only the suffixes w i k [p − t :] have to be inserted into the tree (see the light-shaded suffixes of Figure 3).

Lemma 7 For any t, let r be the shortest node of SLSTree(w
Hence there exists an edge from the root node to leaf p−t labeled with w i k [p − t :] in SLSTree(w i k ).Consider Case 2 (see also Figure 5).Let u Now assume that we have located nodes x p−t and s p−t .We can then locate s p−t+1 as follows.Consider node x p−t+1 .Remark that str (suf (x p−t )) is a prefix of str (x p−t+1 ), and thus we can detect x p−t+1 in O(|str (x p−t+1 )| − |str (suf (x p−t ))|) time by using the suffix link.After finding x p−t+1 , we can locate or create s p−t+1 in constant time.
The total time cost for detecting x p−t for all 1 ≤ t ≤ t max is linear in Hence we can locate each s p−t in amortized constant time.
Let v be the reference node in the path from the root to some leaf p−t .Assume that leaf p−t is removed from the subtree of v, and redirected to node s in the same path, such that str (s) = w i k [p − t : p − 1].In order to update SLSTree(w i−1 k ) to SLSTree(w i k ), we have to maintain triple min(v), max(v), card(v) for node v.One may be concerned that if p − t is neither min(v) or max(v) and card(v) ≥ 4 in SLSTree(w i−1 k ), the occurrences of str (v) in SLSTree(w i k ) do not form a single arithmetic progression any more.However, we have the following lemma.For any factor y of w i k , let Dead that overlap with the i-th greedily selected occurrence of LRF (w k ) in w k .

Lemma 9 Let v be any reference node of SLSTree(w
, then there is no integer r such that m < r < n and r ∈ Dead w i k (str (v)).(See Figure 6).
Proof.Assume for contrary that there exists integer r such that r ∈ Dead Recall that p is the beginning position of the i-th largest greedily selected occurrence of LRF (w k ) in w k .Also, for any 1 ≤ t ≤ |LRF (w k )| such that w i−1 k [r] ∈ Σ for every p − t ≤ r < p, we have removed leaf p−t from the subtree rooted at the reference node v and have reconnected it to node s such that str (s) = w i k [p − t : p − 1].According to the above lemma, if min(v) < p − t < max(v), leaf j for every p − t ≤ j ≤ max(v) is removed from the subtree of v.After processing leaf p−t , then max(v) is updated to p − t − d where d = (min(v) + max(v))/card(v) is the step of the progression, and card(v) Notice that leaf p+h for every 0 ≤ h ≤ |LRF (w k )| − 1 has to be removed from the tree, since w i k [p + h] / ∈ Σ and therefore this leaf node should not exist in SLSTree(w i k ) (see the dark-shaded suffixes of Figure 3).Removing each leaf can be done in constant time.Maintaining the information about the triple for the arithmetic progression of the reference nodes can be done in the same way as mentioned above.
The following lemma states how to locate each reference node.
A special case happens when there exists a node s in the path from the root to leaf j , such that len(s) = and the edge from s in the path starts with some non-terminal symbol A h with h < k.
Namely, w i k [j + ] = A h .Due to the property of the longest first approach, we have |A h | ≥ .Thus v j = leaf j .Moreover, for any j ≤ k ≤ p − 1, v(k) = leaf k .(See also the right illustration of Figure 7).It is thus clear that each v(k) can be found in constant time.Since From the above discussions, we conclude that: Pseudo-codes of our algorithms are shown in Algorithms 1, 2, and 3.

Reducing Grammar Size
In the above sections we considered text compression by longest first substitution, where we construct a context free grammar G that generates only a given string w.By Observation 1, for any production rule A k → x k of G, x k contains only terminal symbols from Σ.In this section, we take the factors of x k into consideration for candidates of LRFs, and also replace LRFs appearing in x k .This way we can reduce A related open problem is the following: Does there exist a linear time algorithm for text compression by largest-area-first substitution (LAFS)?The algorithm presented in [21] uses minimal augmented suffix trees (MASTrees) [31] which enable us to efficiently find a factor of the largest area.The size of MASTrees is known to be linear in the input size [32], but the state-of-the-art algorithm of [32] to construct MASTrees takes O(n log n) time, where n is the input text length.Also, the algorithm of [21] for LAFS reconstructs the MASTree from scratch, every time a factor of the largest area is replaced by a new non-terminal symbol.Would it be possible to update a MASTree or its relaxed version for following substitutions?To find LRF (z 1 ), we check the nodes in the list as follows.
• Length 8.The generalized suffix tree has no node representing aaabbbbb, and hence it is not an LRF.
• Length 7. Since node aaaaaaa exists in the generalized suffix tree, we traverse its subtree and find 2 occurrences 23 and 24 in z 1 .However, it is not an LRF of z 1 .The other candidate aabbbbb does not have a corresponding node in the tree, so it is not an LRF, either.
• Length 6. Node aaaaaa exists in the generalized suffix tree and we find 3 occurrences 23, 24 and 25 in z 1 by traversing the tree, but it is not an LRF.The tree has no node corresponding to abbbbb, hence it is not an LRF.
• Length 5. Node aaaaa exists in the generalized suffix tree and we find 4 occurrences 23, 24, 25 and 26 in z 1 by traversing the tree, but it is not an LRF.There is no node in the tree corresponding to bbbbb.Focus on the above operations where we examined factors of lengths from 7 to 5. The total time cost to find the occurrences for the LRF-candidates of these lengths is proportional to 2 + 3 + 4, but none of them is an LRF of z 1 in the end.
In general, for any input string of the form w = a 2k−1 b k+1 a k b k+1 ca 2k $, the time cost of the Lanctot algorithm for finding LRF (z 1 ) is proportional to Since k = O(|w|) = O(n), the Lanctot algorithm takes O(n 2 ) time.
In his PhD thesis [33], Lanctot modified the algorithm so that all the occurrences of each candidate factor in w are stored in each element of the bin-sorted list (Section 3.1.3,page 55, line 1).However, this clearly requires O(n 2 ) space.Note that using a suffix array cannot immediately solve this, since the lexicographical ordering of the suffixes can change due to substitution of LRFs, and no efficient methods to edit suffix arrays for such a case are known.
On the contrary, as shown in Section 3., each node v of our data structure stores an ordered triple min(v), max(v), card(v) , and our algorithm properly maintains this information when the tree is updated.Using this triple, we can check in amortized constant time whether or not each node in the bin-sorted list is an LRF.Hence the total time cost remains O(n).

Figure 2 .
Figure 2. Illustration for proof of Lemma 2. Since u is represented by a node of SLSTree(w k ), we know that w k [i + |u|] = w k [j + |u|].

Lemma 3 Lemma 4
Let s be any node of SLSTree(w k ) with len(s) ≤ |LRF (w k )| and let s 1 , . . ., s be the children of s.Then BP w k (str (s)) is a disjoint union of BP w k (str (s 1 )), . . ., BP w k (str (s )).Proof.Clear from the definition of SLSTree(w k ).For any node s of SLSTree(w k−1 ) such that |LRF (w k )| ≤ len(s) ≤ |LRF (w k−1 )|, it takes amortized constant time to check whether or not str (s) is an LRF of w k .Proof.Let s 1 , . . ., s be the children of s.Then, str (s) is repeating if and only if

Figure 3 . 1 k
Figure 3. LRF (w k ) at position p of w i−1 k is replaced by non-terminal symbol A k in w i k .Every w i−1 k [p − t :] is removed from the tree and every w i k [p − t :] is inserted into the tree (the light-shaded suffixes in the right figure).In addition, every w i−1 k [p + h :] for 1 ≤ h ≤ |LRF (w k )| − 1 is removed from the tree (the dark-shaded suffixes in the right figure).
denote the reference node of SLSTree(w i−1 k ) in the path from the root spelling out suffixw i−1 k [ :].For each j such that p − |LRF (w k )| ≤ j ≤ p + |LRF (w k )| − 1,we can locate the reference node v(j) in amortized constant time.

Figure 7 .
Figure 7.The left figure illustrates how to find v(j) from v(j − 1).The right one illustrates a special case where v(j) = leaf j .Once v(j) = leaf j , it stands that v(k) = leaf k for any j ≤ k ≤ p − 1.

Theorem 1
For any string w ∈ Σ * , the proposed algorithm for text compression by longest first substitution runs in O(|w|) time using O(|w|) space.

c
2009 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland.This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
and since r is not longer than the reference node in the path spelling out uLRF (w k ) from the root node of SLSTree(w i k ), there exists at least one integer m such that m ∈ BP w i k (str (r)) and m / ∈ BP w i k (uA k ).Hence there exists a node s in SLSTree(w i k ) such that str (s) = u and has an out-going edge labeled with w i k [p :] = A k w i k [p + |A k | :] and leading to leaf p−t .The next lemma states how to locate node s of Case 2 of Lemma 7. Illustration of Case 2 of Lemma 7. Consider the largest possible t and denote it by t max .Since t max ≤ |LRF (w k )|, the node x p−t max can be found in O(|LRF (w k )|) time by going down the path that spells out w Lemma 8 For each t, we can locate node s such that str (s) = w i k [p − t : p − 1] in amortized constant time.Proof.Let x p−t be the longest node in the tree such that str (x p−t ) is a prefix of w i k [p − t : p − 1].Figure 5.