Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Sufﬁx Trees

: We present algorithms computing the non-overlapping Lempel–Ziv-77 factorization and the longest previous non-overlapping factor table within small space in linear or near-linear time with the help of modern sufﬁx tree representations ﬁtting into limited space. With similar techniques, we show how to answer substring compression queries for the Lempel–Ziv-78 factorization with a possible logarithmic multiplicative slowdown depending on the used sufﬁx tree representation. non-overlapping


Introduction
The Lempel-Ziv-77 (LZ77) [1] and Lempel-Ziv-78 (LZ78) [2] factorizations are some of the most well-studied techniques for lossless data compression. Several variants such as Lempel-Ziv-Storer-Szymanski (LZSS) [3] have been proposed, and nowadays we often perceive the LZSS factorization as the standard variant of the LZ77 factorization. Both are defined as follows: Given a factorization T = F 1 · · · F z for a string T: • it is the LZSS factorization of T if each factor F x , for x ∈ [1 . . z], is either the leftmost occurrence of a character or the longest prefix of F x · · · F z that occurs at least twice in F 1 · · · F x ; or • it is the classic LZ77 factorization of T if each factor F x , for x ∈ [1 . . z − 1], is the shortest prefix of F x · · · F z that has only one occurrence in F 1 · · · F x (as a suffix). The last factor F z is the suffix T[1 + |F 1 · · · F z−1 | . .] that may have multiple occurrences in F 1 · · · F z .
The non-overlapping variation is to restrict, when computing F x , all candidate occurrences of F x to end before F x starts. For LZSS, this means that a factor F x must occur at least once in F 1 · · · F x−1 . Given a text T of length n whose characters are drawn from an integer alphabet of size σ = n O(1) , we want to study the problem of computing the non-overlapping LZSS factorization memory-efficiently with the aid of two suffix tree representations, which were used by Fischer et al. [4] (Section 2.2) to compute the classic LZ77, LZSS, and LZ78 factorizations in linear time within the asymptotic space requirements of the respective suffix tree. In this article, we obtain the non-overlapping LZSS factorization with similar techniques and within the same space boundaries: Theorem 1. Given a text T [1 . . n] of length n whose characters are drawn from an integer alphabet with size σ = n O(1) , we can compute its non-overlapping LZSS factorization • in O( −1 n) time using (1 + )n lg n + O(n) bits (excluding the read-only text T); or • in O(n lg n) time using O(n lg σ) bits, for a selectable constant ∈ (0, 1]. We support outputting the factors directly or storing the factors within the (asymptotic) bounds of the working space such that we can retrieve a factor in constant time.
We also show that we can compute the longest previous non-overlapping factor table [5] within the same space and time complexities (Theorem 3) by providing a succinct representation of this table (Lemma 1).
Subsequently, we study the substring compression query problem [6], where the task is to compute the factorization of a given substring of the text in time related to the number of computed factors and possibly a logarithmic dependency on the text length. However, this problem has only been conceived for the LZ77 factorization family. Here, we provide the first non-trivial solutions for LZ78, again with the help of several suffix tree representations: Given a text T [1 . . n] of length n whose characters are drawn from an integer alphabet with size σ = n O(1) , we can compute a data structure on T in O(n) time that computes, given an interval I ⊂ [1 . where z 78[I ] is the number of computed LZ78 factors and ∈ (0, 1] is a selectable constant. In the last result, we need additionally the n lg σ bits of space for the read-only text during the queries if there is any character of the alphabet omitted in the text (otherwise, we can then simulate a text access with the function head as described in [4]).
We can further speed-up the last two solutions of Theorem 2 by spending more space (Theorem 4). Figure 1 shows a juxtaposition of all Lempel-Ziv factorizations addressed in this article. overlapping LZSS non-overlapping LZSS LZ78 Figure 1. Juxtaposition of the overlapping LZSS factorization, the non-overlapping LZSS factorization, and the LZ78 factorization on the string T = aaababaaaba. A factor is visualized by a rounded rectangle. Its coding consists of a mere character if it has no reference; otherwise, its coding consists of its referred position and its lengths for both LZSS variants or its referred index and its last character for LZ78.

Preliminaries
With lg we denote the logarithm log 2 to base two. Our computational model is the word RAM model with machine word size Ω(lg n) for a given input size n. Accessing a word costs O(1) time.
Let T be a text of length n whose characters are drawn from an integer alphabet Σ = [1 . . σ] with σ = n O (1) . Given X, Y, Z ∈ Σ * with T = XYZ, then X, Y, and Z are called a prefix, substring, and suffix of T, respectively. We call T[i . .] the ith suffix of T and denote a substring Given a character c ∈ Σ and an integer j, the rank query T. rank c (j) counts the occurrences of c in T [1 . . j] and the select query T. select c (j) gives the position of the jth c in T. We stipulate that rank c (0) = select c (0) = 0. If the alphabet is binary, i.e., when T is a bit vector, there are data structures [7,8] that use o(|T|) extra bits of space and can compute rank and select in constant time, respectively. Each of those data structures can be constructed in time linear in |T|. We say that a bit vector has a rank-support and a select-support if it is endowed by data structures providing constant time access to rank and select, respectively.
From now on, we assume that T ends with a special character $ smaller than all other characters appearing in T. Under this assumption, there is no suffix of T having another suffix of T as a prefix. The suffix trie of T is the trie of all suffixes of T. There is a one-to-one relationship between the suffix trie leaves and the suffixes of T. The suffix tree ST of T is the tree obtained by compacting the suffix trie of T. Similar to the suffix trie, the suffix tree has n leaves, but the number of internal nodes of the suffix tree is at most n because every ST node is branching. The string stored in a suffix tree edge e is called the label of e. We define the function c(e) returning, for each edge e, the length of e's label. The string label of a node v is defined as the concatenation of all edge labels on the path from the root to v; its string depth, denoted by str_depth(v), is the length of its string label. The leaf corresponding to the ith suffix T[i . .] is labeled with the suffix number i ∈ [1 . . n]. We write sufnum(λ) for the suffix number of a leaf λ. The leaf-rank is the preorder rank (∈ [1 . . n]) of a leaf among the set of all ST leaves, denoted by leaf_rank(λ) for a leaf λ. For instance, the leftmost leaf in ST has leaf-rank 1, while the rightmost leaf has leaf-rank n. The function next_leaf(λ) returns the leaf whose suffix number is the suffix number of λ incremented by one, or 1 if the suffix number of λ is n.
Reading the suffix numbers stored in the leaves of ST in leaf-rank order gives the suffix array [9]. We denote the suffix array and the inverse suffix array of T by SA and ISA, respectively. The array ISA is defined such that ISA[SA[i]] = i for every i = 1, . . . , n. The two arrays SA and ISA have the following relation with the two operations leaf_rank and sufnum on the ST leaves:  [12].
In this article, we focus on the following two suffix tree representations, which are an ensemble of some of the aforementioned data structures: • The succinct suffix tree (SST), using (1 + )n lg n + O(n) bits of space ([4] [Section 2.2.3]) for a selectable constant > 0, contains, among others, a (1 + )n lg n-bits representation of SA and ISA with O(1/ ) access time for each array. • The compressed suffix tree (CST) using O(n lg σ) bits of space [10,13] contains, among others, the Ψ-function.
Both suffix tree representations can be constructed in linear time within their final space requirements (asymptotically) when neglecting the space requirements of the readonly text T. They store the PLCP array and a succinct representation of the suffix tree topology such as a balanced parentheses (BP) [7] sequence. The BP sequence represents a rooted, unlabeled but ordered tree of n nodes by a bit vector of length 2n + o(n) bits. Since the suffix tree has at most 2n nodes, the BP representation of the ST topology uses at most 4n + o(n) bits. For example, the BP sequence of the suffix tree given in Figure 2  ())), where we label the starting of an internal node and the center of a leaf '()' with the respective preorder number on top. The BP sequence can be conceptionally constructed by performing a preorder traversal on the tree, writing an opening parenthesis when walking down an edge and writing a closing parenthesis when climbing up an edge. We augment the BP sequence of ST with auxiliary data structures [14] of o(n) bits to support queries such as parent(v) returning the parent of a node v, a level ancestor query level_anc(λ, d) returning the ancestor on depth d of the leaf λ, or leaf_rank(λ), all in constant time. Note that the depth of a node v, i.e., the number of edges from v to the root, is at most str_depth(v). (1,1) (1,1) (1,1) (Left) Suffix tree of the text T = ababaaa$ with the witness nodes and the corresponding leaves of the non-overlapping LZSS factorization highlight in red ( ) and in green ( ), respectively. We additionally mark the string ab with an implicit node ( ) whose string label is equal to the factor with Type 3. The nodes are labeled by their preorder numbers. The suffix number of each leaf λ is the underlined number drawn in dark yellow below λ. (Right) Non-overlapping LZSS factorization of T.
For our algorithms, we want to simulate a linear scan on the text from its beginning to its end by visiting the leaves in ascending order with respect to their suffix numbers (starting with the leaf with suffix number 1, and ending at the leaf with suffix number n). For that, we iteratively call next_leaf. We can compute next_leaf by first computing the leaf-rank of the succeeding leaf next_leaf(λ) of a leaf λ with leaf_rank(next_leaf(λ)) = Ψ[leaf_rank(λ)], and then selecting next_leaf(λ) by its leaf-rank; we can select a leaf by its leaf-rank in constant time due to the BP sequence representation of the suffix tree topology (the BP sequence can be augmented with a rank-and select-support for leaves represented by the empty parentheses '()'). Since we can simulate Ψ with SA and ISA, the SST needs O(1/ ) time for evaluating next_leaf.
Finally, a factorization of T of size z partitions T into z substrings F 1 · · · F z = T. Each such substring F x is called a factor. In what follows, we deal with the non-overlapping LZSS factorization in Section 3, and subsequently (in Section 4) with the LZ78 factorization in the special context that we want to compute it on a substring of T after a preprocessing step.

Non-Overlapping LZSS
Let z ov and z denote the number of factors of the overlapping LZSS factorization (i.e., the standard LZSS factorization supporting overlaps) and of the non-overlapping LZSS factorization, respectively. Kosolobov and Shur [15] showed that z ov ≤ z ≤ z ov · O(lg(n/ (z ov log σ z ov ))). Although being inferior to the overlapping LZSS factorization with respect to the number of factors, the non-overlapping LZSS factorization is an important tool for finding approximate repetitions [16], periods [17], seeds [18], tandem repeats [19], and other regular structures (cf. the non-overlapping s-factorization in ( [20] [Chpt. 8])).
Algorithms computing the non-overlapping LZSS factorization usually compute the longest previous non-overlapping factor We are aware of the algorithms of Crochemore and Tischler [5] and Crochemore et al. [21] computing LPnF in linear time with a linear number of words. There are further practical optimizations [22][23][24] computing LPnF in linear time for constant alphabets. Finally, Ohlebusch and Weber [25] gave a linear time conversion algorithm from the longest previous factor table LPF [26] to LPnF if the leftmost possible referred positions P[1 .
. n] are provided. It seems possible that, instead of overwriting the LPF array with the LPnF array, we could run their algorithm on a 2n-bits succinct representation of the LPF array supporting sequential scan in constant time ( [27] [Corollary 5]) to produce an LPnF array representation within the same space due to the following lemma: Proof. Assume that LPnF[j − 1] > 0 (since LPnF[j] ≥ 0 trivially holds). According to the definition, there exists an occurrence . The upper bound follows from the fact that a factor cannot protrude T to the right.
. n] in a unary bit sequence, we can linearly decode LPnF from this unary bit sequence because we know that LPnF [1] = 0. Since LPnF[n] + n ≤ n by the above lemma (in particular LPnF[i] ≤ LPF[i]), the sequence has at most 2n bits. Obviously, this sequence can be written sequentially from right to left in constant time per LPnF value in reverse order LPnF[n], . . . , LPnF [1] (the algorithm of Ohlebusch and Weber [25] computes LPnF in this order). It is therefore possible to compute LPnF within O(n) bits on top of P and a compressed indexing data structures such as the FM-index [28] of the text: For that purpose, Okanohara and Sadakane [29] proposed an algorithm computing LPF and P with the FM-index in O(n lg 3 n) time, which was improved by Prezza and Rosone [30] to O(n lg 2 n) time. However, the need of P, using n lg n bits when stored in a plain array, makes an approach that transforms LPF to LPnF after computing LPF and P rather unattractive. In what follows, we present a different way that directly computes the non-overlapping LZSS factorization or LPnF with near-linear or linear running time, without the need of P.

Our idea is an adaptation of the LZSS factorization introduced in ([4] [Section 3]).
To explain our approach, we first stipulate that T ends with a unique character $ that is smaller than all other characters appearing in T. Next, we distinguish between fresh and referencing factors. We say that a factor is fresh if it is the leftmost occurrence of a character. We call all other factors referencing. A referencing factor F x has a reference pointing to the starting position of its longest previous occurrence (as a tie break, we always select the leftmost such position). We call this starting position the referred position of F x . More precisely, the referred position of a factor F x = T[i . . i + − 1] is the smallest text position j with j + ≤ i and T[j . . j + − 1] = T[i . . i + − 1]. Compared to the overlapping LZSS factorization, we require here the additional restriction that j + ≤ i. This makes the computation of the referred positions more technical: Let j be the referred position of a factor F := T[i . . i + − 1], and let S be the longest substring starting before i that is a prefix of T[i . .]. We associate the factor F with one of the following three types: Type 1: T[j . . j + − 1] = S (the factor F coincides with the overlapping LZSS factor that would start at T[i . .]); Type 2: T[j . . j + − 1] is shorter than S, but T[j + ] = T[i + ] (then there is a suffix tree node that has the string label F); or Type 3: T[j + ] = T[i + ] and j + = i (otherwise, the factor F could be extended to the right).
An example is T = a|b| .] share more than characters such that F is not a string label of any suffix tree node in general, but it is at least a prefix of the string label of a node. This is the case for the third factor ab in the aforementioned example, as can be seen in Figure 2.
To find the referred positions, we mark certain nodes as witnesses, which create a connection between corresponding leaves and their referred positions. A leaf is called corresponding if its suffix number is the starting position of a factor. We say that the witness of a fresh factor is the root. For a referencing factor F, the witness of F is the highest node whose string label has F as a prefix; the witness of F determines the referred position of F, which is the smallest suffix number among all leaves in its subtree.
Despite this increased complexity compared to the overlapping LZSS factorization, the non-overlapping factorization can be computed with the suffix tree in O(n lg σ) time using O(n lg n) bits of space ( [31] [APL16]). Here, we adapt the algorithms of (Fischer et al. [4] [Section 3]) computing the overlapping LZSS factorization to compute the non-overlapping factorization by following the approach of Gusfield [31]. Our goal is to compute the coding of the factors, i.e., the referred position and the length of each factor (cf. Figure 1).

The Factorization Algorithm
All LZSS factorization algorithms of (Fischer et al. [4] [Section 3]) are divided into passes. A pass consists of visiting suffix tree leaves in text order (i.e., in order of their suffix numbers). On visiting a leaf, they conduct a leaf-to-root traversal. In what follows, we present our modification, which merely consists of a modification of Pass (a) in all LZSS factorization variants of ([4] [Section 3]): In Pass (a), Fischer et al. computed the factor lengths and the witnesses. To maintain the witnesses and lengths in future passes, they marked and stored the preorder numbers of the witnesses and the starting positions of the LZSS factors in two bit vectors B W and B T , respectively. In succeeding passes, they computed, based on the factor lengths and the witnesses, the referred positions and with that the final coding. Therefore, it suffices to only change Pass (a) according to our definition of witnesses and factors, while keeping the subsequent passes untouched. In this pass, we do the following: Pass (a) Create B W and B T to determine the witnesses and the factor lengths, respectively.

(a)
The main technique of a pass in [4] are leaf-to-root traversals. Here, we do the opposite: We traverse from the root to a specific leaf. We perform a root-to-leaf traversal by level ancestor queries such that visiting a node takes constant time. We perform these traversals only for all corresponding leaves since the other leaves are not useful for determining a factor.
Suppose we visit a leaf λ corresponding to a factor F. We already know the starting position of F (i.e., sufnum(λ)), but not its length, referred position, or witness w. To detect w, we use the following observation: Given j u is the smallest suffix number among all leaves in the subtree rooted at a node u, w is the highest node that maximizes If j w = sufnum(λ), then F is a fresh factor. Otherwise, w determines the length |F| = w and the referred position j w of F. However, the two functions v → str_depth(v) and v → sufnum(λ) − j v are strictly increasing and monotonically decreasing, respectively, when applied to each node v visited when walking downwards the path from the root to λ. Thus, our goal is to find the lowest node u, where the value u of Equation (1) still results from str_depth(u), and not from the second argument sufnum(λ) − j u . We give a sketch in Figure 3 and study a particular case in Figure 4 for factors of Types 2 and 3.
Determining the witness of a factor F whose starting position is the suffix number of the leaf λ. Straight arcs symbolize edges, while curly arcs symbolize paths that can visit multiple nodes (which are not visualized). Given j w is the smallest suffix number among all leaves in the subtree rooted at a node w, and that u is the lowest then either u or its child v is the witness of F (see Section 3.2 for an explanation). The idea behind detecting whether the two intervals are intersecting is that a factor starting at sufnum(λ) of length str_depth(u) would be of Type 1 or Type 2 with referred position j u . In fact, if F is of Type 1, then its witness is the lowest ancestor of λ having a leaf with a suffix number smaller than sufnum(λ) in its subtree (this definition coincides with the witnesses of the overlapping LZSS factorization of ([4] [Section 2.3])). It is possible that j u = j v , i.e., the leaf with suffix number j u is also in the subtree rooted at v. We can observe this case in Figure 4. Figure 4. Special case of the setting considered in Figure 3 for factors of Types 2 and 3. Here, we assign u and v the same roles as in Figure 3, but we additionally assume that j v = j u and sufnum(λ) ∈ I : , as in the right figure, then the factor F of λ starting at sufnum(λ) is of Type 2, and the witness of F is u, although u is not the lowest ancestor of λ having λ and j u in its subtree. If sufnum(λ) ∈ I \ {j u + str_depth(u)}, then F is of Type 3 and the witness of F is v; the witness of F is v even if λ and the leaf with suffix number j v are shared by a descendant of v as shown in the left figure.
To achieve our goal, let being its leftmost occurrence in T. We compute the values of j v , I v and I λ,v for every node v on the path from the root to λ until reaching a node v such that the intervals I v and I λ,v overlap (cf. Line 9 in Algorithm 1). Let u be the parent of v. Then, the edge (u, v) determines the factor F: We consider the following two cases that determine whether F is a fresh or referencing factor, and whether the witness and the referred position of F are u and j u , or v and j v , respectively, in case F is a referencing factor: there is no leaf in v's subtree with a suffix number smaller than sufnum(λ).
-If u is the root, then there is no candidate for a referred position available, i.e., F is a fresh factor (cf. Line 13 in Algorithm 1).

-
Otherwise, str_depth(u) > 0 and I u ∩ I λ,u = ∅ (since v is the highest node on the path from the root to λ for which I v ∩ I λ,v = ∅ holds). Hence, the longest substring occurring before sufnum(λ) that is a prefix of T[sufnum(λ) . .] has an occurrence in T[1 . . sufnum(λ) − 1] (Type 1). One of those occurrences starts at position j u . This means that the referred position is j u , and the witness of F is u; the length of F is str_depth(u) (cf. Line 17 in Algorithm 1).
If the factor F refers to the position j v , then its length is the minimum of sufnum(λ) − j v and the length of the LCP of the suffixes starting at j v and sufnum(λ).
(Note that this LCP can be longer than the string label of v.) Let us denote the value of this minimum by , which coincides with v of Equation (1) due to str_depth(v) ≥ sufnum(λ) − j v , and determines whether F refers to j v or j u (cf. Line 20 in Algorithm 1): , then the referred position of F is actually the suffix number of a leaf contained in u's subtree (Type 2). In this case, the length of F is |I u | = str_depth(u) because I u ∩ I λ,u = ∅. The witness of F is u, and j u is the referred position (cf. Line 21 in Algorithm 1).
, hence F is not the string label of any suffix tree node (Type 3). The node v is the highest node whose string label has F as a prefix. We conclude that the witness, referred position, and length of F are v, j v , and , respectively (cf. Line 23 in Algorithm 1).

Complexity Bounds
To determine the value of j v , we need to answer a range minimum query (RMQ) on SA. Given an array A[1 . According to Lemma 2, we can construct an RMQ data structure in O(t SA n) time using 2n + o(n) bits of space, where t SA is the time for accessing SA. We can access SA in O(1/ ) time and in O(lg n) time with the SST and CST, respectively, where the last time complexity is due to the following lemma: Lemma 3 (Grossi and Vitter [11] [Section 3.2]). There is a data structure using O( −1 n) bits that can access SA in O(lg n) time, where ∈ (0, 1] is a selectable constant. As shown by (Fischer et al. [33] [Lemma 3]), the operation str_depth(u) for a node u can be computed with SA, LCP, and an RMQ data structure on LCP because the leaf-ranks of the leftmost leaf λ L and rightmost leaf λ R in the subtree rooted u define the interval [leaf_rank(λ L ) + 1 . . leaf_rank(λ R )] in SA, and selecting the minimum value in LCP within this interval gives the length of the longest common prefix shared among all leaves in u's subtree, which is str_depth(u). However, we do not store LCP explicitly, but instead simulate an access of its jth entry for j ∈ [2 . . n] by LCP[j] = PLCP[SA [j]]. Hence, we can access an entry of LCP in O(t SA ) time. Consequently, we can build the data structure of Lemma 2 on LCP in O(t SA n) time, which takes 2n + o(n) bits of additional space. Equipped with this data structure, we finally can evaluate str_depth in O(t SA ) time. The total time bounds are composed as follows: Since the number of visited nodes is at most the factor length of a corresponding leaf λ during a root-to-leaf traversal to λ, and ∑ z x=1 |F x | = n, we conclude that the For each root-to-leaf traversal to a leaf corresponding to a factor F, we stop at an edge (u, v) and

Storing the Factorization
From here on, we have two options: We can either directly output the referred positions and the lengths of the computed factors during Pass (a), or we can store additional information for retrieving the witnesses in a later pass. Such a later pass is interesting when working with the SST, as we can store the factors in the (1 + )n lg n + O(n) bits of working space ( [4] [Section 3.3]). There, a later pass overwrites the space occupied by the SST, in particular the suffix array representation, such that later passes no longer can determine witnesses. Although we mark each witness in the bit vector B W during Pass (a), there can be multiple nodes marked in B W on the path from the root to a leaf corresponding to a factor F. The overlapping LZSS factorization obeys the invariant that the witness of F is the lowest ancestor of λ that is marked in B V , given that B V marks all ancestors of the leaves with a suffix number smaller than sufnum(λ) when conducting a leaf-to-root traversal at λ during the overlapping LZSS computation ( [4] [Section 3]). Due to the existence of factors of Types 2 and 3, this invariant does not hold for the non-overlapping factorization.
For the later passes, we want a data structure that finds the witness w of a factor F based on F's starting position in constant time. Fortunately, w is determined by the leaf λ corresponding to F and w's depth due to w = level_anc(λ, depth(w)). To remember the depth of each witness, we maintain a bit vector B L that stores the depth of each witness in unary coding sorted by the suffix number of the respective corresponding leaf. Given that we find the witness w of a leaf λ in Pass (a) during the traversal from the root to λ, we store the unary code 0 d 1 in B L , where d := depth(w). For a leaf corresponding to a fresh factor, we store the unary code 1 in B L . Similar to B D in ( [4] [Sect. 3.4.3 Pass (2)]), we do not need to add a select-support to B L , since we process the corresponding leaves always sequentially in text order. Given a corresponding leaf λ, we can jump to its witness (or to the root if λ corresponds to a fresh factor) with a level ancestor query from λ with the depth B L . select 1 (sufnum(λ) + 1) − B L . select 1 (sufnum(λ)) − 1. The length of B L is at most n + z since the depth of a witness is bounded by the length of its corresponding factor and the sum of all factor lengths is n.

Computing LPnF
Finally, we can compute LPnF with the same algorithm by visiting all leaves (i.e., not only the corresponding ones). However, we no longer can charge the visited nodes during a root-to-leaf traversal with the length of a factor as in Section 3.3 (a). In fact, such an algorithm may visit O(n 2 ) nodes since ∑ n i=1 LPnF[i] = O(n 2 ) (and this sum is Θ(n 2 ) for the string T = a · · · a). To reduce the number of nodes to visit, we can make use of Lemma 1: having LPnF[1 . . i − 1] computed, we know that LPnF[i] ≥ LPnF[i − 1] − 1; hence, it suffices to start the root-to-leaf traversal at the lowest nodew whose string depth is at most LPnF[i − 1] − 1. We find this nodew by a suffix link. A suffix link connects a node with string label S ∈ Σ + to the node with string label S [2 . .] or to the root node if S ∈ Σ 1 . All nodes except the root have a suffix link. However, we do not store suffix links as pointers explicitly, but simulate them with the leaves since we can compute the suffix link of a leaf λ with next_leaf(λ): Suppose that we have processed the leaf λ with suffix number i − 1 for computing LPnF[i − 1]. In what follows, we first assume that the computed factor starting at i − 1 is not of Type 3. Then, the witness of λ is λ's ancestor u with str_depth(u) = LPnF[i − 1] being the computed factor length. First, we select another leaf λ of the subtree rooted at u such that the lowest common ancestor (LCA) of λ and λ is u (e.g., we can select the leftmost or rightmost leaf in u's subtree). Then,λ := next_leaf(λ) is the leaf with suffix number i, and the LCAũ ofλ and next_leaf(λ ) is the node on the path from the root toλ with str_depth(ũ) = str_depth(u) − It is left to deal with the text positions i − 1 for which we computed a factor of Type 3. Here, the leaf λ has a witness v with LPnF[i − 1] < str_depth(v), i.e., the computed factor is implicitly represented on the edge from u := parent(v) to v. We apply the same technique (i.e., taking the suffix link) as for the other types, but apply this technique on u instead of the witness v, such that we end up at a nodeũ with str_depth(ũ) = str_depth(u) − 1. We sketch the setting in Figure 5. Now, we additionally need to walk down fromũ towards λ = next_leaf(λ) to reach the lowest nodeṽ with str_depth(ṽ) ≤ LPnF[i − 1] − 1. There can be at most c(u, v) nodes on the path fromũ toṽ. We can refine this number to at most We have δ > 0 if and only if the factor starting at text position i − 1 is of Type 3. In that case, we additionally walk down fromũ towardsλ to find the lowest nodeṽ with str_depth(ṽ) ≤ LPnF[i − 1] − 1 = str_depth(u) + δ − 1. While u and v are directly connected with an edge, the path fromũ toṽ may contain multiple edges.
To retain our claimed time complexity, we switch from counting nodes to counting characters and use the following charging argument: We charge each edge we traversed by c(e) characters, or δ e characters if we only traversed δ e ≤ c(e) characters on an edge. With the above analysis, we again obtain O(nt SA ) time for the algorithm computing the non-overlapping factorization (as well as the non-Type 3 LPnF values) by spending O(t SA ) time for each charged character (instead of each visited node).
Let us reconsider that the factor of LPnF[i − 1] is of Type 3, where we charge the last edge (u, v) for computing LPnF[i − 1] with δ characters. Here, we observe that we actually spend only O(t SA ) time for processing this edge. Hence, we have δ − 1 characters as a credit left, which we can spend on traversing O(δ) descendants ofũ. If the factor starting at i is again of Type 3, we add the remaining credit to the newly gained credit, and recurse.
Regarding Section 3.3 (b), computing the length of the LCP of T[j v . .] and T[sufnum(λ) . .] naïvely results again in O(n 2 ) overall running time since we need to compute these lengths for all n positions. Here, instead of computing the length of such an LCP naïvely, we determine it by computing str_depth(w) of the LCA of λ and the leaf λ with suffix number j v in O(t SA ) time. We find λ with the RMQ data structure on SA that actually reports the leaf-rank instead of the suffix number j v , which we obtain by accessing SA. Altogether, we obtain the same time and space bounds for computing the non-overlapping LZSS factorization: Theorem 3. We can compute the 2n-bits representation of LPnF within the same time and space as the non-overlapping LZSS factorization described in Theorem 1.

Substring Compression Query Problem
The substring compression query problem [6] is to find the compressed representation of T[I ], given a query interval I ⊂ [1 . . n]. Cormode and Muthukrishnan [6]  The main idea of tackling the problem for LZSS (and similarly for the classic LZ77 factorization) is to use a data structure answering interval LCP queries, which are usually answered by two-dimensional range successor/predecessor data structures. Most recently, Matsuda et al. [35] proposed a data structure answering an interval LCP query in O(n ) time while taking O( −1 n(H 0 (T) + 1)) bits of space, where H 0 denotes the zeroth order empirical entropy. Therefore, they could implicitly answer a substring compression query in O(z SS[I ] n ) time within compressed space. Recently, Bille et al. [36] proposed data structures storing the LZSS-compressed suffixes of T for answering a pattern matching query of an LZSS-compressed pattern P without decompressing P. Their proposed data structures also seem to be capable of answering substring compression queries.
As a warm up for the more-involving techniques for the LZ78 factorization below (cf. Section 4.5), we show that our techniques studied for the non-overlapping LZSS factorization in Section 3 can be adapted to the substring compression query problem under the restriction that the query interval starts at text position 1 (meaning that we query for prefixes instead of arbitrary substrings). Given an interval I = [1 . . p] for a text position p ∈ [1 . . n], the algorithm of Theorem 1 achieves O(pt SA ) time, where t SA is the time to access SA. We can improve the running time by replacing the linear scan on Line 10 of Algorithm 1 with an exponential search [37]: As long as the condition on Line 9 is true (the condition for walking downwards), we do not increment the depth d by one, but instead double d. Now, when the condition on Line 9 becomes false, we may have overestimated the desired depth (we want the first d for which the condition on Line 9 becomes false). Thus, we need to additionally backtrack by performing a binary search on the interval [d/2 . . d]. If we perform this search for computing a factor of length , then we double d at most lg times, and visit O(lg ) depths during the binary search (see also Figure 6 for a visualization). In total, we obtain O(z SS[1. .p] t SA lg ) time, where is the length of the longest non-overlapping LZSS factor (here, z SS [1. .p] denotes the number of computed non-overlapping factors). Note that the result is not particularly interesting since we can just store the whole factorization of T[1 . . n], scan for the leftmost factor F x that ends at p or after, trim F x 's length to end at p, and finally return To generalize this algorithm for an interval I with b(I ) > 1, we need to change the definition of j v for a node v in Section 3.2 to be the smallest suffix number of at least b(I ) among the leaves in the subtree rooted at v. However, this additional complexity makes the approach selecting j v with an RMQ on SA infeasible and leads us back to the interval LCP query problem.  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 root λ root λ 17 nodes π Figure 6. Exponential search on a root-to-leaf path for the first node that does not meet a specific condition. In the setting of the non-overlapping LZSS factorization of Section 4 as well as in the LZ78 factorization of Section 4.5, the path from the root to a leaf λ contains a sub-path π including the root whose contained nodes all share a common property (for LZSS they meet the condition on Line 10 of Algorithm 1, while for LZ78 they are edge witnesses marked in the bit vector B E ). We symbolize the path from the root to λ as an array, where each node is represented by its depth. The sub-path π is visualized by the shaded entries ( ). Here, the leaf λ has depth 18, and we want to find the first unshaded node on depth 15. The exponential search and the subsequent binary search in the range [8 . . 16] is conducted by following the edges below and above the path array, respectively.

Related Substring Compression Query Problems
As far as the author is aware of, the substring compression query problem has only been studied for LZSS. However, Lifshits [38] mentioned that it is also feasible to think about the substring compression query problem in context of straight-line programs (SLPs): Given an SLP of size g representing T, we can construct an SLP of size O(g) Here, we consider answering substring compression queries with the LZ78 factorization (which is actually also an SLP ( [42] [Section VI.A.1])), i.e., the goal is to compress the substring T[I ] with LZ78. Let z 78[I ] denote the number of LZ78 factors of the string T[I ]. When the text is given as an SLP of size g, we can first transform this SLP into an SLP of T[I ] in O(g) time, and then apply the algorithm of Bannai et al. [43] on this SLP to compute the LZ78 factorization in O(g + z 78[I ] lg z 78[I ] ) time. Let us consider from now on that T is given in its plain form as a string with n lg σ bits. A possible way is to apply first a solution for computing an LZ77 substring compression query, and then transform the LZ77-compressed substring into an SLP of size O(z SS[I ] lg |I|) in O(z SS[I ] lg |I|) time by a transformation due to Rytter [44], to finally apply the aforementioned algorithm of Bannai et al. [43]. The fastest LZ78 factorization algorithms [4,45] can answer a LZ78 substring compression query in O(|I|) time alphabet independently. For small alphabet sizes, the running time O(|I|(lg lg |I|) 2 /(log σ |I| lg lg lg |I|)) of the LZ78 factorization algorithm of Jansson et al. [46] becomes even sub-linear in |I|. However, for large I and a compressible text T, these approaches are rather slow compared to the solutions for LZSS mentioned above, whose running times are bounded by the number of computed factors and a logarithmic multiplicative factor on the text length.
To obtain similar bounds for LZ78, we could adapt the approach of Bille et al. [36] to preprocess the LZ78 factorization of all suffixes of T, but that would give us a data structure with super-linear preprocessing time (and possibly super-linear space). Here, we borrow the idea from Nakashima et al. [45] to superimpose the suffix tree with the LZ78 trie, and use a data structure for answering nearest marked ancestor queries to find the lowest marked suffix tree node on the path from the root to a leaf. This data structure [47] takes O(n lg n) bits of space, and can answer a nearest marked ancestor query in O(1) amortized time. We are unaware whether there are improvements for this type of query, even under the light that they only need to answer fringe marked ancestor queries, a notion coined by Breslauer and Italiano [48], which is a special case of nearest marked ancestor queries: in the fringe marked ancestor query problem, the root of a tree (here: the suffix tree) is already marked, and we can only mark the children of an already marked node. In what follows, we formally define the LZ78 factorization, and then propose approaches for the LZ78 substring compression query problem based on different suffix tree representations.

LZ78 Factorization
Stipulating that F 0 is the empty string, a factorization F 1 · · · F z = T is called the LZ78 factorization [2] of T iff, for all x ∈ [1 . . z], the factor F x is the longest prefix of T[|F 1 · · · F x−1 | + 1 . .] with F x = F y c for some y ∈ [0 . . x − 1] and c ∈ Σ, that is, F x is the longest possible previous factor F y appended by the following character T[|F 1 · · · F x |] in the text. We say that y is the referred index of the factor F x . A factor is thus determined by its referred index and its last character, which lets us encode the factors in a list of (integer, character)-pairs, as shown in the example of Figure 1 where we simplify the coding of factors with referred index 0 to plain characters (to ease the comparison with the LZSS variants). Figure 7 gives another visualization of the same example with the LZ trie, which represents each factor as a node (the root represents the factor F 0 ). The node representing the factor F y has a child representing the factor F x connected with an edge labeled by a character c ∈ Σ if and only if F x = F y c. An observation of Nakashima et al. [45] (Section 3) is that the LZ trie is a connected subgraph of the suffix trie containing its root. We can therefore simulate the LZ trie by marking nodes in the suffix trie. Since the suffix trie has O(n 2 ) nodes, we use the suffix tree ST instead of the suffix trie to save space. In ST, however, not every LZ trie node is represented; these implicit LZ trie nodes are on the ST edges between two ST nodes (cf. Figure 8). Since the LZ trie is a connected subgraph of the suffix trie sharing the root node, implicit LZ trie nodes on the same ST edge have the property that they are all consecutive and that they start at the first character of the edge. To represent them, it thus suffices to augment an ST edge with a counter counting the number of its implicit LZ trie nodes. We call this counter an exploration counter, and we write n v ∈ [0 . . c(e)] for the exploration counter of an edge e = (u, v), which is stored in the lower node v that e connects to. Additionally, we call an ST node v an edge witness if n v becomes incremented during the factorization. We additionally stipulate that the root of ST is an edge witness, whose exploration counter is always full. Then, all edge witnesses form a sub-graph of ST sharing the root node. We say that n v is full if n v = c(parent(v), v), meaning that v is an explicit LZ78 trie node. We give an example in Figure 9.
However, since we do not know the shape of the LZ trie in advance, we also do not know which nodes will become an edge witness. For the time being, we augment each node with an exploration counter, spending O(n lg n) bits in total. As in Section 3, we assume that our text T has length n and ends with a special symbol $ smaller than all other characters appearing in T.  Figure 7. The LZ78 factorization and its LZ trie for the text T = aaababaaaba. The xth factor is the concatenation of the edge labels of the path from the root to the node labeled with x. Its referred index is the label of its parent.  Figure 7) computed on T = aaababaaaba$. Blue ( ) colored ST nodes represent the explicit LZ trie nodes, i.e., those nodes that are present in ST. Implicit LZ trie nodes are represented by the small rounded nodes ( ). The edge witnesses are the nodes with the preorder numbers 3, 5, 6, 12, and 16.
(Right) cpST of T described in Section 4.8. The label of a node is the list of preorder numbers of the nodes in its respective heavy path. For instance, the heavy path from the root contains the nodes with the preorder numbers 1, 3, 5, and 6.  Figure 9. Excerpt of the suffix tree depicting three edge witnesses. Implicit trie node are represented by small rounded nodes ( ), which are shaded if they are LZ trie nodes ( ). The explicit LZ trie nodes u and w are shaded in blue ( ). According to the figure, n v = 1 and n w = 2. In particular, the exploration counters of u and w are full.

Linear-Time Computation
Now, we can give our first result of Theorem 2 on the LZ78 substring compression query problem by a simple modification of the LZ78 factorization algorithm presented by Nakashima et al. [45]. This algorithm uses a pointer-based suffix tree, which is augmented by a nearest marked ancestor data structure [47], using altogether O(n lg n) bits of space.
The algorithm works as follows: Suppose that we have computed the factors F 1 · · · F x−1 and now want to compute F x . Since F x is a prefix of the suffix T[p . .] with p = |F 1 · · · F x−1 | + 1, F x is a prefix of the concatenation of edge labels on the path π from the root to the leaf with suffix number p in the suffix tree. The additional requirement that F x , excluding its last character, has to coincide with a preceding factor F y means that F y = F x [1 . . |F x | − 1] is the string label of the lowest LZ trie node on π; this LZ trie node is represented either • explicitly as an ST node w being the lowest edge witness on π; or • implicitly by the exploration counter of w.
In either case, w is the edge witness of F y and determines its length |F y | = str_depth(parent(w)) + n w . We create an LZ trie node representing F x as follows: • If n w is not full, we make w the edge witness of F x , and increment n w by one.
• Otherwise (n w is full), we make the child w of w on the path π the edge witness of F x , and set n w ← 1.
It is left to find w, which we can by traversing π from the root until reaching an edge e = (u, v) whose exploration counter n v is less than the length of its label c(e), where either u or v is w. However, a linear scan of π for finding w would result in O(z) time per factor. Here, the fringe marked ancestor queries come into the picture, which allow us to find a lowest edge witness in amortized constant time: by marking all edge witnesses, querying the lowest marked ancestor of the leaf with suffix number p yields either u or v. This gives us O(1) amortized time per LZ78 factor, and concludes the LZ78 factorization algorithm of Nakashima et al. [45] (Theorem 3).

Outline
In what follows, we want to study variants of this algorithm that use more lightweight data structures at the expense of additional running times. All LZ78 factorization algorithms here presented stick to the following general framework, which we call a pass: For each leaf λ whose suffix number is the starting position of a factor F, locate the lowest edge witness w on the path from the root to λ and create a new LZ trie node by incrementing either n w or the exploration counter of its child on the path towards λ as described in Section 4.3. Since w determines the length of the factor F, we know the suffix number of the leaf that starts with the next factor.
After a pass, we know the LZ trie topology due to the exploration counters. In a subsequent pass (Section 4.7), we use this knowledge to associate an edge witness w with the index of the most recent factor having w as its edge witness such that we can identify the referred indices with this association. However, before that, we reduce the space (Section 4.5) and subsequently show how to perform a pass within the reduced working space (Section 4.6). Finally, we accelerate a root-to-leaf traversal for long factors in Section 4.8.

Space-Efficient Computation
In what follows, we give trade-offs for less space but slightly larger time bounds by using SST and CST. To get below O(n lg n) bits of space, we need to get rid of: (a) the O(n lg n)-bits marked ancestor data structure; and (b) the O(n lg n) bits for the exploration counters. For the latter (b), Fischer et al. [4] (Section 4.1) presented a data structure representing the exploration counters within O(n) bits on top of the suffix tree. For the former (a), we use level ancestor queries to simulate a fringe marked ancestor query: to this end, we mark all edge witnesses in a bit vector B E of length 2n such that B E [j] = 1 if and only if the ST node with preorder rank j is an edge witness (remember that the number of nodes in ST is at most 2n). Suppose now that we want to compute the factor F x . For that, we visit the leaf λ with suffix number b(I ) + |F 1 · · · F x−1 |. As in Section 4.3, we want to find the lowest edge witness on the path from the root to λ, which we find with a fringe marked ancestor query. Here, we answer this query by scanning the path from the root to λ until reaching the lowest marked node in B E . We can traverse linearly from the root to this node by querying d → level_anc(λ, d) for each depth d ≥ 0. However, we then visit O(|F x |) nodes for computing the factor F x , or O(|I|) nodes in total. To improve this bound, we can apply again exponential search (cf. Figure 6). To see why that can be done, let (v 0 , . . . , v m ) be the path from the root v 0 to λ = v m . If each node v d (for each depth d ∈ [0 . . m]) is represented by its preorder number, then B E [v 0 ] · · · B E [v m ] = 1 k+1 0 m−k if the lowest edge witness has depth k, which is the smallest k ≤ |F x | such that str_depth(level_anc(λ, k)) ≥ |F x | − 1. Although we do not know k in advance, we can find the rightmost '

Navigation in Small Space
To complete our algorithm for SST and CST, it is left to study how to access the leaves when issuing the level ancestor queries. While the LZ78 factorization algorithms of Fischer et al. [4] used the fact that they can scan the leaves linearly in suffix number order to simulate the scan of the text in text order within their O(n) time budget, we want to accelerate this algorithm by visiting only the leaves whose suffix numbers match the starting positions of the factors. With the SST, we can select the leaf λ with suffix number i ∈ [1 . . n] in O(1/ ) time since we have access to ISA[i] returning the leaf-rank of λ.
With the CST, we can visit the leaf with the subsequent suffix number with next_leaf in constant time, but may need O(n) time to visit an arbitrary leaf. Here, the idea is to store a sampling of ISA within O(n lg σ) bits of space during a precomputation step. We can produce the values of this sampling by iterating over next_leaf such that we obtain an array that stores in its ith entry the leaf-rank of the leaf with suffix number i log σ n.

LZ78 Coding
Finally, to obtain the LZ78 coding, we need to compute the referred indices. In a classic LZ78 trie, we would augment each trie node with the index of its corresponding factor. Here, we additionally need a trick for the implicitly represented LZ trie nodes: For them, we can now leverage the edge witnesses by augmenting each of them with the factor index of the currently lowest LZ trie node created on its ingoing edge. Fortunately, we know all nodes that become edge witnesses thanks to B E (cf. Section 4.5) marking the preorder numbers of all edge witnesses. We now enhance B E with a rank-support such that we can give each edge witness a rank within [1 . . z 78[I ] ]. Therefore, we can maintain the most recent factor indices corresponding to each edge witness in an array W of z 78[I ] lg z 78[I ] bits. We again conduct a pass as described in Section 4.4, but this time we use W to write out the referred indices (see ([4] [Section 4.2.1 Pass (b)]) for a detailed description on how to read the referred indices from W). By doing so, we finally obtain Theorem 2. For an overview, we present the obtained complexity bounds in Table 1.

Centroid-Path Decomposed Suffix Tree
If the length ≤ z 78[I ] of the longest factor is so large that lg = ω(lg lg n), then we can speed up the exponential search of Section 4.5 by searching in the centroid-path decomposed suffix tree cpST. The centroid path decomposition [49] of the suffix tree is defined as follows: For each internal node, we call its child whose subtree is the largest among all its siblings (ties are broken arbitrarily if there are multiple such children) a heavy node, while we call all other children light nodes. Additionally, we make the root and all leaves light nodes (here we differ from the standard definition because we need a one-to-one relationship between leaves in the original tree and in the path-decomposed one). A heavy path is a path from a light node u to the parent of a leaf containing, except for u, only heavy nodes. There is a one-to-one relationship between light nodes and heavy paths. Since heavy paths do not overlap, we can contract all heavy paths to single nodes and thus form cpST (see ([49] [Section 4.2]) for details and Figure 8 for an example). The centroid path decomposition is helpful, because the number of light nodes on a path from the root to a leaf is O(lg n), which means that a path from the root to a leaf in cpST contains O(lg n) nodes. This can be seen by the fact that the subtree size of a light node is at most half of the subtree size of its heavy sibling; thus, when visiting a light node during a top-down traversal in ST, we at least half the number of ST nodes we can visit from then on. Consequently, a root-to-leaf path in cpST has O(lg n) nodes.
For that to be of use, we need a connection between ST and cpST: observe that the number of leaves and their respective order is the same in both trees, such that we can map leaves by their leaf-ranks in constant time. If we mark the light nodes in the suffix tree in a bit vector B L , then the rank of a light node v in B L is the preorder number of the node in cpST representing the heavy path whose highest node is v. To stay within our space budget, we represent the tree topology of cpST with a BP sequence (which we briefly introduced in Section 2). First, we mark all light nodes in B L by an Euler tour, where we query the ST topology for the subtree size rooted at an arbitrary node in constant time. Next, we perform a depth-first search traversal on the suffix tree while producing the BP sequence of cpST. For that, we use a stack to store the light node ancestors of the currently visited node. Since a node has O(lg n) light nodes as ancestors, the stack uses O(lg 2 n) bits of space. Finally, we endow B L with a select-support such that we can map a node of cpST to its corresponding light node in ST.
Our algorithm conducting a pass works as follows: Suppose that we visit the leaf λ with suffix number b(I ) + |F 1 | + · · · + |F x−1 |. This time, we map λ to the leaf λ of cpST having the same leaf-rank as λ in ST. Next, we apply the exponential search with d → level_anc(λ , d) on cpST, to obtain a cpST node representing the heavy path whose highest node is a light node v, i.e., v is the lowest light node on the ST path from the root to the leaf λ that is an edge witness. Since a root-to-leaf path in ST has O(lg n) light nodes, we spend O(min(lg |F x |, lg lg n)) time to find v.
Finally, it is left to move from v to the lowest edge witness on the path from v to λ in ST. For that, we use a dictionary D that associates a light node with the number of edge witnesses in its heavy path. This number is at most z 78[I ] ≤ |I|, and thus it can be stored in lg |I| bits, while a light node can be represented with its preorder number in O(lg n) bits. D has to be dynamic since we do not know in advance which nodes will become edge witnesses; we can make use of one of the dynamic dictionaries given in Table 2, where t D denotes the time for an operation such as a lookup or an insertion and s D denotes the dictionary size in bits. Now, suppose that D stores that d nodes in v's heavy path are edge witnesses. Let w be the next light node on the path from v to λ (i.e., w is the highest light node on the path from the root to λ whose exploration counter is still zero).

•
If d − 1 is at least the height difference between v and w, then the parent u of w is already an edge witness, and u is a node on the heavy path of v. If the exploration counter of u is full, i.e., n u = c(parent(u), u), then we increment the exploration counter of w, and hence make w an edge witness and add w to D. • Otherwise (d − 1 is smaller than this height difference), the node whose exploration counter we want to increment is within the heavy path, and is either the dth or (d + 1)th descendent of v.
In total, for z := z 78[I ] , we can improve the z lg z factor in the time bounds to z · min(lg z, t D + lg lg n), which is z · min(lg z, lg z/ lg lg z + lg lg n) when implementing D with the dynamic dictionary of Raman et al. [50], costing s D = z lg(nz) + o(z) bits of additional working space during a query. More formally: where z 78[I ] is the number of computed LZ78 factors, ∈ (0, 1] is a selectable constant, and t D and s D are the time and space complexities of a dynamic dictionary associating a lg n-bit integer with a lg |I|-bit value (cf. Table 2). Similar to Theorem 2, we need the read-only text stored for queries if there is a character in the alphabet that does not appear in T. O(n lg σ) Table 2. Dynamic dictionary representations usable in our cpST approach (cf. Section 4.8) for D mapping a light node represented in lg n bits to the lowest edge witness within its heavy path represented in lg |I| bits. z := z 78[I ] denotes the number of LZ78 factors of T[I ], which is an upper bound on the number of edge witnesses. An operation is a lookup or an insertion. We are interested in instances with t D = o(lg z) (since, otherwise, the approach of Section 4.5 is favorable). ∈ (0, 1) is a selectable constant.

Conclusions
We used techniques introduced by Fischer et al. [4], which work on the succinct suffix tree (SST) and the compressed suffix tree (CST), to tackle the non-overlapping LZSS factorization and the LZ78 substring compression query problem. One of the main techniques is the usage of level ancestor queries to traverse a root-to-leaf path. For computing the non-overlapping LZSS factorization, our idea was to merge these techniques with the algorithm of Gusfield [31] working in root-to-leaf traversals. To answer an LZ78 substring compression query, we combined exponential search with the level ancestor queries and could accelerate this by first searching in the centroid path-decomposed suffix tree cpST whenever the factor lengths become large.
We wonder whether we can improve the space bounds for solving the semi-dynamic fringe marked ancestor problem (addressed in Section 4), where updates are restricted to marking a node that is a child of an already marked node; hence, the marked nodes form a connected subgraph of the suffix tree sharing at least the root. Without the need of the O(n) words for the marked ancestor data structure, it becomes interesting to devise algorithms computing the reversed LZSS factorization [52] (see ([53] [Chapter 3.6.2])) in low memory.