Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Dominik Köppl

doi:10.3390/a14020044

M&D Data Science Center, Tokyo Medical and Dental University, Tokyo 113-8510, Japan

Algorithms2021, 14(2), 44;https://doi.org/10.3390/a14020044

This article belongs to the Special Issue Algorithms and Data-Structures for Compressed Computation

Version Notes

Order Reprints

Review Reports

Abstract

We present algorithms computing the non-overlapping Lempel–Ziv-77 factorization and the longest previous non-overlapping factor table within small space in linear or near-linear time with the help of modern suffix tree representations fitting into limited space. With similar techniques, we show how to answer substring compression queries for the Lempel–Ziv-78 factorization with a possible logarithmic multiplicative slowdown depending on the used suffix tree representation.

Keywords:

substring compression query; longest previous non-overlapping factor table; application of suffix trees; non-overlapping Lempel–Ziv factorization; lossless compression; Lempel–Ziv-78 factorization

1. Introduction

The Lempel–Ziv-77 (LZ77) [1] and Lempel–Ziv-78 (LZ78) [2] factorizations are some of the most well-studied techniques for lossless data compression. Several variants such as Lempel–Ziv–Storer–Szymanski (LZSS) [3] have been proposed, and nowadays we often perceive the LZSS factorization as the standard variant of the LZ77 factorization. Both are defined as follows: Given a factorization

T = F_{1} \dots F_{z}

for a string T:

it is the LZSS factorization of T if each factor $F_{x}$ , for $x \in [1 . . z]$ , is either the leftmost occurrence of a character or the longest prefix of $F_{x} \dots F_{z}$ that occurs at least twice in $F_{1} \dots F_{x}$ ; or
it is the classic LZ77 factorization of T if each factor $F_{x}$ , for $x \in [1 . . z - 1]$ , is the shortest prefix of $F_{x} \dots F_{z}$ that has only one occurrence in $F_{1} \dots F_{x}$ (as a suffix). The last factor $F_{z}$ is the suffix $T [1 + | F_{1} \dots F_{z - 1} | . .]$ that may have multiple occurrences in $F_{1} \dots F_{z}$ .

The non-overlapping variation is to restrict, when computing

F_{x}

, all candidate occurrences of

F_{x}

to end before

F_{x}

starts. For LZSS, this means that a factor

F_{x}

must occur at least once in

F_{1} \dots F_{x - 1}

. Given a text T of length n whose characters are drawn from an integer alphabet of size

σ = n^{O (1)}

, we want to study the problem of computing the non-overlapping LZSS factorization memory-efficiently with the aid of two suffix tree representations, which were used by Fischer et al. [4] (Section 2.2) to compute the classic LZ77, LZSS, and LZ78 factorizations in linear time within the asymptotic space requirements of the respective suffix tree. In this article, we obtain the non-overlapping LZSS factorization with similar techniques and within the same space boundaries:

Theorem 1.

Given a text

T [1 . . n]

of length n whose characters are drawn from an integer alphabet with size

σ = n^{O (1)}

, we can compute its non-overlapping LZSS factorization

in $O (ϵ^{- 1} n)$ time using $(1 + ϵ) n lg n + O (n)$ bits (excluding the read-only text T); or
in $O (n {lg}^{ϵ} n)$ time using $O (n lg σ)$ bits,

for a selectable constant

ϵ \in (0, 1]

. We support outputting the factors directly or storing the factors within the (asymptotic) bounds of the working space such that we can retrieve a factor in constant time.

We also show that we can compute the longest previous non-overlapping factor table [5] within the same space and time complexities (Theorem 3) by providing a succinct representation of this table (Lemma 1).

Subsequently, we study the substring compression query problem [6], where the task is to compute the factorization of a given substring of the text in time related to the number of computed factors and possibly a logarithmic dependency on the text length. However, this problem has only been conceived for the LZ77 factorization family. Here, we provide the first non-trivial solutions for LZ78, again with the help of several suffix tree representations:

Theorem 2.

Given a text

T [1 . . n]

of length n whose characters are drawn from an integer alphabet with size

σ = n^{O (1)}

, we can compute a data structure on T in

O (n)

time that computes, given an interval

I \subset [1 . . n]

, the LZ78 factorization of

T [I]

in

$O (z_{78 [I]})$ time using $O (n lg n)$ bits of space;
$O (z_{78 [I]} ({log}_{σ} n + lg z_{78 [I]}))$ time using $O (n lg σ)$ bits of space; or
$O (ϵ^{- 1} z_{78 [I]} lg z_{78 [I]})$ time using $(1 + ϵ) n lg n + O (n)$ bits of space,

where

z_{78 [I]}

is the number of computed LZ78 factors and

ϵ \in (0, 1]

is a selectable constant. In the last result, we need additionally the

n lg σ

bits of space for the read-only text during the queries if there is any character of the alphabet omitted in the text (otherwise, we can then simulate a text access with the function

head

as described in [4]).

We can further speed-up the last two solutions of Theorem 2 by spending more space (Theorem 4). Figure 1 shows a juxtaposition of all Lempel–Ziv factorizations addressed in this article.

Figure 1. Juxtaposition of the overlapping LZSS factorization, the non-overlapping LZSS factorization, and the LZ78 factorization on the string

T = aaababaaaba

. A factor is visualized by a rounded rectangle. Its coding consists of a mere character if it has no reference; otherwise, its coding consists of its referred position and its lengths for both LZSS variants or its referred index and its last character for LZ78.

2. Preliminaries

With lg we denote the logarithm

{log}_{2}

to base two. Our computational model is the word RAM model with machine word size

Ω (lg n)

for a given input size n. Accessing a word costs

O (1)

time.

Let T be a text of length n whose characters are drawn from an integer alphabet

Σ = [1 . . σ]

with

σ = n^{O (1)}

. Given

X, Y, Z \in Σ^{*}

with

T = X Y Z

, then X, Y, and Z are called a prefix, substring, and suffix of T, respectively. We call

T [i . .]

the ith suffix of T and denote a substring

T [i] \dots T [j]

with

T [i . . j]

.

Given a character

c \in Σ

and an integer j, the rank query

T . {rank}_{c} (j)

counts the occurrences of c in

T [1 . . j]

and the select query

T . {select}_{c} (j)

gives the position of the jth c in T. We stipulate that

{rank}_{c} (0) = {select}_{c} (0) = 0

. If the alphabet is binary, i.e., when T is a bit vector, there are data structures [7,8] that use

o (| T |)

extra bits of space and can compute

{rank}_{}

and

{select}_{}

in constant time, respectively. Each of those data structures can be constructed in time linear in

| T |

. We say that a bit vector has a rank-support and a select-support if it is endowed by data structures providing constant time access to

{rank}_{}

and

{select}_{}

, respectively.

From now on, we assume that T ends with a special character $ smaller than all other characters appearing in T. Under this assumption, there is no suffix of T having another suffix of T as a prefix. The suffix trie of T is the trie of all suffixes of T. There is a one-to-one relationship between the suffix trie leaves and the suffixes of T. The suffix tree

ST

of T is the tree obtained by compacting the suffix trie of T. Similar to the suffix trie, the suffix tree has n leaves, but the number of internal nodes of the suffix tree is at most n because every

ST

node is branching. The string stored in a suffix tree edge e is called the label of e. We define the function

c (e)

returning, for each edge e, the length of e’s label. The string label of a node v is defined as the concatenation of all edge labels on the path from the root to v; its string depth, denoted by

str_depth (v)

, is the length of its string label. The leaf corresponding to the ith suffix

T [i . .]

is labeled with the suffix number

i \in [1 . . n]

. We write

sufnum (λ)

for the suffix number of a leaf

λ

. The leaf-rank is the preorder rank (

\in [1 . . n]

) of a leaf among the set of all

ST

leaves, denoted by

leaf_rank (λ)

for a leaf

λ

. For instance, the leftmost leaf in

ST

has leaf-rank 1, while the rightmost leaf has leaf-rank n. The function

next_leaf (λ)

returns the leaf whose suffix number is the suffix number of

λ

incremented by one, or 1 if the suffix number of

λ

is n.

Reading the suffix numbers stored in the leaves of

ST

in leaf-rank order gives the suffix array [9]. We denote the suffix array and the inverse suffix array of T by

SA

and

ISA

, respectively. The array

ISA

is defined such that

ISA [SA [i]] = i

for every

i = 1, \dots, n

. The two arrays

SA

and

ISA

have the following relation with the two operations

leaf_rank

and

sufnum

on the

ST

leaves:

For the $ST$ leaf $λ$ with $sufnum (λ) = i$ , we have $leaf_rank (λ) = ISA [i]$ .
For the $ST$ leaf $λ$ with $leaf_rank (λ) = j$ , we have $sufnum (λ) = SA [j]$ .

LCP [1 . . n]

is an array with

LCP [1] : = 0

and

LCP [j]

being the length of the longest common prefix (LCP) of the lexicographically jth smallest suffix

T [SA [j] . .]

with its lexicographic predecessor

T [SA [j - 1] . .]

for

j \in [2 . . n]

. The permuted LCP array

PLCP

([10] [Section 4]) is a permutation of

LCP

with

PLCP [i] : = LCP [ISA [i]]

for

i \in [1 . . n]

, and can be stored within

2 n + o (n)

bits of space. The

Ψ

-function [11] is defined by

SA [i] = SA [Ψ (i)] - 1

for

i \in [1 . . n]

with

SA [i] \neq n

(and

Ψ (i) = ISA [1]

for

SA [i] = n

). It can be stored in

O (n lg σ)

bits while supporting constant access time [12].

In this article, we focus on the following two suffix tree representations, which are an ensemble of some of the aforementioned data structures:

The succinct suffix tree (SST), using $(1 + ϵ) n lg n + O (n)$ bits of space ([4] [Section 2.2.3]) for a selectable constant $ϵ > 0$ , contains, among others, a $(1 + ϵ) n lg n$ -bits representation of $SA$ and $ISA$ with $O (1 / ϵ)$ access time for each array.
The compressed suffix tree (CST) using $O (n lg σ)$ bits of space [10,13] contains, among others, the $Ψ$ -function.

Both suffix tree representations can be constructed in linear time within their final space requirements (asymptotically) when neglecting the space requirements of the read-only text T. They store the

PLCP

array and a succinct representation of the suffix tree topology such as a balanced parentheses (BP) [7] sequence. The BP sequence represents a rooted, unlabeled but ordered tree of n nodes by a bit vector of length

2 n + o (n)

bits. Since the suffix tree has at most

2 n

nodes, the BP representation of the

ST

topology uses at most

4 n + o (n)

bits. For example, the BP sequence of the suffix tree given in Figure 2 is

\overset{1}{(} \overset{2}{()} \overset{3}{(} \overset{4}{()} \overset{5}{(} \overset{6}{()} \overset{7}{()} \overset{}{)} \overset{8}{(} \overset{9}{()} \overset{10}{()} \overset{}{)} \overset{}{)} \overset{11}{(} \overset{12}{()} \overset{13}{()} \overset{}{)} \overset{}{)}

, where we label the starting of an internal node and the center of a leaf ‘()’ with the respective preorder number on top. The BP sequence can be conceptionally constructed by performing a preorder traversal on the tree, writing an opening parenthesis when walking down an edge and writing a closing parenthesis when climbing up an edge. We augment the BP sequence of

ST

with auxiliary data structures [14] of

o (n)

bits to support queries such as

parent (v)

returning the parent of a node v, a level ancestor query

level_anc (λ, d)

returning the ancestor on depth d of the leaf

λ

, or

leaf_rank (λ)

, all in constant time. Note that the depth of a node v, i.e., the number of edges from v to the root, is at most

str_depth (v)

.

Figure 2. (Left) Suffix tree of the text

T = ababaaa $

with the witness nodes and the corresponding leaves of the non-overlapping LZSS factorization highlight in red (

) and in green (

), respectively. We additionally mark the string ab with an implicit node (

) whose string label is equal to the factor with Type 3. The nodes are labeled by their preorder numbers. The suffix number of each leaf

λ

is the underlined number drawn in dark yellow below

λ

. (Right) Non-overlapping LZSS factorization of T.

For our algorithms, we want to simulate a linear scan on the text from its beginning to its end by visiting the leaves in ascending order with respect to their suffix numbers (starting with the leaf with suffix number 1, and ending at the leaf with suffix number n). For that, we iteratively call

next_leaf

. We can compute

next_leaf

by first computing the leaf-rank of the succeeding leaf

next_leaf (λ)

of a leaf

λ

with

leaf_rank (next_leaf (λ)) = Ψ [leaf_rank (λ)]

, and then selecting

next_leaf (λ)

by its leaf-rank; we can select a leaf by its leaf-rank in constant time due to the BP sequence representation of the suffix tree topology (the BP sequence can be augmented with a rank- and select-support for leaves represented by the empty parentheses ‘()’). Since we can simulate

Ψ

with

SA

and

ISA

, the SST needs

O (1 / ϵ)

time for evaluating

next_leaf

.

Finally, a factorization of T of size z partitions T into z substrings

F_{1} \dots F_{z} = T

. Each such substring

F_{x}

is called a factor. In what follows, we deal with the non-overlapping LZSS factorization in Section 3, and subsequently (in Section 4) with the LZ78 factorization in the special context that we want to compute it on a substring of T after a preprocessing step.

3. Non-Overlapping LZSS

Let

z_{ov}

and z denote the number of factors of the overlapping LZSS factorization (i.e., the standard LZSS factorization supporting overlaps) and of the non-overlapping LZSS factorization, respectively. Kosolobov and Shur [15] showed that

z_{ov} \leq z \leq z_{ov} \cdot O (lg (n / (

z_{ov} {log}_{σ} z_{ov})))

. Although being inferior to the overlapping LZSS factorization with respect to the number of factors, the non-overlapping LZSS factorization is an important tool for finding approximate repetitions [16], periods [17], seeds [18], tandem repeats [19], and other regular structures (cf. the non-overlapping s-factorization in ([20] [Chpt. 8])).

Algorithms computing the non-overlapping LZSS factorization usually compute the longest previous non-overlapping factor table

LPnF [1 . . n]

, where

LPnF [i]

stores the length of the LCP of

T [i . .]

with all substrings

T [j . . i - 1]

for

j \in [1 . . i - 1]

, which we set to zero if no such substring exists (i.e.,

LPnF [1] = 0

). Having

LPnF

, we can iteratively compute the non-overlapping LZSS factorization because

F_{x} = T [k_{x} . . k_{x} + max (0, LPnF [k_{x}] - 1)]

with

k_{x} : = \sum_{y = 1}^{x - 1} |F_{y}| + 1

for

x \in [1 . . z]

.

We are aware of the algorithms of Crochemore and Tischler [5] and Crochemore et al. [21] computing

LPnF

in linear time with a linear number of words. There are further practical optimizations [22,23,24] computing

LPnF

in linear time for constant alphabets. Finally, Ohlebusch and Weber [25] gave a linear time conversion algorithm from the longest previous factor table

LPF

[26] to

LPnF

if the leftmost possible referred positions

P [1 . . n]

with

T [P [i] . . P [i] + LPF [i] - 1] = T [i . . i + LPF [i] - 1]

for each text position

i \in [1 . . n]

are provided. It seems possible that, instead of overwriting the

LPF

array with the

LPnF

array, we could run their algorithm on a

2 n

-bits succinct representation of the

LPF

array supporting sequential scan in constant time ([27] [Corollary 5]) to produce an

LPnF

array representation within the same space due to the following lemma:

Lemma 1.

LPnF [j - 1] - 1 \leq LPnF [j] \leq n - j

for

j \in [2 . . n]

.

Proof.

Assume that

LPnF [j - 1] > 0

(since

LPnF [j] \geq 0

trivially holds). According to the definition, there exists an occurrence

T [i . . i + LPnF [j - 1] - 1]

of

T [j - 1 . . j - 1 + LPnF [j - 1] - 1]

with

i + LPnF [j - 1] - 1 < j - 1

. Hence,

T [i + 1 . . i + LPnF [j - 1] - 1] = T [j . . j + LPnF [j - 1] - 2]

and

| T [j . . j + LPnF [j - 1] - 2] | = LPnF [j - 1] - 1

. Thus,

T [j . .]

has a common prefix with a substring of

T [1 . . j - 1]

of (at least) length

LPnF [j - 1] - 1

, i.e.,

LPnF [j - 1] - 1 \leq LPnF [j]

. The upper bound follows from the fact that a factor cannot protrude T to the right. □

Consequently,

LPnF [1] + 1, LPnF [2] + 2, \dots, LPnF [n] + n

is non-decreasing. By storing the differences

LPnF [j] - LPnF [j - 1] + 1 \geq 0

for

j \in [2 . . n]

in a unary bit sequence, we can linearly decode

LPnF

from this unary bit sequence because we know that

LPnF [1] = 0

. Since

LPnF [n] + n \leq n

by the above lemma (in particular

LPnF [i] \leq LPF [i]

), the sequence has at most

2 n

bits. Obviously, this sequence can be written sequentially from right to left in constant time per

LPnF

value in reverse order

LPnF [n], \dots, LPnF [1]

(the algorithm of Ohlebusch and Weber [25] computes

LPnF

in this order). It is therefore possible to compute

LPnF

within

O (n)

bits on top of P and a compressed indexing data structures such as the FM-index [28] of the text: For that purpose, Okanohara and Sadakane [29] proposed an algorithm computing

LPF

and P with the FM-index in

O (n {lg}^{3} n)

time, which was improved by Prezza and Rosone [30] to

O (n {lg}^{2} n)

time. However, the need of P, using

n lg n

bits when stored in a plain array, makes an approach that transforms

LPF

to

LPnF

after computing

LPF

and P rather unattractive. In what follows, we present a different way that directly computes the non-overlapping LZSS factorization or

LPnF

with near-linear or linear running time, without the need of P.

3.1. Setup

Our idea is an adaptation of the LZSS factorization introduced in ([4] [Section 3]). To explain our approach, we first stipulate that T ends with a unique character $ that is smaller than all other characters appearing in T. Next, we distinguish between fresh and referencing factors. We say that a factor is fresh if it is the leftmost occurrence of a character. We call all other factors referencing. A referencing factor

F_{x}

has a reference pointing to the starting position of its longest previous occurrence (as a tie break, we always select the leftmost such position). We call this starting position the referred position of

F_{x}

. More precisely, the referred position of a factor

F_{x} = T [i . . i + ℓ - 1]

is the smallest text position j with

j + ℓ \leq i

and

T [j . . j + ℓ - 1] = T [i . . i + ℓ - 1]

. Compared to the overlapping LZSS factorization, we require here the additional restriction that

j + ℓ \leq i

. This makes the computation of the referred positions more technical: Let j be the referred position of a factor

F : = T [i . . i + ℓ - 1]

, and let S be the longest substring starting before i that is a prefix of

T [i . .]

. We associate the factor F with one of the following three types:

Type 1:: $T [j . . j + ℓ - 1] = S$ (the factor F coincides with the overlapping LZSS factor that would start at $T [i . .]$ );
Type 2:: $T [j . . j + ℓ - 1]$ is shorter than S, but $T [j + ℓ] \neq T [i + ℓ]$ (then there is a suffix tree node that has the string label F); or
Type 3:: $T [j + ℓ] = T [i + ℓ]$ and $j + ℓ = i$ (otherwise, the factor F could be extended to the right).

An example is

T = a | b | \overset{3}{ab} | \overset{1}{a} | \overset{2}{a} | \overset{1}{a} | $

, where the factor borders are symbolized by the vertical bar ∣, and the referencing factors are labeled with their types (fresh factors are not labeled). If F is of Type 3, the suffixes

T [j . .]

and

T [i . .]

share more than ℓ characters such that F is not a string label of any suffix tree node in general, but it is at least a prefix of the string label of a node. This is the case for the third factor ab in the aforementioned example, as can be seen in Figure 2.

To find the referred positions, we mark certain nodes as witnesses, which create a connection between corresponding leaves and their referred positions. A leaf is called corresponding if its suffix number is the starting position of a factor. We say that the witness of a fresh factor is the root. For a referencing factor F, the witness of F is the highest node whose string label has F as a prefix; the witness of F determines the referred position of F, which is the smallest suffix number among all leaves in its subtree.

Despite this increased complexity compared to the overlapping LZSS factorization, the non-overlapping factorization can be computed with the suffix tree in

O (n lg σ)

time using

O (n lg n)

bits of space ([31] [APL16]). Here, we adapt the algorithms of (Fischer et al. [4] [Section 3]) computing the overlapping LZSS factorization to compute the non-overlapping factorization by following the approach of Gusfield [31]. Our goal is to compute the coding of the factors, i.e., the referred position and the length of each factor (cf. Figure 1).

3.2. The Factorization Algorithm

All LZSS factorization algorithms of (Fischer et al. [4] [Section 3]) are divided into passes. A pass consists of visiting suffix tree leaves in text order (i.e., in order of their suffix numbers). On visiting a leaf, they conduct a leaf-to-root traversal. In what follows, we present our modification, which merely consists of a modification of Pass (a) in all LZSS factorization variants of ([4] [Section 3]): In Pass (a), Fischer et al. computed the factor lengths and the witnesses. To maintain the witnesses and lengths in future passes, they marked and stored the preorder numbers of the witnesses and the starting positions of the LZSS factors in two bit vectors

B_{W}

and

B_{T}

, respectively. In succeeding passes, they computed, based on the factor lengths and the witnesses, the referred positions and with that the final coding. Therefore, it suffices to only change Pass (a) according to our definition of witnesses and factors, while keeping the subsequent passes untouched. In this pass, we do the following:

Pass (a): Create $B_{W}$ and $B_{T}$ to determine the witnesses and the factor lengths, respectively.

The main technique of a pass in [4] are leaf-to-root traversals. Here, we do the opposite: We traverse from the root to a specific leaf. We perform a root-to-leaf traversal by level ancestor queries such that visiting a node takes constant time. We perform these traversals only for all corresponding leaves since the other leaves are not useful for determining a factor.

Suppose we visit a leaf

λ

corresponding to a factor F. We already know the starting position of F (i.e.,

sufnum (λ)

), but not its length, referred position, or witness w. To detect w, we use the following observation: Given

j_{u}

is the smallest suffix number among all leaves in the subtree rooted at a node u, w is the highest node that maximizes

ℓ_{w} : = min (str_depth (w), sufnum (λ) - j_{w}) .

(1)

If

j_{w} = sufnum (λ)

, then F is a fresh factor. Otherwise, w determines the length

| F | = ℓ_{w}

and the referred position

j_{w}

of F. However, the two functions

v \mapsto str_depth (v)

and

v \mapsto sufnum (λ) - j_{v}

are strictly increasing and monotonically decreasing, respectively, when applied to each node v visited when walking downwards the path from the root to

λ

. Thus, our goal is to find the lowest node u, where the value

ℓ_{u}

of Equation (1) still results from

str_depth (u)

, and not from the second argument

sufnum (λ) - j_{u}

. We give a sketch in Figure 3 and study a particular case in Figure 4 for factors of Types 2 and 3.

Figure 3. Determining the witness of a factor F whose starting position is the suffix number of the leaf

λ

. Straight arcs symbolize edges, while curly arcs symbolize paths that can visit multiple nodes (which are not visualized). Given

j_{w}

is the smallest suffix number among all leaves in the subtree rooted at a node w, and that u is the lowest ancestor of

λ

with

[j_{u} . . j_{u} + str_depth (u) - 1] \cap [sufnum (λ) . . sufnum (λ) + str_depth (u) - 1] = \emptyset

, then either u or its child v is the witness of F (see Section 3.2 for an explanation). The idea behind detecting whether the two intervals are intersecting is that a factor starting at

sufnum (λ)

of length

str_depth (u)

would be of Type 1 or Type 2 with referred position

j_{u}

. In fact, if F is of Type 1, then its witness is the lowest ancestor of

λ

having a leaf with a suffix number smaller than

sufnum (λ)

in its subtree (this definition coincides with the witnesses of the overlapping LZSS factorization of ([4] [Section 2.3])). It is possible that

j_{u} = j_{v}

, i.e., the leaf with suffix number

j_{u}

is also in the subtree rooted at v. We can observe this case in Figure 4.

Figure 4. Special case of the setting considered in Figure 3 for factors of Types 2 and 3. Here, we assign u and v the same roles as in Figure 3, but we additionally assume that

j_{v} = j_{u}

and

sufnum (λ) \in I : = [j_{u} + str_depth (u) . . j_{u} + str_depth (v) - 1]

. If

sufnum (λ) = j_{u} + str_depth (u)

, as in the right figure, then the factor F of

λ

starting at

sufnum (λ)

is of Type 2, and the witness of F is u, although u is not the lowest ancestor of

λ

having

λ

and

j_{u}

in its subtree. If

sufnum (λ) \in I ∖ {j_{u} + str_depth (u)}

, then F is of Type 3 and the witness of F is v; the witness of F is v even if

λ

and the leaf with suffix number

j_{v}

are shared by a descendant of v as shown in the left figure.

To achieve our goal, let

I_{v} : = [j_{v} . . j_{v} + str_depth (v) - 1]

and

I_{λ, v} : = [sufnum (λ) . . sufnum (λ) + str_depth (v) - 1]

be two intervals. These two intervals have the property that

T [I_{v}] = T [I_{λ, v}]

. The idea is that

T [I_{λ, v}]

is a candidate for F with

T [I_{v}]

being its leftmost occurrence in T. We compute the values of

j_{v}

,

I_{v}

and

I_{λ, v}

for every node v on the path from the root to

λ

until reaching a node v such that the intervals

I_{v}

and

I_{λ, v}

overlap (cf. Line 9 in Algorithm 1). Let u be the parent of v. Then, the edge

(u, v)

determines the factor F: We consider the following two cases that determine whether F is a fresh or referencing factor, and whether the witness and the referred position of F are u and

j_{u}

, or v and

j_{v}

, respectively, in case F is a referencing factor:

If $j_{v} = sufnum (λ)$ , there is no leaf in v’s subtree with a suffix number smaller than $sufnum (λ)$ .
-
If u is the root, then there is no candidate for a referred position available, i.e., F is a fresh factor (cf. Line 13 in Algorithm 1).
-
Otherwise, $str_depth (u) > 0$ and $I_{u} \cap I_{λ, u} = \emptyset$ (since v is the highest node on the path from the root to $λ$ for which $I_{v} \cap I_{λ, v} \neq \emptyset$ holds). Hence, the longest substring occurring before $sufnum (λ)$ that is a prefix of $T [sufnum (λ) . .]$ has an occurrence in $T [1 . . sufnum (λ) - 1]$ (Type 1). One of those occurrences starts at position $j_{u}$ . This means that the referred position is $j_{u}$ , and the witness of F is u; the length of F is $str_depth (u)$ (cf. Line 17 in Algorithm 1).
If $j_{v} \neq sufnum (λ)$ (i.e., $j_{v} < sufnum (λ)$ ), the length of F is in the interval $[str_depth (u) . . str_depth (v) - 1]$ . If the factor F refers to the position $j_{v}$ , then its length is the minimum of $sufnum (λ) - j_{v}$ and the length of the LCP of the suffixes starting at $j_{v}$ and $sufnum (λ)$ . (Note that this LCP can be longer than the string label of v.) Let us denote the value of this minimum by ℓ, which coincides with $ℓ_{v}$ of Equation (1) due to $str_depth (v) \geq sufnum (λ) - j_{v}$ , and determines whether F refers to $j_{v}$ or $j_{u}$ (cf. Line 20 in Algorithm 1):
-
If $ℓ = str_depth (u)$ , then the referred position of F is actually the suffix number of a leaf contained in u’s subtree (Type 2). In this case, the length of F is $|I_{u}| = str_depth (u)$ because $I_{u} \cap I_{λ, u} = \emptyset$ . The witness of F is u, and $j_{u}$ is the referred position (cf. Line 21 in Algorithm 1).
-
Otherwise, $str_depth (u) < |F| < str_depth (v)$ , hence F is not the string label of any suffix tree node (Type 3). The node v is the highest node whose string label has F as a prefix. We conclude that the witness, referred position, and length of F are v, $j_{v}$ , and ℓ, respectively (cf. Line 23 in Algorithm 1).

3.3. Complexity Bounds

To determine the value of

j_{v}

, we need to answer a range minimum query (RMQ) on

SA

. Given an array

A [1 . . n]

, an RMQ for an interval

I \subset [1 . . n]

asks for the index of the minimum value in

A [I]

. To answer an

RMQ

, we can make use of the following data structure:

Lemma 2

([32] [Thm 5.8]). Let

A [1 . . n]

be an integer array, where accessing an element

A [i]

takes

t_{A}

time for

i \in [1 . . n]

. There exists a data structure of size

2 n + o (n)

bits built on top of A that answers RMQs in constant time. It is constructed in

O (t_{A} n)

time with

o (n)

additional bits of working space.

According to Lemma 2, we can construct an RMQ data structure in

O (t_{SA} n)

time using

2 n + o (n)

bits of space, where

t_{SA}

is the time for accessing

SA

. We can access

SA

in

O (1 / ϵ)

time and in

O ({lg}^{ϵ} n)

time with the SST and CST, respectively, where the last time complexity is due to the following lemma:

Lemma 3

(Grossi and Vitter [11] [Section 3.2]). There is a data structure using

O (ϵ^{- 1} n)

bits that can access

SA

in

O ({lg}^{ϵ} n)

time, where

ϵ \in (0, 1]

is a selectable constant.

As shown by (Fischer et al. [33] [Lemma 3]), the operation

str_depth (u)

for a node u can be computed with

SA

,

LCP

, and an RMQ data structure on

LCP

because the leaf-ranks of the leftmost leaf

λ_{L}

and rightmost leaf

λ_{R}

in the subtree rooted u define the interval

[leaf_rank (λ_{L}) + 1 . . leaf_rank (λ_{R})]

in

SA

, and selecting the minimum value in

LCP

within this interval gives the length of the longest common prefix shared among all leaves in u’s subtree, which is

str_depth (u)

. However, we do not store

LCP

explicitly, but instead simulate an access of its jth entry for

j \in [2 . . n]

by

LCP [j] = PLCP [SA [j]]

. Hence, we can access an entry of

LCP

in

O (t_{SA})

time. Consequently, we can build the data structure of Lemma 2 on

LCP

in

O (t_{SA} n)

time, which takes

2 n + o (n)

bits of additional space. Equipped with this data structure, we finally can evaluate

str_depth

in

O (t_{SA})

time. The total time bounds are composed as follows:

(a): Since the number of visited nodes is at most the factor length of a corresponding leaf $λ$ during a root-to-leaf traversal to $λ$ , and $\sum_{x = 1}^{z} |F_{x}| = n$ , we conclude that the RMQs take $O (n t_{SA})$ time in total.
(b): For each root-to-leaf traversal to a leaf corresponding to a factor F, we stop at an edge $(u, v)$ and compute the length of the LCP of $T [j_{v} . .]$ and $T [sufnum (λ) . .]$ by naïvely comparing $O (|F|)$ characters. In total, the number of compared characters is $O (\sum_{x = 1}^{z} |F_{x}|) = O (n)$ .

Altogether, Pass (a) takes takes

O (n t_{SA})

time, since all applied tree navigational operations take constant time. With Lemma 3, we obtain the time and space complexities claimed in Theorem 1.

Algorithm 1: Pass (a) of the non-overlapping LZSS algorithm of Section 3. The function

report (w, j, ℓ)

outputs the referred position j and the length ℓ of the respective referencing factor; marks the witness w and the starting position of the next factor (determined by ℓ) in

B_{W}

and in

B_{T}

, respectively; and appends the unary value of

depth (w)

to

B_{L}

(defined in Section 3.4).

lmost_leaf (v)

and

rmost_leaf (v)

return the leftmost and the rightmost leaf of the subtree rooted at v in constant time, respectively. All break statements exit the nested inner loop and jump to Line 25.

3.4. Storing the Factorization

From here on, we have two options: We can either directly output the referred positions and the lengths of the computed factors during Pass (a), or we can store additional information for retrieving the witnesses in a later pass. Such a later pass is interesting when working with the SST, as we can store the factors in the

(1 + ϵ) n lg n + O (n)

bits of working space ([4] [Section 3.3]). There, a later pass overwrites the space occupied by the SST, in particular the suffix array representation, such that later passes no longer can determine witnesses. Although we mark each witness in the bit vector

B_{W}

during Pass (a), there can be multiple nodes marked in

B_{W}

on the path from the root to a leaf corresponding to a factor F. The overlapping LZSS factorization obeys the invariant that the witness of F is the lowest ancestor of

λ

that is marked in

B_{V}

, given that

B_{V}

marks all ancestors of the leaves with a suffix number smaller than

sufnum (λ)

when conducting a leaf-to-root traversal at

λ

during the overlapping LZSS computation ([4] [Section 3]). Due to the existence of factors of Types 2 and 3, this invariant does not hold for the non-overlapping factorization.

For the later passes, we want a data structure that finds the witness w of a factor F based on F’s starting position in constant time. Fortunately, w is determined by the leaf

λ

corresponding to F and w’s depth due to

w = level_anc (λ, depth (w))

. To remember the depth of each witness, we maintain a bit vector

B_{L}

that stores the depth of each witness in unary coding sorted by the suffix number of the respective corresponding leaf. Given that we find the witness w of a leaf

λ

in Pass (a) during the traversal from the root to

λ

, we store the unary code

0^{d} 1

in

B_{L}

, where

d : = depth (w)

. For a leaf corresponding to a fresh factor, we store the unary code 1 in

B_{L}

. Similar to

B_{D}

in ([4] [Sect. 3.4.3 Pass (2)]), we do not need to add a select-support to

B_{L}

, since we process the corresponding leaves always sequentially in text order. Given a corresponding leaf

λ

, we can jump to its witness (or to the root if

λ

corresponds to a fresh factor) with a level ancestor query from

λ

with the depth

B_{L} . {select}_{1} (sufnum (λ) + 1) - B_{L} . {select}_{1} (sufnum (λ)) - 1

. The length of

B_{L}

is at most

n + z

since the depth of a witness is bounded by the length of its corresponding factor and the sum of all factor lengths is n.

3.5. Computing $LPnF$

Finally, we can compute

LPnF

with the same algorithm by visiting all leaves (i.e., not only the corresponding ones). However, we no longer can charge the visited nodes during a root-to-leaf traversal with the length of a factor as in Section 3.3 (a). In fact, such an algorithm may visit

O (n^{2})

nodes since

\sum_{i = 1}^{n} LPnF [i] = O (n^{2})

(and this sum is

Θ (n^{2})

for the string

T = a \dots a

). To reduce the number of nodes to visit, we can make use of Lemma 1: having

LPnF [1 . . i - 1]

computed, we know that

LPnF [i] \geq LPnF [i - 1] - 1

; hence, it suffices to start the root-to-leaf traversal at the lowest node

\tilde{w}

whose string depth is at most

LPnF [i - 1] - 1

. We find this node

\tilde{w}

by a suffix link. A suffix link connects a node with string label

S \in Σ^{+}

to the node with string label

S [2 . .]

or to the root node if

S \in Σ^{1}

. All nodes except the root have a suffix link. However, we do not store suffix links as pointers explicitly, but simulate them with the leaves since we can compute the suffix link of a leaf

λ

with

next_leaf (λ)

: Suppose that we have processed the leaf

λ

with suffix number

i - 1

for computing

LPnF [i - 1]

. In what follows, we first assume that the computed factor starting at

i - 1

is not of Type 3. Then, the witness of

λ

is

λ

’s ancestor u with

str_depth (u) = LPnF [i - 1]

being the computed factor length. First, we select another leaf

λ^{'}

of the subtree rooted at u such that the lowest common ancestor (LCA) of

λ

and

λ^{'}

is u (e.g., we can select the leftmost or rightmost leaf in u’s subtree). Then,

\tilde{λ} : = next_leaf (λ)

is the leaf with suffix number i, and the LCA

\tilde{u}

of

\tilde{λ}

and

next_leaf (λ^{'})

is the node on the path from the root to

\tilde{λ}

with

str_depth (\tilde{u}) = str_depth (u) - 1

. By omitting the nodes from the root to

\tilde{u}

in the traversal to

\tilde{λ}

for computing

LPnF [i]

, we only need to visit at most

LPnF [i] - LPnF [i - 1] + 1

nodes for determining

LPnF [i]

. A telescoping sum with the upper bound of Lemma 1 shows that we visit

O (n)

nodes in total.

It is left to deal with the text positions

i - 1

for which we computed a factor of Type 3. Here, the leaf

λ

has a witness v with

LPnF [i - 1] < str_depth (v)

, i.e., the computed factor is implicitly represented on the edge from

u : = parent (v)

to v. We apply the same technique (i.e., taking the suffix link) as for the other types, but apply this technique on u instead of the witness v, such that we end up at a node

\tilde{u}

with

str_depth (\tilde{u}) = str_depth (u) - 1

. We sketch the setting in Figure 5. Now, we additionally need to walk down from

\tilde{u}

towards

\tilde{λ} = next_leaf (λ)

to reach the lowest node

\tilde{v}

with

str_depth (\tilde{v}) \leq LPnF [i - 1] - 1

. There can be at most

c (u, v)

nodes on the path from

\tilde{u}

to

\tilde{v}

. We can refine this number to at most

δ : = LPnF [i - 1] - str_depth (u)

, where

δ

is the number of characters on the edge

(u, v)

contributing to

LPnF [i - 1]

. Nevertheless, these extra

δ

nodes seem to invalidate the

O (n)

bound on the number of visited nodes.

Figure 5. Computing

LPnF [i]

from

LPnF [i - 1]

by simulating a suffix link from u to

\tilde{u}

(cf. Section 3.5). Straight arcs symbolize edges, while curly arcs symbolize paths visiting multiple nodes (which are not visualized). We have

str_depth (u) + δ = LPnF [i - 1]

, and hence

str_depth (\tilde{u}) + δ = LPnF [i - 1] - 1

. We have

δ > 0

if and only if the factor starting at text position

i - 1

is of Type 3. In that case, we additionally walk down from

\tilde{u}

towards

\tilde{λ}

to find the lowest node

\tilde{v}

with

str_depth (\tilde{v}) \leq LPnF [i - 1] - 1 = str_depth (u) + δ - 1

. While u and v are directly connected with an edge, the path from

\tilde{u}

to

\tilde{v}

may contain multiple edges.

To retain our claimed time complexity, we switch from counting nodes to counting characters and use the following charging argument: We charge each edge we traversed by

c (e)

characters, or

δ_{e}

characters if we only traversed

δ_{e} \leq c (e)

characters on an edge. With the above analysis, we again obtain

O (n t_{SA})

time for the algorithm computing the non-overlapping factorization (as well as the non-Type 3

LPnF

values) by spending

O (t_{SA})

time for each charged character (instead of each visited node).

Let us reconsider that the factor of

LPnF [i - 1]

is of Type 3, where we charge the last edge

(u, v)

for computing

LPnF [i - 1]

with

δ

characters. Here, we observe that we actually spend only

O (t_{SA})

time for processing this edge. Hence, we have

δ - 1

characters as a credit left, which we can spend on traversing

O (δ)

descendants of

\tilde{u}

. If the factor starting at i is again of Type 3, we add the remaining credit to the newly gained credit, and recurse.

Regarding Section 3.3 (b), computing the length of the LCP of

T [j_{v} . .]

and

T [sufnum (λ) . .]

naïvely results again in

O (n^{2})

overall running time since we need to compute these lengths for all n positions. Here, instead of computing the length of such an LCP naïvely, we determine it by computing

str_depth (w)

of the LCA of

λ

and the leaf

λ^{'}

with suffix number

j_{v}

in

O (t_{SA})

time. We find

λ^{'}

with the RMQ data structure on

SA

that actually reports the leaf-rank instead of the suffix number

j_{v}

, which we obtain by accessing

SA

. Altogether, we obtain the same time and space bounds for computing the non-overlapping LZSS factorization:

Theorem 3.

We can compute the

2 n

-bits representation of

LPnF

within the same time and space as the non-overlapping LZSS factorization described in Theorem 1.

4. Substring Compression Query Problem

The substring compression query problem [6] is to find the compressed representation of

T [I]

, given a query interval

I \subset [1 . . n]

. Cormode and Muthukrishnan [6] solved this problem for LZSS with a data structure answering the query for

I

in

O (z_{S S [I]} lg n lg lg n)

time, where

z_{S S [I]}

denotes the number of produced LZSS factors of the queried substring

T [I]

. Their data structure uses

O (n {lg}^{ϵ} n)

space, and it can be constructed in

O (n lg n)

time. This result was improved by Keller et al. [34] to

O (z_{S S [I]} lg lg n)

query time for the same space or to

O (z_{S S [I]} {lg}^{ϵ} n)

query time for linear space. They also gave other trade-offs regarding query time and the size of the used data structure for larger data structures.

The main idea of tackling the problem for LZSS (and similarly for the classic LZ77 factorization) is to use a data structure answering interval LCP queries, which are usually answered by two-dimensional range successor/predecessor data structures. Most recently, Matsuda et al. [35] proposed a data structure answering an interval LCP query in

O (n^{ϵ})

time while taking

O (ϵ^{- 1} n (H_{0} (T) + 1))

bits of space, where

H_{0}

denotes the zeroth order empirical entropy. Therefore, they could implicitly answer a substring compression query in

O (z_{S S [I]} n^{ϵ})

time within compressed space. Recently, Bille et al. [36] proposed data structures storing the LZSS-compressed suffixes of T for answering a pattern matching query of an LZSS-compressed pattern P without decompressing P. Their proposed data structures also seem to be capable of answering substring compression queries.

As a warm up for the more-involving techniques for the LZ78 factorization below (cf. Section 4.5), we show that our techniques studied for the non-overlapping LZSS factorization in Section 3 can be adapted to the substring compression query problem under the restriction that the query interval starts at text position 1 (meaning that we query for prefixes instead of arbitrary substrings). Given an interval

I = [1 . . p]

for a text position

p \in [1 . . n]

, the algorithm of Theorem 1 achieves

O (p t_{SA})

time, where

t_{SA}

is the time to access

SA

. We can improve the running time by replacing the linear scan on Line 10 of Algorithm 1 with an exponential search [37]: As long as the condition on Line 9 is true (the condition for walking downwards), we do not increment the depth d by one, but instead double d. Now, when the condition on Line 9 becomes false, we may have overestimated the desired depth (we want the first d for which the condition on Line 9 becomes false). Thus, we need to additionally backtrack by performing a binary search on the interval

[d / 2 . . d]

. If we perform this search for computing a factor of length ℓ, then we double d at most

lg ℓ

times, and visit

O (lg ℓ)

depths during the binary search (see also Figure 6 for a visualization). In total, we obtain

O (z_{S S [1 . . p]} t_{SA} lg ℓ)

time, where ℓ is the length of the longest non-overlapping LZSS factor (here,

z_{S S [1 . . p]}

denotes the number of computed non-overlapping factors). Note that the result is not particularly interesting since we can just store the whole factorization of

T [1 . . n]

, scan for the leftmost factor

F_{x}

that ends at p or after, trim

F_{x}

’s length to end at p, and finally return

F_{1}, \dots, F_{x}

, all in

O (z_{S S [1 . . p]})

time.

Figure 6. Exponential search on a root-to-leaf path for the first node that does not meet a specific condition. In the setting of the non-overlapping LZSS factorization of Section 4 as well as in the LZ78 factorization of Section 4.5, the path from the root to a leaf

λ

contains a sub-path

π

including the root whose contained nodes all share a common property (for LZSS they meet the condition on Line 10 of Algorithm 1, while for LZ78 they are edge witnesses marked in the bit vector

B_{E}

). We symbolize the path from the root to

λ

as an array, where each node is represented by its depth. The sub-path

π

is visualized by the shaded entries (

). Here, the leaf

λ

has depth 18, and we want to find the first unshaded node on depth 15. The exponential search and the subsequent binary search in the range

[8 . . 16]

is conducted by following the edges below and above the path array, respectively.

To generalize this algorithm for an interval

I

with

b (I) > 1

, we need to change the definition of

j_{v}

for a node v in Section 3.2 to be the smallest suffix number of at least

b (I)

among the leaves in the subtree rooted at v. However, this additional complexity makes the approach selecting

j_{v}

with an RMQ on

SA

infeasible and leads us back to the interval LCP query problem.

4.1. Related Substring Compression Query Problems

As far as the author is aware of, the substring compression query problem has only been studied for LZSS. However, Lifshits [38] mentioned that it is also feasible to think about the substring compression query problem in context of straight-line programs (SLPs): Given an SLP of size g representing T, we can construct an SLP of size

O (g)

on

T [I]

in

O (g)

time. Actually, we can do better if the SLP is locally consistent. For that, we augment each non-terminal with the number of terminal symbols it expands to (after recursively expanding all non-terminals by their right hand sides). For a grammar such as HSP ([39] [Theorem 3.5]), we can compute the SLP variant of HSP (analogously to the SLP variant of ESP [40]) in

O (lg | I | {lg}^{*} n)

time, or ESP [41] in

O ({lg}^{2} | I | {lg}^{*} n)

time due to ([39] [Lemma 2.11]).

Here, we consider answering substring compression queries with the LZ78 factorization (which is actually also an SLP ([42] [Section VI.A.1])), i.e., the goal is to compress the substring

T [I]

with LZ78. Let

z_{78 [I]}

denote the number of LZ78 factors of the string

T [I]

. When the text is given as an SLP of size g, we can first transform this SLP into an SLP of

T [I]

in

O (g)

time, and then apply the algorithm of Bannai et al. [43] on this SLP to compute the LZ78 factorization in

O (g + z_{78 [I]} lg z_{78 [I]})

time. Let us consider from now on that T is given in its plain form as a string with

n lg σ

bits. A possible way is to apply first a solution for computing an LZ77 substring compression query, and then transform the LZ77-compressed substring into an SLP of size

O (z_{S S [I]} lg | I |)

in

O (z_{S S [I]} lg | I |)

time by a transformation due to Rytter [44], to finally apply the aforementioned algorithm of Bannai et al. [43]. The fastest LZ78 factorization algorithms [4,45] can answer a LZ78 substring compression query in

O (| I |)

time alphabet independently. For small alphabet sizes, the running time

{O (| I | (lg lg | I |)}^{2} / ({log}_{σ} | I | lg lg lg | I |))

of the LZ78 factorization algorithm of Jansson et al. [46] becomes even sub-linear in

| I |

. However, for large

I

and a compressible text T, these approaches are rather slow compared to the solutions for LZSS mentioned above, whose running times are bounded by the number of computed factors and a logarithmic multiplicative factor on the text length.

To obtain similar bounds for LZ78, we could adapt the approach of Bille et al. [36] to preprocess the LZ78 factorization of all suffixes of T, but that would give us a data structure with super-linear preprocessing time (and possibly super-linear space). Here, we borrow the idea from Nakashima et al. [45] to superimpose the suffix tree with the LZ78 trie, and use a data structure for answering nearest marked ancestor queries to find the lowest marked suffix tree node on the path from the root to a leaf. This data structure [47] takes

O (n lg n)

bits of space, and can answer a nearest marked ancestor query in

O (1)

amortized time. We are unaware whether there are improvements for this type of query, even under the light that they only need to answer fringe marked ancestor queries, a notion coined by Breslauer and Italiano [48], which is a special case of nearest marked ancestor queries: in the fringe marked ancestor query problem, the root of a tree (here: the suffix tree) is already marked, and we can only mark the children of an already marked node. In what follows, we formally define the LZ78 factorization, and then propose approaches for the LZ78 substring compression query problem based on different suffix tree representations.

4.2. LZ78 Factorization

Stipulating that

F_{0}

is the empty string, a factorization

F_{1} \dots F_{z} = T

is called the LZ78 factorization [2] of T iff, for all

x \in [1 . . z]

, the factor

F_{x}

is the longest prefix of

T [| F_{1} \dots F_{x - 1} | + 1 . .]

with

F_{x} = F_{y} c

for some

y \in [0 . . x - 1]

and

c \in Σ

, that is,

F_{x}

is the longest possible previous factor

F_{y}

appended by the following character

T [| F_{1} \dots F_{x} |]

in the text. We say that y is the referred index of the factor

F_{x}

. A factor is thus determined by its referred index and its last character, which lets us encode the factors in a list of (integer, character)-pairs, as shown in the example of Figure 1 where we simplify the coding of factors with referred index 0 to plain characters (to ease the comparison with the LZSS variants). Figure 7 gives another visualization of the same example with the LZ trie, which represents each factor as a node (the root represents the factor

F_{0}

). The node representing the factor

F_{y}

has a child representing the factor

F_{x}

connected with an edge labeled by a character

c \in Σ

if and only if

F_{x} = F_{y} c

. An observation of Nakashima et al. [45] (Section 3) is that the LZ trie is a connected subgraph of the suffix trie containing its root. We can therefore simulate the LZ trie by marking nodes in the suffix trie. Since the suffix trie has

O (n^{2})

nodes, we use the suffix tree

ST

instead of the suffix trie to save space. In

ST

, however, not every LZ trie node is represented; these implicit LZ trie nodes are on the

ST

edges between two

ST

nodes (cf. Figure 8). Since the LZ trie is a connected subgraph of the suffix trie sharing the root node, implicit LZ trie nodes on the same

ST

edge have the property that they are all consecutive and that they start at the first character of the edge. To represent them, it thus suffices to augment an

ST

edge with a counter counting the number of its implicit LZ trie nodes. We call this counter an exploration counter, and we write

n_{v} \in [0 . . c (e)]

for the exploration counter of an edge

e = (u, v)

, which is stored in the lower node v that e connects to. Additionally, we call an

ST

node v an edge witness if

n_{v}

becomes incremented during the factorization. We additionally stipulate that the root of

ST

is an edge witness, whose exploration counter is always full. Then, all edge witnesses form a sub-graph of

ST

sharing the root node. We say that

n_{v}

is full if

n_{v} = c (parent (v), v)

, meaning that v is an explicit LZ78 trie node. We give an example in Figure 9.

Figure 7. The LZ78 factorization and its LZ trie for the text

T = aaababaaaba

. The xth factor is the concatenation of the edge labels of the path from the root to the node labeled with x. Its referred index is the label of its parent.

Figure 8. (Left) The suffix tree of T superimposed by the LZ trie (cf. Figure 7) computed on

T = aaababaaaba $

. Blue (

) colored

ST

nodes represent the explicit LZ trie nodes, i.e., those nodes that are present in

ST

. Implicit LZ trie nodes are represented by the small rounded nodes (

). The edge witnesses are the nodes with the preorder numbers 3, 5, 6, 12, and 16. (Right)

cpST

of T described in Section 4.8. The label of a node is the list of preorder numbers of the nodes in its respective heavy path. For instance, the heavy path from the root contains the nodes with the preorder numbers 1, 3, 5, and 6.

Figure 9. Excerpt of the suffix tree depicting three edge witnesses. Implicit trie node are represented by small rounded nodes (○), which are shaded if they are LZ trie nodes (

). The explicit LZ trie nodes u and w are shaded in blue (

). According to the figure,

n_{v} = 1

and

n_{w} = 2

. In particular, the exploration counters of u and w are full.

However, since we do not know the shape of the LZ trie in advance, we also do not know which nodes will become an edge witness. For the time being, we augment each node with an exploration counter, spending

O (n lg n)

bits in total. As in Section 3, we assume that our text T has length n and ends with a special symbol $ smaller than all other characters appearing in T.

4.3. Linear-Time Computation

Now, we can give our first result of Theorem 2 on the LZ78 substring compression query problem by a simple modification of the LZ78 factorization algorithm presented by Nakashima et al. [45]. This algorithm uses a pointer-based suffix tree, which is augmented by a nearest marked ancestor data structure [47], using altogether

O (n lg n)

bits of space.

The algorithm works as follows: Suppose that we have computed the factors

F_{1} \dots F_{x - 1}

and now want to compute

F_{x}

. Since

F_{x}

is a prefix of the suffix

T [p . .]

with

p = | F_{1} \dots F_{x - 1} | + 1

,

F_{x}

is a prefix of the concatenation of edge labels on the path

π

from the root to the leaf with suffix number p in the suffix tree. The additional requirement that

F_{x}

, excluding its last character, has to coincide with a preceding factor

F_{y}

means that

F_{y} = F_{x} [1 . . | F_{x} | - 1]

is the string label of the lowest LZ trie node on

π

; this LZ trie node is represented either

explicitly as an $ST$ node w being the lowest edge witness on $π$ ; or
implicitly by the exploration counter of w.

In either case, w is the edge witness of

F_{y}

and determines its length

| F_{y} | = str_depth (parent (w)) + n_{w}

. We create an LZ trie node representing

F_{x}

as follows:

If $n_{w}$ is not full, we make w the edge witness of $F_{x}$ , and increment $n_{w}$ by one.
Otherwise ( $n_{w}$ is full), we make the child $w^{'}$ of w on the path $π$ the edge witness of $F_{x}$ , and set $n_{w^{'}} \leftarrow 1$ .

It is left to find w, which we can by traversing

π

from the root until reaching an edge

e = (u, v)

whose exploration counter

n_{v}

is less than the length of its label

c (e)

, where either u or v is w. However, a linear scan of

π

for finding w would result in

O (z)

time per factor. Here, the fringe marked ancestor queries come into the picture, which allow us to find a lowest edge witness in amortized constant time: by marking all edge witnesses, querying the lowest marked ancestor of the leaf with suffix number p yields either u or v. This gives us

O (1)

amortized time per LZ78 factor, and concludes the LZ78 factorization algorithm of Nakashima et al. [45] (Theorem 3).

Finally, to obtain the LZ78 factorization of

T [b (I) . . e (I)]

for a given interval

I

with

I = [b (I) . . e (I)]

, we do not start the computation at

T [1 . .]

, but directly at

T [b (I) . .]

, and terminate when a factor ends at

T [e (I)]

or protrudes

T [I]

to the right. In the latter case, we trim this factor. Hence, we can compute the factorization of

T [I]

in

O (z_{78 [I]})

time with

O (n lg n)

bits of space, in which we can store a pointer-based suffix tree on T.

4.4. Outline

In what follows, we want to study variants of this algorithm that use more lightweight data structures at the expense of additional running times. All LZ78 factorization algorithms here presented stick to the following general framework, which we call a pass: For each leaf

λ

whose suffix number is the starting position of a factor F, locate the lowest edge witness w on the path from the root to

λ

and create a new LZ trie node by incrementing either

n_{w}

or the exploration counter of its child on the path towards

λ

as described in Section 4.3. Since w determines the length of the factor F, we know the suffix number of the leaf that starts with the next factor.

After a pass, we know the LZ trie topology due to the exploration counters. In a subsequent pass (Section 4.7), we use this knowledge to associate an edge witness w with the index of the most recent factor having w as its edge witness such that we can identify the referred indices with this association. However, before that, we reduce the space (Section 4.5) and subsequently show how to perform a pass within the reduced working space (Section 4.6). Finally, we accelerate a root-to-leaf traversal for long factors in Section 4.8.

4.5. Space-Efficient Computation

In what follows, we give trade-offs for less space but slightly larger time bounds by using SST and CST. To get below

O (n lg n)

bits of space, we need to get rid of: (a) the

O (n lg n)

-bits marked ancestor data structure; and (b) the

O (n lg n)

bits for the exploration counters. For the latter (b), Fischer et al. [4] (Section 4.1) presented a data structure representing the exploration counters within

O (n)

bits on top of the suffix tree. For the former (a), we use level ancestor queries to simulate a fringe marked ancestor query: to this end, we mark all edge witnesses in a bit vector

B_{E}

of length

2 n

such that

B_{E} [j] = 1

if and only if the

ST

node with preorder rank j is an edge witness (remember that the number of nodes in

ST

is at most

2 n

). Suppose now that we want to compute the factor

F_{x}

. For that, we visit the leaf

λ

with suffix number

b (I) + | F_{1} \dots F_{x - 1} |

. As in Section 4.3, we want to find the lowest edge witness on the path from the root to

λ

, which we find with a fringe marked ancestor query. Here, we answer this query by scanning the path from the root to

λ

until reaching the lowest marked node in

B_{E}

. We can traverse linearly from the root to this node by querying

d \mapsto level_anc (λ, d)

for each depth

d \geq 0

. However, we then visit

O (| F_{x} |)

nodes for computing the factor

F_{x}

, or

O (| I |)

nodes in total. To improve this bound, we can apply again exponential search (cf. Figure 6). To see why that can be done, let

(v_{0}, \dots, v_{m})

be the path from the root

v_{0}

to

λ = v_{m}

. If each node

v_{d}

(for each depth

d \in [0 . . m]

) is represented by its preorder number, then

B_{E} [v_{0}] \dots B_{E} [v_{m}] = 1^{k + 1} 0^{m - k}

if the lowest edge witness has depth k, which is the smallest

k \leq | F_{x} |

such that

str_depth (level_anc (λ, k)) \geq | F_{x} | - 1

. Although we do not know k in advance, we can find the rightmost ‘

1

’ in

B_{E} [v_{0}] \dots B_{E} [v_{m}]

with an exponential search visiting

O (lg k) = O (lg | F_{x} |)

nodes (we evaluate

d \mapsto level_anc (λ, d)

for specific d and check each time whether the returned node is marked in

B_{E}

). Thus, we can determine

| F_{x} |

and

F_{x}

’s edge witness in

O (lg | F_{x} |)

time. Since

| F_{x} | \leq x

, we spend

O (z_{78 [I]} lg z_{78 [I]})

time in total.

4.6. Navigation in Small Space

To complete our algorithm for SST and CST, it is left to study how to access the leaves when issuing the level ancestor queries. While the LZ78 factorization algorithms of Fischer et al. [4] used the fact that they can scan the leaves linearly in suffix number order to simulate the scan of the text in text order within their

O (n)

time budget, we want to accelerate this algorithm by visiting only the leaves whose suffix numbers match the starting positions of the factors. With the SST, we can select the leaf

λ

with suffix number

i \in [1 . . n]

in

O (1 / ϵ)

time since we have access to

ISA [i]

returning the leaf-rank of

λ

.

With the CST, we can visit the leaf with the subsequent suffix number with

next_leaf

in constant time, but may need

O (n)

time to visit an arbitrary leaf. Here, the idea is to store a sampling of

ISA

within

O (n lg σ)

bits of space during a precomputation step. We can produce the values of this sampling by iterating over

next_leaf

such that we obtain an array that stores in its ith entry the leaf-rank of the leaf with suffix number

i {log}_{σ} n

. Consequently, we can jump to a leaf with suffix number

j \in [1 . . n]

in

O ({log}_{σ} n)

time by jumping to the closest sampled predecessor of j, and subsequently applying

next_leaf

O ({log}_{σ} n)

times to reach the leaf with suffix number j. To sum up, we need

O ({log}_{σ} n)

time to traverse between two corresponding leaves. The total time becomes

O (ϵ^{- 1} z_{78 [I]} lg z_{78 [I]})

and

O (z_{78 [I]} (lg z_{78 [I]} + {log}_{σ} n))

for the SST and the CST, respectively.

4.7. LZ78 Coding

Finally, to obtain the LZ78 coding, we need to compute the referred indices. In a classic LZ78 trie, we would augment each trie node with the index of its corresponding factor. Here, we additionally need a trick for the implicitly represented LZ trie nodes: For them, we can now leverage the edge witnesses by augmenting each of them with the factor index of the currently lowest LZ trie node created on its ingoing edge. Fortunately, we know all nodes that become edge witnesses thanks to

B_{E}

(cf. Section 4.5) marking the preorder numbers of all edge witnesses. We now enhance

B_{E}

with a rank-support such that we can give each edge witness a rank within

[1 . . z_{78 [I]}]

. Therefore, we can maintain the most recent factor indices corresponding to each edge witness in an array W of

z_{78 [I]} lg z_{78 [I]}

bits. We again conduct a pass as described in Section 4.4, but this time we use W to write out the referred indices (see ([4] [Section 4.2.1 Pass (b)]) for a detailed description on how to read the referred indices from W). By doing so, we finally obtain Theorem 2. For an overview, we present the obtained complexity bounds in Table 1.

Table 1. Complexities for answering an LZ78 substring compression query with different suffix tree representations. The query is on an interval

I \subset [1 . . n]

. A query additionally needs an array of

z_{78 [I]} lg n = O (n lg σ)

bits of space as described in Section 4.7. If there are characters of the alphabet appearing nowhere in the text, we additionally need to keep the text available during a query, which adds

n lg σ

bits to the query space complexity of the SST solution.

4.8. Centroid-Path Decomposed Suffix Tree

If the length

ℓ \leq z_{78 [I]}

of the longest factor is so large that

lg ℓ = ω (lg lg n)

, then we can speed up the exponential search of Section 4.5 by searching in the centroid-path decomposed suffix tree

cpST

. The centroid path decomposition [49] of the suffix tree is defined as follows: For each internal node, we call its child whose subtree is the largest among all its siblings (ties are broken arbitrarily if there are multiple such children) a heavy node, while we call all other children light nodes. Additionally, we make the root and all leaves light nodes (here we differ from the standard definition because we need a one-to-one relationship between leaves in the original tree and in the path-decomposed one). A heavy path is a path from a light node u to the parent of a leaf containing, except for u, only heavy nodes. There is a one-to-one relationship between light nodes and heavy paths. Since heavy paths do not overlap, we can contract all heavy paths to single nodes and thus form

cpST

(see ([49] [Section 4.2]) for details and Figure 8 for an example). The centroid path decomposition is helpful, because the number of light nodes on a path from the root to a leaf is

O (lg n)

, which means that a path from the root to a leaf in

cpST

contains

O (lg n)

nodes. This can be seen by the fact that the subtree size of a light node is at most half of the subtree size of its heavy sibling; thus, when visiting a light node during a top-down traversal in

ST

, we at least half the number of

ST

nodes we can visit from then on. Consequently, a root-to-leaf path in

cpST

has

O (lg n)

nodes.

For that to be of use, we need a connection between

ST

and

cpST

: observe that the number of leaves and their respective order is the same in both trees, such that we can map leaves by their leaf-ranks in constant time. If we mark the light nodes in the suffix tree in a bit vector

B_{L}

, then the rank of a light node v in

B_{L}

is the preorder number of the node in

cpST

representing the heavy path whose highest node is v. To stay within our space budget, we represent the tree topology of

cpST

with a BP sequence (which we briefly introduced in Section 2). First, we mark all light nodes in

B_{L}

by an Euler tour, where we query the

ST

topology for the subtree size rooted at an arbitrary node in constant time. Next, we perform a depth-first search traversal on the suffix tree while producing the BP sequence of

cpST

. For that, we use a stack to store the light node ancestors of the currently visited node. Since a node has

O (lg n)

light nodes as ancestors, the stack uses

O ({lg}^{2} n)

bits of space. Finally, we endow

B_{L}

with a select-support such that we can map a node of

cpST

to its corresponding light node in

ST

.

Our algorithm conducting a pass works as follows: Suppose that we visit the leaf

λ

with suffix number

b (I) + | F_{1} | + \dots + | F_{x - 1} |

. This time, we map

λ

to the leaf

λ^{'}

of

cpST

having the same leaf-rank as

λ

in

ST

. Next, we apply the exponential search with

d \mapsto level_anc (λ^{'}, d)

on

cpST

, to obtain a

cpST

node representing the heavy path whose highest node is a light node v, i.e., v is the lowest light node on the

ST

path from the root to the leaf

λ

that is an edge witness. Since a root-to-leaf path in

ST

has

O (lg n)

light nodes, we spend

O (min (lg | F_{x} |, lg lg n))

time to find v.

Finally, it is left to move from v to the lowest edge witness on the path from v to

λ

in

ST

. For that, we use a dictionary

D

that associates a light node with the number of edge witnesses in its heavy path. This number is at most

z_{78 [I]} \leq | I |

, and thus it can be stored in

lg | I |

bits, while a light node can be represented with its preorder number in

O (lg n)

bits.

D

has to be dynamic since we do not know in advance which nodes will become edge witnesses; we can make use of one of the dynamic dictionaries given in Table 2, where

t_{D}

denotes the time for an operation such as a lookup or an insertion and

s_{D}

denotes the dictionary size in bits.

Table 2. Dynamic dictionary representations usable in our

cpST

approach (cf. Section 4.8) for

D

mapping a light node represented in

lg n

bits to the lowest edge witness within its heavy path represented in

lg | I |

bits.

z : = z_{78 [I]}

denotes the number of LZ78 factors of

T [I]

, which is an upper bound on the number of edge witnesses. An operation is a lookup or an insertion. We are interested in instances with

t_{D} = o (lg z)

(since, otherwise, the approach of Section 4.5 is favorable).

ϵ \in (0, 1)

is a selectable constant.

Now, suppose that

D

stores that d nodes in v’s heavy path are edge witnesses. Let w be the next light node on the path from v to

λ

(i.e., w is the highest light node on the path from the root to

λ

whose exploration counter is still zero).

If $d - 1$ is at least the height difference between v and w, then the parent u of w is already an edge witness, and u is a node on the heavy path of v. If the exploration counter of u is full, i.e., $n_{u} = c (parent (u), u)$ , then we increment the exploration counter of w, and hence make w an edge witness and add w to $D$ .
Otherwise ( $d - 1$ is smaller than this height difference), the node whose exploration counter we want to increment is within the heavy path, and is either the dth or $(d + 1)$ th descendent of v.

In total, for

z : = z_{78 [I]}

, we can improve the

z lg z

factor in the time bounds to

z \cdot min (lg z, t_{D} + lg lg n)

, which is

z \cdot min (lg z, lg z / lg lg z + lg lg n)

when implementing

D

with the dynamic dictionary of Raman et al. [50], costing

s_{D} = z lg (n z) + o (z)

bits of additional working space during a query. More formally:

Theorem 4.

Given a text

T [1 . . n]

of length n whose characters are drawn from an alphabet with size

σ = n^{O (1)}

, we can compute a data structure on T in

O (n)

time that computes, given an interval

I \subset [1 . . n]

, the LZ78 factorization of

T [I]

in

$O (z_{78 [I]} ({log}_{σ} n + min (lg z_{78 [I]}, t_{D} + lg lg n)))$ time using $O (n lg σ) + s_{D}$ bits of space, or
$O (ϵ^{- 1} z_{78 [I]} min (lg z_{78 [I]}, t_{D} + lg lg n))$ time using $(1 + ϵ) n lg n + s_{D} + O (n)$ bits of space,

where

z_{78 [I]}

is the number of computed LZ78 factors,

ϵ \in (0, 1]

is a selectable constant, and

t_{D}

and

s_{D}

are the time and space complexities of a dynamic dictionary associating a

lg n

-bit integer with a

lg | I |

-bit value (cf. Table 2). Similar to Theorem 2, we need the read-only text stored for queries if there is a character in the alphabet that does not appear in T.

5. Conclusions

We used techniques introduced by Fischer et al. [4], which work on the succinct suffix tree (SST) and the compressed suffix tree (CST), to tackle the non-overlapping LZSS factorization and the LZ78 substring compression query problem. One of the main techniques is the usage of level ancestor queries to traverse a root-to-leaf path. For computing the non-overlapping LZSS factorization, our idea was to merge these techniques with the algorithm of Gusfield [31] working in root-to-leaf traversals. To answer an LZ78 substring compression query, we combined exponential search with the level ancestor queries and could accelerate this by first searching in the centroid path-decomposed suffix tree

cpST

whenever the factor lengths become large.

We wonder whether we can improve the space bounds for solving the semi-dynamic fringe marked ancestor problem (addressed in Section 4), where updates are restricted to marking a node that is a child of an already marked node; hence, the marked nodes form a connected subgraph of the suffix tree sharing at least the root. Without the need of the

O (n)

words for the marked ancestor data structure, it becomes interesting to devise algorithms computing the reversed LZSS factorization [52] (see ([53] [Chapter 3.6.2])) in low memory.

Funding

This work was funded by the JSPS KAKENHI Grant Number JP18F18120.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Johannes Fischer for helpful comments on the part of the non-overlapping LZSS factorization (Section 3) as part of the thesis ([53] [Section 3.6.1]). We thank the anonymous reviewers for their insightful remarks on improving the quality of this article. A reviewer discovered that Type 3 factors had been neglected in the analysis of the time complexity analysis of the

LPnF

computation in an early version of the manuscript.

Conflicts of Interest

The author declares no conflict of interest.

References

Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
Storer, J.A.; Szymanski, T.G. Data compression via textural substitution. J. ACM 1982, 29, 928–951. [Google Scholar] [CrossRef]
Fischer, J.; Tomohiro, I.; Köppl, D.; Sadakane, K. Lempel-Ziv Factorization Powered by Space Efficient Suffix Trees. Algorithmica 2018, 80, 2048–2081. [Google Scholar] [CrossRef]
Crochemore, M.; Tischler, G. Computing Longest Previous non-overlapping Factors. Inf. Process. Lett. 2011, 111, 291–295. [Google Scholar] [CrossRef]
Cormode, G.; Muthukrishnan, S. Substring compression problems. In Proceedings of the SODA, Vancouver, BC, Canada, 23–25 January 2005; pp. 321–330. [Google Scholar]
Jacobson, G. Space-efficient Static Trees and Graphs. In Proceedings of the FOCS, Research Triangle Park, NC, USA, 30 October–1 November 1989; pp. 549–554. [Google Scholar]
Clark, D.R. Compact Pat Trees. Ph.D. Thesis, University of Waterloo, Waterloo, ON, Canada, 1996. [Google Scholar]
Manber, U.; Myers, E.W. Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
Sadakane, K. Compressed Suffix Trees with Full Functionality. Theory Comput. Syst. 2007, 41, 589–607. [Google Scholar] [CrossRef]
Grossi, R.; Vitter, J.S. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM J. Comput. 2005, 35, 378–407. [Google Scholar] [CrossRef]
Hon, W.; Sadakane, K.; Sung, W. Breaking a Time-and-Space Barrier in Constructing Full-Text Indices. SIAM J. Comput. 2009, 38, 2162–2178. [Google Scholar] [CrossRef]
Munro, J.I.; Navarro, G.; Nekrich, Y. Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time. In Proceedings of the SODA, Barcelona, Spain, 16–19 January 2017; pp. 408–424. [Google Scholar]
Navarro, G.; Sadakane, K. Fully Functional Static and Dynamic Succinct Trees. ACM Trans. Algorithms 2014, 10, 16:1–16:39. [Google Scholar] [CrossRef]
Kosolobov, D.; Shur, A.M. Comparison of LZ77-type parsings. Inf. Process. Lett. 2019, 141, 25–29. [Google Scholar] [CrossRef]
Kolpakov, R.M.; Kucherov, G. Finding approximate repetitions under Hamming distance. Theor. Comput. Sci. 2003, 303, 135–156. [Google Scholar] [CrossRef]
Duval, J.; Kolpakov, R.; Kucherov, G.; Lecroq, T.; Lefebvre, A. Linear-time computation of local periods. Theor. Comput. Sci. 2004, 326, 229–240. [Google Scholar] [CrossRef][Green Version]
Kociumaka, T.; Kubica, M.; Radoszewski, J.; Rytter, W.; Walen, T. A linear time algorithm for seeds computation. In Proceedings of the SODA, Kyoto, Japan, 17–19 January 2012; pp. 1095–1112. [Google Scholar]
Butrak, T.; Chairungsee, S. A Linear Time Algorithm for Finding Tandem Repeat in DNA Sequences. In Proceedings of the ICIT, Melbourne, Australia, 13–15 February 2019; pp. 426–429. [Google Scholar]
Lothaire, M. Applied Combinatorics on Words; Encyclopedia of Mathematics and Its Applications, Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Crochemore, M.; Iliopoulos, C.S.; Kubica, M.; Rytter, W.; Walen, T. Efficient algorithms for three variants of the LPF table. J. Discret. Algorithms 2012, 11, 51–61. [Google Scholar] [CrossRef]
Chairungsee, S.; Butrak, T.; Chareonrak, S.; Charuphanthuset, T. Longest Previous Non-overlapping Factors Computation. In Proceedings of the DEXA, Valencia, Spain, 1–4 September 2015; pp. 5–8. [Google Scholar]
Chairungsee, S.; Crochemore, M. Longest Previous Non-overlapping Factors Table Computation. In Proceedings of the COCOA, LNCS, Shanghai, China, 16–18 December 2017; Volume 10628, pp. 483–491. [Google Scholar]
Chairungsee, S. Efficient Approaches to Compute Longest Previous Non-overlapping Factor Array. Fundam. Inform. 2018, 163, 291–304. [Google Scholar] [CrossRef]
Ohlebusch, E.; Weber, P. On the Computation of Longest Previous Non-overlapping Factors. In Proceedings of the SPIRE, LNCS, Segovia, Spain, 7–9 October 2019; Volume 11811, pp. 372–381. [Google Scholar]
Crochemore, M.; Ilie, L. Computing Longest Previous Factor in linear time and applications. Inf. Process. Lett. 2008, 106, 75–80. [Google Scholar] [CrossRef]
Bannai, H.; Inenaga, S.; Köppl, D. Computing All Distinct Squares in Linear Time for Integer Alphabets. In Proceedings of the CPM, LIPIcs, Copenhagen, Denmark, 17–19 June 2017; Volume 78, pp. 22:1–22:18. [Google Scholar]
Ferragina, P.; Manzini, G. Opportunistic Data Structures with Applications. In Proceedings of the FOCS, Redondo Beach, CA, USA, 12–14 November 2000; pp. 390–398. [Google Scholar]
Okanohara, D.; Sadakane, K. An Online Algorithm for Finding the Longest Previous Factors. In Proceedings of the ESA, LNCS, Karlsruhe, Germany, 15–17 September 2008; Volume 5193, pp. 696–707. [Google Scholar]
Prezza, N.; Rosone, G. Faster Online Computation of the Succinct Longest Previous Factor Array. In Proceedings of the CiE, LNCS, Fisciano, Italy, 29 June–3 July 2020; Volume 12098, pp. 339–352. [Google Scholar]
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
Fischer, J.; Heun, V. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM J. Comput. 2011, 40, 465–492. [Google Scholar] [CrossRef]
Fischer, J.; Mäkinen, V.; Navarro, G. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 2009, 410, 5354–5364. [Google Scholar] [CrossRef]
Keller, O.; Kopelowitz, T.; Feibish, S.L.; Lewenstein, M. Generalized substring compression. Theor. Comput. Sci. 2014, 525, 42–54. [Google Scholar] [CrossRef]
Matsuda, K.; Sadakane, K.; Starikovskaya, T.; Tateshita, M. Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP. In Proceedings of the CPM, LIPIcs, Aarhus, Denmark, 30 June–2 July 2020; Volume 161, pp. 23:1–23:13. [Google Scholar]
Bille, P.; Gørtz, I.L.; Steiner, T.A. String Indexing with Compressed Patterns. In Proceedings of the STACS, LIPIcs, Montpelier, France, 10–13 March 2020; Volume 154, pp. 10:1–10:13. [Google Scholar]
Bentley, J.L.; Yao, A.C. An Almost Optimal Algorithm for Unbounded Searching. Inf. Process. Lett. 1976, 5, 82–87. [Google Scholar] [CrossRef]
Lifshits, Y. Solving Classical String Problems an Compressed Texts. In Proceedings of the Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, Dagstuhl Seminar Proceedings, Dagstuhl, Germany, 14–19 March 2006. Number 06201. [Google Scholar]
Fischer, J.I.T.; Köppl, D. Deterministic Sparse Suffix Sorting in the Restore Model. ACM Trans. Algorithms 2020, 16, 50:1–50:53. [Google Scholar] [CrossRef]
Maruyama, S.; Nakahara, M.; Kishiue, N.; Sakamoto, H. ESP-index: A compressed index based on edit-sensitive parsing. J. Discret. Algorithms 2013, 18, 100–112. [Google Scholar] [CrossRef]
Cormode, G.; Muthukrishnan, S. The string edit distance matching problem with moves. ACM Trans. Algorithms 2007, 3, 2:1–2:19. [Google Scholar] [CrossRef]
Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The smallest grammar problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
Bannai, H.; Gawrychowski, P.; Inenaga, S.; Takeda, M. Converting SLP to LZ78 in almost Linear Time. In Proceedings of the CPM, LNCS, Bad Herrenalb, Germany, 17–19 June 2013; Volume 7922, pp. 38–49. [Google Scholar]
Rytter, W. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 2003, 302, 211–222. [Google Scholar] [CrossRef]
Nakashima, Y.; Tomohiro, I.; Inenaga, S.; Bannai, H.; Takeda, M. Constructing LZ78 tries and position heaps in linear time for large alphabets. Inf. Process. Lett. 2015, 115, 655–659. [Google Scholar] [CrossRef]
Jansson, J.; Sadakane, K.; Sung, W. Linked Dynamic Tries with Applications to LZ-Compression in Sublinear Time and Space. Algorithmica 2015, 71, 969–988. [Google Scholar] [CrossRef]
Alstrup, S.; Husfeldt, T.; Rauhe, T. Marked Ancestor Problems. In Proceedings of the FOCS, Palo Alto, CA, USA, 8–11 November 1998; pp. 534–544. [Google Scholar]
Breslauer, D.; Italiano, G.F. Near real-time suffix tree construction via the fringe marked ancestor problem. J. Discret. Algorithms 2013, 18, 32–48. [Google Scholar] [CrossRef]
Ferragina, P.; Grossi, R.; Gupta, A.; Shah, R.; Vitter, J.S. On searching compressed string collections cache-obliviously. In Proceedings of the PODS, Vancouver, BC, Canada, 9–11 June 2008; pp. 181–190. [Google Scholar]
Raman, R.; Raman, V.; Rao, S.S. Succinct Dynamic Data Structures. In Proceedings of the WADS, LNCS, Providence, RI, USA, 8–10 August 2001; Volume 2125, pp. 426–437. [Google Scholar]
Arbitman, Y.; Naor, M.; Segev, G. Backyard Cuckoo Hashing: Constant Worst-Case Operations with a Succinct Representation. In Proceedings of the FOCS, Las Vegas, NV, USA, 23–26 October 2010; pp. 787–796. [Google Scholar]
Kolpakov, R.; Kucherov, G. Searching for gapped palindromes. Theor. Comput. Sci. 2009, 410, 5365–5373. [Google Scholar] [CrossRef][Green Version]
Köppl, D. Exploring Regular Structures in Strings. Ph.D. Thesis, TU Dortmund, Dortmund, Germany, 2018. [Google Scholar]

Figure 1. Juxtaposition of the overlapping LZSS factorization, the non-overlapping LZSS factorization, and the LZ78 factorization on the string

T = aaababaaaba

. A factor is visualized by a rounded rectangle. Its coding consists of a mere character if it has no reference; otherwise, its coding consists of its referred position and its lengths for both LZSS variants or its referred index and its last character for LZ78.

Figure 2. (Left) Suffix tree of the text

T = ababaaa $

with the witness nodes and the corresponding leaves of the non-overlapping LZSS factorization highlight in red ( Algorithms 14 00044 i002

) and in green ( Algorithms 14 00044 i003

), respectively. We additionally mark the string ab with an implicit node ( Algorithms 14 00044 i004

) whose string label is equal to the factor with Type 3. The nodes are labeled by their preorder numbers. The suffix number of each leaf

λ

is the underlined number drawn in dark yellow below

λ

. (Right) Non-overlapping LZSS factorization of T.

Figure 3. Determining the witness of a factor F whose starting position is the suffix number of the leaf

λ

. Straight arcs symbolize edges, while curly arcs symbolize paths that can visit multiple nodes (which are not visualized). Given

j_{w}

is the smallest suffix number among all leaves in the subtree rooted at a node w, and that u is the lowest ancestor of

λ

with

[j_{u} . . j_{u} + str_depth (u) - 1] \cap [sufnum (λ) . . sufnum (λ) + str_depth (u) - 1] = \emptyset

, then either u or its child v is the witness of F (see Section 3.2 for an explanation). The idea behind detecting whether the two intervals are intersecting is that a factor starting at

sufnum (λ)

of length

str_depth (u)

would be of Type 1 or Type 2 with referred position

j_{u}

. In fact, if F is of Type 1, then its witness is the lowest ancestor of

λ

having a leaf with a suffix number smaller than

sufnum (λ)

in its subtree (this definition coincides with the witnesses of the overlapping LZSS factorization of ([4] [Section 2.3])). It is possible that

j_{u} = j_{v}

, i.e., the leaf with suffix number

j_{u}

is also in the subtree rooted at v. We can observe this case in Figure 4.

Figure 4. Special case of the setting considered in Figure 3 for factors of Types 2 and 3. Here, we assign u and v the same roles as in Figure 3, but we additionally assume that

j_{v} = j_{u}

and

sufnum (λ) \in I : = [j_{u} + str_depth (u) . . j_{u} + str_depth (v) - 1]

. If

sufnum (λ) = j_{u} + str_depth (u)

, as in the right figure, then the factor F of

λ

starting at

sufnum (λ)

is of Type 2, and the witness of F is u, although u is not the lowest ancestor of

λ

having

λ

and

j_{u}

in its subtree. If

sufnum (λ) \in I ∖ {j_{u} + str_depth (u)}

, then F is of Type 3 and the witness of F is v; the witness of F is v even if

λ

and the leaf with suffix number

j_{v}

are shared by a descendant of v as shown in the left figure.

Figure 5. Computing

LPnF [i]

from

LPnF [i - 1]

by simulating a suffix link from u to

\tilde{u}

(cf. Section 3.5). Straight arcs symbolize edges, while curly arcs symbolize paths visiting multiple nodes (which are not visualized). We have

str_depth (u) + δ = LPnF [i - 1]

, and hence

str_depth (\tilde{u}) + δ = LPnF [i - 1] - 1

. We have

δ > 0

if and only if the factor starting at text position

i - 1

is of Type 3. In that case, we additionally walk down from

\tilde{u}

towards

\tilde{λ}

to find the lowest node

\tilde{v}

with

str_depth (\tilde{v}) \leq LPnF [i - 1] - 1 = str_depth (u) + δ - 1

. While u and v are directly connected with an edge, the path from

\tilde{u}

to

\tilde{v}

may contain multiple edges.

Figure 6. Exponential search on a root-to-leaf path for the first node that does not meet a specific condition. In the setting of the non-overlapping LZSS factorization of Section 4 as well as in the LZ78 factorization of Section 4.5, the path from the root to a leaf

λ

contains a sub-path

π

including the root whose contained nodes all share a common property (for LZSS they meet the condition on Line 10 of Algorithm 1, while for LZ78 they are edge witnesses marked in the bit vector

B_{E}

). We symbolize the path from the root to

λ

as an array, where each node is represented by its depth. The sub-path

π

is visualized by the shaded entries ( Algorithms 14 00044 i005

). Here, the leaf

λ

has depth 18, and we want to find the first unshaded node on depth 15. The exponential search and the subsequent binary search in the range

[8 . . 16]

is conducted by following the edges below and above the path array, respectively.

Figure 7. The LZ78 factorization and its LZ trie for the text

T = aaababaaaba

. The xth factor is the concatenation of the edge labels of the path from the root to the node labeled with x. Its referred index is the label of its parent.

Figure 8. (Left) The suffix tree of T superimposed by the LZ trie (cf. Figure 7) computed on

T = aaababaaaba $

. Blue (

) colored

ST

nodes represent the explicit LZ trie nodes, i.e., those nodes that are present in

ST

. Implicit LZ trie nodes are represented by the small rounded nodes ( Algorithms 14 00044 i004

). The edge witnesses are the nodes with the preorder numbers 3, 5, 6, 12, and 16. (Right)

cpST

of T described in Section 4.8. The label of a node is the list of preorder numbers of the nodes in its respective heavy path. For instance, the heavy path from the root contains the nodes with the preorder numbers 1, 3, 5, and 6.

Figure 9. Excerpt of the suffix tree depicting three edge witnesses. Implicit trie node are represented by small rounded nodes (○), which are shaded if they are LZ trie nodes ( Algorithms 14 00044 i004

). The explicit LZ trie nodes u and w are shaded in blue ( Algorithms 14 00044 i006

). According to the figure,

n_{v} = 1

and

n_{w} = 2

. In particular, the exploration counters of u and w are full.

Table 1. Complexities for answering an LZ78 substring compression query with different suffix tree representations. The query is on an interval

I \subset [1 . . n]

. A query additionally needs an array of

z_{78 [I]} lg n = O (n lg σ)

bits of space as described in Section 4.7. If there are characters of the alphabet appearing nowhere in the text, we additionally need to keep the text available during a query, which adds

n lg σ

bits to the query space complexity of the SST solution.

Table 1. Complexities for answering an LZ78 substring compression query with different suffix tree representations. The query is on an interval

I \subset [1 . . n]

. A query additionally needs an array of

z_{78 [I]} lg n = O (n lg σ)

bits of space as described in Section 4.7. If there are characters of the alphabet appearing nowhere in the text, we additionally need to keep the text available during a query, which adds

n lg σ

bits to the query space complexity of the SST solution.

Construction
Data Structure	Time	Space in Bits
suffix tree [45]	$O (n)$	$O (n lg n)$
SST [4] [Section 2.2.3]	$O (n ϵ^{- 1})$	$n lg σ + (1 + ϵ) n lg n + O (n)$
CST [4] [Section 2.2.2]	$O (n)$	$O (n lg σ)$
Query
Data Structure	Time	Space in Bits
suffix tree [45]	$O (z_{78 [I]})$	$O (n lg n)$
SST [4] [Section 2.2.3]	$O (z_{78 [I]} lg z_{78 [I]} ϵ^{- 1})$	$((1 + ϵ) n + z_{78 [I]}) lg n + O (n)$
CST [4] [Section 2.2.2]	$O (z_{78 [I]} (lg z_{78 [I]} + {log}_{σ} n))$	$O (n lg σ)$

Table 2. Dynamic dictionary representations usable in our

cpST

approach (cf. Section 4.8) for

D

mapping a light node represented in

lg n

bits to the lowest edge witness within its heavy path represented in

lg | I |

bits.

z : = z_{78 [I]}

denotes the number of LZ78 factors of

T [I]

, which is an upper bound on the number of edge witnesses. An operation is a lookup or an insertion. We are interested in instances with

t_{D} = o (lg z)

(since, otherwise, the approach of Section 4.5 is favorable).

ϵ \in (0, 1)

is a selectable constant.

Table 2. Dynamic dictionary representations usable in our

cpST

approach (cf. Section 4.8) for

D

mapping a light node represented in

lg n

bits to the lowest edge witness within its heavy path represented in

lg | I |

bits.

z : = z_{78 [I]}

denotes the number of LZ78 factors of

T [I]

, which is an upper bound on the number of edge witnesses. An operation is a lookup or an insertion. We are interested in instances with

t_{D} = o (lg z)

(since, otherwise, the approach of Section 4.5 is favorable).

ϵ \in (0, 1)

is a selectable constant.

Data Structure $D$	Operation Time $t_{D}$	Space $s_{D}$ in Bits
plain array	$O (1)$	$n lg \| I \|$
Raman et al. [50]	$O (lg z / lg lg z)$ amortized	$z lg (n \| I \|) + o (z)$
backyard Cuckoo hashing [51]	$O (lg (1 / ϵ) / ϵ^{2})$ expected	$(1 + ϵ) z lg (n \| I \|)$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Abstract

1. Introduction

2. Preliminaries

3. Non-Overlapping LZSS

3.1. Setup

3.2. The Factorization Algorithm

3.3. Complexity Bounds

3.4. Storing the Factorization

3.5. Computing $LPnF$

4. Substring Compression Query Problem

4.1. Related Substring Compression Query Problems

4.2. LZ78 Factorization

4.3. Linear-Time Computation

4.4. Outline

4.5. Space-Efficient Computation

4.6. Navigation in Small Space

4.7. LZ78 Coding

4.8. Centroid-Path Decomposed Suffix Tree

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Abstract

1. Introduction

2. Preliminaries

3. Non-Overlapping LZSS

3.1. Setup

3.2. The Factorization Algorithm

3.3. Complexity Bounds

3.4. Storing the Factorization

3.5. Computing LPnF

4. Substring Compression Query Problem

4.1. Related Substring Compression Query Problems

4.2. LZ78 Factorization

4.3. Linear-Time Computation

4.4. Outline

4.5. Space-Efficient Computation

4.6. Navigation in Small Space

4.7. LZ78 Coding

4.8. Centroid-Path Decomposed Suffix Tree

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3.5. Computing $LPnF$