Linear-Time Text Compression by Longest-First Substitution

Nakamura, Ryosuke; Inenaga, Shunsuke; Bannai, Hideo; Funamoto, Takashi; Takeda, Masayuki; Shinohara, Ayumi

doi:10.3390/a2041429

Open AccessArticle

Linear-Time Text Compression by Longest-First Substitution

by

Ryosuke Nakamura

¹,

Shunsuke Inenaga

^2,*,

Hideo Bannai

¹,

Takashi Funamoto

¹,

Masayuki Takeda

¹ and

Ayumi Shinohara

³

¹

Department of Informatics, Kyushu University, 744 Motooka, Fukuoka 819-0395, Japan

²

Graduate School of Information Science and Electrical Engineering, Kyushu University, 744 Motooka, Fukuoka 819-0395, Japan

³

Graduate School of Information Sciences, Tohoku University, Aoba 6-6-05, Aramaki, Sendai 980-8579, Japan

^*

Author to whom correspondence should be addressed.

Algorithms 2009, 2(4), 1429-1448; https://doi.org/10.3390/a2041429

Submission received: 30 September 2009 / Accepted: 20 November 2009 / Published: 25 November 2009

(This article belongs to the Special Issue Data Compression)

Download

Browse Figures

Versions Notes

Abstract

:

We consider grammar-based text compression with longest first substitution (LFS), where non-overlapping occurrences of a longest repeating factor of the input text are replaced by a new non-terminal symbol. We present the first linear-time algorithm for LFS. Our algorithm employs a new data structure called sparse lazy suffix trees. We also deal with a more sophisticated version of LFS, called LFS2, that allows better compression. The first linear-time algorithm for LFS2 is also presented.

Keywords:

grammar-based text compression; suffix trees; linear-time algorithms

1. Introduction

Data compression is a task of reducing data description length. Not only does it enable us to save space for data storage, but also it reduces time for data communication. This paper focuses on text compression where the data to be compressed are texts (strings). Recent research developments show that text compression has a wide range of applications, e.g., pattern matching [1, 2, 3], string similarity computation [4, 5], detecting palindromic/repetitive structures [4, 6], inferring hierarchal structure of natural language texts [7, 8], and analyses of biological sequences [9].

Grammar-based compression [10] is a kind of text compression scheme in which a context-free grammar (CFG) that generates only an input text w is output as a compressed form of w. Since the problem of computing the smallest CFG which generates w is NP-hard [11], many attempts have been made to develop practical algorithms that compute a small CFG which generates w. Examples of grammar-based compression algorithms are LZ78 [12], LZW [13], Sequitur [7], and Bisection [14]. Approximation algorithms for optimal grammar-based compression have also been proposed [15, 16, 17]. The first compression algorithm based on a subclass of context-sensitive grammars was introduced in [18].

Grammar-based compression based on greedy substitutions has been extensively studied. Wolff [19] introduced a concept of most-frequent-first substitution (MFFS) such that a digram (a factor of length 2) which occurs most frequently in the text is recursively replaced by a new non-terminal symbol. He also presented an

O (n^{2})

-time algorithm for it, where n is the input text length. A linear-time algorithm for most-frequent-first substitution, called Re-pair, was later proposed by Larsson and Moffat [20]. Apostolico and Lonardi [21] proposed a concept of largest-area-first substitution such that a factor of the largest “area” is recursively replaced by a new non-terminal symbol. Here the area of a factor refers to the product of the length of the factor by the number of its non-overlapping occurrences in the input text. It was reported in [22] that compression by largest-area-first substitution outperforms gzip (based on LZ77 [23]) and bzip2 (based on the Burrows-Wheeler Transform [24]) on DNA sequences. However, to the best of our knowledge, no linear-time algorithm for this compression scheme is known.

This paper focuses on another greedy text compression scheme called longest-first substitution (LFS), in which a longest repeating factor of an input text is recursively replaced by a new non-terminal symbol. For example, for input text

w = abaaabbababb $

, the following grammar

\begin{matrix} S & \to & B aa A B A $; \\ A & \to & abb; \\ B & \to & ab, \end{matrix}

which generates only w is the output of LFS.

In this paper, we propose the first linear-time algorithm for text compression by LFS substitution. A key idea is the use of a new data structure called sparse lazy suffix trees. Moreover, this paper deals with a more sophisticated version of longest-first text compression (named LFS2), where we also consider repeating factors of the right-hand of the existing production rules. For the same input text

w = abaaabbababb $

as above, we obtain the following grammar:

\begin{matrix} S & \to & B aa A B A $; \\ A & \to & B b; \\ B & \to & ab . \end{matrix}

This method allows better compression since the total grammar size becomes smaller. In this paper, we present the first linear-time algorithm for text compression based on LFS2. Preliminary versions of our paper appeared in [25] and [26].

Related Work

It is true that several algorithms for LFS or LFS2 were already proposed, however, in fact none of them runs in linear time in the worst case. Bentley and McIlroy [27] proposed an algorithm for LFS, but Nevill-Manning and Witten [8] pointed out that the algorithm does not run in linear time. Nevill-Manning and Witten also claimed that the algorithm can be improved so as to run in linear time, but they only noted a too short sketch for how, which is unlikely to give a shape to the idea of the whole algorithm. Lanctot et al. [28] proposed an algorithm for LFS2 and stated that it runs in linear time, but a careful analysis reveals that it actually takes

O (n^{2})

time in the worst case for some input string of length n. See Appendix for our detailed analysis.

2. Preliminaries

2.1. Notations

Let Σ be a finite alphabet of symbols. We assume that Σ is fixed and

| Σ |

is constant. An element of

Σ^{*}

is called a string. Strings x, y, and z are said to be a prefix, factor, and suffix of string

w = x y z

, respectively.

The length of a string w is denoted by

| w |

. The empty string is denoted by

ε

, that is,

| ε | = 0

. Also, we assume that all strings end with a unique symbol

$ \in Σ

that does not occur anywhere else in the strings. Let

Σ^{+} = Σ^{*} \ {ε}

. The i-th symbol of a string w is denoted by

w [i]

for

1 \leq i \leq | w |

, and the factor of a string w that begins at position i and ends at position j is denoted by

w [i : j]

for

1 \leq i \leq j \leq | w |

. For convenience, let

w [i : j] = ε

for

j < i

, and

w [i :] = w [i : | w |]

for

1 \leq i \leq | w |

. For any strings

x, w

, let

{BP}_{w} (x)

denote the set of the beginning positions of all the occurrences of x in w. That is,

{BP}_{w} (x) = {i ∣ x = w [i : i + | x | - 1]}

.

We say that strings

x, y

overlap in w if there exist integers

i, j

such that

x = w [i : i + | x | - 1]

,

y = w [j : j + | y | - 1]

, and

i \leq j \leq i + | x | - 1

or

j \leq i \leq j + | y | - 1

.

Let

# {occ}_{w} (x)

denote the possible maximum number of non-overlapping occurrences of x in w. If

# {occ}_{w} (x) \geq 2

, then x is said to be repeating in w. We abbreviate a longest repeating factor of w to an LRF of w. Remark that there can exist more than one LRF for w.

Let Σ and Π be the set of terminal and non-terminal symbols, respectively, such that

Σ \cap Π = \emptyset

. A context-free grammar

G

is a formal grammar in which every production rule is of the form

A \to u

, where

A \in Π

and

u \in {(Σ \cup Π)}^{*}

. Let

u = x B y

and

v = x β y

with

x, y, β \in {(Σ \cup Π)}^{*}

and

B \in Π

. If there exists a production rule

B \to β

in

G

, then

v = x β y

is said to be directly derived from

u = x B y

by

G

, and it is denoted by

u \Rightarrow_{G} v

. If there exists a sequence

w_{0}, w_{1}, \dots, w_{n}

such that

w_{i} \in {(Σ \cup Π)}^{*}

and

u = w_{0} \Rightarrow_{G} w_{1} \Rightarrow_{G} \dots \Rightarrow_{G} w_{n} = v,

then we say that v is derived from u. The length of a non-terminal symbol A, denoted

| A |

, is the length of the string

z \in Σ^{*}

that is derived from the production rule

A \to v

. For convenience, we assume that any non-terminal symbol A in

G

has

| A |

positions. The size of the production rule is the number of terminal and non-terminal symbols v contains.

Figure 1.

STree (w)

with

w = ababa $

. Solid arrows represent edges, and dotted arrows are suffix links.

Figure 1.

STree (w)

with

w = ababa $

. Solid arrows represent edges, and dotted arrows are suffix links.

2.2. Data Structures

Our text compression algorithm uses a data structure based on suffix trees [29]. The suffix tree of string w, denoted by

STree (w)

, is defined as follows:

Definition 1 (Suffix Trees)

STree (w)

is a tree structure such that: (1) every edge is labeled by a non-empty factor of w, (2) every internal node has at least two child nodes, (3) all out-going edge labels of every node begin with mutually distinct symbols, and (4) every suffix of w is spelled out in a path starting from the root node.

Assuming any string w terminates with the unique symbol $ not appearing elsewhere in w, there is a one-to-one correspondence between a suffix of w and a leaf node of

STree (w)

. It is easy to see that the numbers of the nodes and edges of

STree (w)

are linear in

| w |

. Moreover, by encoding every edge label x of

STree (w)

with an ordered pair

(i, j)

of integers such that

x = w [i : j]

, each edge only needs constant space. Therefore,

STree (w)

can be implemented with total of

O (| w |)

space. Also, it is well known that

STree (w)

can be constructed in

O (| w |)

time (e.g. see [29]).

STree (w)

for string

w = ababa $

is shown in Figure 1. For any node v of

STree (w)

,

str (v)

denotes the string obtained by concatenating the labels of the edges in the path from the root node to node v. The length of node v, denoted

len (v)

, is defined to be

| str (v) |

. It is an easy application of the Ukkonen algorithm [29] to compute the lengths of all nodes while constructing

STree (w)

. The leaf node

ℓ

such that

str (ℓ) = w [i :]

is denoted by

{leaf}_{i}

, and i is said to be the id of the leaf. Every node v of

STree (w)

except for the root node has a suffix link, denoted by

suf (v)

, such that

suf (v) = v^{'}

where

str (v^{'})

is a suffix of

str (v)

and

len (v^{'}) + 1 = len (v)

. Linear-time suffix tree construction algorithms (e.g., [29]) make extensive use of the suffix links.

A sparse suffix tree [30] of

w \in Σ^{*}

is a kind of suffix tree which represents only a subset of the suffixes of w. The sparse suffix tree of

w \in {(Σ \cup Π)}^{*}

represents the subset

{w [i :] ∣ w [i] \in Σ}

of suffixes of w which begin with a terminal symbol. Let

ℓ

be the length of the LRFs of w. A reference node of the sparse suffix tree of

w \in {(Σ \cup Π)}^{*}

is any node v such that

len (v) \geq ℓ + 1

, and there is no node u such that

str (u)

is a proper prefix of

str (v)

and

len (u) \geq ℓ + 1

.

Our algorithm uses the following data structure.

Definition 2 (Sparse Lazy Suffix Trees)

A sparse lazy suffix tree (SLSTree) of string

w \in {(Σ \cup Π)}^{*}

, denoted by

SLSTree (w)

, is a kind of sparse suffix tree such that: (1) All paths from the root node to the reference nodes coincide with those of the sparse suffix tree of w, and (2) Every reference node v stores an ordered triple

〈 min (v), max (v), card (v) 〉

such that

min (v) = min {BP}_{w} (str (v))

,

max (v) = max {BP}_{w} (str (v))

, and

card (v) = | {BP}_{w} (str (v)) |

.

SLSTree (w)

is called “lazy” since its subtrees that are located below the reference nodes may not coincide with those of the corresponding sparse suffix tree of w. Our algorithms of Section 3. run in linear time by “neglecting” updating these subtrees below the reference nodes.

Proposition 1

For any string

w \in Σ^{*}

,

SLSTree (w)

can be obtained from

STree (w)

in

O (| w |)

time.

Proof.

By a standard postorder traversal on

STree (w)

, propagating the id of each leaf node. □

Since

STree (w)

can be constructed in

O (| w |)

time [29], we can build

SLSTree (w)

in total of

O (| w |)

time.

3. Off-Line Compression by Longest-First Substitution

Given a text string

w \in Σ^{*}

, we here consider a greedy approach to construct a context-free grammar which generates only w. The key is how to select a factor of w to be replaced by a non-terminal symbol from Π. Here, we consider the longest-first-substitution approach where we recursively replace as many LRFs as possible with non-terminal symbols.

Example.

Let

w = abaaabbababb $

. At the beginning, the grammar is of the following simple form

S \to abaaabbababb $

, where the right-hand of the production rule consists only of terminal symbols from Σ. Now we focus on the right-hand of S which has two LRFs

aba

and

abb

. Let us here choose

abb

to be replaced by non-terminal

A \in Π

. We obtain the following grammar:

S \to abaa A ab A $

;

A \to abb

. The other LRF

aba

of length 3 is no longer present in the right-hand of S. Thus we focus on an LRF

ab

of length 2. Replacing

ab

by non-terminal

B \in Π

results in the following grammar:

S \to B aa A B A $

;

A \to abb

;

B \to ab

. Since the right-hand of S has no repeating factor longer than 1, we are done.

Let

w_{0} = w

, and let

w_{k}

denote the string obtained by replacing an LRF of

w_{k - 1}

with a non-terminal symbol

A_{k}

.

LRF (w_{k - 1})

denotes the LRF of

w_{k - 1}

that is replaced by

A_{k}

, namely, we create a new production rule

A_{k} \to LRF (w_{k - 1})

. In the above example,

w_{0} = w = abaaabbababb $

,

LRF (w_{0}) = abb

,

A_{1} = A

,

w_{1} = abaa A ab A $

,

LRF (w_{1}) = ab

,

A_{2} = B

, and

w_{2} = B aa A B A $

.

Due to the property of the longest first approach, we have the following observation.

Observation 1

Let

A_{1}, \dots, A_{k} \in Π

be the non-terminal symbols which replace

LRF (w_{0}),

\dots,

LRF (w_{k - 1})

, respectively. For any

1 \leq i \leq k

, the right-hand of the production rule of

A_{i}

contains none of

A_{1}, \dots, A_{i - 1}

.

In what follows, we will show our algorithm which outputs a context-free grammar which generates a given string. Our algorithm heavily uses the SLSTree structure.

3.1. How to Find $LRF (w_{k})$ Using $SLSTree (w_{k})$

In this section, we show how to find an LRF of

w_{k}

from

SLSTree (w_{k})

.

The next lemmas characterize an LRF of

w_{k}

that is not represented by a node of

SLSTree (w_{k})

.

Lemma 1

If an LRF x of

w_{k}

is not represented by a node of

SLSTree (w_{k})

, then

max {BP}_{w_{k}} (x) = min {BP}_{w_{k}} (x) + | x |

.

Proof.

Let

i = min {BP}_{w_{k}} (x)

and

j = max {BP}_{w_{k}} (x)

. Since x is a repeating factor of

w_{k}

,

| {BP}_{w_{k}} (x) | \geq 2

, which means that

i \neq j

. If

w_{k} [i + | x |] \neq w_{k} [j + | x |]

, then it contradicts the precondition that x is not represented by a node of

SLSTree (w_{k})

. Hence we have

w_{k} [i + | x |] = w_{k} [j + | x |]

. Moreover, since x is an LRF of

w_{k}

, we have

j \geq i + | x |

. However, if we assume

j > i + | x |

, this contradicts the precondition that x is an LRF of

w_{k}

, since

w_{k} [i + | x |] = w_{k} [j + | x |]

and we obtain a longer LRF

w_{k} [i : i + | x |] = w_{k} [j : j + | x |]

. Hence we have

j = i + | x |

. □

The above lemma implies that an LRF x is not represented by a node of

SLSTree (w_{k})

only if the first and the last occurrences of x form a square

x x

in

w_{k}

. For example, see Figure 1 that illustrates

SLSTree (w_{0})

for

w = ababa $

. One can see that

ab

is an LRF of

w_{0}

but it is not represented by a node of

SLSTree (w_{0})

.

However, the following lemma guarantees that it is indeed sufficient to consider the strings represented by nodes of

SLSTree (w_{k})

as candidates for

LRF (w_{k})

.

Lemma 2

Let x be an LRF of

w_{k}

that is not represented by a node of

SLSTree (w_{k})

. Then, there exists another LRF y of

w_{k}

that is represented by a node of

SLSTree (w_{k})

such that

| x | = | y |

. Moreover, x is no longer present in

w_{k + 1}

after a substitution for y (see also Figure 2).

Proof.

Let

i = min {BP}_{w_{k}} (x)

and

j = max {BP}_{w_{k}} (x)

. It follows from Lemma 1 that

j = i + | x |

. Suppose that x is represented on an edge from some node s to some node t of

STree (w)

. Let

u = str (t)

. Then we have

{BP}_{w_{k}} (x) = {BP}_{w_{k}} (u)

. Let y be the suffix of u of length

| x |

. It is clear that

i + | u | - | y |, j + | u | - | y | \in {BP}_{w_{k}} (y)

. Since

j = i + | x | = i + | y |

,

# {occ}_{w_{k}} (y) \geq 2

. Thus y is an LRF of

w_{k}

. Since u is represented by node t and

i = min {BP}_{w_{k}} (u)

and

j = max {BP}_{w_{k}} (u)

, we know that

w_{k} [i + | u |] \neq w_{k} [j + | u |]

. Hence y is represented by a node of

SLSTree (w_{k})

. Since x occurs only within the region

w_{k} [i : j + | u | - 1]

, x does not occur in

w_{k + 1}

after a substitution for y. □

In the running example of Figure 1,

ba

is an LRF of

w_{0}

that is represented by a node of

SLSTree (w_{0})

. After its two occurrences are replaced by a non-terminal symbol

A_{1}

, then

ab

, which is an LRF of

w_{0}

not represented by a node of

SLSTree (w_{0})

, is no more present in

w_{1} = a A_{1} A_{1} $

.

After constructing

SLSTree (w_{0}) = SLSTree (w)

, we create a bin-sorted list of the internal nodes of

SLSTree (w)

in the decreasing order of their lengths. This can be done in linear time by a standard

Figure 2. Illustration for proof of Lemma 2. Since u is represented by a node of

SLSTree (w_{k})

, we know that

w_{k} [i + | u |] \neq w_{k} [j + | u |]

.

Figure 2. Illustration for proof of Lemma 2. Since u is represented by a node of

SLSTree (w_{k})

, we know that

w_{k} [i + | u |] \neq w_{k} [j + | u |]

.

traversal on

SLSTree (w)

. We remark that a new internal node v may appear in

SLSTree (w_{k})

for some

k \geq 1

, which did not exist in

SLSTree (w_{k - 1})

. However, we have that

len (v) \leq | LRF (w_{k - 1}) |

. Thus, we can maintain the bin-sorted list by inserting node v in constant time.

Given a node s in the bin-sorted list, we can determine whether

str (s)

is repeating or not by using

SLSTree (w_{k})

, as follows.

Lemma 3

Let s be any node of

SLSTree (w_{k})

with

len (s) \leq | LRF (w_{k}) |

and let

s_{1}, \dots, s_{ℓ}

be the children of s. Then

{BP}_{w_{k}} (str (s))

is a disjoint union of

{BP}_{w_{k}} (str (s_{1})),

\dots,

{BP}_{w_{k}} (str (s_{ℓ}))

.

Proof.

Clear from the definition of

SLSTree (w_{k})

. □

Lemma 4

For any node s of

SLSTree (w_{k - 1})

such that

| LRF (w_{k}) | \leq len (s) \leq | LRF (w_{k - 1}) |

, it takes amortized constant time to check whether or not

str (s)

is an LRF of

w_{k}

.

Proof.

Let

s_{1}, \dots, s_{ℓ}

be the children of s. Then,

str (s)

is repeating if and only if

max {max {BP}_{w_{k - 1}} (s_{i}) ∣ 1 \leq i \leq ℓ} - min {min {BP}_{w_{k - 1}} (s_{j}) ∣ 1 \leq j \leq ℓ} \geq len (s) .

Remark that the values of

min {BP}_{w_{k - 1}} (s_{i})

and

max {BP}_{w_{k - 1}} (s_{i})

are stored in node

s_{i}

and can be referred to in constant time. Since the above inequality is checked at most once for each node s, it takes amortized constant time. □

Suppose we have found an LRF of

w_{k}

as mentioned above. In the sequel, we show our greedy strategy to select occurrences of the LRF in

w_{k}

to be replaced with a new non-terminal symbol.

The next lemma is essentially the same as Lemma 2 of Kida et al. [1].

Lemma 5

For any non-repeating factor x of

w_{k}

,

{BP}_{w_{k}} (x)

forms a single arithmetic progression.

Therefore, for any non-repeating factor x of

w_{k}

,

{BP}_{w_{k}} (x)

can be expressed by an ordered triple consisting of minimum element

min {BP}_{w_{k}} (x)

, maximum element

max {BP}_{w_{k}} (x)

, and cardinality

| {BP}_{w_{k}} (x) |

, which takes constant space.

Lemma 6

Let s be any node of

SLSTree (w_{k})

such that

str (s)

is an LRF of

w_{k}

, and

s^{'}

be any child of s. Then,

{BP}_{w_{k}} (str (s^{'}))

contains at most two positions corresponding to non-overlapping occurrences of

str (s)

in

w_{k}

.

Proof.

Assume for contrary that

{BP}_{w_{k}} (str (s^{'}))

contains three non-overlapping occurrences of

str (s)

, and let them be

i_{1}, i_{2}, i_{3}

in the increasing order. Then we have

i_{3} - (i_{1} + len (s) - 1) \geq i_{3} - i_{2} \geq len (s) \geq 1,

which implies that

w_{k} [i_{1} : i_{1} + len (s)]

and

w_{k} [i_{3} : i_{3} + len (s)]

are non-overlapping. Moreover, since

len (s^{'}) > len (s)

, we have

w_{k} [i_{1} : i_{1} + len (s)] = w_{k} [i_{3} : i_{3} + len (s)]

. However, this contradicts the precondition that

str (s)

is an LRF of

w_{k}

. □

From Lemma 6, each child

s^{'}

of node s such that

str (s)

is an LRF, corresponds to at most two non-overlapping occurrences of

str (s)

. Due to Lemma 3, we can greedily select occurrences of

str (s)

to be replaced by a new non-terminal symbol, by checking all children

s_{1}, \dots, s_{ℓ}

of node s. According to Lemma 5, it takes amortized constant time to select such occurrences for each node s.

Note that we have to select occurrences of

str (s)

so that no occurrences of

str (s)

remain in the text string, and at least two occurrences of

str (s)

are selected. We remark that we can greedily choose at least

max {2, # occ (str (s)) / 2}

occurrences.

3.2. How to Update $SLSTree (w_{k}^{i - 1})$ to $SLSTree (w_{k}^{i})$

Let L be the set of the greedily selected occurrences of

LRF (w_{k})

in

w_{k}

. For any

0 \leq i \leq | L |

, let

w_{k}^{i}

denote the string obtained after replacing the first i occurrences of

LRF (w_{k})

with non-terminal symbol

A_{k + 1}

. Namely,

w_{k}^{0} = w_{k}

and

w_{k}^{| L |} = w_{k + 1}

.

In this section we show how to update

SLSTree (w_{k}^{i - 1})

to

SLSTree (w_{k}^{i})

. Let p be the beginning position of the i-th occurrence in L. Assume that we have

SLSTree (w_{k}^{i - 1})

, and that we have replaced

w_{k}^{i - 1} [p : p + | LRF (w_{k}) | - 1]

with non-terminal symbol

A_{k + 1}

such that

| A_{k + 1} | = | LRF (w_{k}) |

. We now have

w_{k}^{i}

, and we have to update

SLSTree (w_{k}^{i - 1})

to

SLSTree (w_{k}^{i})

.

A naive way to obtain

SLSTree (w_{k}^{i})

is to remove all the suffixes of

w_{k}^{i - 1}

from

SLSTree (w_{k}^{i - 1})

and insert all the suffixes of

w_{k}^{i}

into it. However, since only the nodes not longer than

LRF (w_{k})

are important for our longest-first strategy, only the suffixes

w_{k}^{i - 1} [p - t :]

such that

1 \leq t \leq | LRF (w_{k}) |

and

w_{k}^{i - 1} [r] \in Σ

for any

p - t \leq r < p

have to be removed from

SLSTree (w_{k}^{i - 1})

, and only the suffixes

w_{k}^{i} [p - t :]

have to be inserted into the tree (see the light-shaded suffixes of Figure 3).

Lemma 7

For any t, let r be the shortest node of

SLSTree (w_{k}^{i - 1})

such that

w_{k}^{i} [p - t : p - 1]

is a prefix of

str (r)

. Assume

p - t = min {BP}_{w_{k}^{i - 1}} (str (r))

.

If $len (r) > | LRF (w_{k}) | + t - 1$ , then there exists an edge in $SLSTree (w_{k}^{i})$ from the root node to ${leaf}_{p - t}$ labeled with $w_{k}^{i} [p - t :]$ .
If $len (r) \leq | LRF (w_{k}) | + t - 1$ , then there exists a node s in $SLSTree (w_{k}^{i})$ such that $str (s) = w_{k}^{i} [p - t : p - 1]$ and s has an edge labeled with $w_{k}^{i} [p :] = A_{k} w_{k}^{i} [p + | A_{k} | :]$ and leading to ${leaf}_{p - t}$ .

Proof.

Consider Case 1 (see also Figure 4). Since

t \geq 1

,

len (r) > | LRF (w_{k}) |

. Hence

str (r)

is a non-repeating factor of

w_{k}^{i}

. By Lemma 5,

{BP}_{w_{k}^{i - 1}} (str (r))

forms a single arithmetic progression. Also, since

len (r) > | LRF (w_{k}) |

,

max {BP}_{w_{k}^{i - 1}} (str (r)) - min {BP}_{w_{k}^{i - 1}} (str (r)) \leq | LRF (w_{k}) |

. Therefore, if

Figure 3.

LRF (w_{k})

at position p of

w_{k}^{i - 1}

is replaced by non-terminal symbol

A_{k}

in

w_{k}^{i}

. Every

w_{k}^{i - 1} [p - t :]

is removed from the tree and every

w_{k}^{i} [p - t :]

is inserted into the tree (the light-shaded suffixes in the right figure). In addition, every

w_{k}^{i - 1} [p + h :]

for

1 \leq h \leq | LRF (w_{k}) | - 1

is removed from the tree (the dark-shaded suffixes in the right figure).

Figure 3.

LRF (w_{k})

at position p of

w_{k}^{i - 1}

is replaced by non-terminal symbol

A_{k}

in

w_{k}^{i}

. Every

w_{k}^{i - 1} [p - t :]

is removed from the tree and every

w_{k}^{i} [p - t :]

is inserted into the tree (the light-shaded suffixes in the right figure). In addition, every

w_{k}^{i - 1} [p + h :]

for

1 \leq h \leq | LRF (w_{k}) | - 1

is removed from the tree (the dark-shaded suffixes in the right figure).

Figure 4. Illustration of Case 1 of Lemma 7.

p - t = min {BP}_{w_{k}^{i - 1}} (str (r))

, then

{BP}_{w_{k}^{i}} (w_{k}^{i} [p - t :]) = {p - t}

. Hence there exists an edge from the root node to

{leaf}_{p - t}

labeled with

w_{k}^{i} [p - t :]

in

SLSTree (w_{k}^{i})

.

Consider Case 2 (see also Figure 5). Let

u = w_{k}^{i - 1} [p - t : p - 1] = w_{k}^{i} [p - t : p - 1]

. Then

| u | = t - 1

. Since

len (r) \leq | LRF (w_{k}) | + t - 1

, and since r is not longer than the reference node in the path spelling out

u LRF (w_{k})

from the root node of

SLSTree (w_{k}^{i})

, there exists at least one integer m such that

m \in {BP}_{w_{k}^{i}} (str (r))

and

m \notin {BP}_{w_{k}^{i}} (u A_{k})

. Hence there exists a node s in

SLSTree (w_{k}^{i})

such that

str (s) = u

and has an out-going edge labeled with

w_{k}^{i} [p :] = A_{k} w_{k}^{i} [p + | A_{k} | :]

and leading to

{leaf}_{p - t}

.□

It is not difficult to see that the edge in each case of Lemma 7 does not exist in

SLSTree (w_{k}^{i - 1})

. Hence we create the edge when we update

SLSTree (w_{k}^{i - 1})

to

SLSTree (w_{k}^{i})

.

The next lemma states how to locate node s of Case 2 of Lemma 7.

Lemma 8

For each t, we can locate node s such that

str (s) = w_{k}^{i} [p - t : p - 1]

in amortized constant time.

Proof.

Let

x_{p - t}

be the longest node in the tree such that

str (x_{p - t})

is a prefix of

w_{k}^{i} [p - t : p - 1]

.

Figure 5. Illustration of Case 2 of Lemma 7.

Consider the largest possible t and denote it by

t_{max}

. Since

t_{max} \leq | LRF (w_{k}) |

, the node

x_{p - t_{max}}

can be found in

O (| LRF (w_{k}) |)

time by going down the path that spells out

w_{k}^{i} [p - t_{max} : p - 1]

from the root node (recall that Σ is fixed). Let

z \in Σ^{*}

be the string such that

str (x_{p - t_{max}}) z = w_{k}^{i} [p - t_{max} : p - 1]

. If

z \neq ε

, then we create a new child node

s_{p - t_{max}}

of

x_{p - t_{max}}

such that

str (s_{p - t_{max}}) = w_{k}^{i} [p - t_{max} : p - 1]

. Otherwise, we set

s_{p - t_{max}} = x_{p - t_{max}}

.

Now assume that we have located nodes

x_{p - t}

and

s_{p - t}

. We can then locate

s_{p - t + 1}

as follows. Consider node

x_{p - t + 1}

. Remark that

str (suf (x_{p - t}))

is a prefix of

str (x_{p - t + 1})

, and thus we can detect

x_{p - t + 1}

in

O (| str (x_{p - t + 1}) | - | str (suf (x_{p - t})) |)

time by using the suffix link. After finding

x_{p - t + 1}

, we can locate or create

s_{p - t + 1}

in constant time.

The total time cost for detecting

x_{p - t}

for all

1 \leq t \leq t_{max}

is linear in

\begin{matrix} \sum_{t = 2}^{t_{max}} & (| str (x_{p - t + 1}) | - | str (suf (x_{p - t})) |) \\ = & | str (x_{p - 1}) | - | str (suf (x_{p - 2})) | \\ + & | str (x_{p - 2}) | - | str (suf (x_{p - 3})) | \\ \dots \dots \\ + & | str (x_{p - t_{max} + 1}) | - | str (suf (x_{p - t_{max}})) | \\ = & | str (x_{p - 1}) | - | str (suf (x_{p - t_{max}})) | + t_{max} - 2 \\ = & | str (x_{p - 1}) | - | str (x_{p - t_{max}}) | + t_{max} - 1 \\ \leq & t_{max} \leq | LRF (w_{k}) | . \end{matrix}

Hence we can locate each

s_{p - t}

in amortized constant time. □

Let v be the reference node in the path from the root to some

{leaf}_{p - t}

. Assume that

{leaf}_{p - t}

is removed from the subtree of v, and redirected to node s in the same path, such that

str (s) = w_{k}^{i} [p - t : p - 1]

. In order to update

SLSTree (w_{k}^{i - 1})

to

SLSTree (w_{k}^{i})

, we have to maintain triple

〈 min (v), max (v), card (v) 〉

for node v. One may be concerned that if

p - t

is neither

min (v)

or

max (v)

and

card (v) \geq 4

in

Figure 6. Illustration of proof for Lemma 9.

SLSTree (w_{k}^{i - 1})

, the occurrences of

str (v)

in

SLSTree (w_{k}^{i})

do not form a single arithmetic progression any more. However, we have the following lemma. For any factor y of

w_{k}^{i}

, let

{Dead}_{w_{k}^{i}} (y) = {BP}_{w_{k}^{i - 1}} (y) \ {BP}_{w_{k}^{i}} (y)

, namely,

{Dead}_{w_{k}^{i}} (y)

denotes the occurrences of y in

w_{k}^{i - 1}

that overlap with the i-th greedily selected occurrence of

LRF (w_{k})

in

w_{k}

.

Lemma 9

Let v be any reference node of

SLSTree (w_{k}^{i})

such that

# {occ}_{w_{k}^{i}} (str (v)) = 1

. For any integer

m, n

, if

m, n \in {BP}_{w_{k}^{i}} (str (v))

, then there is no integer r such that

m < r < n

and

r \in {Dead}_{w_{k}^{i}} (str (v))

. (See Figure 6).

Proof.

Assume for contrary that there exists integer r such that

r \in {Dead}_{w_{k}^{i}} (str (v))

and

m < r < n

. Since

r \in {Dead}_{w_{k}^{i}} (str (v))

, there exist integers

a, b

such that

a \leq r \leq b

, and

b - a + 1 = 2 | LRF (w_{k}) |

. For any integer j such that

a \leq j \leq b

and

j \in {BP}_{w_{k}^{i - 1}} (str (v))

, we have

j \in {Dead}_{w_{k}^{i}} (str (v))

. Since

m, n \notin {Dead}_{w_{k}^{i}} (str (v))

,

m < a < b < n

. As

str (v)

is non-repeating,

n < m + len (v) - 1

. Since

m < a < b < m + len (v) - 1

,

w [a : b]

is a factor of

str (v)

. Therefore, there exist two integers

a^{'}, b^{'}

such that

w [a^{'} : b^{'}] = w [a : b]

. Since

m < a < b < n < a^{'} < b^{'} < n + len (v) - 1

,

w [a : b]

is repeating and

| w [a : b] | = b - a + 1 = 2 | LRF (w_{k}) | > | LRF (w_{k}) |

. It contradicts that

LRF (w_{k})

is an LRF of

w_{k}

. □

Recall that p is the beginning position of the i-th largest greedily selected occurrence of

LRF (w_{k})

in

w_{k}

. Also, for any

1 \leq t \leq | LRF (w_{k}) |

such that

w_{k}^{i - 1} [r] \in Σ

for every

p - t \leq r < p

, we have removed

{leaf}_{p - t}

from the subtree rooted at the reference node v and have reconnected it to node s such that

str (s) = w_{k}^{i} [p - t : p - 1]

. According to the above lemma, if

min (v) < p - t < max (v)

,

{leaf}_{j}

for every

p - t \leq j \leq max (v)

is removed from the subtree of v. After processing

{leaf}_{p - t}

, then

max (v)

is updated to

p - t - d

where

d = (min (v) + max (v)) / card (v)

is the step of the progression, and

card (v)

is updated to

(max (v) - (p - t)) / d + 1

.

Notice that

{leaf}_{p + h}

for every

0 \leq h \leq | LRF (w_{k}) | - 1

has to be removed from the tree, since

w_{k}^{i} [p + h] \notin Σ

and therefore this leaf node should not exist in

SLSTree (w_{k}^{i})

(see the dark-shaded suffixes of Figure 3). Removing each leaf can be done in constant time. Maintaining the information about the triple for the arithmetic progression of the reference nodes can be done in the same way as mentioned above.

The following lemma states how to locate each reference node.

Lemma 10

Let p be the i-th greedily selected occurrence of

LRF (w_{k})

in

w_{k}

. For any integer

ℓ

such that

w_{k}^{i - 1} [ℓ] \in Σ

, let

v (ℓ)

denote the reference node of

SLSTree (w_{k}^{i - 1})

in the path from the root spelling out suffix

w_{k}^{i - 1} [ℓ :]

. For each j such that

p - | LRF (w_{k}) | \leq j \leq p + | LRF (w_{k}) | - 1

, we can locate the reference node

v (j)

in amortized constant time.

Figure 7. The left figure illustrates how to find

v (j)

from

v (j - 1)

. The right one illustrates a special case where

v (j) = {leaf}_{j}

. Once

v (j) = {leaf}_{j}

, it stands that

v (k) = {leaf}_{k}

for any

j \leq k \leq p - 1

.

Figure 7. The left figure illustrates how to find

v (j)

from

v (j - 1)

. The right one illustrates a special case where

v (j) = {leaf}_{j}

. Once

v (j) = {leaf}_{j}

, it stands that

v (k) = {leaf}_{k}

for any

j \leq k \leq p - 1

.

Proof.

Let

ℓ = | LRF (w_{k}) |

. We find

v (p - ℓ)

by spelling out

w_{k}^{i - 1} [p - ℓ :]

from the root in

O (ℓ)

time, since there can be at most

ℓ + 1

nodes in the path from the root to

v (p - ℓ)

.

Suppose we have found

v (j - 1)

. We find

v (j)

as follows. Let

u (j - 1)

be the parent node of

v (j - 1)

. We have

len (u (j - 1)) \leq ℓ

and

len (v (j - 1)) \leq ℓ + 1

. We go to

suf (u (j - 1))

. Since

len (suf (u (j - 1))) + 1 = len (u (j - 1))

, we have

len (suf (u (j - 1))) \leq ℓ + 1

. Thus, we can find

v (j)

by going down the path starting from

suf (u (j - 1))

and spelling out

w_{k}^{i - 1} [j - 1 + len (u (j - 1)) : j - 1 + len (v (j - 1))] = w_{k}^{i - 1} [j + len (suf (u (j - 1))) : j - 1 + len (v (j - 1))]

. (See also the left illustration of Figure 7).

A special case happens when there exists a node s in the path from the root to

{leaf}_{j}

, such that

len (s) = ℓ

and the edge from s in the path starts with some non-terminal symbol

A_{h}

with

h < k

. Namely,

w_{k}^{i} [j + ℓ] = A_{h}

. Due to the property of the longest first approach, we have

| A_{h} | \geq ℓ

. Thus

v_{j} = {leaf}_{j}

. Moreover, for any

j \leq k \leq p - 1

,

v (k) = {leaf}_{k}

. (See also the right illustration of Figure 7). It is thus clear that each

v (k)

can be found in constant time. Since

| A_{h} | \geq ℓ = LRF (w_{k})

, the leaves corresponding to

w_{k}^{i - 1} [p + x - 1 :]

with

1 \leq x \leq ℓ

do not exist in

SLSTree (w_{k}^{i - 1})

. □

From the above discussions, we conclude that:

Theorem 1

For any string

w \in Σ^{*}

, the proposed algorithm for text compression by longest first substitution runs in

O (| w |)

time using

O (| w |)

space.

Pseudo-codes of our algorithms are shown in Algorithms 1, 2, and 3.

3.3. Reducing Grammar Size

In the above sections we considered text compression by longest first substitution, where we construct a context free grammar

G

that generates only a given string w. By Observation 1, for any production rule

A_{k} \to x_{k}

of

G

,

x_{k}

contains only terminal symbols from Σ. In this section, we take the factors of

x_{k}

into consideration for candidates of LRFs, and also replace LRFs appearing in

x_{k}

. This way we can reduce

Algorithms 1: Recursively find longest repeating factors.

Algorithm 2: updateSLSTree

Algorithm 3: getGreedilySelectedOccurrences

the total size of the grammar. In so doing, we consider an LRF of string

z_{k} = w_{k} $_{0} x_{1} $_{1} \dots x_{k} $_{k}

, where

z_{0} = w_{0} = w

and each

$_{i}

appears nowhere else in

z_{k}

.

Example.

Let

w = w_{0} = z_{0} = abaaabbababb $_{0}

. We replace an LRF

abb

with A, and obtain the following grammar:

S \to abaa A ab A $_{0}

;

A \to abb

. Then,

w_{1} = abaa A ab A $_{0}

and

LRF (z_{0}) = abb

. Now,

z_{1} = abaa A ab A $_{0} abb $_{1}

. We replace an LRF

ab

of

z_{1}

with a non-terminal B, getting

S \to B aa A B A $_{0}

;

A \to B b

;

B \to ab

. Then,

w_{2} = B aa A B A $_{0}

and

LRF (z_{1}) = ab

. Now,

z_{2} = B aa A B A $_{0} B b $_{1} ab $_{2}

. Since there is no LRF of length more than 1 in

z_{2}

, we are done.

We call this method of text compression LFS2.

Theorem 2

Given a string w, the LFS2 strategy compresses w in linear time and space.

Proof.

We modify the algorithm proposed in the previous sections. If we have a generalized SLSTree for set

{w_{k}

,

x_{1} $_{1}, \dots, x_{k} $_{k}}

of strings, we can find an LRF of

z_{k} = w_{k} x_{1} $_{1} \dots x_{k} $_{k}

. It follows from the property of the longest first substitution strategy that

| x_{i} | \geq | x_{j} |

for any

i < j

. Therefore, any new node inserted into the generalized SLSTree for

{w_{k}

,

x_{1} $_{1}, \dots, x_{k - 1} $_{k - 1}}

is shorter than the reference nodes of the tree. Thus, using the Ukkonen on-line algorithm [29], we can obtain the generalized SLSTree of

{w_{k}

,

x_{1} $_{1}, \dots, x_{k} $_{k}}

, by inserting the suffixes of each

x_{k} $_{k}

into the generalized SLSTree of

{w_{k}

,

x_{1} $_{1}, \dots, x_{k - 1} $_{k - 1}}

in

O (| x_{k} $_{k} |)

time. It is easy to see that the total length of

x_{1} $_{1}, \dots, x_{k} $_{k}, \dots

is

O (| w |)

. □

4. Conclusions and Future Work

This paper introduced a linear-time algorithm to compress a given text by longest-first substitution (LFS). We employed a new data structure called sparse lazy suffix trees in the core of the algorithm.

We also gave a linear-time algorithm for LFS2 that achieves better compression than LFS.

A related open problem is the following: Does there exist a linear time algorithm for text compression by largest-area-first substitution (LAFS)? The algorithm presented in [21] uses minimal augmented suffix trees (MASTrees) [31] which enable us to efficiently find a factor of the largest area. The size of MASTrees is known to be linear in the input size [32], but the state-of-the-art algorithm of [32] to construct MASTrees takes

O (n log n)

time, where n is the input text length. Also, the algorithm of [21] for LAFS reconstructs the MASTree from scratch, every time a factor of the largest area is replaced by a new non-terminal symbol. Would it be possible to update a MASTree or its relaxed version for following substitutions?

Acknowledgments

We would like to thank Matthias Gallé and Pierre Peterlongo for leading us to reference [28].

References

Kida, T.; Matsumoto, T.; Shibata, Y.; Takeda, M.; Shinohara, A.; Arikawa, S. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science 2003, 298, 253–272. [Google Scholar] [CrossRef]
M¨akinen, V.; Ukkonen, E.; Navarro, G. Approximate Matching of Run-Length Compressed Strings. Algorithmica 2003, 35, 347–369. [Google Scholar] [CrossRef]
Lifshits, Y. Processing Compressed Texts: A Tractability Border. In Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM’07); Springer-Verlag, 2007; Vol. 4580, Lecture Notes in Computer Science; pp. 228–240. [Google Scholar]
Matsubara, W.; Inenaga, S.; Ishino, A.; Shinohara, A.; Nakamura, T.; Hashimoto, K. Efficient Algorithms to Compute Compressed Longest Common Substrings and Compressed Palindromes. Theoretical Computer Science 2009, 410, 900–913. [Google Scholar] [CrossRef]
Hermelin, D.; Landau, G. M.; Landau, S.; Weimann, O. A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression. In Proc. 26th International Symposium on Theoretical Aspects of Computer Science (STACS’09); 2009; pp. 529–540. [Google Scholar]
Matsubara, W.; Inenaga, S.; Shinohara, A. Testing Square-Freeness of Strings Compressed by Balanced Straight Line Program. In Proc. 15th Computing: The Australasian Theory Symposium (CATS’09); Australian Computer Society, 2009; Vol. 94, CRPIT; pp. 19–28. [Google Scholar]
Nevill-Manning, C. G.; Witten, I. H. Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artificial Intelligence Research 1997, 7, 67–82. [Google Scholar]
Nevill-Manning, C. G.; Witten, I. H. Online and offline heuristics for inferring hierarchies of repetitions in sequences. Proc. IEEE 2000, 88, 1745–1755. [Google Scholar] [CrossRef]
Giancarlo, R.; Scaturro, D.; Utro, F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009, 25, 1575–1586. [Google Scholar] [CrossRef] [PubMed]
Kieffer, J. C.; Yang, E.-H. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
Storer, J. NP-completeness Results Concerning Data Compression. Technical Report 234, Department of Electrical Engineering and Computer Science, Princeton University. 1977. [Google Scholar]
Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
Welch, T. A. A Technique for High-Performance Data Compression. IEEE Computer 1984, 17, 8–19. [Google Scholar] [CrossRef]
Kieffer, J. C.; Yang, E.-H.; Nelson, G. J.; Cosman, P. C. Universal lossless compression via multilevel pattern matching. IEEE Transactions on Information Theory 2000, 46, 1227–1245. [Google Scholar] [CrossRef]
Sakamoto, H. A fully linear-time approximation algorithm for grammar-based compression. Journal of Discrete Algorithms 2005, 3, 416–430. [Google Scholar] [CrossRef]
Rytter, W. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 2003, 302, 211–222. [Google Scholar] [CrossRef]
Sakamoto, H.; Maruyama, S.; Kida, T.; Shimozono, S. A Space-Saving Approximation Algorithm for Grammar-Based Compression. IEICE Trans. on Information and Systems 2009, E92-D, 158–165. [Google Scholar] [CrossRef]
Maruyama, S.; Tanaka, Y.; Sakamoto, H.; Takeda, M. Context-Sensitive Grammar Transform: Compression and Pattern Matching. In Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE’08); Springer-Verlag, 2008; Vol. 5280, Lecture Notes in Computer Science; pp. 27–38. [Google Scholar]
Wolff, J. G. An algorithm for the segmentation for an artificial language analogue. British Journal of Psychology 1975, 66, 79–90. [Google Scholar] [CrossRef]
Larsson, N. J.; Moffat, A. Offline Dictionary-Based Compression. In Proc. Data Compression Conference ’99 (DCC’99); IEEE Computer Society, 1999; p. 296. [Google Scholar]
Apostolico, A.; Lonardi, S. Off-Line Compression by Greedy Textual Substitution. Proc. IEEE 2000, 88, 1733–1744. [Google Scholar] [CrossRef]
Apostolico, A.; Lonardi, S. Compression of Biological Sequences by Greedy Off-Line Textual Substitution. In Proc. Data Compression Conference ’00 (DCC’00); IEEE Computer Society, 2000; pp. 143–152. [Google Scholar]
Ziv, J.; Lempel, A. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 1977, IT-23, 337–349. [Google Scholar] [CrossRef]
Burrows, M.; Wheeler, D. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. 1994. [Google Scholar]
Nakamura, R.; Bannai, H.; Inenaga, S.; Takeda, M. Simple Linear-Time Off-Line Text Compression by Longest-First Substitution. In Proc. Data Compression Conference ’07 (DCC’07); IEEE Computer Society, 2007; pp. 123–132. [Google Scholar]
Inenaga, S.; Funamoto, T.; Takeda, M.; Shinohara, A. Linear-time off-line text compression by longest-first substitution. In Proc. 10th International Symposium on String Processing and Information Retrieval (SPIRE’03); Springer-Verlag, 2003; Vol. 2857, Lecture Notes in Computer Science; pp. 137–152. [Google Scholar]
Bentley, J.; McIlroy, D. Data compression using long common strings. In Proc. Data Compression Conference ’99 (DCC’99); IEEE Computer Society, 1999; pp. 287–295. [Google Scholar]
Lanctot, J. K.; Li, M.; Yang, E.-H. Estimating DNA sequence entropy. In Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00); 2000; pp. 409–418. [Google Scholar]
Ukkonen, E. On-line Construction of Suffix Trees. Algorithmica 1995, 14, 249–260. [Google Scholar] [CrossRef]
K¨arkk¨ainen, J.; Ukkonen, E. Sparse Suffix Trees. In Proc. 2nd Annual International Computing and Combinatorics Conference (COCOON’96); Springer-Verlag, 1996; Vol. 1090, Lecture Notes in Computer Science; pp. 219–230. [Google Scholar]
Apostolico, A.; Preparata, F. P. Data structures and algorithms for the string statistics problem. Algorithmica 1996, 15, 481–494. [Google Scholar] [CrossRef]
Brødal, G. S.; Lyngsø, R. B.; O¨stlin, A.; Pedersen, C. N. S. Solving the String Stastistics Problem in Time O(n log n). In Proc. 29th International Colloquium on Automata,Languages, and Programming (ICALP’02); Springer-Verlag, 2002; Vol. 2380, Lecture Notes in Computer Science; pp. 728–739. [Google Scholar]
Lanctot, J. K. Some String Problems in Computational Biology. PhD thesis, University ofWaterloo, 2004. [Google Scholar]

Appendix

In this appendix we show that the algorithm of Lanctot et al. [28] for LFS2 takes

O (n^{2})

time, where n is the length of the input string.

Consider string

w = w_{0} = z_{0} = aaaaaaabbbbbaaaabbbbbcaaaaaaaa $ .

The Lanctot algorithm constructs a suffix tree of w, constructs a bin-sorted list of internal nodes of the tree, and updates the tree in a similar way to our algorithm in Section 3.3. However, a critical difference is that any node v of their tree structure does not store an ordered triple

〈 min (v), max (v), card (v) 〉

such that

min (v) = min {BP}_{w} (str (v))

,

max (v) = max {BP}_{w} (str (v))

, and

card (v) = | {BP}_{w} (str (v)) |

.

See Figure 8 which illustrates the suffix tree of w.

A bin-sorted list of internal nodes of

STree (w)

in decreasing order of their length is as follows:

\begin{matrix} 9 : aaaabbbbb \\ 8 : aaabbbbb \\ 7 : aabbbbb, aaaaaaa \\ 6 : abbbbb, aaaaaa \\ 5 : bbbbb, aaaaa \\ 4 : bbbb, aaaa \\ 3 : bbb, aaa \\ 2 : bb, aa \\ 1 : b, a \end{matrix}

In [28], Lanctot et al. do not mention how they find occurrences of each node in the sorted list. Since they do not have an ordered triple

〈 min (v), max (v), card (v) 〉

for each node v, the best possible way is to traverse the subtree of v checking the leaves in the subtree. Now, for the first LRF-candidate

aaaabbbbb

, we get positions 4 and 13 and find out that

LRF (w) = LRF (z_{0}) = aaaabbbbb

. Then we obtain

w_{1} = aaa A A caaaaaaaa $,

where A is a new non-terminal symbol that replaces

LRF (z_{0}) = aaaabbbbb

.

Figure 8.

STree (w)

with

w = aaaaaaabbbbbaaaabbbbbcaaaaaaaa $

.

Figure 8.

STree (w)

with

w = aaaaaaabbbbbaaaabbbbbcaaaaaaaa $

.

Now see Figure 9 which illustrates a generalized sparse suffix tree for

z_{1} = aaa A A caaaaaaaa $ aaaabbbbb # .

To find

LRF (z_{1})

, we check the nodes in the list as follows.

Length 8. The generalized suffix tree has no node representing $aaabbbbb$ , and hence it is not an LRF.
Length 7. Since node $aaaaaaa$ exists in the generalized suffix tree, we traverse its subtree and find 2 occurrences 23 and 24 in $z_{1}$ . However, it is not an LRF of $z_{1}$ . The other candidate $aabbbbb$ does not have a corresponding node in the tree, so it is not an LRF, either.
Length 6. Node $aaaaaa$ exists in the generalized suffix tree and we find 3 occurrences 23, 24 and 25 in $z_{1}$ by traversing the tree, but it is not an LRF. The tree has no node corresponding to $abbbbb$ , hence it is not an LRF.
Length 5. Node $aaaaa$ exists in the generalized suffix tree and we find 4 occurrences 23, 24, 25 and 26 in $z_{1}$ by traversing the tree, but it is not an LRF. There is no node in the tree corresponding to $bbbbb$ .

Figure 9. Generalized sparse suffix tree of

z_{1} = aaa A A caaaaaaaa $ aaaabbbbb #

.

Figure 9. Generalized sparse suffix tree of

z_{1} = aaa A A caaaaaaaa $ aaaabbbbb #

.

Length 4. Node $aaaa$ exists in the generalized suffix tree and we find 5 occurrences 23, 24, 25, 26 and 27. Now 23 and 27 are non-overlapping occurrences of $aaaa$ , and hence it is an LRF of $z_{1}$ .

Focus on the above operations where we examined factors of lengths from 7 to 5. The total time cost to find the occurrences for the LRF-candidates of these lengths is proportional to 2 + 3 + 4, but none of them is an LRF of

z_{1}

in the end.

In general, for any input string of the form

w = a^{2 k - 1} b^{k + 1} a^{k} b^{k + 1} c a^{2 k} $,

the time cost of the Lanctot algorithm for finding

LRF (z_{1})

is proportional to

2 + 3 + \dots + k = \frac{(k - 1) (k + 2)}{2} .

Since

k = O (| w |) = O (n)

, the Lanctot algorithm takes

O (n^{2})

time.

In his PhD thesis [33], Lanctot modified the algorithm so that all the occurrences of each candidate factor in w are stored in each element of the bin-sorted list (Section 3.1.3, page 55, line 1). However, this clearly requires

O (n^{2})

space. Note that using a suffix array cannot immediately solve this, since the lexicographical ordering of the suffixes can change due to substitution of LRFs, and no efficient methods to edit suffix arrays for such a case are known.

On the contrary, as shown in Section 3, each node v of our data structure stores an ordered triple

〈 min (v), max (v), card (v) 〉

, and our algorithm properly maintains this information when the tree is updated. Using this triple, we can check in amortized constant time whether or not each node in the bin-sorted list is an LRF. Hence the total time cost remains

O (n)

.

© 2009 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Nakamura, R.; Inenaga, S.; Bannai, H.; Funamoto, T.; Takeda, M.; Shinohara, A. Linear-Time Text Compression by Longest-First Substitution. Algorithms 2009, 2, 1429-1448. https://doi.org/10.3390/a2041429

AMA Style

Nakamura R, Inenaga S, Bannai H, Funamoto T, Takeda M, Shinohara A. Linear-Time Text Compression by Longest-First Substitution. Algorithms. 2009; 2(4):1429-1448. https://doi.org/10.3390/a2041429

Chicago/Turabian Style

Nakamura, Ryosuke, Shunsuke Inenaga, Hideo Bannai, Takashi Funamoto, Masayuki Takeda, and Ayumi Shinohara. 2009. "Linear-Time Text Compression by Longest-First Substitution" Algorithms 2, no. 4: 1429-1448. https://doi.org/10.3390/a2041429

APA Style

Nakamura, R., Inenaga, S., Bannai, H., Funamoto, T., Takeda, M., & Shinohara, A. (2009). Linear-Time Text Compression by Longest-First Substitution. Algorithms, 2(4), 1429-1448. https://doi.org/10.3390/a2041429

Article Menu

Linear-Time Text Compression by Longest-First Substitution

Abstract

1. Introduction

Related Work

2. Preliminaries

2.1. Notations

2.2. Data Structures

3. Off-Line Compression by Longest-First Substitution

3.1. How to Find $LRF (w_{k})$ Using $SLSTree (w_{k})$

3.2. How to Update $SLSTree (w_{k}^{i - 1})$ to $SLSTree (w_{k}^{i})$

3.3. Reducing Grammar Size

4. Conclusions and Future Work

Acknowledgments

References

Appendix

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Linear-Time Text Compression by Longest-First Substitution

Abstract

1. Introduction

Related Work

2. Preliminaries

2.1. Notations

2.2. Data Structures

3. Off-Line Compression by Longest-First Substitution

3.1. How to Find LRF ( w k ) Using SLSTree ( w k )

3.2. How to Update SLSTree ( w k i - 1 ) to SLSTree ( w k i )

3.3. Reducing Grammar Size

4. Conclusions and Future Work

Acknowledgments

References

Appendix

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. How to Find $LRF (w_{k})$ Using $SLSTree (w_{k})$

3.2. How to Update $SLSTree (w_{k}^{i - 1})$ to $SLSTree (w_{k}^{i})$