Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings

Danny Hucke; Carl Philipp Reh

doi:10.3390/a14020065

Abstract

A grammar-based compressor is an algorithm that receives a word and outputs a context-free grammar that only produces this word. The approximation ratio for a single input word is the size of the grammar produced for this word divided by the size of a smallest grammar for this word. The worst-case approximation ratio of a grammar-based compressor for a given word length is the largest approximation ratio over all input words of that length. In this work, we study the worst-case approximation ratio of the algorithms Greedy, RePair and LongestMatch on unary strings, i.e., strings that only make use of a single symbol. Our main contribution is to show the improved upper bound of

O ({(log n)}^{8} \cdot {(log log n)}^{3})

for the worst-case approximation ratio of Greedy. In addition, we also show the lower bound of

1.34847194 \dots

for the worst-case approximation ratio of Greedy, and that RePair and LongestMatch have a worst-case approximation ratio of

{log}_{2} (3)

.

Keywords:

data compression; grammar-based compression; approximation algorithm; addition chain

1. Introduction

The goal of grammar-based compression is to represent a word w by a small context-free grammar that produces exactly

{w}

. Such a grammar is called a straight-line program (SLP) for w. In the best case, one gets an SLP of size

Θ (log n)

for a word of length n, where the size of an SLP is the total length of all right-hand sides of the rules of the grammar. A grammar-based compressor is an algorithm that produces an SLP for a given word w. There are various grammar-based compressors that can be found in many places in the literature. A well-known example is the classic LZ78-compressor of Lempel and Ziv [1]. Although it was not introduced as a grammar-based compressor, it is straightforward to compute from the LZ78-factorization of w an SLP for w of roughly the same size. Other examples include BISECTION [2] and SEQUITUR [3]. In this work, we study the global grammar-based compressors Greedy [4,5,6], RePair [7] and LongestMatch [8], to which we will also refer to as global algorithms. Global algorithms are important in practice because they show excellent compression results in various fields. For example, Greedy is used in [5] to compress DNA sequences. Among all global algorithms, RePair is probably the most used one. Examples include compressing web graphs [9], searching compressed text [10], suffix array compression [11] and compressing XML [12]. A key concept of global compressors are maximal strings. A maximal string of an SLP

A

is a word that has length at least two and occurs at least twice without overlap as a factor of the right-hand sides of the rules of

A

. Furthermore, no strictly longer word appears at least as many times without overlap as a factor of the right-hand sides of

A

. For an input word w, a global grammar-based compressor starts with the SLP that has a single rule

S \to w

, where S is the start nonterminal of the grammar. The SLP is then recursively updated by choosing a maximal string

γ

of the current SLP and replacing a maximal set of pairwise nonoverlapping occurrences of

γ

by a new nonterminal X. Additionally, a new rule

X \to γ

is introduced. The algorithm stops when the obtained SLP has no maximal string. In the case of Greedy the chosen maximal string minimizes the size of the SLP in each round, while RePair selects in each round a most frequent maximal string, and LongestMatch chooses a longest maximal string. Please note that the Greedy algorithm as originally presented in [4,5,6] is different from the version studied in this work as well as in [13]: The original Greedy algorithm only considers the right-hand side of the start rule for the choice and the replacement of the maximal string. In particular, all other rules do not change after they are introduced.

In [13] the worst-case approximation ratio of grammar-based compressors is studied. For a grammar-based compressor

C

that computes an SLP

C (w)

for a given word w, one defines the approximation ratio of

C

on w as the quotient of the size of

C (w)

and the size

g (w)

of a smallest SLP for w. The approximation ratio

α_{C} (n)

is the maximal approximation ratio of

C

among all words of length n. In [13] the authors provide upper and lower bounds for the approximation ratios of several grammar-based compressors (among them are all compressors mentioned so far), but for none of the compressors the lower and upper bounds match. For LZ78 and BISECTION these gaps were closed in [14]. For all global algorithms the best upper bound on the approximation ratio is

O ({(n / log n)}^{2 / 3})

[13], while the best known lower bounds so far are

Ω (log n / log log n)

for RePair [15],

Ω (log log n)

for LongestMatch and

5 / (3 {log}_{3} (5)) = 1.137 \dots

for Greedy [13]. In general, the achieved bounds “leave a large gap of understanding surrounding the global algorithms" as the authors in [13] conclude.

Unary words have the form

a^{n}

for some symbol a and integer

n \geq 1

. Grammar-based compression on unary words is strongly related to the field of addition chains, which has been studied for decades (see [16] (Chapter 4.6.3) for a survey) and still is an active topic due to the strong connection to public key cryptosystems (see [17] for a review from that point of view). An addition chain for an integer n of size m is a sequence of integers

1 = k_{1}, k_{2}, \dots, k_{m} = n

such that for each d (

2 \leq d \leq m

), there exists

i, j

(

1 \leq i, j < d

) such that

k_{i} + k_{j} = k_{d}

. It is straightforward to compute from an addition chain for an integer n of size m an SLP for

a^{n}

of size

2 m - 2

. Vice versa, an SLP for

a^{n}

of size m yields an addition chain for n of size m. Therefore, grammar-based compressors on unary inputs can also be thought of as addition chain solvers, i.e., algorithms that find a (small) addition chain for a given integer.

The worst-case approximation ratio for global algorithms is difficult to analyze. A good starting point is therefore to analyze them on unary words because of their simplicity. Even though unary words are not interesting to compress, it is still interesting to look at how global algorithms perform on them. The improved upper bound we show for Greedy uses unary words and is the first improvement that happened in 15 years.

We show the worst-case approximation ratio of RePair and LongestMatch for unary words to be

{log}_{2} (3)

. Both algorithms are basically identical to the binary method that produces an addition chain for n by creating powers of two using repeated squaring, and then the integer n is represented as the sum of those powers of two that correspond to a one in the binary representation of n. Based on that information, we show that for any unary input w the produced SLPs of RePair and LongestMatch have size at most

{log}_{2} (3) \cdot g (w)

and we also provide a lower bound.

We improve the upper bound for the approximation ratio of Greedy on unary words to

O ({(log n)}^{8} \cdot {(log log n)}^{3})

. In [18], which is the previous version of this article, the authors showed an upper bound for the approximation ratio of Greedy of

O (n^{1 / 4} / log n)

for unary inputs, by only analyzing the first three rounds. Here, we present a more in-depth analysis that makes use of every round. We can prove that Greedy produces an SLP of size

O ({(log n)}^{9} \cdot {(log log n)}^{3})

on input

a^{n}

, which together with the fact that a smallest SLP for

a^{n}

has size

Ω (log n)

then yields the improved upper bound of

O ({(log n)}^{8} \cdot {(log log n)}^{3})

. To prove the size bound on the SLP produced by Greedy, we distinguish unary and nonunary nonterminals. A nonterminal X is called unary if its right-hand side is of the form

X \to Z^{d}

when it is first introduced. Otherwise, it is nonunary. We bound the total number of occurrences of all unary and nonunary nonterminals in the grammar produced by Greedy separately. For the unary nonterminals, we bound their total number to be

O ({(log n)}^{9})

, while each of them contributes a size of

O (log log n)

, which yields a total size contribution of

O ({(log n)}^{9} \cdot log log n)

. We then bound the number of occurrences of nonunary nonterminals using the already established number of unary nonterminals, which comes out to be

O ({(log n)}^{9} \cdot {(log log n)}^{3})

. Thus, we obtain the desired upper bound on the size of the grammar.

We also show the lower bound of

1.34847194 \dots

for the approximation ratio of Greedy. The key to achieve this bound is the sequence

y_{k} = y_{k - 1}^{2} + 1

with

y_{0} = 2

, which has been studied in [19] (among other sequences), where it is shown that

y_{k} = ⌊ γ^{2^{k}} ⌋

for

γ = 2.258 \dots

. To prove the lower bound, we show that the SLP produced by Greedy on input

a^{y_{k}}

has size

3 \cdot 2^{k} - 1

, while a smallest SLP for

a^{y_{k}}

has size

3 \cdot {log}_{3} (γ) \cdot 2^{k} + o (2^{k})

(this follows from a construction used to prove the lower bound for Greedy in [13]).

This paper is an extended version of our paper published in the proceedings of SPIRE 2019 [18].

Related Work

One of the first appearances of straight-line programs in the literature are [20,21], where they are called word chains (since they generalize addition chains from numbers to words). In [20] it is shown that the function

g (k, n) = max {g (w) ∣ w \in {1, \dots, k}^{n}}

is in

Θ (n / {log}_{k} n)

. Recall that

g (w)

is the size of a smallest SLP for the word w and thus

g (k, n)

measures the worst-case SLP-compression over all words of length n over a k-letter alphabet.

The smallest grammar problem is the problem of computing a smallest SLP for a given input word. It is known from [13,22] that in general no grammar-based compressor can solve the smallest grammar problem in polynomial time unless

P = NP

. Even worse, unless

P = NP

one cannot compute in polynomial time for a given word w an SLP of size at most

\frac{8569}{8568} \cdot g (w)

[13]. One should mention that the constructions to prove these hardness results use alphabets of unbounded size. Although in [13] it is remarked that the construction in [22] works for words over a ternary alphabet, in [23] it is argued that this is not clear at all and a construction for fixed alphabets of size at least 24 is given. However, for grammar-based compression on unary strings as studied in this work (as well as for the problem of computing a smallest addition chain), there is no

NP

-hardness result, so there might be an optimal polynomial-time algorithm even though it is widely believed that there is none.

Other notable systematic investigations of grammar-based compression are provided in [8,24]. However, in [8], grammar-based compressors are used for universal lossless compression (in the information-theoretical sense), it is shown in [24] that the size of so-called irreducible SLPs (that include SLPs produced by global algorithms) can be upper-bounded by the (unnormalized) k-th order empirical entropy of the produced string plus some lower order terms.

2. Preliminaries

For

i, j \in N

we write

[i, j] = {i, i + 1, \dots, j}

for

i \leq j

and

[i, j] = \emptyset

otherwise. For

m, n \in N

we denote by

m div n

the integer division of m and n. We denote by

m mod n

the modulo of m and n, i.e.,

m mod n \in [0, n - 1]

and

m = (m div n) \cdot n + (m mod n) .

If

m / n

or

\frac{m}{n}

is used, then this refers to the standard division over

R

. Please note that

m div n = ⌊ m / n ⌋

and

(m div n) + (m mod n) \geq m / n

.

An alphabet

Σ

is a finite set of symbols. For a word or string

w = a_{1} \dots a_{n}

over

Σ

with

a_{1}, \dots, a_{n} \in Σ

and

n \geq 0

we write

| w | = n

to denote w’s length. The set of

Σ^{*}

consists of all words over

Σ

and

Σ^{+} = Σ^{*} \ {ε}

, where

ε

is the word of length 0. A unary word is a word of the form

a^{n}

with

a \in Σ

and

n \in N

. All other words

w \in Σ^{+}

are called nonunary. For words

w, v \in Σ^{+}

we say that v is a factor of w if there are

x, y \in Σ^{*}

such that

w = x v y

.

A context-free grammar is a tuple

(N, Σ, P, S)

, where N is the finite set of nonterminals,

Σ

is the alphabet with

Σ \cap N = \emptyset

, P is the set of productions of the form

X \to v

, where

X \in N

and

v \in {(Σ \cup N)}^{+}

, and

S \in N

is the start symbol. An SLP is a context-free grammar

A = (N, Σ, P, S)

, where

every $X \in N$ has exactly one production, i.e.,

$| {(X \to v) \in P ∣ v \in {(Σ \cup N)}^{+}} | = 1, and$
the relation ${(A, B) \in N \times N ∣ (A \to v) \in P, B occurs in v}$ is acyclic.

This way, for every

X \in N

there exists a unique word

w \in Σ^{+}

with

X \to^{+} w

. We say that

A

produces w if

S \to^{+} w

. Please note that some authors require SLPs to be in Chomsky Normal Form (CNF), i.e., every production is of the form

A \to B C

, where

B, C \in N

or

A \to a

, where

a \in Σ

. We do not make this assumption here because, in general, grammar-based compressors produce SLPs that are not in CNF. Furthermore, every SLP can easily be transformed into an SLP that is in CNF, produces the same word and has roughly the same size. A grammar-based compressor

C

is an algorithm that given an input

w \in Σ^{+}

outputs an SLP

A

that produces w. The size of an SLP is defined as

| A | = \sum_{(X \to v) \in P} | v |

. For a word

w \in Σ^{+}

we write

g (w)

for the size of a smallest SLP that produces w. The worst-case approximation ratio

α_{C} (k, n)

of

C

is the maximal approximation ratio over all words of length n over an alphabet of size k:

α_{C} (k, n) = max \{\frac{| C (w) |}{g (w)} ∣ w \in {[1, k]}^{n}\} .

For a given SLP

A

, a word

γ

is called a maximal string of

A

if

$| γ | \geq 2$ ,
$γ$ appears at least twice without overlap as a factor of the right-hand sides,
and no strictly longer word appears at least as many times as a factor of the right-hand sides without overlap.

Example 1.

Let

A = ({S, X, Y, Z}, {a, b}, P, S)

such that P contains

$S \to a X X X X b b Y Z Y Z$ ,
$X \to Y b b Y Z Y Z a$ ,
$Y \to b b b Z Z a Z$ , and
$Z \to a b b$ .

The maximal strings of

A

are

b b

,

Y Z

and

b b Y Z Y Z

. The factors

b b

and

Y Z

occur four times on the right-hand sides without overlap and

b b Y Z Y Z

occurs twice without overlap.

A global grammar-based compressor (or simply global algorithm) starts on input w with the SLP

A_{0} = ({S}, Σ, {S \to w}, S)

. In each round

i \geq 1

, the algorithm selects a maximal string

γ

of

A_{i - 1}

and updates

A_{i - 1}

to

A_{i}

by replacing a largest set of pairwise nonoverlapping occurrences of

γ

in

A_{i - 1}

by a new nonterminal X. Additionally, the algorithm introduces the rule

X \to γ

in

A_{i}

. The algorithm stops when no maximal string occurs. Please note that the replacement is not unique, e.g., the word

a^{5}

has a unique maximal string

γ = a a

, which yields SLPs with rules

S \to X X a

,

X \to a a

or

S \to X a X

,

X \to a a

or

S \to a X X

,

X \to a a

. We assume the first variant here, i.e., maximal strings are replaced from left to right. The compressor Greedy that we study in this work chooses a maximal string in each round

i \geq 1

such that the size of

A_{i}

is minimal.

Example 2(Greedy).

Let

w = a a a a a b b a b a b b b a a a b b

. We have

$A_{0} :$: $S \to a a a a a b b a b a b b b a a a b b$ ,
$A_{1} :$: $S \to a a a a X a b X b a a X$ , $X \to a b b$ ,
$A_{2} :$: $S \to Y Y X a b X b Y X$ , $X \to a b b$ , $Y \to a a$ ,
$A_{3} :$: $S \to Y Y X Z X b Y X$ , $X \to Z b$ , $Y \to a a$ , $Z \to a b$ ,
$A_{4} :$: $S \to Y A Z X b A$ , $X \to Z b$ , $Y \to a a$ , $Z \to a b$ , $A \to Y X$ .

Please note that in the first round, instead of the maximal string

a b b

the algorithm could also choose the maximal string

a a a b b

, because both choices yield SLPs of minimal size 15. In the second round, instead of

a a

the algorithm could also choose

a a X

, because both choices yield SLPs of size 14. Finally, the order of the choices

a b

(round 3) and

Y X

(round 4) could be swapped because both choices yield SLPs of unchanged size 14.

The following lemma from [13] provides a lower bound on the size of an SLP for a word of length n.

Lemma 1

([13] (Lemma 1)). For every word

w \in Σ^{+}

of length n, we have

g (w) \geq 3 {log}_{3} (n) - 3

.

3. Upper Bound for `Greedy`

To show our improved upper bound for the approximation ratio of Greedy on unary words, we are first going to prove that the size of the SLP produced by Greedy for the input

a^{n}

is upper-bounded by

O {(log n)}^{9} \cdot {(log log n)}^{3}

.

Proposition 1.

For all n, we have

| Greedy (a^{n}) | \in O ({(log n)}^{9} \cdot {(log log n)}^{3}) .

First, we need to prove several lemmas that are fulfilled for any global algorithm. When we apply specific arguments to Greedy, we draw attention to it. For better readability, we will use

X_{0} = a

, i.e., the input is

X_{0}^{n}

. Furthermore, let

A_{i} = (N_{i}, {X_{0}}, P_{i}, S)

be the SLP obtained by the global algorithm on input

X_{0}^{n}

after i rounds. Please note that until the algorithm stops, we have

| N_{i} \ {S} | = i

since exactly one new nonterminal is introduced in each round. If we quantify over the rounds of the algorithm, we always implicitly mean that the statements hold until the algorithm stops. If i is mentioned without a quantification, then the statement holds for any

A_{i}

constructed after some round i of the algorithm.

Lemma 2.

For every i, there is a fixed order

X_{i} > X_{i - 1} > \dots > X_{1}

of the nonterminals in

N_{i} \ {S}

such that every right-hand side of a rule

(X \to v) \in P_{i}

satisfies

v \in X_{i}^{*} X_{i - 1}^{*} \dots X_{1}^{*} X_{0}^{*} .

Proof.

We prove this property by induction. Initially, the property holds for the SLP

A_{0}

since

N_{0} \ {S} = \emptyset

and the only rule

S \to X_{0}^{n}

satisfies

X_{0}^{n} \in X_{0}^{*}

. Now assume the claim is true for

A_{i}

, i.e., each right-hand side of a rule in

P_{i}

is a word from

X_{i}^{*} X_{i - 1}^{*} \dots X_{1}^{*} X_{0}^{*}

. Please note that any nonempty factor of such a right-hand side is a word from

X_{k}^{*} \dots X_{j + 1}^{*} X_{j}^{+}

for some

i \geq k > j \geq 0

. Therefore, assume the global algorithm chooses a maximal string

γ \in X_{k}^{*} \dots X_{j + 1}^{*} X_{j}^{+}

in round

i + 1

and

(X \to γ) \in P_{i + 1}

is the corresponding new rule. We show that

v \in X_{i}^{*} X_{i - 1}^{*} \dots X^{*} X_{j}^{*} \dots X_{1}^{*} X_{0}^{*}

for all rules

(Y \to v) \in P_{i + 1}

, i.e., the order of the nonterminals after round

i + 1

is obtained by inserting the new nonterminal X directly before

X_{j}

in the previous order. First, this is obviously true for the new rule

X \to γ

as well as for all rules that have not been modified during round

i + 1

. It remains to check the rules

(Y \to v) \in P_{i + 1}

that are obtained from a rule

(Y \to v^{'}) \in P_{i}

by replacing a largest set of pairwise nonoverlapping occurrences of

γ

in

v^{'}

by the new nonterminal X. If

γ = X_{j}^{d}

(

d \geq 2

) is unary and

X_{j}^{ℓ}

is the single maximal

X_{j}

-block that occurs in

v^{'}

, then replacing occurrences of

γ

from left to right yields

X^{ℓ div d} X_{j}^{ℓ mod d}

as the new maximal blocks of X and

X_{j}

in w. It follows that

v \in X_{i}^{*} X_{i - 1}^{*} \dots X^{*} X_{j}^{*} \dots X_{1}^{*} X_{0}^{*}

. If otherwise

γ

is not a unary word, i.e.,

γ \in X_{k}^{+} \dots X_{j + 1}^{*} X_{j}^{+}

for

i \geq k > j \geq 0

, then

v^{'}

has exactly one occurrence of

γ

as a factor. It follows that

v \in X_{i}^{*} X_{i - 1}^{*} \dots X_{k}^{*} X X_{j}^{*} \dots X_{1}^{*} X_{0}^{*}

and thus v satisfies the claim. This finishes the induction. □

In other words, there is at most one maximal block for each symbol on each right-hand side and the order of these blocks is the same for all rules. Similar to the case distinction in the last steps of the proof of Lemma 2, we will distinguish two types of nonterminals. Let

X \to γ

be the introduced rule in some round of the algorithm. If

γ

is unary, then we call X a unary nonterminal. Otherwise, we call X nonunary. We categorize

X_{0}

as a unary nonterminal, although formally

X_{0}

is not a nonterminal. Please note that the type of a nonterminal is decided when it is introduced and does not change later, i.e., even if the right-hand side of a unary nonterminal becomes nonunary during the execution of the algorithm, the type of the nonterminal stays the same. Our strategy to prove Theorem 1 is to bound the total number of occurrences of unary nonterminals and nonunary nonterminals on right-hand sides independently. It follows from Lemma 2 that every factor that occurs more than once on the right-hand side of a single rule is unary. The following lemma is a direct consequence of that fact.

Lemma 3.

Every nonunary nonterminal occurs at most once on the right-hand side of each rule at any time of the algorithm.

Corollary 1.

If a unary nonterminal X is introduced and

X \to Z^{d}

with

d \geq 2

is the corresponding rule for X, then Z is a unary nonterminal.

Next, we bound the number of rules that contain a unary nonterminal X on the right-hand side. For a nonterminal X, including

X_{0}

, let

#_{i} (X)

be the number of rules of

A_{i}

where X occurs on the right-hand side, or more formally

#_{i} (X) = | {(Y \to v) \in P_{i} ∣ X occurs in v} | .

The next two lemmas describe how

#_{i}

evolves depending on the type of the introduced nonterminal.

Lemma 4.

If a nonunary nonterminal X is introduced in some round

i + 1

, then for every

X^{'} \neq X

(including

X_{0}

), we have

#_{i + 1} (X^{'}) \leq #_{i} (X^{'})

.

Proof.

To prove this point, we use that all rules

(Y \to v) \in P_{i}

satisfy

v \in X_{i}^{*} X_{i - 1}^{*} \dots X_{1}^{*} X_{0}^{*}

(Lemma 2). Since X is nonunary, the chosen maximal string

γ

satisfies

γ \in X_{k}^{+} \dots X_{j + 1}^{*} X_{j}^{+}

for

i \geq k > j \geq 0

. If a nonterminal

X^{'}

does not occur in

γ

, then

#_{i + 1} (X^{'}) = #_{i} (X^{'})

. So, assume the nonterminal

X^{'}

occurs in

γ

, i.e.,

X^{'} = X_{t}

for

t \in [j, k]

. Note first that

X_{t}

occurs on the right-hand side of the new rule

(X \to γ) \in P_{i + 1}

. It follows that to prove the claimed result, we must show that for at least one rule

(Y \to v) \in P_{i}

such that

X_{t}

occurs in v, all occurrences of

X_{t}

must disappear, i.e.,

X_{t}

does not occur in

v^{'}

for

(Y \to v^{'}) \in P_{i + 1}

. Let

M = {Y \in N_{i} ∣ (Y \to v) \in P_{i} and γ is a factor of v} \subseteq N_{i}

be the set of nonterminals of

A_{i}

where the corresponding rule is modified in round

i + 1

. Please note that

| M | \geq 2

since a maximal string occurs at least twice on all right-hand sides and

γ

occurs at most once as factor of each rule since X is nonunary (Lemma 3). If

k < t < j

then for all

Y \in M

and

(Y \to v^{'}) \in P_{i + 1}

, the nonterminal

X_{t}

does not occur in

v^{'}

anymore since the complete

X_{t}

-block (among other symbols) has been replaced. This means that

#_{i + 1} (X_{t}) < #_{i} (X_{t})

since one new rule contains

X_{t}

on the right-hand side, while for at least two rules the occurrences of

X_{t}

have been removed in round

i + 1

. If otherwise

t = k

or

t = j

, then the same argument fails since for

Y \in M

and

(Y \to v^{'}) \in P_{i + 1}

, the right-hand side

v^{'}

could still contain

X_{t}

since it is not necessarily true that the complete

X_{t}

-block has been replaced. However, due to the properties of a maximal string, we show that

X_{t}

does not occur in

v^{'}

for at least one

Y \in M

and

(Y \to v^{'}) \in P_{i + 1}

. Towards a contradiction, assume

X_{t}

occurs in

v^{'}

for all

Y \in M

and

(Y \to v^{'}) \in P_{i + 1}

. This means that for all

Y \in M

and

(Y \to v) \in P_{i}

, the length of the maximal

X_{t}

-block that occurs in v is strictly larger than the length of the maximal

X_{t}

-block that occurs in

γ

. If

t = k

, it follows that

X_{k} γ

is a factor of v for all

Y \in M

and

(Y \to v) \in P_{i}

, and symmetrically, if

t = j

then

γ X_{j}

is a factor of v for all

Y \in M

and

(Y \to v) \in P_{i}

. This contradicts the property that no strictly longer string than

γ

occurs at least as often on the right-hand sides of the rules. It follows that in this case

#_{i + 1} (X_{t}) \leq #_{i} (X_{t})

, which finishes the proof. □

Lemma 5.

If a unary nonterminal X is introduced in some round

i + 1

and

X \to Z^{d}

with

d \geq 2

is the corresponding rule, then

#_{i + 1} (X) \leq #_{i} (Z)

and

#_{i + 1} (Z) \leq #_{i} (Z) + 1

.

Proof.

Both points are straightforward: A rule

(Y \to v) \in P_{i + 1}

only contains X on the right-hand side if

Z^{d}

is a factor of

v^{'}

for

(Y \to v^{'}) \in P_{i}

, which shows that

#_{i + 1} (X) \leq #_{i} (Z)

. For the second point, note that if

(Y \to v) \in P_{i}

does not contain Z on the right-hand side, then the same is true for (the unchanged rule)

(Y \to v) \in P_{i + 1}

. The only rule where Z occurs new is the new rule

X \to Z^{d}

, and thus

#_{i + 1} (Z) \leq #_{i} (Z) + 1

. □

So far, we have shown that when a unary nonterminal X is introduced and

X \to Z^{d}

with

d \geq 2

is the corresponding rule, then Z is a unary nonterminal as well (Corollary 1). Furthermore, we argued that introducing a nonunary nonterminal does not increase the number of rules where a unary nonterminal occurs on the right-hand side (Lemma 4). It follows that we can upper-bound the number of unary nonterminals and the number of rules where those nonterminals occur on right-hand sides independently of the nonunary nonterminals.

To do so, we inductively define a binary tree

T_{i}

that describes how the unary nonterminals evolve until

A_{i}

is reached. All nodes in the tree are labeled with

(X, k)

, where X is a unary nonterminal and k is an upper bound on

#_{j} (X)

for some j.

(1): Initially, the tree $T_{0}$ only contains a single node that is labeled with $(X_{0}, 1)$ .
(2): If a rule $X \to Z^{d}$ is introduced in round $i + 1$ for some $d \geq 2$ , then we update $T_{i}$ to $T_{i + 1}$ by adding two children to the unique leaf that is labeled with $(Z, k)$ for some k. The new left child is labeled with $(Z, k + 1)$ and the new right child is labeled with $(X, k)$ as depicted on the left of Figure 1.

Figure 1. On the left, the general pattern that is applied during the construction of $T_{i}$ is illustrated, where the split refers to a rule $X \to Z^{d}$ that has been introduced by the global algorithm. On the right, the tree $T_{3}$ that corresponds to the introduced rules of Example 3 is shown.
(3): If otherwise a nonunary nonterminal is introduced in round $i + 1$ , then $T_{i + 1} = T_{i}$ , i.e., nonunary nonterminals are ignored.

The initial tree

T_{0}

reflects that the only unary nonterminal of

A_{0}

is

X_{0}

and

#_{0} (X_{0}) = 1

. If the tree is modified according to point (2) of the definition, this refers to Lemma 5, where

#_{i + 1} (X) \leq #_{i} (Z)

and

#_{i + 1} (Z) \leq #_{i} (Z) + 1

is shown when a rule

X \to Z^{d}

for

d \geq 2

is introduced.

The level of a node is the length of the path from the node to the root. For a unary nonterminal

X \in N_{i}

, we denote by level_i

(X)

the level of the unique leaf of

T_{i}

that is labeled with

(X, k)

for some k.

Example 3.

Assume that the first three rules introduced by a global algorithm are

X_{2} \to X_{0}^{d_{1}}

in the first round,

X_{1} \to X_{0}^{d_{2}}

in the second round and

X_{3} \to X_{2}^{d_{3}}

in the third round for

d_{1}, d_{2}, d_{3} \geq 2

. The tree

T_{3}

that corresponds to this introduced rules is depicted on the right of Figure 1. The indices for the introduced nonterminals are chosen such that the ordering of the nonterminals in

N_{3} \ {S}

(see Lemma 2) is

X_{3} > X_{2} > X_{1}

, i.e., all right-hand sides of rules are contained in

X_{3}^{*} X_{2}^{*} X_{1}^{*} X_{0}^{*}

. The corresponding SLPs

A_{0}

,

A_{1}

,

A_{2}

and

A_{3}

are depicted next, where we simply use ∗ instead of the exact exponents of the symbols due to better readability.

\begin{array}{l} A_{0} : & S & \to X_{0}^{*} & A_{2} : & S & \to X_{2}^{*} X_{1}^{*} X_{0}^{*} & A_{3} : & S & \to X_{3}^{*} X_{2}^{*} X_{1}^{*} X_{0}^{*} \\ A_{1} : & S & \to X_{2}^{*} X_{0}^{*} & X_{2} & \to X_{1}^{*} X_{0}^{*} & X_{2} & \to X_{1}^{*} X_{0}^{*} \\ X_{2} & \to X_{0}^{*} & X_{1} & \to X_{0}^{*} & X_{1} & \to X_{0}^{*} \\ X_{3} & \to X_{2}^{*} \end{array}

Please note that in this example, we have

#_{3} (X_{0}) \leq 3

,

#_{3} (X_{2}) \leq 2

,

#_{3} (X_{1}) \leq 2

and

#_{3} (X_{3}) \leq 1

, which is exactly the information contained in the second components of the leaf labels in

T_{3}

. Furthermore, we havelevel₃

(X_{i}) = 2

for

i \in [0, 3]

in this example.

The following lemma is a direct consequence of the fact that the maximal k that occurs for some label

(X, k)

is incremented from one level to the next level (as described in Lemma 5).

Lemma 6.

For each node of

T_{i}

at level m that is labeled with

(X, k)

for some unary nonterminal

X \in N_{i}

, we have

k \leq m + 1

. Also, let

X \in N_{i}

be a unary nonterminal. Then we have

#_{i} (X) \leq {level}_{i} (X) + 1

.

So far, we provided information about the number of rules where a unary nonterminal occurs. Next, we move on to the total number of occurrences of a unary nonterminal on all right-hand sides. We denote by

t_{i} (X)

the total number of occurrences of X on right-hand sides of rules in

A_{i}

. We have

#_{i} (X) \leq t_{i} (X)

by the definition of both functions and for a nonunary nonterminal X, we have

#_{i} (X) = t_{i} (X)

due to Lemma 3.

Lemma 7.

Let

X \to γ

be the rule that is introduced in some round

i + 1

and let

M = {Y \in N_{i} ∣ Y occurs in γ}

. We have

(1): $t_{i + 1} (Y) \leq t_{i} (Y)$ for all $Y \in N_{i} = N_{i + 1} \ {X}$ , and
(2): $\sum_{Y \in M} t_{i + 1} (Y) + t_{i + 1} (X) \leq \sum_{Y \in M} t_{i} (Y)$ .

Proof.

Point (1) is straightforward: For

Y \in M

let

Y^{ℓ}

be the maximal Y-block that occurs as a factor in

γ

for some

ℓ \geq 1

. Replacing

γ

on right-hand sides yields that at least two occurrences of

Y^{ℓ}

are eliminated while only

Y^{ℓ}

is added as a part of the new rule

X \to γ

. If otherwise

Y \notin M

, then

t_{i + 1} (Y) = t_{i} (Y)

because the occurrences of Y are not affected by the new rule.

Point (2) is also based on a simple observation. Please note that

\sum_{Y \in M} t_{i} (Y)

describes the part of the SLP

A_{i}

that is affected by the replacement of

γ

in round

i + 1

, and

\sum_{Y \in M} t_{i + 1} (Y) + t_{i + 1} (X)

is the size of that part in

A_{i + 1}

after the occurrences of

γ

are replaced by X plus the new occurrences in the introduced rule. All other parts of

A_{i}

are not affected by the new rule. Now the properties of a maximal string ensure that

| A_{i + 1} | \leq | A_{i} |

. The extreme case where

γ

has length two and occurs only twice without overlap on the right-hand sides of rules in

A_{i}

satisfies

| A_{i + 1} | = | A_{i} |

. All other cases even satisfy

| A_{i + 1} | < | A_{i} |

. Point (2) directly follows. □

Our next goal is to bound

t_{i} (X)

depending on

{level}_{i} (X)

for a unary nonterminal X. To do so, we now apply arguments specific to Greedy. Recall that Greedy selects a maximal string that minimizes the size of the obtained SLP in each round.

Lemma 8.

Let

Z \in N_{i} \cup {X_{0}}

be a unary nonterminal and assume a rule

X \to Z^{d}

for some

d \geq 2

is introduced byGreedyin round

i + 1

. We have

t_{i + 1} (X) + t_{i + 1} (Z) \leq 2 \sqrt{t_{i} (Z)} \sqrt{#_{i} (Z) + 1} + 1 .

Proof.

If a unary nonterminal X and a rule

X \to Z^{d}

with

d \geq 2

are introduced in round

i + 1

, then the choice of d only depends on the maximal Z-blocks occurring on all right-hand sides of rules in

A_{i}

since the remaining part of

A_{i}

does not change. Assume that

#_{i} (Z) = k

and let

ℓ_{1}, \dots, ℓ_{k}

be the lengths of the maximal Z-blocks occurring on right-hand sides of

A_{i}

, i.e.,

\sum_{j = 1}^{k} ℓ_{j} = t_{i} (Z)

. Then Greedy minimizes

t_{i + 1} (X) + t_{i + 1} (Z) = d + \sum_{j = 1}^{k} (ℓ_{j} div d) + (ℓ_{j} mod d)

, where d is the size of the new rule

X \to Z^{d}

and for each

j \in [1, k]

a maximal block

Z^{ℓ_{j}}

on the right-hand side of a rule in

A_{i}

is transformed into

X^{ℓ_{j} div d} Z^{ℓ_{j} mod d}

. Due to the greedy nature of the algorithm, the following equation holds for all

d \geq 1

:

\begin{matrix} t_{i + 1} (X) + t_{i + 1} (Z) & \leq d + \sum_{j = 1}^{k} (ℓ_{j} div d) + (ℓ_{j} mod d) \\ \leq d + \sum_{j = 1}^{k} \frac{ℓ_{j}}{d} + k (d - 1) \\ = d + \frac{t_{i} (Z)}{d} + k (d - 1) . \end{matrix}

Please note that the chosen maximal string has length at least 2, but the upper bound also holds for

d = 1

since in this case we have

t_{i + 1} (X) + t_{i + 1} (Z) \leq t_{i} (Z)

due to Lemma 7 (point (2)). If we apply

d = ⌈\sqrt{t_{i} (Z)} / \sqrt{k + 1}⌉

, we get

\begin{matrix} t_{i + 1} (X) + t_{i + 1} (Z) & \leq ⌈\frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}}⌉ + \frac{t_{i} (Z)}{⌈\frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}}⌉} + k (⌈\frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}}⌉ - 1) \\ \leq \frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}} + 1 + \frac{t_{i} (Z)}{\frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}}} + k (\frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}}) \\ = (k + 1) \frac{\sqrt{t_{i} (Z)}}{\sqrt{k + 1}} + \frac{t_{i} (Z) \cdot \sqrt{k + 1}}{\sqrt{t_{i} (Z)}} + 1 \\ = 2 \sqrt{t_{i} (Z)} \sqrt{k + 1} + 1 . \end{matrix}

Together with

k = #_{i} (Z)

this proves the lemma. □

The following lemma is essential for the proof of Theorem 1 since we bound the total number of occurrences of a unary nonterminal depending on its level.

Lemma 9.

Let

X \in N_{i} \cup {X_{0}}

be a unary nonterminal with

{level}_{i} (X) = m

. We have

t_{i} (X) \leq 2^{2 - 2^{1 - m}} n^{2^{- m}} \prod_{j = 1}^{m} {(m + 2 - j)}^{2^{- j}} .

(1)

Proof.

We prove the lemma by induction on

m = {level}_{i} (X)

and we start with

m = 0

. The only SLP

A_{i}

that contains a unary nonterminal X such that

{level}_{i} (X) = 0

is the initial SLP

A_{0}

and the unary nonterminal is

X = X_{0}

. Please note that the maximal string

γ

chosen by any global algorithm in the first round on input

X_{0}^{n}

trivially satisfies

γ \in X_{0}^{*}

and thus the two unary nonterminals of

A_{1}

have level one. We have

t_{0} (X_{0}) = n

and this is exactly what we obtain when

m = 0

is used on the right side of Equation (1) (the empty product is considered to be 1).

Now assume any unary nonterminal that has level m satisfies the claimed bound and we consider a unary nonterminal X such that

{level}_{i} (X) = m + 1 > 0

for some i. It follows from the definition that there is a leaf node at level

m + 1

in

T_{i}

that is labeled with

(X, k)

for some k. There are two cases that need to be distinguished. Either this leaf is a left child or a right child of its parent node. Assume that

(X, k)

is the label of a right child and let

(Z, k + 1)

be the label of the left sibling of that node. To prove both cases simultaneously, we prove the upper bound for X and for Z, i.e., we use Z to cover the second case where the node is a left child. The parent node of

(Z, k + 1)

and

(X, k)

is labeled with

(Z, k)

(see Figure 1 on the left). Let

i^{'} < i

be the maximal

i^{'}

such that

{level}_{i^{'}} (Z) = m

, i.e.,

X \to Z^{d}

for

d \geq 2

is the introduced rule in round

i^{'} + 1

and

(Z, k)

is the label of a leaf at level m in

T_{i^{'}}

. By induction, we have

t_{i^{'}} (Z) \leq 2^{2 - 2^{1 - m}} n^{2^{- m}} \prod_{j = 1}^{m} {(m + 2 - j)}^{2^{- j}}

. Now by Lemma 8, we have

t_{i^{'} + 1} (X) + t_{i^{'} + 1} (Z) \leq 2 \sqrt{t_{i^{'}} (Z)} \sqrt{#_{i^{'}} (Z) + 1} + 1 .

Together with

#_{i^{'}} (Z) \leq m + 1

(Lemma 6), this yields

\begin{matrix} t_{i^{'} + 1} (X) + t_{i^{'} + 1} (Z) & \leq 2 {(2^{2 - 2^{1 - m}} n^{2^{- m}} \prod_{j = 1}^{m} {(m + 2 - j)}^{2^{- j}})}^{\frac{1}{2}} {(m + 2)}^{\frac{1}{2}} + 1 \\ = 2^{2 - 2^{- m}} n^{2^{- m - 1}} \prod_{j = 1}^{m} {(m + 2 - j)}^{2^{- j - 1}} {(m + 2)}^{2^{- 1}} + 1 \\ = 2^{2 - 2^{- m}} n^{2^{- m - 1}} \prod_{j = 2}^{m + 1} {(m + 2 - j + 1)}^{2^{- j}} {(m + 2)}^{2^{- 1}} + 1 \\ = 2^{2 - 2^{- m}} n^{2^{- m - 1}} \prod_{j = 1}^{m + 1} {(m + 2 - j + 1)}^{2^{- j}} + 1 . \end{matrix}

Using the fact that

t_{i^{'} + 1} (X) \geq 2

(there are at least two nonoverlapping occurrences of a maximal string) and

t_{i^{'} + 1} (Z) \geq 2

(the new rule contains Z at least twice) yields the claimed upper bound on

t_{i^{'} + 1} (X)

and

t_{i^{'} + 1} (Z)

. Finally, this upper bound holds for

i \geq i^{'} + 1

due to Lemma 7 (point (1)). □

Corollary 2.

Let

X \in N_{i} \cup {X_{0}}

be a unary nonterminal with

{level}_{i} (X) = m

. We have

t_{i} (X) \leq 4 n^{2^{- m}} (m + 2) .

Proof.

Please note that

2^{2 - 2^{- m}} \leq 4

for all

m \geq 0

. We upper-bound the right side of Equation (1) (Lemma 9) as follows:

\begin{matrix} 2^{2 - 2^{1 - m}} n^{2^{- m}} \prod_{j = 1}^{m} {(m + 2 - j)}^{2^{- j}} & \leq 4 n^{2^{- m}} \prod_{j = 1}^{m} {(m + 2)}^{2^{- j}} \\ = 4 n^{2^{- m}} {(m + 2)}^{\sum_{j = 1}^{m} 2^{- j}} \\ \leq 4 n^{2^{- m}} (m + 2) \end{matrix}

□

What we achieved so far is to bound the total size

t_{i} (X)

that a unary nonterminal X contributes on right-hand sides of the rules depending on

{level}_{i} (X)

. Next, we bound the size that nonunary nonterminals contribute to

| A_{i} |

depending on the levels of all unary nonterminals. To do so, we need the following definitions. Let

R_{i} (X)

be the number of distinct right neighbors of X (which are not equal to X) on right-hand sides plus the number of occurrences of X as the last symbol of a right-hand side in

A_{i}

, i.e.,

\begin{matrix} R_{i} (X) = & | {A \in N_{i} \cup {X_{0}} ∣ A \neq X, X A occurs on a right - hand side in A_{i}} | \\ + | {(Y \to v X) \in P_{i} ∣ Y \in N_{i}, v \in {(N_{i} \cup {X_{0}})}^{*}} | . \end{matrix}

Let

L_{i} (X)

be the number of distinct left neighbors of X (which are not equal to X) on right-hand sides plus the number of occurrences of X as the first symbol of a right-hand side in

A_{i}

, i.e.,

\begin{matrix} L_{i} (X) = & | {A \in N_{i} \cup {X_{0}} ∣ A \neq X, A X occurs on a right - hand side in A_{i}} | \\ + | {(Y \to X v) \in P_{i} ∣ Y \in N_{i}, v \in {(N_{i} \cup {X_{0}})}^{*}} | . \end{matrix}

Also, let

f_{i} (X) = #_{i} (X) - R_{i} (X)

and

g_{i} (X) = #_{i} (X) - L_{i} (X)

. Please note that

R_{i} (X) \leq #_{i} (X)

and

L_{i} (X) \leq #_{i} (X)

since for each right-hand side of a rule there is at most one right (respectively, left) neighbor

A \neq X

for some occurrence of X due to Lemma 2 and each right-hand side can contain X at most once as the last (respectively, first) symbol. Also,

#_{i} (X) = R_{i} (X)

means that all maximal X-blocks on right-hand sides are either at the end of the right-hand side or are followed by a distinct symbol. Similarly,

#_{i} (X) = L_{i} (X)

means that all maximal X-blocks on right-hand sides are either at the beginning of the right-hand side or are preceded by a distinct symbol. The following lemmas describe how the functions

f_{i} (X)

and

g_{i} (X)

evolve.

Lemma 10.

If

X \in N_{i} \cup {X_{0}}

then

f_{i + 1} (X) \leq f_{i} (X)

. If a nonunary, maximal string

γ = X v

is selected in round

i + 1

for some

v \in {(N_{i} \cup {X_{0}})}^{+}

, then

f_{i + 1} (X) < f_{i} (X)

.

Proof.

Let

Y \to γ

be the introduced rule in round

i + 1

. If X does not occur in

γ

, then it is straightforward to see that

f_{i + 1} (X) \leq f_{i} (X)

since

#_{i + 1} (X) = #_{i} (X)

and

R_{i + 1} (X) \geq R_{i} (X)

. The new nonterminal Y could be a new right neighbor for some occurrences of X, but all occurrences of X which have this new right neighbor Y in

A_{i + 1}

shared the same right neighbor in

A_{i}

(the first symbol of

γ

).

If otherwise X occurs in

γ

, then first assume that

γ = X^{d}

for some

d \geq 2

. Please note that replacing an occurrence of

γ

on the right-hand side of a rule

(Z \to u) \in P_{i}

either removes all occurrences of X on this right-hand side (in case u contains a maximal X-block of length

k \cdot d

for some integer

k \geq 1

) or the right neighbor of the maximal X-block in u does not change in the modified rule

(Z \to u^{'}) \in P_{i + 1}

since occurrences of

γ

are replaced from left to right. It follows that the only way to obtain

R_{i + 1} (X) < R_{i} (X)

is to remove all occurrences of X on a right-hand side, but then

#_{i} (X)

decreases by the same value. Additionally, the new rule

Y \to X^{d}

adds a new right-hand side to

#_{i + 1} (X)

, but since X is the last symbol on this right-hand side it follows that

R_{i + 1} (X)

is incremented as well. Together this yields

f_{i + 1} (X) \leq f_{i} (X)

in this case.

The case remains where

γ

is nonunary and X occurs in

γ

. Here, we have

#_{i + 1} (X) \leq #_{i} (X)

due to Lemma 4 and

γ

occurs at most once on each right-hand side due to Lemma 2. However, again, the only way to reduce

R_{i + 1} (X)

compared to

R_{i} (X)

is to remove all occurrences of X on a right-hand side, but then again

#_{i + 1} (X)

decreases by the same value. This yields

f_{i + 1} (X) \leq f_{i} (X)

.

Assume now that a nonunary, maximal string

γ = X v

for some

v \in {(N_{i} \cup {X_{0}})}^{+}

is selected in round

i + 1

. We show that

f_{i + 1} (X) < f_{i} (X)

. If X only occurs in the new rule in

A_{i + 1}

after occurrences of

γ

are replaced on all right-hand sides in

A_{i}

, i.e., all modified rules do not contain X anymore, then

#_{i + 1} (X) < #_{i} (X)

because at least two right-hand sides do not contain X as a factor anymore while only the new rule

Y \to γ

adds a new right-hand side which contains X to

#_{i + 1} (X)

. Also, we have

R_{i + 1} (X) = R_{i} (X)

in this case and thus

f_{i + 1} (X) < f_{i} (X)

because all rules where the maximal X-block is removed shared the same right neighbor due to the fact that

γ = X v

is the chosen maximal string. However,

X v

still occurs in the new rule of

A_{i + 1}

and thus

R_{i + 1} (X) = R_{i} (X)

. If otherwise at least one of the modified rules still contains X on the right-hand side, then

X Y

is a factor of this right-hand side in

A_{i + 1}

after the replacement of

γ

. It follows that

R_{i + 1} (X) > R_{i} (X)

in this case and thus

f_{i + 1} (X) < f_{i} (X)

because each distinct right neighbor of X in

A_{i}

is still a right neighbor of X in

A_{i + 1}

as argued above, but additionally

X Y

is new since Y is a new nonterminal. □

The same result does not hold for

g_{i} (X)

. In particular,

g_{i + 1} (X) > g_{i} (X)

is possible when a rule

Y \to X^{d}

is introduced in round

i + 1

for some

d \geq 2

due to the assumption that global algorithms replace occurrences of the maximal string from left to right. For example, assume that

A X^{4}

,

B X^{7}

and

C X^{10}

are the maximal X-blocks on right-hand sides of

A_{i}

including distinct left neighbors for each X-block (A, B and C). Therefore, we have

#_{i} (X) = 3

,

L_{i} (X) = 3

and thus

g_{i} (X) = 0

in this example. If now a rule

Y \to X^{3}

is introduced, then this yields

A Y X

,

B Y^{2} X

and

C Y^{3} X

after replacing

X^{3}

. Hence we have

#_{i + 1} (X) = 4

,

L_{i + 1} (X) = 2

and thus

g_{i + 1} (X) = 2

. We show in the following lemma that this is the only case where

g_{i + 1} (X) > g_{i} (X)

occurs.

Lemma 11.

Let

X \in N_{i} \cup {X_{0}}

. If a rule

Y \to γ

is introduced in round

i + 1

such that

γ \notin X^{+}

, then

g_{i + 1} (X) \leq g_{i} (X)

. If a nonunary, maximal string

γ = X v

is selected in round

i + 1

for some

v \in {(N_{i} \cup {X_{0}})}^{+}

, then

g_{i + 1} (X) < g_{i} (X)

Proof.

The arguments are similar to the corresponding cases in Lemma 10. Let

Y \to γ

be the introduced rule in round

i + 1

. If X does not occur in

γ

, then

g_{i + 1} (X) \leq g_{i} (X)

since

#_{i + 1} (X) = #_{i} (X)

and

L_{i + 1} (X) \geq L_{i} (X)

. The new nonterminal Y could be a new left neighbor for some occurrences of X in

A_{i + 1}

, but all these occurrences of X shared the same left neighbor in

A_{i}

(the last symbol of

γ

).

If otherwise

γ

is nonunary and contains X, then we have

#_{i + 1} (X) \leq #_{i} (X)

due to Lemma 4 and

γ

occurs at most once on each right-hand side due to Lemma 2. The only way to obtain

L_{i + 1} (X) < L_{i} (X)

is again to remove all occurrences of X on a right-hand side, but then

#_{i + 1} (X)

decreases by the same value. Please note that due to the assumption that

γ

is nonunary, it is not possible to modify two (or more) rules such that the maximal X-blocks have different left neighbors in

A_{i}

and after the replacement these X-blocks share the same left neighbor in

A_{i + 1}

. This yields

g_{i + 1} (X) \leq g_{i} (X)

.

Assume now that a nonunary, maximal string

γ = v X

is selected in round

i + 1

for some word

v \in {(N_{i} \cup {X_{0}})}^{+}

. We show that

g_{i + 1} (X) < g_{i} (X)

. If X only occurs in the new rule in

A_{i + 1}

, i.e., X does not occur in the modified rules, then we have

#_{i + 1} (X) < #_{i} (X)

because at least two right-hand sides do not contain X as a factor anymore while only the new rule

Y \to γ

adds a new right-hand side which contains X to

#_{i + 1} (X)

. Moreover, we have

L_{i + 1} (X) = L_{i} (X)

in this case because all rules where the maximal X-block is removed shared the same left neighbor since

γ = v X

is the selected nonunary, maximal string. However,

v X

still occurs on the right-hand side of the new rule of

A_{i + 1}

. It follows that

g_{i + 1} (X) < g_{i} (X)

. If otherwise at least one of the modified rules still contains X on the right-hand side, then

Y X

is a factor of this right-hand side in

A_{i + 1}

after the replacement of

γ

. It follows that

L_{i + 1} (X) > L_{i} (X)

and thus

g_{i + 1} (X) < g_{i} (X)

because each distinct left neighbor of X in

A_{i}

is still a left neighbor of X in

A_{i + 1}

for some occurrence of X. Additionally, Y is a new left neighbor. □

In the following proof, we use the notation

U_{i} = {X \in N_{i} \cup {X_{0}} ∣ X is a unary nonterminal}

for all unary nonterminals that appear in

A_{i}

and

M_{i} = N_{i} \ U_{i}

for all nonunary nonterminals that appear in

A_{i}

.

Lemma 12.

We have

\sum_{X \in M_{i}} t_{i} (X) \leq \sum_{X \in U_{i}} ({level}_{i} (X) + 1) \cdot ({level}_{i} (X) + 1 + {({level}_{i} (X) + 1)}^{2}) .

Proof.

Let

s (i) = \sum_{X \in M_{i}} t_{i} (X)

be the total size that all nonunary nonterminals contribute to the size of

A_{i}

. We first bound the number of rounds where the function s increases, i.e., we bound

| {j \in [0, i - 1] ∣ s (j + 1) > s (j)} |

. If a unary nonterminal is introduced in some round

j + 1

, then

s (j + 1) = s (j)

, i.e., we can ignore these rules. So, consider some round

j + 1

where a nonunary nonterminal X is introduced and let

X \to γ

be the introduced rule. Let

M = {Z \in N_{j} ∣ Z occurs in γ}

be the set of nonterminals that occur at least once in

γ

. We first show that if

k : = | M_{j} \cap M | \geq 2

, then

s (j + 1) \leq s (j)

. In other words, if two nonunary nonterminals occur in

γ

, then

s (j + 1) \leq s (j)

. Let r be the number of rules

(Z \to v) \in P_{j}

such that

γ

is a factor of v. Recall that nonunary factors and nonterminals occur at most once on the right-hand side of a single rule (Lemma 3). We have

s (j + 1) - s (j) = k + r - k \cdot r

because the new nonunary nonterminal X occurs now on r right-hand sides,

γ

contains k nonunary nonterminals which occur exactly once in

γ

each, and the replacement of

γ

on right-hand sides deletes these k nonterminals on r right-hand sides. We have

r \geq 2

(due to the properties of a maximal string) which together with

k \geq 2

yields

s (j + 1) - s (j) \leq 0

. Hence we can assume that

k = | M_{j} \cap M | \leq 1

. The maximal string

γ

has length

| γ | \geq 2

and is not unary, so the first and the last symbol of

γ

are different and at least one of them is unary due to our assumption that at most one nonunary nonterminal occurs in

γ

. Let Y be this unary nonterminal and assume that Y is the first symbol, i.e., the nonunary, maximal string is

γ = Y v

for some (nonempty) v. Afterwards we discuss the case where Y is the last symbol, i.e.,

γ = v Y

.

We bound the number of rounds where a nonunary, maximal string

γ = Y v

is selected for some v. Let

j_{0} \leq i

be the round where the unary nonterminal Y has been introduced. We have

f_{j_{0}} (Y) \leq #_{j_{0}} (Y) \leq {level}_{j_{0}} (Y) + 1 \leq {level}_{i} (Y) + 1

due to Lemma 6 and

{level}_{j} (Y) \leq {level}_{i} (Y)

for all

j \leq i

. We also have

f_{j + 1} (Y) \leq f_{j} (Y)

for

j \in [j_{0}, i - 1]

by Lemma 10. In addition to that, if

Y v

is the selected nonunary, maximal string in round

j + 1

for some v, then we have

f_{j + 1} (Y) < f_{j} (Y)

again by Lemma 10. It follows that after at most

{level}_{i} (Y) + 1

many rounds where the chosen maximal string is nonunary and has the form

Y v

for some (nonempty) v, we have

f_{i} (Y) = 0

. In this case, all maximal Y-blocks have distinct right neighbors or occur at the end of a right-hand side. Hence there is no possibility to select a nonunary, maximal string

Y v

anymore.

Now we similarly bound the number of rounds such that a nonunary, maximal string

γ = v Y

is selected for some v. However, care must be taken in this case, because it is possible that

g_{j + 1} (Y) > g_{j} (Y)

when a rule

X^{'} \to Y^{d}

for

d \geq 2

is introduced in round

j + 1

as explained above. Fortunately, rules of this form (the selected maximal string is from

Y^{+}

) are introduced at most

{level}_{i} (Y)

many times up to round i by the definition of

{level}_{i} (Y)

. Let

j_{0} \leq i

be the round where the unary nonterminal Y has been introduced. We have

g_{j} (Y) \leq #_{j} (Y) \leq {level}_{j} (Y) + 1 \leq {level}_{i} (Y) + 1

for each

j \in [j_{0}, i]

due to Lemma 6 and

{level}_{j} (Y) \leq {level}_{i} (Y)

for all

j \leq i

. Furthermore, if the selected maximal string in round

j + 1

(

j \geq j_{0}

) is not from

Y^{+}

, we have

g_{j + 1} (Y) \leq g_{j} (Y)

due to Lemma 11. Moreover, if the maximal string

γ

is nonunary and

γ = v Y

for some (nonempty) v, then

g_{j + 1} (Y) < g_{j} (Y)

. It follows that between two rounds where maximal strings from

Y^{+}

are selected, there are at most

{level}_{i} (Y) + 1

many rounds where a nonunary, maximal string of the form

v Y

is chosen because then

g_{j} (Y) = 0

is reached (for some j) and thus all maximal Y blocks have distinct left neighbors or occur at the beginning of a right-hand side. Hence no nonunary string of the form

v Y

for some v occurs twice on right-hand sides. Since maximal strings from

Y^{+}

are chosen at most

{level}_{i} (Y)

many times up to round i, it follows that the number of rounds where a nonunary, maximal string of the form

v Y

for some v is selected is at most

{({level}_{i} (Y) + 1)}^{2}

.

Furthermore, the maximal increase

max {s (j + 1) - s (j) ∣ j \in [0, i - 1]}

in a single round is at most

{level}_{i} (Y) + 1

, because the new nonunary nonterminal occurs in

A_{j + 1}

on at most

#_{j} (Y) \leq {level}_{j} (Y) + 1 \leq {level}_{i} (Y) + 1

many right-hand sides of rules for any

j \leq i

and the total number of occurrences of all other (nonunary) nonterminals does not increase (Lemma 7, point (1)).

We conclude that for each unary nonterminal Y, at most

{level}_{i} (Y) + 1 + {({level}_{i} (Y) + 1)}^{2}

many rules are introduced such that the nonunary, maximal string

γ

satisfies

γ = Y v

or

γ = v Y

for some v and each of those rules increases the total size that nonunary nonterminals contribute by at most

{level}_{i} (Y) + 1

. In all other cases, we showed that the size that nonunary nonterminals contribute does not increase. □

Now we are able the prove Proposition 1.

Proof of Proposition 1.

Let

A_{f} = Greedy (X_{0}^{n})

be the final SLP obtained by Greedy, i.e., after f rounds the algorithm stops because

A_{f}

has no maximal string. First, we want to bound the level of unary nonterminals occurring in

A_{f}

. Assume there is a unary nonterminal X such that

{level}_{i} (X) = ⌈ log log n ⌉

after some round

i \leq f

of the algorithm. By Corollary 2, we have

t_{i} (X) \leq 4 n^{2^{- log log n}} (log log n + 3) \leq 8 (log log n + 3) .

Consider the unique leaf node

v_{X}

in the tree

T_{i}

which has level

⌈ log log n ⌉

and label

(X, k)

for some k. If in some round

j \in [i + 1, f]

two children with labels

(X, k + 1)

and

(Y, k)

are attached to

v_{X}

, i.e., the introduced rule in round j is

Y \to X^{d}

for some

d \geq 2

, then we have

t_{j} (X) + t_{j} (Y) \leq t_{j - 1} (X) \leq t_{i} (X)

by Lemma 7. To be more specific, if the length of the chosen maximal string

X^{d}

is exactly

d = 2

and this maximal string

X X

occurs exactly twice without overlap in

A_{j - 1}

, then we have

t_{j} (X) + t_{j} (Y) = t_{j - 1} (X)

(and

| A_{j} | = | A_{j - 1} |

). Please note that in this case, there does not exist a maximal string

X^{d}

or

Y^{d}

of

A_{k}

for all

k \in [j, f]

(since Y occurs only twice and

X X

does not occur on right-hand sides of

A_{j}

), i.e., the children of the node

v_{X}

in

T_{k}

are leaves for

k \in [j, f]

. Otherwise, if the maximal string has length

d \geq 3

or occurs at least three times without overlap, then we have

t_{j} (X) + t_{j} (Y) < t_{j - 1} (X)

since

| A_{j} | < | A_{j - 1} |

holds in this setting. This means that when a new branch occurs in the tree

T_{j}

for some j, then the new children of the branching node are either leaves of the final tree

T_{f}

or the corresponding nonterminals contribute strictly less to the size of the current SLP than the nonterminal which corresponds to the parent node did before the branch. We can iterate this argument for the children of the children of

v_{X}

and so on, i.e., if we consider the subtree rooted at

v_{X}

in

T_{f}

, then from level to level the size that the nonterminals contribute decreases until only leaves occur at some level. Since

t_{i} (X) \leq 8 (log log n + 3)

, it follows that the subtree of

T_{f}

rooted at

v_{X}

has depth at most

8 (log log n + 3) + 1

and thus the maximal level of any unary nonterminal in

A_{f}

is bounded by

⌈ log log n ⌉ + 8 (log log n + 3) + 1 \leq 9 log log n + 26 .

Consequently, the number of unary nonterminals (the number of leaves of

T_{f}

) is bounded by

O ({(log n)}^{9})

since

T_{f}

is a binary tree of depth at most

9 log log n + 26

. Furthermore, each unary nonterminal X in

A_{f}

satisfies

t_{f} (X) \leq O (log log n)

. If there is a round

i \leq f

such that

O_{i} (X) = ⌈ log log n ⌉

we obtain

t_{f} (X) \leq t_{i} (X) \leq 8 (log log n + 3)

by Corollary 2 and Lemma 7, point (1). Otherwise, let

m = {level}_{f} (X) < ⌈ log log n ⌉

. Then

t_{f} (X) \leq m + 3 \leq log log n + 3

, because there is at most one nonoverlapping occurrence of

X X

on right-hand sides of

A_{f}

(otherwise there would exist a maximal string of

A_{f}

) and the number of rules where X occurs on the right-hand side is

#_{f} (X) \leq m + 1

by Lemma 6. To be more precise, a single right-hand side of

A_{f}

could have a maximal X-block of length 3 and all other right-hand sides must have at most one occurrence of X since two different right-hand sides where X-blocks of length 2 occur as well as one right-hand side where an X-block of length 4 occurs would contradict the fact that

A_{f}

has no maximal string. It follows that the size which unary nonterminals contribute to

A_{f}

is

O ({(log n)}^{9} \cdot log log n)

. By Lemma 12, we can bound the size that nonunary nonterminals contribute by

O ({(log n)}^{9} \cdot {(log log n)}^{3})

since there are at most

O ({(log n)}^{9})

many unary nonterminals and each has level at most

O (log log n)

as argued above. It follows that

| A_{f} | \leq O ({(log n)}^{9} \cdot {(log log n)}^{3})

, which proves the proposition. □

The following theorem follows directly from Proposition 1 and Lemma 1, where

g (w) \geq Ω (log n)

is shown for words w of length n.

Theorem 1.

For all n, we have

α_{Greedy} (1, n) \leq O ({(log n)}^{8} \cdot {(log log n)}^{3}) .

4. Lower Bound for `Greedy`

We proceed with the lower bound on the approximation ratio of Greedy. The best lower bound [13] (Theorem 11) that was known previously was

α_{Greedy} (k, n) \geq \frac{5}{3 {log}_{3} (5)} = 1.13767699 \dots

for all

k \geq 1

and infinitely many n. This bound was shown using unary words, which we will also use in our proof for the improved lower bound. A key concept for doing this is the sequence

x_{n}

described in the following lemma by [19]:

Lemma 13

([19] (Example 2.2)). Let

x_{n + 1} = x_{n}^{2} + 1

with

x_{0} = 1

and

β = exp (\sum_{i = 1}^{\infty} \frac{1}{2^{i}} log (1 + \frac{1}{x_{i}^{2}})) .

We have

x_{n} = ⌊β^{2^{n}}⌋

.

In this work, we use the shifted sequence

y_{n} = x_{n + 1}

, i.e., we start with

y_{0} = 2

. It follows that

y_{n} = ⌊γ^{2^{n}}⌋

, where

γ = β^{2} = 2.25851845 \dots

. Additionally, we need the following lemma:

Lemma 14.

Let

m \geq 1

be an integer. Let

f_{m} : R_{> 0} \to R

with

f_{m} (x) = x + \frac{m^{2} + 1}{x} .

We have

f_{m} (x) > 2 m

for all

x > 0

.

Proof.

The unique minimum of

f_{m} (x)

is

2 \sqrt{m^{2} + 1}

for

x = \sqrt{m^{2} + 1}

. It follows that

f_{m} (x) \geq 2 \sqrt{m^{2} + 1} > 2 \sqrt{m^{2}} = 2 m

. □

Now we can prove the new lower bound for Greedy:

Theorem 2.

For all

k \geq 1

and infinitely many n, we have

α_{Greedy} (k, n) \geq \frac{1}{{log}_{3} (γ)} = 1.34847194 \dots .

Proof.

Let

Σ = {a}

be a unary alphabet. We define

w_{k} = a^{y_{k}}

. By Lemma 13, we have

| w_{k} | \leq γ^{2^{k}}

. Applying Lemma 1 yields

g (w_{k}) \leq 3 \cdot {log}_{3} (γ) \cdot 2^{k} + o (2^{k}) .

In the remaining proof we show that on input

w_{k}

, Greedy produces an SLP of size

3 \cdot 2^{k} - 1

, which directly implies

α_{Greedy} (1, n) \geq 3 / (3 {log}_{3} (γ))

. We start with the SLP

A_{0}

which has the single rule

S \to a^{y_{k}}

. Consider now the first round of the algorithm, i.e., we need to find a maximal string

a^{x}

of

A_{0}

such that the grammar

A_{1}

with rules

X_{1} \to a^{x}, S \to X_{1}^{y_{k} div x} a^{y_{k} mod x}

has minimal size. We have

| A_{1} | = x + (y_{k} div x) + (y_{k} mod x) \geq x + y_{k} / x

. By the definition of

y_{k}

we have

| A_{1} | \geq x + (y_{k - 1}^{2} + 1) / x

. Applying Lemma 14 yields

| A_{1} | \geq 2 y_{k - 1} + 1

. Please note that for

x = y_{k - 1}

this minimum is achieved, i.e., we can assume that Greedy selects the maximal string

a^{y_{k - 1}}

and

A_{1}

is

X_{1} \to a^{y_{k - 1}}, S \to X_{1}^{y_{k - 1}} a .

Each maximal string of

A_{1}

is either a unary word over X or a unary word over a, i.e., we can analyze the behavior of Greedy on both rules independently. The rule

X_{1} \to a^{y_{k - 1}}

is obviously treated similarly as the initial SLP

A_{0}

, so we continue with analyzing

S \to X_{1}^{y_{k - 1}} a

. However, again, the same arguments as above show that Greedy introduces a rule

X_{3} \to X_{1}^{y_{k - 2}}

which yields

S \to X_{3}^{y_{k - 2}} X_{1} a

as the new start rule. This process can be iterated using the same arguments for the leading unary strings of length

y_{i}

for some

i \in [1, k]

.

The reader might think of this process as a binary tree, where each node is labeled with a rule (the root is labeled with

S \to a^{y_{k}}

) and the children of a node are the two rules obtained by Greedy when the rule has been processed. We assume that the left child represents the rule for the chosen maximal string and the right child represents the parent rule where all occurrences of the maximal string are replaced by the new nonterminal. In Figure 2 this binary tree is depicted for the steps we discussed above.

Figure 2. Three rounds of Greedy on input

a^{y_{k}}

.

Please note that when a rule is processed, the longest common factor of the two new rules has length 1 (the remainder). More generally, after each round there is no word of length at least two that occurs as a factor in two different rules, since a possibly shared remainder has length 1 and otherwise only new nonterminals are introduced. It follows that we can iterate this process independently for each rule until no maximal string occurs. This is the case when each rule starts with a unary string of length

y_{0} = 2

or, in terms of the interpretation as a binary tree, when a full binary tree of height k is produced. Each right branch occurring in this tree adds a new remainder to those remainders that already occur in the parent rule and a left branch introduces a new (smaller) instance of the start problem. We show by induction that at level

i \in [0, k]

of this full binary tree of height k, there is one rule of size

y_{k - i} + i

and

2^{i - j - 1}

many rules of size

y_{k - i} + j

for

j \in [0, i - 1]

. At level 0, this is true since there is only a single rule of size

y_{k} + 0

. Assuming that our claim is true at level

i < k

, we derive from each rule at level i two new rules at level

i + 1

: A right branch yields a rule that starts with a leading unary string of size

y_{k - i - 1}

and adds a new remainder to the parent rule. A left branch yields a rule that contains only a unary string of size

y_{k - i - 1}

. If we first consider the left branches, we derive that each of the

2^{i}

many rules at level i adds a rule of size

y_{k - i - 1}

at level

i + 1

. For the right branches, the single rule of size

y_{k - i} + i

at level i yields a rule of size

y_{k - i - 1} + i + 1

at level

i + 1

. Furthermore, each of the

2^{i - j - 1}

many rules of size

y_{k - i} + j

(

j \in [0, i - 1]

) yields a rule of size

y_{k - i - 1} + j + 1

. When we put everything together, we get that at level

i + 1

there is a single rule of size

y_{k - i - 1} + i + 1

and

2^{i - j}

many rules of size

y_{k - i - 1} + j

for

j \in [0, i]

. That finishes the induction. It follows that the final SLP (which consists of the rules at level k) has a single rule of size

y_{0} + k = 2 + k

and

2^{k - j - 1}

many rules of size

2 + j

for

j = 0, \dots, k - 1

. This gives a total size of

\begin{matrix} 2 + k + \sum_{j = 0}^{k - 1} 2^{k - j - 1} (2 + j) & = 2 + k + 2^{k} \sum_{j = 0}^{k - 1} 2^{- j} + 2^{k} \sum_{j = 0}^{k - 1} 2^{- j - 1} j \\ = 2 + k + 2^{k} (2 - 2^{- k + 1}) + 2^{k} (- 2^{- k} k - 2^{- k} + 1) \\ = 2 + k + 2^{k + 1} - 2 - k - 1 + 2^{k} \\ = 2^{k + 1} + 2^{k} - 1 \\ = 3 \cdot 2^{k} - 1 . \end{matrix}

□

5. `RePair` and `LongestMatch`

In this section, we analyze the global grammar-based compressors RePair and LongestMatch. In each round i, RePair selects a most frequent maximal string of

A_{i - 1}

and LongestMatch selects a longest maximal string of

A_{i - 1}

.

We will abbreviate the approximation ratio

α_{LongestMatch}

by

α_{LM}

for better readability. We will first show that RePair and LongestMatch produce SLPs of equal size for unary inputs

a^{n}

and we show what the exact size of these SLPs depending on n is. We then use this information to obtain our result for

α_{RePair} (1, n)

and

α_{LM} (1, n)

, respectively. Fix an integer

n \geq 2

and consider the binary representation

n = \sum_{i = 0}^{⌊ {log}_{2} n ⌋} b_{i} \cdot 2^{i}

(2)

of n, where

b_{i} \in {0, 1}

for

i \in [0, ⌊ {log}_{2} n ⌋]

. We denote by

ν (n)

the number of 1’s in the binary representation of n, i.e.,

ν (n) = \sum_{i = 0}^{⌊ {log}_{2} n ⌋} b_{i} .

For example, we have

11 = 1 \cdot 2^{3} + 0 \cdot 2^{2} + 1 \cdot 2^{1} + 1 \cdot 2^{0}

and thus

b_{0} = b_{1} = b_{3} = 1

,

b_{2} = 0

and

ν (11) = 3

.

Proposition 2.

For

n \geq 2

, let

A

be the SLP produced by RePair on input

a^{n}

and

B

be the SLP produced by LongestMatch on input

a^{n}

. We have

| A | = | B | = 2 ⌊ {log}_{2} n ⌋ + ν (n) - 1 .

Proof.

If

n = 2

or

n = 3

then

a^{n}

has no maximal string and thus the final SLP of any global algorithm has a single rule

S \to a^{n}

. The reader can easily verify the claimed result for these cases.

We now assume that

n \geq 4

. Let

m = ⌊ {log}_{2} n ⌋ - 1

. We prove the claim for RePair first and for LongestMatch after. On input

a^{n}

, RePair runs for exactly m rounds and creates rules

X_{1} \to a a

and

X_{i} \to X_{i - 1} X_{i - 1}

for

i \in [2, m]

, i.e., the nonterminal

X_{i}

produces the string

a^{2^{i}}

. These rules have a total size of

2 m

. After these steps, the start rule is

S \to X_{m} X_{m} X_{m}^{b_{m}} X_{m - 1}^{b_{m - 1}} \dots X_{1}^{b_{1}} a^{b_{0}},

where the

b_{i}

’s are the coefficients occurring in the binary representation of n, see equation (2). In other words, the symbol a only occurs in the start rule if the least significant bit is

b_{0} = 1

, and the nonterminal

X_{i}

(

i \in [1, m - 1]

) occurs in the start rule if and only if

b_{i} = 1

. Since RePair only replaces words with at least two occurrences, the most significant bit

b_{m + 1} = 1

is represented by

X_{m} X_{m}

. A third

X_{m}

occurs in the start rule if and only if

b_{m} = 1

. The size of the start rule is

2 + \sum_{i = 0}^{m} b_{i}

. It follows that the total size of the SLP produced by RePair on input

a^{n}

is

2 m + 2 + \sum_{i = 0}^{m} b_{i}

, which together with

m = ⌊ {log}_{2} n ⌋ - 1

and

b_{⌊ {log}_{2} n ⌋} = 1

(the most significant bit is always 1) yields the claimed size.

Now we prove the same result for LongestMatch. In the first round, the chosen maximal string is

a^{⌊ n / 2 ⌋}

, which yields rules

X_{1} \to a^{⌊ n / 2 ⌋}

and

S \to X_{1} X_{1} a^{b_{0}}

, i.e., the symbol a occurs in the start rule if and only if n is odd and thus the least significant bit is

b_{0} = 1

. Assuming that

n \geq 8

, this procedure is now repeated for the rule

X_{1} \to a^{⌊ n / 2 ⌋}

(for

n < 8

there is no maximal string, and the algorithm stops after the first round). This yields

X_{2} \to a^{⌊ n / 4 ⌋}

,

X_{1} \to X_{2} X_{2} a^{b_{1}}

and

S \to X_{1} X_{1} a^{b_{0}}

(note that

⌊ (⌊ n / 2 ⌋) / 2 ⌋ = ⌊ n / 4 ⌋

). After

m = ⌊ {log}_{2} n ⌋ - 1

steps, the iteration of this process results in the final SLP with rules

S \to X_{1} X_{1} a^{b_{0}}

,

X_{i} \to X_{i + 1} X_{i + 1} a^{b_{i}}

for

i \in [1, m - 1]

and

X_{m} \to a a a^{b_{m}}

. The size of this SLP is

2 \cdot (m + 1) + \sum_{i = 0}^{m} b_{i}

, which directly implies the claimed result for LongestMatch. □

Using Proposition 2, we prove the matching bounds for

α_{RePair} (1, n)

and

α_{LM} (1, n)

:

Theorem 3.

For all n, we have

α_{RePair} (1, n) = α_{LM} (1, n) \leq {log}_{2} (3)

.

Proof.

As a consequence of Proposition 2, RePair and LongestMatch produce on input

a^{n}

SLPs of size at most

3 {log}_{2} n

, since

ν (n) - 1 \leq {log}_{2} n

. By Lemma 1, we have

g (a^{n}) \geq 3 {log}_{3} n - 3

. The equality

{log}_{2} n / {log}_{3} n = {log}_{2} (3)

finishes the proof. □

Theorem 4.

For infinitely many n, we have

α_{RePair} (1, n) = α_{LM} (1, n) \geq {log}_{2} (3)

.

Proof.

Let

w_{k} = a^{2^{k} - 1}

. We have

2^{k} - 1 = \sum_{i = 0}^{k - 1} 2^{i}

and thus

ν (2^{k} - 1) = k

. By Proposition 2, the size of the SLPs produced by RePair and LongestMatch is

3 k - 3

. By Lemma 1, we have

g (w_{k}) \leq 3 {log}_{3} (2^{k} - 1) + o (log (2^{k} - 1) \leq 3 {log}_{3} (2) \cdot k + o (k) .

The equality

1 / {log}_{3} (2) = {log}_{2} (3)

finishes the proof. □

6. Discussion

Although we improved the upper bound on the approximation ration for Greedy, the lower bound remains quite weak. We might be able to improve it by finding a similar sequence such that Greedy produces larger remainders in each round. However, care must be taken since for larger remainders it is not true anymore that the rules can be analyzed independently because the rules could share factors of length greater than one. Concerning the upper bound, we conjecture that Greedy achieves logarithmic compression for all unary inputs and thus the approximation ratio is constant, but we have not been able to prove this so far. For arbitrary alphabets, a nonconstant lower bound for Greedy as well as an improvement of the upper bound of

O ({(n / log n)}^{2 / 3})

for any global algorithm seems to be natural starting points for future work.

Author Contributions

Conceptualization, D.H.; methodology, D.H.; validation, D.H. and C.P.R.; formal analysis, D.H.; writing—original draft preparation, D.H.; writing—review and editing, C.P.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the DFG research project LO 748/10-2.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
Kieffer, J.C.; Yang, E.; Nelson, G.J.; Cosman, P.C. Universal lossless compression via multilevel pattern matching. IEEE Trans. Inf. Theory 2000, 46, 1227–1245. [Google Scholar] [CrossRef]
Nevill-Manning, C.G.; Witten, I.H. Identifying Hierarchical Structure in Sequences: A linear-time algorithm. CoRR 1997, 7, 67–82. [Google Scholar] [CrossRef]
Apostolico, A.; Lonardi, S. Some Theory and Practice of Greedy Off-Line Textual Substitution. In Proceedings of the Data Compression Conference, DCC 1998, IEEE Computer Society, Snowbird, UH, USA, 30 March–1 April 1998; pp. 119–128. [Google Scholar] [CrossRef]
Apostolico, A.; Lonardi, S. Compression of Biological Sequences by Greedy Off-Line Textual Substitution. In Proceedings of the Data Compression Conference, DCC 2000, IEEE Computer Society, Snowbird, UH, USA, 28–30 March 2000; pp. 143–152. [Google Scholar] [CrossRef]
Apostolico, A.; Lonardi, S. Off-line compression by greedy textual substitution. Proc. IEEE 2000, 88, 1733–1744. [Google Scholar] [CrossRef]
Larsson, N.J.; Moffat, A. Offline Dictionary-Based Compression. In Proceedings of the Data Compression Conference, DCC 1999, IEEE Computer Society, Snowbird, UH, USA, 29–31 March 1999; pp. 296–305. [Google Scholar] [CrossRef]
Kieffer, J.C.; Yang, E. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
Claude, F.; Navarro, G. Fast and Compact Web Graph Representations. ACM Trans. Web 2010, 4, 16:1–16:31. [Google Scholar] [CrossRef]
Kida, T.; Matsumoto, T.; Shibata, Y.; Takeda, M.; Shinohara, A.; Arikawa, S. Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci. 2003, 298, 253–272. [Google Scholar] [CrossRef]
González, R.; Navarro, G. Compressed Text Indexes with Fast Locate. In Proceedings of the Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, ON, Canada, 9–11 July 2007; In Lecture Notes in Computer Science. Ma, B., Zhang, K., Eds.; Springer: Berlin, Germany, 2007; Volume 4580, pp. 216–227. [Google Scholar] [CrossRef]
Lohrey, M.; Maneth, S.; Mennicke, R. XML tree structure compression using RePair. Inf. Syst. 2013, 38, 1150–1167. [Google Scholar] [CrossRef]
Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The smallest grammar problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
Hucke, D.; Lohrey, M.; Reh, C.P. The Smallest Grammar Problem Revisited. In Proceedings of the String Processing and Information Retrieval—23rd International Symposium, SPIRE 2016, Beppu, Japan, 18–20 October 2016; In Lecture Notes in Computer Science. Inenaga, S., Sadakane, K., Sakai, T., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 9954, pp. 35–49. [Google Scholar] [CrossRef]
Hucke, D.; Jez, A.; Lohrey, M. Approximation ratio of RePair. arXiv 2017, arXiv:1703.06061. [Google Scholar]
Knuth, D.E. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 3rd ed.; Addison-Wesley: Boston, MA, USA, 1998. [Google Scholar]
Noma, A.M.; Muhammed, A.; Mohamed, M.A.; Zulkarnain, Z.A. A Review on Heuristics for Addition Chain Problem: Towards Efficient Public Key Cryptosystems. J. Comput. Sci. 2017, 13, 275–289. [Google Scholar] [CrossRef]
Hucke, D. Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings. In Proceedings of the String Processing and Information Retrieval—26th International Symposium, SPIRE 2019, Segovia, Spain, 7–9 October 2019; In Lecture Notes in Computer Science. Brisaboa, N.R., Puglisi, S.J., Eds.; Springer: Berlin, Germany, 2019; Volume 11811, pp. 3–15. [Google Scholar] [CrossRef]
Aho, A.V.; Sloane, N.J.A. Some doubly exponential sequences. Fibonacci Quart. 1973, 11, 429–437. [Google Scholar]
Berstel, J.; Brlek, S. On the Length of Word Chains. Inf. Process. Lett. 1987, 26, 23–28. [Google Scholar] [CrossRef]
Diwan, A.A. A New Combinatorial Complexity Measure for Languages; Tata Institute: Bombay, India, 1986. [Google Scholar]
Storer, J.A.; Szymanski, T.G. Data compression via textual substitution. J. ACM 1982, 29, 928–951. [Google Scholar] [CrossRef]
Casel, K.; Fernau, H.; Gaspers, S.; Gras, B.; Schmid, M.L. On the Complexity of Grammar-Based Compression over Fixed Alphabets. In Proceedings of the 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, Rome, Italy, 11–15 July 2016; Chatzigiannakis, I., Mitzenmacher, M., Rabani, Y., Sangiorgi, D., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik. 2016; Volume 55, p. 122. [Google Scholar] [CrossRef]
Ochoa, C.; Navarro, G. RePair and All Irreducible Grammars are Upper Bounded by High-Order Empirical Entropy. IEEE Trans. Inf. Theory 2019, 65, 3160–3164. [Google Scholar] [CrossRef]

Figure 1. On the left, the general pattern that is applied during the construction of

T_{i}

is illustrated, where the split refers to a rule

X \to Z^{d}

that has been introduced by the global algorithm. On the right, the tree

T_{3}

that corresponds to the introduced rules of Example 3 is shown.

Figure 1. On the left, the general pattern that is applied during the construction of

T_{i}

is illustrated, where the split refers to a rule

X \to Z^{d}

that has been introduced by the global algorithm. On the right, the tree

T_{3}

that corresponds to the introduced rules of Example 3 is shown.

Figure 2. Three rounds of Greedy on input

a^{y_{k}}

.

Figure 2. Three rounds of Greedy on input

a^{y_{k}}

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings †

Abstract

1. Introduction

Related Work

2. Preliminaries

3. Upper Bound for Greedy

4. Lower Bound for Greedy

5. RePair and LongestMatch

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Article Access Statistics

Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings^†

3. Upper Bound for `Greedy`

4. Lower Bound for `Greedy`

5. `RePair` and `LongestMatch`