Next Article in Journal
Merging Discrete Morse Vector Fields: A Case of Stubborn Geometric Parallelization
Previous Article in Journal
Agent State Flipping Based Hybridization of Heuristic Optimization Algorithms: A Case of Bat Algorithm and Krill Herd Hybrid Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lempel-Ziv Parsing for Sequences of Blocks

by
Dmitry Kosolobov
1,* and
Daniel Valenzuela
2
1
Department of Physical and Mathematical Sciences, Ural Federal University, 620000 Ekaterinburg, Russia
2
Department of Computer Science, University of Helsinki, FI-00014 Helsinki, Finland
*
Author to whom correspondence should be addressed.
Algorithms 2021, 14(12), 359; https://doi.org/10.3390/a14120359
Submission received: 10 November 2021 / Revised: 5 December 2021 / Accepted: 9 December 2021 / Published: 10 December 2021
(This article belongs to the Section Analysis of Algorithms and Complexity Theory)

Abstract

:
The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b > 1 , the number z of phrases in the LZ77 parsing of a string of length n and the number z b of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as z b = O ( b z log n z ) . The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound z b = O ( b z ) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods.

1. Introduction

The Lempel-Ziv parsing (LZ77) [1,2] is one of the central techniques in data compression and string algorithms. Its idea is simple: to compress the data, we parse the data into phrases f 1 f 2 f z such that each phrase f i is either one letter or has an earlier occurrence in f 1 f 2 f i , and the compressed encoding, instead of storing the phrases explicitly, stores the references to the occurrences. (In our investigation, we consider both non-overlapping and overlapping versions of LZ77; see precise definitions below.) Typically, algorithms that produce the parsing interpret the data as a sequence of bytes; however, some algorithms treat the data as a sequence of bits. What is the relationship between these parsings that differ only in the way they view the same input? In this paper, we investigate this question.
Our main result is that, for any integer b > 1 , the number z of phrases in the LZ77 parsing of a string of length n and the number z b of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as z b = O ( b z log n z ) (a more precise formulation follows). We partially complement this upper bound with a lower bound z b = Ω ( b z ) in a series of examples. Further, we prove that a better bound z b = O ( b z ) , which is tight, holds for the special case when each phrase f i in the LZ77 parsing has a “phrase-aligned” earlier occurrence, i.e., f i = f j f j + 1 f j , for some j < j < i . This special case is particularly interesting in connection to the grammar compression [3]: a grammar of size g that produces the string naturally induces such “phrase-aligned” LZ77 parsing of size O ( g ) [4].
The present work is a continuation of the line of research focused on the comparison of different efficiently computable measures of compression, mainly centering around the LZ77 parsing, the golden standard in this field (see a discussion of other close measures in [5,6,7]). The results on this topic are too numerous to be listed here exhaustively. We, however, point out a couple of relevant studies and open problems.
The relations between the LZ77 parsing and the grammar-based compression, which naturally induces an LZ77 parsing [4], are not sufficiently understood: known upper and lower bounds on their sizes differ by an O ( log log n ) factor [8,9,10]. A better lower bound z b = Ω ( b z log n ) , which would show that our main result is tight, even only for b = 2 , would imply that the minimal grammar generating the string attaining this bound is of size Ω ( z log n ) , thus removing the O ( log log n ) -factor gap. This gives a new approach to attack this problem. However, it is still quite possible that our upper bound O ( b z log n z ) can be improved and lowered to O ( b z ) .
The recently introduced LZ77 variant called ReLZ [11] uses a certain preprocessing in the spirit of the so-called relative LZ77 (RLZ) [12]. The efficiency of the obtained parsing was evaluated mostly experimentally, and there are no good upper bounds comparing it to the classical LZ77. Our present result actually stems from this work in an attempt to find a good upper (or lower) bound for this version of the LZ77 parsing. We believe that techniques developed in the present paper could help to obtain such bounds.
The paper is organized as follows. The following section provides all necessary definitions and known facts about the LZ77 parsings and related concepts. In Section 3, we formalize the idea of “contracted” blocks of length b and construct a series of examples showing that one of our upper bounds is tight. Then, in Section 4, we prove our upper bound results. We conclude with some open problems in Section 5.

2. LZ77 Parsings

A string s over an alphabet Σ is a map { 1 , 2 , , n } Σ , where n is referred to as the length of s, denoted by | s | . The string of length zero is called the empty string. We write s [ i ] for the ith letter of s and s [ i j ] for s [ i ] s [ i + 1 ] s [ j ] (which is empty if i > j ). A string u is a substring of s if u = s [ i j ] for some i and j; the pair ( i , j ) is not necessarily unique, and we say that i specifies an occurrence of u in s starting at position i. We say that substrings s [ i j ] and s [ i j ] overlap if j i and i j . Throughout the text, substrings are sometimes identified with their particular occurrences; we do not emphasize this explicitly if it is clear from the context. We say that a substring s [ i j ] has an earlier occurrence if there exist i and j such that s [ i j ] = s [ i j ] and j < i (note that the occurrences do not overlap). A substring s [ 1 j ] (respectively, s [ i n ] ) is a prefix (respectively, suffix) of s. For any i and j, the set { k Z : i k j } (possibly empty) is denoted by [ i j ] . All logarithms have a base of two.
A parsing s = f 1 f 2 f z of a given string s is called a Lempel-Ziv (LZ77) parsing if each string f i (called phrase) is non-empty, and it either is one letter or occurs in the string f 1 f 2 f i 1 (i.e., it has an earlier occurrence). The size of the parsing is the number z of phrases. The parsing is called greedy if it is built greedily from left to right by choosing each phrase f i as the longest substring that starts at a given position and occurs in f 1 f 2 f i 1 (see [2]). It is known that the greedy parsing is, in a sense, optimal, as it is stated in the following lemma.
Lemma 1
(see [4,10,13]). No LZ77 parsing of a string can have smaller size than the greedy LZ77 parsing.
The defined LZ77 parsing is often called non-overlapping since its non-one-letter phrases have non-overlapping earlier occurrences. An analogously defined overlapping LZ77 parsing is its more popular variant: it is a parsing f 1 f 2 f z in which each non-one-letter phrase f i has at least two occurrences in the string f 1 f 2 f i (so that “earlier occurrences” of f i might overlap f i ). Consider s = a b a b a b c for example. The greedy non-overlapping and overlapping LZ77 parsings of s are a . b . a b . a b . c and a . b . a b a b . c , respectively.
In this paper, we mainly discuss the non-overlapping version and, hence, for brevity, often omit the term “non-overlapping”, which is assumed by default. Clearly, every LZ77 parsing is an overlapping LZ77 parsing, but the converse is not necessarily true. Indeed, one can show that the non-overlapping and overlapping LZ77 parsings are not equivalent in terms of size (see [14]). Our results, however, hold for both variants.
An LZ77 parsing f 1 f 2 f z is phrase-aligned if each non-one-letter phrase f i has an earlier occurrence f i = f j f j + 1 f j , for some j < j < i . This particular type of parsing is interesting because of its close connections to the grammar compression, another popular compression technique.
A grammar is a set of rules of the form A B C and A a with a designated initial rule, where a denotes a letter and A , B , C denote non-alphabet non-terminals; see [15]. The size of a grammar is the number of rules in it. A straight line program (SLP) grammar is a grammar that infers exactly one string. The SLP grammars and LZ77 parsings are related as follows.
Lemma 2
(see [4,10]). If a string s is produced by an SLP grammar of size g, then there exists a phrase-aligned LZ77 parsing f 1 f 2 f z for s of size at most g.
By a non-constructive argument [8,10,16], one can show that the converse equivalent reduction from LZ77 parsings to SLP grammars is not possible: in some cases, the size of the minimal SLP grammar can be Ω ( log n log log n ) -times larger than the size of the greedy (i.e., minimal) LZ77 parsing. For completeness, let us show this by repeating here the counting argument essentially used in [10,17].
Consider all SLPs of size g that produce strings over the alphabet { a , b } and contain the rules A a and B b . Since each rule C D E can be constructed in, at most, g 2 ways by choosing a pair ( D , E ) , there are at most g ( g 2 ) g 2 possible configurations of the SLPs ( g 2 choices of the pairs ( D , E ) and the choice of initial rule). Therefore, all such SLPs produce, at most, g ( g 2 ) g 2 g 2 g different strings. Further, for given n and k [ 1 n 1 ] , there are exactly n k strings of length n consisting of n k letters a and k letters b, and each such string has an LZ77 parsing of size O ( k + log n ) . If each such string can be produced by an SLP of size g, then g 2 g n k and, hence, 2 g log g log n k k log n k . Choosing k = log n , we deduce that g log g Ω ( log 2 n ) and, therefore, g Ω ( log 2 n log log n ) , which implies that g Ω ( k log n log log n ) , an Ω ( log n log log n ) -blow up in size compared to the LZ77 parsing of size O ( k + log n ) = O ( k ) .
Albeit there is no an equivalent reduction from LZ77 parsings to grammars, a slightly weaker reduction described in the following lemma still holds.
Lemma 3
(see [4,10,18,19]). If a string s has an (overlapping or non-overlapping) LZ77 parsing of size z, then there exists an SLP grammar of size O ( z log n z ) producing s.

3. Block Contractions and a Lower Bound for Their LZ77 Parsings

Fix an integer b > 0 . A b-block contraction for a string s is a string t such that | t | = | s | / b and, for any i , j [ 1 | t | ] , we have t [ i ] = t [ j ] iff s [ b ( i 1 ) + 1 b i ] = s [ b ( j 1 ) + 1 b j ] . (The string s is padded arbitrarily with new letters so that these two substrings are well defined.) The substrings s [ b ( i 1 ) + 1 b i ] are called b-blocks or blocks if b is clear from the context. We say that a substring s [ i j ] starts (respectively, ends) on a block boundary if i 1 ( mod b ) (respectively, j 0 ( mod b ) ). A substring s [ i j ] is block-aligned if it starts and ends on block boundaries.
For example, consider s = a b a b b a b c and b = 2 . A b-block contraction of s is d d e f , where the letter d corresponds to the blocks a b , e corresponds to b a , and f to b c . The string a b has two block-aligned occurrences at positions 1 and 3 and one non-block-aligned occurrence at position 6.
In this paper, we are interested in the comparison of LZ77 parsings for a string and its b-block contraction. The next theorem establishes a lower bound by providing a series of examples. We will then show in the following section that this lower bound is tight for phrase-aligned LZ77 parsings.
Theorem 1.
For any integers b and z such that 1 < b z / 2 , there exists a string that has a phrase-aligned LZ77 parsing of size Θ ( z ) and whose b-block contraction can have only LZ77 parsings of size at least Ω ( b z ) .
Proof. 
Consider a string t over the alphabet { a , b } whose greedy LZ77 parsing is phrase-aligned and has size z = z b . For h [ 0 b 1 ] , define a morphism ϕ h such that ϕ h ( a ) = c b h 1 a c h and ϕ h ( b ) = c b h 1 b c h , where c is a letter different from a and b. Note that | ϕ h ( a ) | = | ϕ h ( b ) | = b . The example string is s = ϕ 0 ( t ) c b ϕ 1 ( t ) ϕ 2 ( t ) ϕ b 2 ( t ) . It has a phrase-aligned LZ77 parsing of size z = O ( z + b ) constructed as follows. The first occurrences of the substrings ϕ 0 ( a ) and ϕ 0 ( b ) take O ( b ) phrases to encode; therefore, ϕ 0 ( t ) can be encoded into O ( z + b ) phrases by mimicking the parsing of t. Further, each substring ϕ h ( t ) , for h > 0 , occurs at position h + 1 and, thus, can be encoded by one phrase referring to this occurrence. The latter referenced occurrence is phrase-aligned provided the prefix s [ 1 b ] and the substring s [ | ϕ 0 ( t ) | + 1 | ϕ 0 ( t ) | + b ] = c b were parsed trivially as sequences of b letters. □
At the same time, any LZ77 parsing of a b-block contraction of the string s must have size at least ( b 1 ) z : the b-block contraction of each substring ϕ h ( t ) produces a “copy” of t over a different alphabet, but the new “contracted” letters of these “copies” are pairwise distinct; therefore, each contraction should be parsed without references to other substrings, hence, occupying at least z phrases. It remains to note that ( b 1 ) z = ( b 1 ) ( z b ) ( b 1 ) z / 2 = Ω ( b z ) . □

4. Upper Bounds on LZ77 Parsings for Block Contractions

We first consider the case of phrase-aligned LZ77 parsings in the following theorem. As will be seen later, it easily implies our result for arbitrary parsings. The proof of this theorem is quite complicated and long, occupying most of the present section.
Theorem 2.
Suppose that f 1 f 2 f z is a phrase-aligned LZ77 parsing of a string s, i.e., every non-one-letter phrase f i has an earlier occurrence f i = f j f j + 1 f j , for some j < j < i . Then, for any integer b > 1 , the size of the greedy LZ77 parsing of a b-block contraction for s is O ( b z ) .
Proof. 
Let s be a given string and s b be a b-block contraction of s. Assume that the alphabets of s and s b do not intersect. We are to show that there exists some LZ77 parsing of s b of size O ( b z ) . Then, the statement of the theorem immediately follows from Lemma 1. The proof is split into five Section 4.1Section 4.5. □

4.1. Basic Ideas

For simplicity, assume that | s | is a multiple of b (if not, s is padded with at most b new letters). Our construction relies on the following notion: for h [ 0 . . b 1 ] , the h-shifted block parsing of s is the following parsing for the string s [ h + 1 . . | s | b + h ] :
s [ h + 1 . . | s | b + h ] = t h , 1 d h , 1 t h , 2 d h , 2 t h , z 1 d h , z 1 t h , z ,
where | d h , 1 | = | d h , 2 | = = | d h , z 1 | = b , and each phrase t h , i is the longest substring starting at position p = h + | t h , 1 d h , 1 t h , i 1 d h , i 1 | + 1 , whose length is a multiple of b and that is entirely contained in the phrase f j covering the position p, i.e., 0 < p | f 1 f 2 f j 1 | | f j | and p + | t h , i | 1 | f 1 f 2 f j | (in particular, t h , i might be empty if f j is too short). It might be viewed also as a parsing constructed by a greedy process that, starting from the position h + 1 , alternatively chooses first a longest substring t h , i whose length is a multiple of b that starts at the current position and is “inscribed” in the current phrase f j , and then chooses a phrase d h , i of length exactly b, which “bridges” neighboring phrases in f 1 f 2 f z . It is straightforward that z 2 z . It is instructive to imagine the h-shifted block parsings (for h = 0 , 1 , , b 1 ) written one after another as in Figure 1. Let us describe briefly the rationale behind the definition.
In the case h = 0 , we omit the first index “h” in the notation t h , i , d h , i and write the zero-shifted block parsing as t 1 d 1 t 2 d 2 d z 1 t z . Note that the phrases in this parsing are block-aligned, and each phrase t i has an earlier occurrence; however, it is not an LZ77 parsing of s since substrings d i might not have earlier occurrences. Moreover, the zero-shifted block parsing could have served as a generator of a correct LZ77 parsing for s b if every substring t i had a block-aligned earlier occurrence in s: then, the required parsing for s b would have consisted of 2 z 1 phrases obtained by the b-block contractions of t 1 , d 1 , t 2 , d 2 , , d z 1 , t z , respectively. (Note that each string d i has length b and, thus, is contracted into one letter.) Unfortunately, albeit each substring t k has earlier occurrences in s, it does not necessarily have an occurrence starting and ending on block boundaries. However, each earlier occurrence s [ i j ] = t k is block-aligned in the h-shifted block parsing of s with h = j mod b . This observation is the primary motivation for the introduction of the h-shifted block parsings, and we use it in the sequel.
Informally, our idea is as follows. Consider a phrase t whose earlier occurrence s [ i j ] = t does not start and end on block boundaries. The occurrence s [ i j ] is block-aligned according to an h-shifted block parsing and, since the initial parsing f 1 f 2 f z was phrase-aligned, coincides with the concatenation of some consecutive phrases t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k in this h-shifted block parsing (details are discussed below). The phrase t will be disassembled in place into smaller block-aligned chunks that have earlier occurrences starting and ending on block boundaries. Thus, a block-aligned copy of the fragment s [ i j ] = t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k will appear in place of t and, if a following phrase t refers to the same fragment s [ i j ] , we can instead refer to this reconstructed block-aligned copy without the need to disassemble into chunks again. Suppose that each phrase t h , k in the decomposition of s [ i j ] has a block-aligned occurrence before the phrase t . Then, the number of chunks in place of t is O ( k k ) : the “reconstruction” proceeds as consecutive “concatenations” of t h , k , d h , k , t h , k + 1 , d h , k + 1 , , d h , k 1 , t h , k , skipping “concatenations” that already occurred before. There are, at most, O ( z ) = O ( z ) such concatenations in total. After this, all substrings block-aligned according to the h-shifted block parsing will have occurrences starting and ending on block boundaries according to the string s itself. For all h [ 0 . . b ) , we obtain O ( b z ) “concatenations” in total. (This informal argument requires further clarifications and details that follow below.) We note, however, that this is only an intuition, and there are many details on this way. For instance, if a phrase t h , k had no block-aligned occurrences before the phrase t , we would have to “reconstruct” t h , k by disassembling it analogously into chunks, thus recursively performing the same process. The counting of “concatenations” during the “reconstructions” has many details too.

4.2. Greedy Phrase-Splitting Procedure

Our construction of an LZ77 parsing for s b transforms the zero-shifted block parsing of s. We consecutively consider the phrases t 1 , d 1 , t 2 , d 2 , , d z 1 , t z from left to right: for each d i , we emit a one-letter phrase—the b-block contraction of d i —into the resulting parsing for s b ; for each t i , we perform a process splitting t i into chunks whose lengths are multiples of b, and each of which either is of length b or has a block-aligned earlier occurrence, and the b-block contractions of these chunks are the new phrases emitted into the resulting parsing for s b . Let us describe the process splitting t i .
Consider a phrase t i and suppose that the phrases t 1 , d 1 , t 2 , d 2 , , d i 1 were already processed, and new phrases corresponding to them in the parsing of s b under construction were emitted. If | t i | = 0 , we skip this step. If | t i | = b , we simply emit a new one-letter phrase by contracting t i . Suppose that | t i | > b . Denote by f the phrase into which t i is “inscribed”, i.e., 0 | f 1 f 2 f | | t 1 d 1 t i 1 d i 1 t i | < | f | . By the assumption of the theorem, there exists a phrase-aligned earlier occurrence of f , i.e., there exist j and j such that j < j < and f = f j f j + 1 f j . Thus, there is a copy of the phrase t i in the substring f j f j + 1 f j . While the phrase t i itself is block-aligned, its copy does not have to be block-aligned. However, the copy is necessarily block-aligned in the h-shifted block parsing with appropriate h, namely with h = ( | f 1 f 2 f j 1 | + | t 1 d 1 t 2 d 2 d i 1 | | f 1 f 2 f 1 | ) mod b ; see Figure 2.
Let t h , k , d h , k , t h , k + 1 , d h , k + 1 , , d h , k 1 , t h , k be the phrases of the h-shifted block parsing that are “inscribed” into the substring f j f j + 1 f j (the phrase d h , k 1 could be “inscribed” too, but we omit it in this case). The key observation for our construction is that the following equality holds (see Figure 2):
t i = t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k .
Now consider the following parsing:
t 1 d 1 t 2 d 2 t i 1 d i 1 · t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k ,
which is the zero-shifted block parsing of s up to the phrase t i in which phrase t i was expanded into the chunks t h , k , d h , k , t h , k + 1 , d h , k + 1 , , d h , k 1 , t h , k . If each of the chunks t h , k , t h , k + 1 , , t h , k from the expansion of t i in Parsing (1) either is of length, at most, b or has a block-aligned earlier occurrence, then, in principle, we could have emitted into the resulting parsing of s b the b-block contractions of t h , k , d h , k , t h , k + 1 , d h , k + 1 , , d h , k 1 , t h , k , generating, at most, 2 ( k k ) + 1 new phrases for s b in total. Unfortunately, this simple approach produces too many phrases for s b in the end. Instead, we greedily unite the chunks t h , k , d h , k , from left to right so that the resulting united chunks still either have length b or have block-aligned earlier occurrences. Namely, we greedily take the maximum m [ k . . k ] such that the substring t h , k d h , k t h , k + 1 d h , k + 1 d h , m 1 t h , m either has length b or has a block-aligned earlier occurrence (i.e., a block-aligned occurrence in the prefix t 1 d 1 t 2 d 2 t i 1 d i 1 ); then, the b-block contraction of the string q 1 = t h , k d h , k t h , k + 1 d h , k + 1 d h , m 1 t h , m is emitted as a new phrase into the parsing of s b under construction. If m = k , then the whole string t i was “split” into one chunk q 1 and we are done. Otherwise (if m < k ), we further emit the next (one-letter) phrase, the b-block contraction of d h , m , and proceed taking the maximum m [ m + 1 . . k ] such that the substring q 2 = t h , m + 1 d h , m + 1 t h , m + 2 d h , m + 2 d h , m 1 t h , m either has length b or has a block-aligned earlier occurrence (i.e., a block-aligned occurrence in the prefix t 1 d 1 t 2 d 2 t i 1 d i 1 · q 1 d h , m ). Similarly, we emit the contraction of q 2 and the one-letter contraction of d h , m into the resulting parsing of s b . The greedy procedure then analogously continues reading t i from left to right until the whole string t i is decomposed into chunks q 1 , d h , m , q 2 , d h , m , and all the corresponding new phrases for s b are emitted.
The described process works correctly if each of the chunks t h , k , t h , k + 1 , , t h , k in the expansion of t i in Parsing (1) either has a block-aligned earlier occurrence or is of length at most b. What if this is not the case? Suppose that t h , k is such a “bad” chunk in Parsing (1): it does not have a block-aligned earlier occurrence and | t h , k | > b . In a certain sense, we arrive at an analogous problem that we had with the phrase t i : we have to split t h , k (here, we are talking about the occurrence of t h , k from the expansion of t i in Parsing (1)) into a number of chunks whose lengths are multiples of b and each of which either is of length b or has a block-aligned earlier occurrence; we will then emit the b-block contractions of these chunks into the resulting parsing for s b . We solve this problem recursively by the same procedure, which can be roughly sketched as follows. The phrase t h , k from the h-shifted block parsing of s (not t h , k from the expansion of t i in Parsing (1)!) was “inscribed” into a phrase f 0 (such as, analogously, t i was “inscribed” into f ), which, by the assumption of the theorem, has a phrase-aligned earlier occurrences f 0 = f j 0 f j 0 + 1 f j 0 , for some j 0 < j 0 < 0 (like, similarly, f had an occurrence f = f j f j + 1 f j ). There is, thus, a copy of t h , k in the substring f j 0 f j 0 + 1 f j 0 . This copy is not necessarily block-aligned according to neither zero-shifted nor h-shifted block parsings. However, the copy is block-aligned with respect to the h 0 -shifted block parsing for appropriate h 0 (for example, the copy of t i from f j f j + 1 f j was block-aligned according to the h-shifted block parsing). We then consider a fragment t h 0 , k 0 d h 0 , k 0 t h 0 , k 0 + 1 d h 0 , k 0 + 1 d h 0 , k 0 1 t h 0 , k 0 of the h 0 -shifted block parsing “inscribed” into the substring f j 0 f j 0 + 1 f j 0 and conclude that this fragment is equal to t h , k (such as, analogously, the fragment t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k “inscribed” into f j f j + 1 f j was equal to t i ). We proceed recursively decomposing t h , k into chunks by greedily uniting t h 0 , k 0 , d h 0 , k 0 , t h 0 , k 0 + 1 , d h 0 , k 0 + 1 , , d h 0 , k 0 1 , t h 0 , k 0 by analogy to the procedure for t i . This recursion, in turn, itself can arrive at an analogous splitting problem for another chunk (say, t h 0 , k ), which, in the same way, is solved recursively.

4.3. Formalized Recursive Phrase-Splitting Procedure

We formalize the described recursive procedure as a function parse ( k , k , h , p ) , which is called with parameters k, k , h, p such that the substring t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k from the h-shifted block parsing of s occurs before the position p and its copy occurs at position p, i.e., h + | t h , 1 d h , 1 t h , 2 d h , 2 d h , k 1 t h , k | < p and s [ p . . p ] = t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k , for an appropriate p > p . In order to process t i , we call parse ( k , k , h , | t 1 d 1 t 2 d 2 t i 1 d i 1 | + 1 ) , where k, k , and h are determined as explained above. The function emits phrases of the resulting parsing of s b from left to right and works as follows:
1.
If the substring t h , k starting at position p does not have a block-aligned earlier occurrence and | t h , k | > b , then, as was described above, we find numbers k 0 , k 0 , h 0 such that t h , k = t h 0 , k 0 d h 0 , k 0 t h 0 , k 0 + 1 d h 0 , k 0 + 1 d h 0 , k 0 1 t h 0 , k 0 , and we first recursively call parse ( k 0 , k 0 , h 0 , p ) , thus processing t h , k , then we emit a one-letter phrase by contracting d h , k , and finally, call parse ( k + 1 , k , h , p + | t h , k d h , k | ) , ending the procedure afterwards;
2.
Otherwise, we compute the maximal number m [ k . . k ] such that the substring q 1 = t h , k d h , k t h , k + 1 d h , k + 1 d h , m 1 t h , m starting at position p either has length of, at most, b or has a block-aligned earlier occurrence;
3.
We emit the b-block contraction of q 1 unless q 1 is empty (which happens if m = k and t h , k is empty);
4.
If m = k , we exit; otherwise, we emit a one-letter phrase by contracting d h , m and call recursively parse ( m + 1 , k , h , p + | q 1 | ) .
In the sequel, we refer to the steps of the splitting procedure as “item 1, 2, 3, 4”.
It is easy to see that the generated parsing of s b after processing t 1 , d 1 , t 2 , d 2 , , d z , t z is a correct LZ77 parsing. It remains to estimate its size.

4.4. Basic Analysis of the Number of Produced Phrases

Suppose that every time a new phrase is emitted by the described procedure, we “pay” one point. We are to show that the total payment is O ( b z ) , which is O ( b z ) . For the “payment”, it will suffice to reserve a budget of 7 b z points. In this section of the proof, we explain how 4 b z points are spent (4 points per phrase t h , k ; details follow); the remaining 3 b z points will be discussed in the following final section of the proof.
The described process reads the string s from left to right, emitting new phrases for the parsing of s b and, during the run, advances a pointer p, the last parameter of the function parse , from 1 to n. Along the way, block-aligned occurrences of some fragments from h-shifted block parsings appear. The key observation is that once a block-aligned occurrence of a fragment t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k appeared in the string s at a position p, any subsequent block-aligned substring t h , a d h , a t h , a + 1 d h , a + 1 d h , a 1 t h , a with k a a k that appears during the processing of phrases after the position p can be united into one chunk, and there is no need to analyze it by splitting into smaller chunks again.
For each phrase t h , k with k [ 1 z ] and h [ 0 . . b 1 ] , we reserve four points (hence, 4 b z points in total). These points are spent when the string t h , k is first encountered in the function parse (the spending scheme is detailed below), namely when the function is invoked as parse ( k 0 , k 0 , h 0 , p ) with parameters such that t h , k = t h 0 , k 0 d h 0 , k 0 t h 0 , k 0 + 1 d h 0 , k 0 + 1 d h 0 , k 0 1 t h 0 , k 0 and t h , k occur at the position p (the position is necessarily block-aligned; therefore, this is a block-aligned occurrence of t h , k ). We note that this case includes the initial calls parse ( k , k , h , p ) that process phrases t i : it is a particular case where h = 0 . We call such an invocation of parse for the phrase t h , k a generating call. The generating call “generates” a block-aligned occurrence of t h , k in the string s. For each phrase t h , k , at most, one generating call might happen: the only place where a generating call could have happened twice is in item 1 of the description of the function parse , but any subsequent occurrence of t h , k will have a block-aligned earlier occurrence, and hence, the condition in item 1 could not be satisfied. Therefore, there are, at most, b z generating calls in total, and we spend, at most, 4 b z points on all of them.
Let us consider the work of a generating call. For simplicity, let it be a call for a phrase t i as in the description of parse ( k , k , h , p ) above, i.e., we have t i = t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k , h + | t h , 1 d h , 1 t h , 2 d h , 2 d h , k 1 t h , k | < p , and s [ p . . p ] = t i , for an appropriate p > p (the analysis for phrases t h , i from other h -shifted block parsings is analogous to the analysis for t i ). The work of parse can be essentially viewed as follows. The function parse splits t i greedily into chunks of two types:
(a)
chunks of the form t h , a d h , a t h , a + 1 d h , a + 1 d h , a 1 t h , a with k a a k that either have a length of, at most, b or have block-aligned earlier occurrences (such chunks are built in item 2 of the description of the function parse , where they are denoted as q 1 );
(b)
chunks t h , a on which a generating call is invoked in item 1, which further recursively splits t h , a into “subchunks” (recall that, according to item 1, this is the case when t h , a did not have block-aligned earlier occurrences before and | t h , a | > b ).
Such possible splitting of t i is schematically depicted in Figure 3, where there is only one recursive generating call (for t h , k + 1 ), and there are four chunks of type (a). The b-block contractions of the chunks of type (a) and the contractions of short phrases d h , a 1 , d h , a surrounding them are emitted as new phrases for the parsing of s b . We spend the whole four-point budget allocated for t i by paying two points for the leftmost chunk of type (a) and the phrase d h , a that follows after it, and by paying two points for the rightmost chunk of type (a) and the phrase d h , a 1 that precedes it (for the example from Figure 3, these are the chunks t h , k and t h , k + 6 d h , k + 6 t h , k + 7 , and the short phrases d h , k and d h , k + 5 ).
For each chunk t h , a of type (b), the corresponding generating call recursively produces “subchunks” splitting t h , a . We delegate the payment for these subchunks to the budget allocated for t h , a . It remains to pay somehow for other chunks of type (a) between the leftmost and rightmost chunks from the decomposition of t i . To this end, we introduce a separate common budget of 3 b z points.

4.5. Total Number of Type (a) Chunks

We estimate the total number of chunks of type (a) by connecting them to a common combinatorial structure defined as follows. The structure is called a segment system: it starts with z unit integer segments [ 1 1 ] , [ 2 2 ] , , [ z z ] , and we can perform merge queries, which, given two neighboring segments [ i j ] and [ j + 1 k ] with i j k , create a new united segment [ i k ] and remove the old segments [ i j ] and [ j + 1 k ] . Clearly, after z merge queries, everything will be fused into one segment [ 1 z ] . Fix h [ 1 b 1 ] . We maintain a segment system connected with h, and we associate with each phrase t h , k the segment [ k k ] . During the work of the function parse , the following invariant is maintained:
  • if a segment [ k . . k ] such that k > k belongs to the current segment system, then there is a block-aligned occurrence of the string t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k before the currently processed position p.
Observe that, at most, ( b 1 ) z b z merge queries can be performed in total on all b 1 introduced segment systems. We will spend 3 points per query from our remaining budget of 3 b z points. The segment systems, their merge queries, and chunks of type (a) are related as follows.
Let us temporarily alter the algorithm parse as follows: item 2 first chooses the number m [ k . . k ] as in the original version (as the maximal m [ k . . k ] such that t h , k d h , k t h , k + 1 d h , k + 1 d h , m 1 t h , m can serve as a new chunk), but then, we take the segment [ a . . a ] containing k from the segment system, i.e., a k a , and we assign m : = min { m , a } . Thus, possibly, m in the altered version will be smaller than in the original version. Such choice of m is correct too since, due to our invariant, the altered condition still guarantees that the substring t h , k d h , k t h , k + 1 d h , k + 1 d h , m 1 t h , m is either of length b or has a block-aligned earlier occurrence. It is obvious that the case when a k in the altered function parse can only occur for the leftmost chunk in the decomposition of t i into chunks. Further, the number of chunks of type (a) obtained for t i by the altered procedure is not less than the number of chunks of type (a) constructed by the original greedy version. It follows straightforwardly from the description that each chunk of type (a) in the altered version that is not the leftmost or rightmost chunk corresponds to a segment from the segment system associated with h, i.e., if t h , a d h , a t h , a + 1 d h , a + 1 d h , a 1 t h , a , is such a chunk, then [ a a ] belongs to the system. Similarly, the invariant implies that the unit segments [ a a ] corresponding to chunks t h , a of type (b) belong to the system too. For instance, if the example of Figure 3 was obtained by this altered way, the segments [ a . . k ] , [ k + 1 k + 1 ] , [ k + 2 k + 4 ] , [ k + 5 k + 5 ] , [ k + 6 a ] would belong to the segment system for some a k and a k + 7 . We then perform merge queries uniting the segments corresponding to all chunks generated for t i , except the leftmost and rightmost chunks. (For the example in Figure 3, we unite all subsegments of [ k + 1 k + 5 ] , performing two merge queries in total.) The number of merge queries for t i is equal to the number of chunks between the leftmost and rightmost chunks from the decomposition of t i minus one. The invariant of the segment system is maintained: the substring of the h-shifted block parsing that corresponds to the new united segment now has a block-aligned occurrence, which is precisely the occurrence in the substring t i .
Suppose that we maintain in a similar way b segment systems for all h [ 0 . . b 1 ] , performing analogous merge queries in every generating call. Then, the total number of merge queries among all h [ 1 . . b 1 ] and all chunks during the work of all calls to the function parse does not exceed b z (at most, z queries for each system). Therefore, it suffices to spend the budget of 3 b z points for all phrases corresponding to all chunks of type (a) as follows. We pay three points for each merge query: two for the emitted phrases corresponding to two merged chunks and one for a phrase d h , a separating the chunks. □
As a corollary of Theorem 2, we obtain our main upper bound.
Theorem 3.
For any integer b > 0 , the size z b of the greedy LZ77 parsing of a b-block contraction for a string of length n is O ( b z log n z ) , where z is the size of the greedy (overlapping or non-overlapping) LZ77 parsing of this string.
Proof. 
We apply Lemma 3 to obtain a “phrase-aligned” LZ77 parsing of size O ( z log n z ) , and then, using Theorem 2, we obtain the bound O ( b z log n z ) on the size of the greedy LZ77 parsing of the b-block contraction. □

5. Conclusions

Given the results obtained in this paper, the main open problem is to verify whether the upper bound from Theorem 3 is tight. To this end, one either has to improve the upper bound z b = O ( b z log n z ) or has to provide a more elaborate series of examples improving the lower bound z b = Ω ( b z ) from Section 3 (obviously, the examples must deal with non-phrase-aligned parsings). We point out, however, that the tightness of the bound from Theorem 3 would necessarily imply the tightness of the currently best upper bound g = O ( z log n z ) [4,18] from Lemma 3 that relates the size g of the minimal grammar generating the string and the size z of the LZ77 parsing for the string (the best lower bound up-to-date is g = Ω ( z log n log log n ) [8,10]). Indeed, for a constant b > 1 , if there exists a string whose LZ77 parsing has size z and whose b-block contraction can have only LZ77 parsings of size at least Ω ( z log n z ) , then the minimal grammar of such string must have a size of at least g = Ω ( z log n z ) since, by Lemma 3, the string has a phrase-aligned LZ77 parsing of size g, and thus, by Theorem 2, the b-block contraction has an LZ77 parsing of size O ( b g ) , which is O ( g ) as b is constant.
The present work stems from the paper [11] where the following LZ77 parsing was considered: we first parse the input string s into phrases p 1 p 2 p k such that each phrase p i is either one letter or has an earlier occurrence in the prefix s [ 1 . . ] , for a fixed > 0 , and then we treat every phrase p i as a separate letter and construct the greedy LZ77 parsing for the obtained “contracted” string of length k. The resulting parsing naturally induces an LZ77 parsing for s. The motivation for this variant is in the considerable reduction in space during the construction of the parsing that allowed us to compress very large chunks of data using this two-pass scheme. The problem that was investigated in the present paper can be interpreted as a special case when all phrases p i have the same length b (at least those phrases that are located outside of the prefix s [ 1 . . ] ). The following question can be posed: can we adapt the techniques developed here to a more general case in order to find an upper bound for the LZ77 parsing from [11] in terms of the optimal greedy LZ77 parsing for s?

Author Contributions

All authors have contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Russian Federation (Ural Mathematical Center project No. 075-02-2021-1387).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank the anonymous referees for their comments that helped to improve the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ziv, J.; Lempel, A. On the complexity of finite sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81. [Google Scholar]
  2. Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef] [Green Version]
  3. Kieffer, J.C.; Yang, E.H. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
  4. Rytter, W. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 2003, 302, 211–222. [Google Scholar] [CrossRef] [Green Version]
  5. Kempa, D.; Prezza, N. At the roots of dictionary compression: String attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), Los Angeles, CA, USA, 25–29 June 2018; pp. 827–840. [Google Scholar] [CrossRef] [Green Version]
  6. Kociumaka, T.; Navarro, G.; Prezza, N. Towards a definitive measure of repetitiveness. In Proceedings of the 14th Latin American Symposium on Theoretical Informatics (LATIN), São Paulo, Brazil, 5–8 January 2021; LNCS. Springer: Berlin/Heidelberg, Germany, 2021; Volume 12118, pp. 207–219. [Google Scholar] [CrossRef]
  7. Navarro, G.; Urbina, C. On Stricter Reachable Repetitiveness Measures. arXiv 2021, arXiv:2105.13595. [Google Scholar]
  8. Bannai, H.; Hirayama, M.; Hucke, D.; Inenaga, S.; Jeż, A.; Lohrey, M.; Reh, C. The smallest grammar problem revisited. IEEE Trans. Inf. Theory 2020, 67, 317–328. [Google Scholar] [CrossRef]
  9. Bille, P.; Gagie, T.; Gørtz, I.L.; Prezza, N. A separation between RLSLPs and LZ77. J. Discret. Algorithms 2018, 50, 36–39. [Google Scholar] [CrossRef] [Green Version]
  10. Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The smallest grammar problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
  11. Kosolobov, D.; Valenzuela, D.; Navarro, G.; Puglisi, S.J. Lempel-Ziv-like parsing in small space. Algorithmica 2020, 82, 3195–3215. [Google Scholar] [CrossRef]
  12. Kuruppu, S.; Puglisi, S.J.; Zobel, J. Relative Lempel–Ziv compression of genomes for large-scale storage and retrieval. In Proceedings of the SPIRE 2010, Los Cabos, Mexico, 11–13 October 2010; LNCS. Springer: Berlin/Heidelberg, Germany, 2010; Volume 6393, pp. 201–206. [Google Scholar] [CrossRef]
  13. Storer, J.A.; Szymanski, T.G. Data compression via textual substitution. J. ACM 1982, 29, 928–951. [Google Scholar] [CrossRef]
  14. Kosolobov, D.; Shur, A.M. Comparison of LZ77-type parsings. Inf. Process. Lett. 2019, 141, 25–29. [Google Scholar] [CrossRef] [Green Version]
  15. Okhotin, A. Underlying principles and recurring ideas of formal grammars. In Proceedings of the 12th International Conference on Language and Automata Theory and Applications (LATA), Ramat Gan, Israel, 9–11 April 2018; LNCS. Springer: Berlin/Heidelberg, Germany, 2018; Volume 10792, pp. 36–59. [Google Scholar] [CrossRef]
  16. Hucke, D.; Lohrey, M.; Reh, C.P. The smallest grammar problem revisited. In Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE), Beppu, Japan, 18–20 October 2016; LNCS. Springer: Berlin/Heidelberg, Germany, 2016; Volume 9954, pp. 35–49. [Google Scholar] [CrossRef] [Green Version]
  17. Pippenger, N. On the evaluation of powers and monomials. SIAM J. Comput. 1980, 9, 230–250. [Google Scholar] [CrossRef] [Green Version]
  18. Gawrychowski, P. Pattern matching in Lempel-Ziv compressed strings: Fast, simple, and deterministic. In Proceedings of the 19th Annual European Symposium on Algorithms (ESA), Saarbrücken, Germany, 5–9 September 2011; LNCS. Springer: Berlin/Heidelberg, Germany, 2011; Volume 6942, pp. 421–432. [Google Scholar] [CrossRef] [Green Version]
  19. Jeż, A. A really simple approximation of smallest grammar. Theor. Comput. Sci. 2016, 616, 141–150. [Google Scholar] [CrossRef]
Figure 1. The h-shifted block parsings for h = 0 , 1 , 2 , 3 , 4 . Gray and blue rectangles denote, respectively, phrases d h , i and t h , i . Here, we have b = 5 . Note that some phrases t h , i are empty and, thus, not depicted.
Figure 1. The h-shifted block parsings for h = 0 , 1 , 2 , 3 , 4 . Gray and blue rectangles denote, respectively, phrases d h , i and t h , i . Here, we have b = 5 . Note that some phrases t h , i are empty and, thus, not depicted.
Algorithms 14 00359 g001
Figure 2. A phrase-aligned copy f j f j + 1 f j of f and the corresponding copy t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k of t i in the h-shifted block parsing. Gray and blue rectangles denote, respectively, phrases d h , i and t h , i . Here, we have j = j + 2 and k = k + 2 .
Figure 2. A phrase-aligned copy f j f j + 1 f j of f and the corresponding copy t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k of t i in the h-shifted block parsing. Gray and blue rectangles denote, respectively, phrases d h , i and t h , i . Here, we have j = j + 2 and k = k + 2 .
Algorithms 14 00359 g002
Figure 3. The decomposition from a generating call for t i = t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k . Here, t h , k + 1 is in light gray meaning that we invoke a generating call for it; for other parts, t i is split into four chunks t h , k , t h , k + 2 d h , k + 2 t h , k + 3 d h , k + 3 t h , k + 4 , t h , k + 5 , t h , k + 6 d h , k + 6 t h , k + 7 , each of which has a block-aligned earlier occurrence.
Figure 3. The decomposition from a generating call for t i = t h , k d h , k t h , k + 1 d h , k + 1 d h , k 1 t h , k . Here, t h , k + 1 is in light gray meaning that we invoke a generating call for it; for other parts, t i is split into four chunks t h , k , t h , k + 2 d h , k + 2 t h , k + 3 d h , k + 3 t h , k + 4 , t h , k + 5 , t h , k + 6 d h , k + 6 t h , k + 7 , each of which has a block-aligned earlier occurrence.
Algorithms 14 00359 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kosolobov, D.; Valenzuela, D. Lempel-Ziv Parsing for Sequences of Blocks. Algorithms 2021, 14, 359. https://doi.org/10.3390/a14120359

AMA Style

Kosolobov D, Valenzuela D. Lempel-Ziv Parsing for Sequences of Blocks. Algorithms. 2021; 14(12):359. https://doi.org/10.3390/a14120359

Chicago/Turabian Style

Kosolobov, Dmitry, and Daniel Valenzuela. 2021. "Lempel-Ziv Parsing for Sequences of Blocks" Algorithms 14, no. 12: 359. https://doi.org/10.3390/a14120359

APA Style

Kosolobov, D., & Valenzuela, D. (2021). Lempel-Ziv Parsing for Sequences of Blocks. Algorithms, 14(12), 359. https://doi.org/10.3390/a14120359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop