Next Article in Journal
Distributed Consensus Algorithms in Sensor Networks with Higher-Order Topology
Next Article in Special Issue
Joint Detection and Communication over Type-Sensitive Networks
Previous Article in Journal
A Lightweight Trust Mechanism with Attack Detection for IoT
Previous Article in Special Issue
Optimal Resource Allocation for Loss-Tolerant Multicast Video Streaming
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Universal Random Coding Ensemble for Sample-Wise Lossy Compression

The Viterbi Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Technion City, Haifa 3200003, Israel
Entropy 2023, 25(8), 1199; https://doi.org/10.3390/e25081199
Submission received: 16 July 2023 / Revised: 6 August 2023 / Accepted: 10 August 2023 / Published: 11 August 2023
(This article belongs to the Collection Feature Papers in Information Theory)

Abstract

:
We propose a universal ensemble for the random selection of rate–distortion codes which is asymptotically optimal in a sample-wise sense. According to this ensemble, each reproduction vector, x ^ , is selected independently at random under the probability distribution that is proportional to 2 L Z ( x ^ ) , where L Z ( x ^ ) is the code length of x ^ pertaining to the 1978 version of the Lempel–Ziv (LZ) algorithm. We show that, with high probability, the resulting codebook gives rise to an asymptotically optimal variable-rate lossy compression scheme under an arbitrary distortion measure, in the sense that a matching converse theorem also holds. According to the converse theorem, even if the decoder knew the -th order type of source vector in advance ( being a large but fixed positive integer), the performance of the above-mentioned code could not have been improved essentially for the vast majority of codewords pertaining to source vectors in the same type. Finally, we present a discussion of our results, which includes among other things, a clear indication that our coding scheme outperforms the one that selects the reproduction vector with the shortest LZ code length among all vectors that are within the allowed distortion from the source vector.

1. Introduction

We revisit the well-known problem of lossy source coding for finite-alphabet sequences with respect to a certain distortion measure ([1]; [2], Chapter 10; [3], Chapter 9; [4]; [5], Chapters 7 and 8). More concretely, our focus is on d-semifaithful codes, namely, variable–rate codes that meet a certain distortion constraint for every source sequence (and not only in expectation). As is very well known [1], the rate–distortion function quantifies the minimum achievable expected coding rate for a given memoryless source and distortion measure.
During several past decades, many research efforts were motivated by the fact that the source statistics are rarely (if not never) known in practice, and were, therefore, directed to the quest for universal coding schemes, namely, coding schemes which do not depend on the unknown statistics, but nevertheless, approach the lower bound (i.e., the entropy in lossless compression, or the rate–distortion function in the lossy case) asymptotically, as the block length grows without bound. We next provide a very brief (and non-comprehensive) review of some of the relevant earlier work.
In lossless compression, the theory of universal source coding is very well developed and mature. Davisson’s work [6] concerning universal coding redundancies has established the concepts of weak universality and strong universality (vanishing maximin and minimax redundancies, respectively), and has characterized the connection to the capacity of the ‘channel’ defined by the family of conditional distributions of the data to be compressed given the index (or parameter) of the source in the class [7,8,9]. For many of the frequently encountered parametric classes of sources, the minimum achievable redundancy of universal codes is well known to be dominated by k log n 2 n , where k is the number of degrees of freedom of the parameter, and n is the block length [10,11,12,13]. A central idea that arises from Davisson’s theory is to construct a Shannon code pertaining to the probability distribution of the data vector with respect to a mixture (with a certain prior function) of all sources in the class. Rissanen, who was the inventor of the minimum description length (MDL) principle [14], has proved in [15] a converse to a coding theorem, which asserts that asymptotically, no universal code can achieve redundancy below ( 1 ϵ ) k log n 2 n , with the possible exception of sources from a subset of the parameter space, whose volume tends to zero as n for every positive ϵ . Merhav and Feder [16] have generalized this result to more general classes of sources, with the term k log n 2 n substituted by the capacity of the above-mentioned ‘channel’. Further developments, including more refined redundancy analyses, have been carried out in later studies.
In the wider realm of universal lossy compression, the theory is, unfortunately, not as sharp and well developed as in the lossless setting. We confine our attention, in this work, to d-semifaithful codes [17], namely, codes that satisfy the distortion requirement with probability one. Zhang, Yang, and Wei [18] have proved that, unlike in lossless compression, in the lossy case, even if the source statistics are known perfectly, it is impossible to achieve redundancy below log n 2 n , but log n n is achievable. Not knowing the source conveys the price of enlarging the multiplicative constant in front of log n n . Indeed, Yu and Speed [19] have established weak universality with a constant that grows with the cardinalities of the alphabets of the source and the reconstruction [20]. Ornstein and Shields [17] have considered universal d-semifaithful coding for stationary and ergodic sources with respect to the Hamming distortion measure, and established convergence with probability one to the rate–distortion function. Kontoyiannis [21] had several interesting findings. The first is a certain central limit theorem (CLT), with an O ( 1 / n ) redundancy term, whose coefficient is described as a limiting Gaussian random variable with some constant variance. The second is the so-called law of iterated logarithm (LIL), with redundancy proportional to log ( log n ) n infinitely often with probability one. One of the counter-intuitive conclusions from [21] is that universality is priceless under those CLT and LIL criteria. In [22], many of the findings are based on the observation that optimal compression can be characterized in terms of the negative logarithm of the probability of a sphere of radius n D around the source vector with respect to the distortion measure, where D is the allowed per-letter distortion. In the same article, they proposed also the ensemble of random coding with respect to a probability distribution given by a mixture of all distributions in a certain class. In two recent articles, Mahmood and Wagner [23,24] have studied d-semifaithful codes that are strongly universal with respect to both the source and the distortion function. The redundancy rates in [23] behave like log n n with different multiplicative constants. Other interesting results concerning a special distortion measure appear in [25].
A parallel line of research work on universal lossless and lossy compression, which was pioneered by Ziv, pertains to the individual sequence approach. According to this approach, there are no assumptions on the statistical properties of the source. The source sequence to be compressed is considered an arbitrary deterministic (individual) sequence, but limitations are imposed on the encoder and/or the decoder to be implementable by finite-state machines. This includes, first and foremost, the celebrated Lempel–Ziv (LZ) algorithm [26,27,28], as well as further developments that extend the scope to lossy compression with and without side information [29,30], as well as to joint source–channel coding [31,32]. In the lossless case, the article [33] provides an individual sequence analogue of the above-mentioned result due to Rissanen, where the expression k log n 2 n continues to designate the best achievable redundancy, but the main term of the compression ratio there is the empirical entropy of the source vector instead of the ordinary entropy of the probabilistic setting. The converse bound of [33] applies to the vast majority of source sequences within each type, and the vast majority of types (in analogy to the vast majority of the parameter space in Rissanen’s framework). In a way, this kind of a converse result still contains some flavor of the probabilistic setting, because arguing that the number of exceptional typical sequences is relatively small is actually equivalent to imposing a uniform distribution across the type and asserting that the induced probability of violating the bound is small. The achievability result of [33], on the other hand, holds pointwise for every sequence. A similar comment applies to [34], where asymptotically pointwise lossy compression was established with respect to first-order statistics (i.e., “memoryless” statistics) with an emphasis on distortion universality, similar to in [23,24].
A similar kind of a mix between the individual sequence setting and the probabilistic setting applies to this paper as well, in the context of universal rate–distortion coding, but here, just like in [34], there is no limitation to finite-state encoders/decoders as in [33]. In particular, our converse theorem asserts that given an arbitrary variable-rate code, and given an arbitrary distortion function within a certain wide class, the vast majority of reproduction vectors that represent source sequences of a given type (of any fixed order), must have a code length that is essentially at least as large as the negative logarithm of the probability of a ball with normalized radius D (D being the allowed per-letter distortion) centered at the given source sequence. The probability of this ball is taken with respect to a universal distribution that is proportional to 2 L Z ( x ^ ) , where L Z ( x ^ ) is the code length of LZ encoding of the reproduction vector, x ^ .
The emphasis on the word “majority” in the previous paragraph (which was also mentioned earlier) calls for an explanation: it should be understood that in the absence of limitations on the encoding memory resources (such as the finite-state machine model described above), there cannot exist any non-trivial lower bound that applies to each and every given individual sequence. The reason is simple: given a particular individual source sequence, one can always tailor an encoder that compresses that sequence to a single bit (even losslessly), for example, by assigning the bit ‘0’ to be the compressed representation of the given sequence and by appending the bit ‘1’ as a header to the uncompressed binary representation of any other source sequence. For the given individual sequence, the compression ratio would then be 1 / n , which tends to zero as n grows without bound (of course, this encoder would be very poor for many other source sequences). It is, therefore, unavoidable that any non-trivial lower bound must apply collectively over some set of source sequences, allowing at least some minority of exceptional sequences that may violate the bound.
We also present a matching achievability result, asserting that for every source sequence, this code length is essentially achievable by random coding, using a universal ensemble of codes, which is defined by independent random selection, where each codeword is drawn under the above-described universal probability distribution.
While the achievability result in [34] was pointwise as well, it was tailored to a memoryless structure, in the sense that it was given in terms of the rate–distortion function of the first-order empirical distribution, which is blind to any empirical dependencies and repetitive patterns within the source sequence. In this paper, we both extend the scope to general individual sequences beyond the memoryless statistics and extend the allowable class of distortion measures. In terms of the technical aspects, the proof of the achievability result is similar to a parallel proof in [34], but the novelty lies considerably more in the converse theorem and its proof, which is very different from the one in [34] because here there is no information-theoretic single-letter expression for the best achievable asymptotic performance.
Our main novel messages in this work can be summarized as follows.
1.
We essentially derive an asymptotic formula for the rate–distortion function of an individual sequence. It is given by the normalized negative logarithm of the universal probability of a sphere of radius n D (in the distortion metric sense) around the given source sequence.
2.
We establish the universality of the LZ probability distribution, not only in the context of lossless compression, but also for lossy compression, in the role of a universal distribution for independent random selection of a rate–distortion codebook.
3.
The expression of this rate–distortion function is intimately related to a converse result due to Kontoyiannis and Zhang [22] (in terms of the probability of a sphere), but it is more explicit in a certain sense, that will be discussed in Section 5.
4.
We provide some intuitive insight regarding this rate–distortion function by demonstrating the analogy to the ordinary rate–distortion function (see Section 5).
5.
We show that our achievability scheme outperforms the (seemingly) natural extension of LZ compression to the lossy case, where one seeks, within a sphere of radius n D , the reproduction vector whose LZ code length is minimal (see Section 5).
The outline of this paper is as follows. In Section 2, we establish the notation conventions, define a few terms and quantities, and provide some background. In Section 3, we present the converse theorem and its proof. In Section 4, we present the achievability theorem and prove it. Finally, in Section 5, we summarize the paper and discuss our results.

2. Notation, Definitions, and Background

Throughout the paper, random variables will be denoted by capital letters, specific values they may take will be denoted by the corresponding lower case letters, and their alphabets will be denoted by calligraphic letters. Random vectors and their realizations will be denoted, respectively, by capital letters and the corresponding lower case letters, both in a bold face font. Their alphabets will be superscripted by their dimensions. The source vector of length n, ( x 1 , , x n ) , with components, x i , i = 1 , , n , from a finite alphabet, X , will be denoted by x . The set of all such n-vectors will be denoted by X n , which is the nth-order Cartesian power of X . Likewise, a reproduction vector of length n, ( x ^ 1 , , x ^ n ) , with components, x ^ i , i = 1 , , n , from a finite alphabet, X ^ , will be denoted by x ^ X ^ n . We denote the cardinalities of X and X ^ by J and K, respectively.
For i j , the notation x i j will be used to denote the substring ( x i , x i + 1 , , x j ) . Probability distributions will be denoted by the letter P or Q with possible subscripts, depending on the context. The probability of an event E will be denoted by P r { E } , and the expectation operator with respect to a probability distribution P will be denoted by E { · } . For two positive sequences, a n and b n , the notation a n = · b n will stand for equality in the exponential scale, that is, lim n 1 n log a n b n = 0 . Similarly, a n · b n means that lim sup n 1 n log a n b n 0 , and so on. The indicator function of an event E will be denoted by I { E } . The notation [ x ] + will stand for max { 0 , x } . The logarithmic function, log x , will be understood to be defined to the base 2. Logarithms to the base e will be denoted by ln.
Let be a positive integer that divides n. The th-order empirical distribution of x X n , which will be denoted by P ^ x , is the vector of relative frequencies { P ^ x ( a ) , a X } , where
P ^ x ( a ) = n i = 0 n / 1 I { x i + 1 ( i + 1 ) = a } .
The set of all th-order empirical distributions of sequences in X n will be denoted by P n . For P P n , the type class { x X n : P ^ x = P } will be denoted by T n ( P ) . Likewise, T n ( Q ) will denote { x ^ X ^ n : P ^ x ^ = Q } , where P ^ x ^ is the th-order empirical distribution of x ^ . Finally, P ^ x x ^ will denote the th-order joint empirical distribution of ( x , x ^ ) , i.e.,
P ^ x x ^ ( a , b ) = n i = 0 n / 1 I { x i + 1 ( i + 1 ) = a , x ^ i + 1 ( i + 1 ) = b } , ( a , b ) X × X ^ .
For a given positive integer n, a distortion function, d, is a function from X n × X ^ n into R + . In the two main parts of this paper, different assumptions will be imposed on the distortion function.
1.
For the achievability theorem, the distortion function can be completely arbitrary.
2.
For the converse theorem, we assume that d ( x , x ^ ) depends on x and x ^ only via their first-order joint empirical distribution, P ^ x x ^ 1 , and that for a given such distribution, it grows linearly in n, that is, d ( x , x ^ ) = n ρ ( P ^ x x ^ 1 ) , where the function ρ is independent of n.
Regarding item 2, additive distortion measures, which obviously comply with the requirement, are given by linear functionals of P ^ x x ^ 1 . However, here arbitrary non-linear functionals are allowed as well.
A rate–distortion block code of length n is a mapping, ϕ n : X n B n , B n { 0 , 1 } * , that maps the space of source vectors of length n, X n , into a set, B n , of variable-length compressed bit strings. The decoder is a mapping ψ n : B n C n X ^ n that maps the set of compressed variable-length binary strings into a reproduction codebook C n . A block code is called d-semifaithful if for every x X n , d ( x , ψ n ( ϕ n ( x ) ) ) n D . The code length for x , denoted L ( x ) , is the number of bits of ϕ n ( x ) . Since L ( x ) depends on x only via ϕ n ( x ) , we will also denote it sometimes as L ( ϕ n ( x ) ) or by L ( x ^ ) ( x ^ being the reproduction vector pertaining to ϕ n ( x ) ), with a slight abuse of notation. For the converse theorem, we assume that correspondence between B n and C n is one-to-one. For the achievability theorem, we consider prefix-free codes. Accordingly, the encoder can equivalently be presented as a cascade of a reproduction encoder (also known as a vector quantizer), which maps X n into C n , followed by an entropy coder, which maps C n into B n with no additional loss of information.
For the purpose of presenting both the converse theorem and the achievability theorem, we need to recall a few terms and facts concerning the 1978 version of the LZ algorithm (also known as the LZ78 algorithm) [27]. The incremental parsing procedure of the LZ78 algorithm is a procedure of sequentially parsing a vector x ^ X ^ n such that each new phrase is the shortest string that has not been encountered before as a parsed phrase, with the possible exception of the last phrase which might be incomplete. For example, the incremental parsing of the vector x ^ = a b b a b a a b b a a a b a a is a , b , b a , b a a , b b , a a , a b , a a . Let c ( x ^ ) denote the number of phrases in x ^ resulting from the incremental parsing procedure. Let L Z ( x ^ ) denote the length of the LZ78 binary compressed code for x ^ . According to ([27], Theorem 2),
L Z ( x ^ ) [ c ( x ^ ) + 1 ] log { 2 K [ c ( x ^ ) + 1 ] } = c ( x ^ ) log [ c ( x ^ ) + 1 ] + c ( x ^ ) log ( 2 J ) + log { 2 K [ c ( x ^ ) + 1 ] } = c ( x ^ ) log c ( x ^ ) + c ( x ^ ) log 1 + 1 c ( x ^ ) + c ( x ^ ) log ( 2 K ) + log { 2 K [ c ( x ^ ) + 1 ] } c ( x ^ ) log c ( x ^ ) + log e + n ( log K ) log ( 2 K ) ( 1 ϵ n ) log n + log [ 2 K ( n + 1 ) ] = c ( x ^ ) log c ( x ^ ) + n · ϵ ( n ) ,
where we remember that K is the cardinality of X ^ , and where ϵ ( n ) clearly tends to zero as n , at the rate of 1 / log n . We next define a universal probability distribution (see also [35,36]):
U ( x ^ ) = 2 L Z ( x ^ ) x ^ X ^ n 2 L Z ( x ^ ) , x ^ X ^ n .
Finally, we define the D-sphere around x as
S ( x , D ) = { x ^ : d ( x , x ^ ) n D } ,
and
U [ S ( x , D ) ] = x ^ S ( x , D ) U ( x ^ ) .
For later use, we also define
S ^ ( x ^ , D ) = { x : d ( x , x ^ ) n D } .
Our purpose is to derive upper and lower bounds on the smallest achievable code length, L ( x ) , for d-semifaithful block codes of length n, and individual sequences, { x } , from a given th-order-type class, T n ( P ) . As will be seen shortly, in both the converse and the achievability theorems, the main term of the bound on the length function will be log ( U [ S ( x , D ) ] ) .

3. The Converse Theorem

The following converse theorem asserts that even if the type class of the source vector was known to the decoder ahead of time, the code length could not be much smaller than log ( U [ S ( x , D ) ] ) for the vast majority of the codewords pertaining to that type.
Theorem 1.
Let ℓ be a positive integer that divides n and let P ^ be an arbitrary empirical distribution pertaining to a certain type class, T n ( P ^ ) , of source sequences in X n . Let d be a distortion function that depends on ( x , x ^ ) only via P ^ x x ^ 1 . Then, for every d-semifaithful variable-length block code, with one-to-one correspondence between B n and C n , and for every ϵ > 0 , the following lower bound applies to a fraction of at least ( 1 2 n ϵ ) of the codewords, { ϕ n ( x ) , x T n ( P ^ ) } :
L ( ϕ n ( x ) ) log ( U [ S ( x , D ) ] ) n Δ n ( ) ϵ log n ,
where Δ n ( ) has the property lim n Δ n ( ) = 1 / .
As a technical note, observe that Δ n ( ) can be made small when is chosen to be large, as Δ n ( ) behaves like 1 / for fixed and large n. This suggests that the theorem is meaningful mainly when is appreciably large, which is not surprising, because the larger is, the better one can exploit empirical dependencies within the source sequence. Enlarging also increases the asymptotic lower bound and, thus, makes it tighter. An alternative approach would be to let = n grow with n, but sufficiently slowly such that Δ n ( n ) 0 as n .
The remaining part of this section is devoted to the proof of Theorem 1. The proof is based on two main steps whose aim is to bypass the need for the traditional information-theoretic expression of the rate–distortion function. The first step is based on a simple counting argument, displayed in Equations (9)–(14) below, which leads to the identity (15) and, more specifically, to (16). Equation (16) is pivotal to the proof because it establishes exact equality between the sphere-covering ratio (on the left-hand side) and a quantity (on the right-hand side) that can be interpreted as one over the probability of falling into an n D -sphere around x under random coding with respect to the uniform distribution over the type class, T n ( Q ) . The sphere-covering ratio on the left-hand side belongs to the converse, and the expression on the right-hand side is strongly related to achievability (see Theorem 2 below), as it directly supports random coding. Therefore, these identities create the basic link between the converse and the direct theorems. The second step is to replace the uniform distribution over T n ( Q ) by the universal LZ distribution, U ( x ) 2 L Z ( x ^ ) , and thereby relax the dependence upon P (or Q ), as well as on the parameter , without sacrificing the asymptotic performance. This proof is entirely different from the one in [34].
Proof. 
We first establish a relationship that will be used later on. For two given types T n ( P ) X n and T n ( Q ) X ^ n , consider the quantity,
N ( D ) = x , x ^ I { x T n ( P ) , x ^ T n ( Q ) , d ( x , x ^ ) n D } .
We can evaluate N ( D ) in two ways. The first is as follows:
N ( D ) = x T n ( P ) | T n ( Q ) S ( x , D ) |
= | T n ( P ) | · | T n ( Q ) S ( x , D ) | ,
where the second equality is since | T n ( Q ) S ( x , D ) | is the same for all x T n ( P ) , due to the permutation-invariance assumption on the distortion function. By the same token, we can also express N ( D ) in the following manner:
N ( D ) = x ^ T n ( Q ) | T n ( P ) S ^ ( x ^ , D ) |
= | T n ( Q ) | · | T n ( P ) S ^ ( x ^ , D ) | ,
which follows from the same consideration by symmetry. It follows then that
| T n ( P ) | · | T n ( Q ) S ( x , D ) | = | T n ( Q ) | · | T n ( P ) S ^ ( x ^ , D ) | ,
or, equivalently,
| T n ( P ) | | T n ( P ) S ^ ( x ^ , D ) | = | T n ( Q ) | | T n ( Q ) S ( x , D ) | .
Now, let Q * be the type of x ^ that maximizes | T n ( P ) S ^ ( x ^ , D ) | . Then, the last equation implies that
| T n ( P ) | max x ^ X ^ n | T n ( P ) S ^ ( x ^ , D ) | = | T n ( Q * ) | | T n ( Q * ) S ( x , D ) | , x T n ( P ) .
This relationship will be used shortly.
Let P P n be given. Any d-semifaithful code must fully cover the type class T n ( P ) with spheres of radius n D (henceforth, referred to as D-spheres) centered at the various codewords. Let x ^ 1 , , x ^ M X ^ n be M codewords. The number of members of T n ( P ) that are covered by x ^ 1 , , x ^ M X ^ n is upper-bounded as follows.
G = | i = 1 M T n ( P ) S ^ ( x ^ i , D ) } | i = 1 M | T n ( P ) S ^ ( x ^ i , D ) } | M · max x ^ X n ^ | T n ( P ) S ^ ( x ^ , D ) } | ,
and so, the necessary condition for complete covering, which is G | T n ( P ) | , amounts to
M | T n ( P ) | max x ^ X n ^ | T n ( P ) S ^ ( x ^ , D ) } | = | T n ( Q * ) | | T n ( Q * ) S ( x , D ) | = M 0 ,
where the second line is from (16). Consider now a variable-length code with a codebook of size M. Let L ( x ^ ) denote the length (in bits) of the compressed binary string that represents x ^ . The number of codewords with L ( x ^ ) log M ϵ log n is upper-bounded as follows:
| { x ^ C n : L ( x ^ ) log M ϵ log n } | = k = 1 log M ϵ log n | { x ^ : L ( x ^ ) = k } | k = 1 log M ϵ log n 2 k = 2 log M ϵ log n + 1 1 < 2 n ϵ M ,
where in the first inequality we have used the assumed one-to-one property of the mapping between the reproduction codewords and their variable-length compressed binary representations. It follows then, that for at least M ( 1 2 n ϵ ) out of the M codewords in C n (that is, the vast majority of codewords), we have
L ( ϕ n ( x ) ) log M ϵ log n log M 0 ϵ log n = log | T n ( Q * ) S ( x , D ) | | T n ( Q * ) | ϵ log n = log x ^ S ( x , D ) U Q * ( x ^ ) ϵ log n ,
where U Q * is the uniform probability distribution across the type class Q * , i.e.,
U Q * ( x ^ ) = 1 | T n ( Q * ) | x ^ T n ( Q * ) 0 elsewhere
We now argue that for every x ^ X ^ n
U Q * ( x ^ ) exp 2 { L Z ( x ^ ) + n Δ n ( ) } .
For x ^ T n ( Q * ) , this is trivial as the left-hand side is equal to zero. For x ^ T n ( Q * ) , we have the following consideration. In ([37], Equation (30)), it is shown (with somewhat different notation) that a certain quantity called the (unnormalized) s-state complexity [27] of x ^ , n ρ s ( x ^ ) , for s = S ( ) = ( K + 1 1 ) / ( K 1 ) , is upper-bounded by
n ρ s ( x ^ ) n [ H Q * ( X ^ ) + 1 ] ,
where
H Q * ( X ^ ) = b X ^ Q * ( b ) log Q * ( b ) .
On the other hand, n ρ s ( x ^ ) is lower-bounded (see [37], Equation (32)) by
n ρ s ( x ^ ) c ( x ^ ) log c ( x ^ ) n log ( 4 S 2 ( ) ) log K ( 1 ϵ n ) log n S 2 ( ) log ( 4 S 2 ( ) ) ,
where ϵ n 0 , and so,
n [ H Q * ( X ^ ) + 1 ] c ( x ^ ) log c ( x ^ ) n log ( 4 S 2 ( ) ) log K ( 1 ϵ n ) log n S 2 ( ) log ( 4 S 2 ( ) ) .
Combining this with the inequality ([38], p. 17, Lemma 2.3),
| T n ( Q * ) | n + 1 K · 2 n H Q * ( X ^ ) / ,
we have
log | T n ( Q * ) | c ( x ^ ) log c ( x ^ ) n δ n ( ) K log n + 1 L Z ( x ^ ) n ϵ ( n ) n δ n ( ) K log n + 1 = L Z ( x ^ ) n Δ n ( ) ,
where
δ n ( ) = log [ 4 S 2 ( ) ] log K ( 1 ϵ n ) log n + S 2 ( ) log [ 4 S 2 ( ) ] n + 1 ,
and where the second inequality in (28) follows from (3). The last line of (28) is equivalent to (22). It follows then, that for at least M ( 1 2 · n ϵ ) out of the M codewords in C n ,
L ( ϕ n ( x ) ) log x ^ S ( x , D ) 2 L Z ( x ^ ) n Δ n ( ) ϵ log n = log x ^ S ( x , D ) 2 L Z ( x ^ ) x ^ X ^ n 2 L Z ( x ^ ) log x ^ X ^ n 2 L Z ( x ^ ) n Δ n ( ) ϵ log n log ( U [ S ( x , D ) ] ) n Δ n ( ) ϵ log n ,
where in the last step we have applied Kraft’s inequality to the LZ code-length function. This completes the proof of Theorem 1.   □

4. The Achievability Theorem

The lower bound of Theorem 1 naturally suggests achievability using the universal distribution, U, for random selection of the various codewords. The basic idea is quite standard and simple: the quantity U [ S ( x , D ) ] is the probability that a single randomly chosen reproduction vector, drawn under U, would fall within distance n D from the source vector, x . If all reproduction vectors are drawn independently under U, then the typical number of such random selections that it takes before one sees the first one in S ( x , D ) is of the exponential order of 1 / U [ S ( x , D ) ] . Given that the codebook is revealed to both the encoder and decoder, once it has been selected, the encoder merely needs to transmit the index of the first such reproduction vector within the codebook, and the description length of that index can be made essentially as small as log { 1 / U [ S ( x , D ) ] } = log ( U [ S ( x , D ) ] ) . We use this simple idea to prove achievability for an arbitrary distortion measure. The proof is similar to the parallel proof in [34], and it is presented here mainly for completeness.
The achievability theorem is the following.
Theorem 2.
Let d : X n × X ^ n R + be an arbitrary distortion function. Then, for every ϵ > 0 , there exists a sequence of d-semifaithful, variable-length block codes of block length n, such that for every x X n , the code length for x is upper-bounded by
L ( x ) log ( U [ S ( x , D ) ] ) + ( 2 + ϵ ) log n + c + δ n ,
where c > 0 is a constant and δ n = O ( n J n e n 1 + ϵ ) .
Proof. 
The proof is based on the following simple well-known fact: Given a source vector x X n and a codebook, C n , let I ( x ) denote the index, i, of the first vector, x ^ i , such that d ( x , x ^ i ) n D , namely, x ^ i S ( x , D ) . If all reproduction vectors are drawn independently under U, then, for every positive integer, N:
P r { I ( x ) > N } = ( 1 U [ S ( x , D ) ] ) N = exp { N ln ( 1 U [ S ( x , D ) ] } exp { N · U [ S ( x , D ) ] } ,
and so, if N = N n = e λ n / U [ S ( x , D ) ] , for some arbitrary positive sequence, { λ n } , that tends to infinity, then
P r { I ( x ) > N n } exp { e λ n } .
This fact will be used few times in this section.   □
For later use, we also need the following uniform lower bound to U [ S ( x , D ) ] : for a given x , let x ^ 0 X ^ n denote an arbitrary reproduction vector within S ( x , D ) . Then,
( 34 ) U [ S ( x , D ) ] U ( x ^ 0 ) ( 35 ) = 2 L Z ( x ^ 0 ) x ^ X ^ n 2 L Z ( x ^ ) ( 36 ) 2 L Z ( x ^ 0 ) .
Next, observe that L Z ( x ^ 0 ) is maximized by the K-ary extension of the counting sequence ([27], p. 532), which is defined as follows: for i = 1 , 2 , , m (m – positive integer), let u ( i ) denote the K-ary string of length i K i that lists, say, in lexicographic order, all the K i words from X ^ i , and let x ^ 0 = ( u ( 1 ) u ( 2 ) u ( m ) ) , whose length is
n = i = 1 m i K i = K · i = 1 m i K i 1 = K · K i = 1 m K i = K · K K m + 1 K K 1 = K ( K 1 ) 2 [ m K m + 1 ( m + 1 ) K m + 1 ] .
The LZ incremental parsing of x ^ 0 , which is exactly ( u ( 1 ) , u ( 2 ) , , u ( m ) ) , yields
c ( x ^ 0 ) = i = 1 m K i = K m + 1 K K 1 ,
and so, considering Equation (3), it follows that L Z ( x ^ 0 ) ( 1 + ϵ n ) n log K for some ϵ n 0 as n . It follows then, that
U [ S ( x , D ) ] 2 n ( 1 + ϵ n ) log K .
Remark 1.
For an alternative to this upper bound on the LZ code length, one can slightly modify the LZ algorithm as follows: if L Z ( x ^ ) n log K , use the LZ algorithm as usual, otherwise, send x ^ uncompressed using n log K bits. To distinguish between the two modes of operation, append a flag bit to indicate whether or not the data are LZ-compressed. The modified code length would then be L Z ( x ^ ) = min { L Z ( x ^ ) , n log K } + 1 . Now, replace L Z ( x ^ ) by L Z ( x ^ ) in all places throughout this paper, including the definition of U.
Consider now an independent random selection of all reproduction vectors to form a codebook, C n , of size M = A n ( A > K ) codewords, x ^ 1 , x ^ 2 , , x ^ M , according to U. Once the codebook C n = { x ^ 1 , x ^ 2 , , x ^ M } has been drawn, it is revealed to both the encoder and the decoder. Consider next the following encoder. As defined before, let I ( x ) be defined as the index of the first codeword that falls within S ( x , D ) , but now, with the small modification that if none of the A n codewords fall in S ( x , D ) , then we define I ( x ) = A n anyway (and then the encoding fails). Next, we define the following probability distribution over the positive integers, 1 , 2 , , A n :
u [ i ] = 1 / i k = 1 A n 1 / k , i = 1 , 2 , , A n .
Given x , the encoder finds I ( x ) and encodes it using a variable-rate lossless code with the length function (in bits, and ignoring the integer length constraint)
L ( x ) = log u [ I ( x ) ] log I ( x ) + log k = 1 A n 1 k log I ( x ) + log ( ln A n + 1 ) = log I ( x ) + log ( n ln A + 1 ) log I ( x ) + log n + c ,
where c = log ( ln A + 1 ) . It follows that the expected codeword length for x X n (with respect to the randomness of the code) is upper-bounded by
E { L ( x ) } E { log I ( x ) } + log n + c log E { I ( x ) } + log n + c = log k = 1 A n k · 1 U [ S ( x , D ) ] k 1 · U [ S ( x , D ) ] + A n · 1 U [ S ( x , D ) ] A n + log n + c = log k = 1 min { k , A n } · 1 U [ S ( x , D ) ] k 1 · U [ S ( x , D ) ] + log n + c log k = 1 k · 1 U [ S ( x , D ) ] k 1 · U [ S ( x , D ) ] + log n + c = log 1 U [ S ( x , D ) ] + log n + c ,
and we denote
L + ( x ) = log 1 U [ S ( x , D ) ] + log n + c .
Consider now the quantity
E n = E { max ( max x X n I { d ( x , X ^ ) > n D } , max x X n L ( x ) L + ( x ) ( 1 + ϵ ) log n + ) } ,
where the expectation is with respect to the randomness of the code C n . If E n can be upper-bounded by δ n , which tends to zero as n , this will imply that there must exist a code for which both
max x X n I { d ( x , x ^ ) > n D } δ n
and
max x X n L ( x ) L + ( x ) ( 1 + ϵ ) log n δ n
at the same time. Observe that since the left-hand side of (45) is either zero or one, then if we know that it must be less than δ n 0 for some codebook C n , it means that it must vanish as soon as n is large enough such that δ n < 1 , namely, d ( x , x ^ ) n D for all x , in other words, the code is d-semifaithful. Furthermore, by (46), for the same codebook, we must have
L ( x ) L + ( x ) + ( 1 + ϵ ) log n + δ n x X n ,
and δ n adds a negligible redundancy term.
To prove that E n 0 , we first use the simple fact that the maximum of two non-negative numbers is upper-bounded by their sum, i.e.,
E n E max x X n I { d ( x , X ^ ) > n D } + E max x X n L ( x ) L + ( x ) ( 1 + ϵ ) log n ) + ,
and, therefore, it is sufficient to prove that each one of these terms tends to zero. As for the first term, we have:
E max x X n I { d ( x , X ^ ) > n D } E x X n I { d ( x , X ^ ) > n D } = x X n E I { d ( x , X ^ ) > n D } = x X n P r { d ( x , X ^ ) > n D } = x X n 1 U [ S ( x , D ) ] A n x X n exp A n U [ S ( x , D ) ] ( a ) x X n exp exp n ln A ( 1 + ϵ n ) ln K J n exp exp n ln A ( 1 + ϵ n ) ln K ,
where in (a) we have used (39). This quantity decays double-exponentially rapidly as n since we have assumed A > K .
As for the second term of (48), we have:
E max x X n L ( x ) L + ( x ) ( 1 + ϵ ) log n + ( a ) E max x X n log I ( x ) log 1 U [ S ( x , D ) ] ( 1 + ϵ ) log n + = 0 P r max x X n log I ( x ) log 1 U [ S ( x , D ) ] ( 1 + ϵ ) log n ) s d s = 0 n log A P r max x X n log I ( x ) log 1 U [ S ( x , D ) ] ( 1 + ϵ ) log n s d s = 0 n log A P r x X n I ( x ) 2 ( 1 + ϵ ) log n + s U [ S ( x , D ) ] d s x X n 0 n log A P r I ( x ) 2 ( 1 + ϵ ) log n + s U [ S ( x , D ) ] d s ( b ) x X n 0 n log A exp { 2 s n 1 + ϵ } d s J n · ( n log A ) · exp { n 1 + ϵ } ,
where in (a) we have used (41) and (43), and in (b) we have used (33). The right-most side of this chain of inequalities clearly decays as well when n grows without bound. This completes the proof.

5. Summary and Discussion

By deriving asymptotically matching upper and lower bounds, we have established the quantity 1 n log ( U [ S ( x , D ) ] ) as having the significance of an empirical rate–distortion function for individual sequences. While this quantity is not easy to calculate for large n, the operative meaning of our results is that we propose a universal ensemble for rate–distortion coding. According to this ensemble, the codewords are drawn independently under the probability distribution that is proportional to 2 L Z ( x ^ ) .
There are several observations, insights, and perspectives that should be addressed.
Relation to earlier converse bounds. The converse bound is given in terms of the probability of a sphere of radius n D around the source vector x , under the universal distribution, U, defined in (4). This is intimately related to a converse result due to Kontoyiannis and Zhang ([22], Theorem 1, part i), which states that for any d-semifaithful code, there exists a probability distribution Q on X ^ n such that L ( x ) log ( Q [ S ( x , D ) ] ) for all x (see also [21]). Here, upon giving up any claims on a minority of the codewords pertaining to a given type class, we derived a lower bound of essentially the same form with the benefit of specifying a concrete choice of the distribution Q, i.e., we propose Q = U , the universal distribution (unlike the distribution in [22] (Section III.A), which is proportional to 2 L ( x ^ ) across the codebook).
Interpretation of the main term of the bound. First, observe that in the special case of lossless compression ( D = 0 ), the expression log ( U [ S ( x , D ) ] ) boils down to L Z ( x ) , as expected. In the lossy case, since L Z ( x ^ ) is essentially bounded by a linear function of n (see (39)), we can approximate the main term as follows:
log ( U [ S ( x , D ) ] ) log x ^ S ( x , D ) 2 L Z ( x ^ ) = log L 1 2 L · | { x ^ : L Z ( x ^ ) = L } S ( x , D ) | min L 1 L log | { x ^ : L Z ( x ^ ) = L } S ( x , D ) | .
This expression, when normalized by n, can be viewed as a certain extension of the rate–distortion function, from the memoryless case to the general case, in the following sense. For a memoryless source P, the rate–distortion function has the following representation, which is parallel to (51):
R ( D ) = min P X ^ H ( X ^ ) max { P X | X ^ : E d ( X , X ^ ) D , P X = P } H ( X ^ | X ) ,
where the maximum over the empty set is understood to be . Indeed, if we replace U by the the uniform distribution across the first-order type pertaining to the optimal P X ^ , this is the corresponding single-letter expression of log ( P X ^ [ S ( x , D ) ] ) that is obtained using the method in [38].
Comparing to the LZ description length of the most compressible x ^ S ( x , D ) . Since our achievable bound involves LZ compression, it is interesting to compare it to the conceptually simple coding scheme that encodes x by the vector x ^ that minimizes L Z ( x ^ ) within S ( x , D ) . Consider the following chain of equalities and inequalities:
min x ^ S ( x , D ) L Z ( x ^ ) = log max x ^ S ( x , D ) 2 L Z ( x ^ ) log x ^ S ( x , D ) 2 L Z ( x ^ ) log x ^ S ( x , D ) 2 L Z ( x ^ ) x ^ X ^ n 2 L Z ( x ^ ) = log ( U [ S ( x , D ) ] ) ,
which means that the performance of our proposed scheme is never worse (and conceivably often much better) than that of selecting the vector x ^ with the smallest L Z ( x ^ ) among all reproduction vectors in S ( x , D ) . The reason for the superiority of the proposed scheme is that it takes advantage of the fact that x ^ cannot be any vector in X ^ n , but it must be a member of the codebook, C n , i.e., one of the possible outputs of a vector quantizer. On the other hand, in view of [27], min x ^ S ( x , D ) L Z ( x ^ ) is essentially achievable upon compressing the output of a certain reproduction encoder (or vector quantizer) using a finite-state encoder, but a finite-state machine does not have enough memory resources to take advantage of the fact that vectors outside C n cannot be encountered by the encoder. Another interesting comparison between the two schemes is in terms of computational complexity. While in our scheme, the encoder has to carry out typically about 1 / U [ S ( x , D ) ] distortion calculations before finding the first x ^ S ( x , D ) , in the alternative scheme the number of calculations is | S ( x , D ) | . The former is a decreasing function of D, whereas the latter is an increasing function of D. Therefore, in terms of computational complexity, the preference between the two schemes might depend on D. Specifically, for an additive distortion measure, it is easy to see that
1 U [ S ( x , D ) ] · exp 2 { n R ( D , P x 1 ) }
and, by the method of types [38]:
| S ( x , D ) | = · exp 2 { n E ( D , P x 1 ) ) } = exp 2 [ max { H ( X ^ | X ) , E d ( X , X ^ ) D , P X = P x 1 } ] .
Therefore, whenever D is large enough such that R ( D , P x 1 ) < E ( D , P x 1 ) ) , it is guaranteed that the coding scheme proposed here is computationally less demanding than the alternative scheme of minimizing L Z ( x ^ ) across S ( x , D ) .
Implementation of the random coding distribution. The universal random coding distribution is not difficult to implement. One way to achieve this is by feeding the LZ decoder with a sequence of purely random bits (fair coin tosses) until we have obtained n symbols at the decoder output. The details can be found in [36].
Universality with respect to the distortion measure. As mentioned in the Introduction, in [23,24,34], there are results on the existence of rate–distortion codes that are universal, not only in terms of the source, but also in the sense of the distortion measure. Since the proof of our achievability scheme is very similar to that of [34], it is possible to extend the achievability proof here too, so as to make our code distortion-universal for a wide class of distortion measures. This can be carried out by redefining E n to include the maximization of both terms over a dense grid of distortion functions, as was performed in [34]. We opted not to include this in the present paper since it is straightforward given the results we already have here and in [34].

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Berger, T. Rate Distortion Theory—A Mathematical Basis for Data Compression; Prentice-Hall Inc.: Englewood Cliffs, NJ, USA, 1971. [Google Scholar]
  2. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  3. Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: New York, NY, USA, 1968. [Google Scholar]
  4. Gray, R.M. Source Coding Theory; Kluwer Academic Publishers: Boston, MA, USA, 1990. [Google Scholar]
  5. Viterbi, A.J.; Omura, J.K. Principles of Digital Communication and Coding; McGraw-Hill Inc.: New York, NY, USA, 1979. [Google Scholar]
  6. Davisson, L.D. Universal noiseless coding. IEEE Trans. Inf. Theory 1973, 19, 783–795. [Google Scholar] [CrossRef]
  7. Gallager, R.G. Source Coding with Side Information and Universal Coding; LIDS-P-937; Massachusetts Institute of Technology: Cambridge, MA, USA, 1976. [Google Scholar]
  8. Ryabko, B. Coding of a source with unknown but ordered probabilities. Probl. Inf. Transm. 1979, 15, 134–138. [Google Scholar]
  9. Davisson, L.D.; Leon-Garcia, A. A source matching approach to finding minimax codes. IEEE Trans. Inf. Theory 1980, 26, 166–174. [Google Scholar] [CrossRef]
  10. Krichevsky, R.E.; Trofimov, R.K. The performance of universal encoding. IEEE Trans. Inf. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
  11. Shtar’kov, Y.M. Universal sequential coding of single messages. Probl. Inf. Transm. 1987, 23, 175–186. [Google Scholar]
  12. Barron, A.R.; Rissanen, J.; Yu, B. The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 1998, 44, 2734–2760. [Google Scholar] [CrossRef]
  13. Yang, Y.; Barron, A.R. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999, 27, 1564–1599. [Google Scholar] [CrossRef]
  14. Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  15. Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 1984, 30, 629–636. [Google Scholar] [CrossRef]
  16. Merhav, N.; Feder, M. A strong version of the redundancy–capacity theorem of universal coding. IEEE Trans. Inf. Theory 1995, 41, 714–722. [Google Scholar] [CrossRef]
  17. Ornstein, D.S.; Shields, P.C. Universal almost sure data compression. Ann. Probab. 1990, 18, 441–452. [Google Scholar] [CrossRef]
  18. Zhang, Z.; Yang, E.-H.; Wei, V. The redundancy of source coding with a fidelity criterion. I. known statistics. IEEE Trans. Inf. Theory 1997, 43, 71–91. [Google Scholar] [CrossRef]
  19. Yu, B.; Speed, T. A rate of convergence result for a universal d-semifaithful code. IEEE Trans. Inf. Theory 1993, 39, 813–820. [Google Scholar] [CrossRef]
  20. Silva, J.F.; Piantanida, P. On universal d-semifaithful coding for memoryless sources with infinite alphabets. IEEE Trans. Inf. Theory 2022, 68, 2782–2800. [Google Scholar] [CrossRef]
  21. Kontoyiannis, I. Pointwise redundancy in lossy data compression and universal lossy data compression. IEEE Trans. Inf. Theory 2000, 46, 136–152. [Google Scholar] [CrossRef]
  22. Kontoyiannis, I.; Zhang, J. Arbitrary source models and Bayesian codebooks in rate-distortion theory. IEEE Trans. Inf. Theory 2002, 48, 2276–2290. [Google Scholar] [CrossRef]
  23. Mahmood, A.; Wagner, A.B. Lossy compression with universal distortion. IEEE Trans. Inf. Theory 2023, 69, 3525–3543. [Google Scholar] [CrossRef]
  24. Mahmood, A.; Wagner, A.B. Minimax rate-distortion. arXiv 2022, arXiv:2202.04481. [Google Scholar]
  25. Sholomov, L.A. Measure of information in fuzzy and partially defined data. Dokl. Math. 2006, 74, 775–779. [Google Scholar] [CrossRef]
  26. Ziv, J. Coding theorems for individual sequences. IEEE Trans. Inf. Theory 1978, 24, 405–412. [Google Scholar] [CrossRef]
  27. Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
  28. Potapov, V.N. Redundancy estimates for the Lempel-Ziv algorithm of data compression. Discret. Appl. Math. 2004, 135, 245–254. [Google Scholar] [CrossRef]
  29. Merhav, N.; Ziv, J. On the Wyner-Ziv problem for individual sequences. IEEE Trans. Inf. Theory 2006, 52, 867–873. [Google Scholar] [CrossRef]
  30. Ziv, J. Fixed-rate encoding of individual sequences with side information. IEEE Trans. Inf. Theory 1984, 30, 348–452. [Google Scholar] [CrossRef]
  31. Merhav, N. Finite-state source-channel coding for individual source sequences with source side information at the decoder. IEEE Trans. Inf. Theory 2022, 68, 1532–1544. [Google Scholar] [CrossRef]
  32. Ziv, J. Distortion-rate theory for individual sequences. IEEE Trans. Inf. Theory 1980, 26, 137–143. [Google Scholar] [CrossRef]
  33. Weinberger, M.J.; Merhav, N.; Feder, M. Optimal sequential probability assignment for individual sequences. IEEE Trans. Inf. Theory 1994, 40, 384–396. [Google Scholar] [CrossRef]
  34. Merhav, N. D-semifaithful codes that are universal over both memoryless sources and distortion measures. IEEE Trans. Inf. Theory 2023, 69, 4746–4757. [Google Scholar] [CrossRef]
  35. Cohen, A.; Merhav, N. Universal randomized guessing subjected to distortion. IEEE Trans. Inf. Theory 2022, 68, 7714–7734. [Google Scholar] [CrossRef]
  36. Merhav, N.; Cohen, A. Universal randomized guessing with application to asynchronous decentralized brute–force attacks. IEEE Trans. Inf. Theory 2020, 66, 114–129. [Google Scholar] [CrossRef]
  37. Merhav, N. Guessing individual sequences: Generating randomized guesses using finite-state machines. IEEE Trans. Inf. Theory 2020, 66, 2912–2920. [Google Scholar] [CrossRef]
  38. Csiszár, I.; Körner, J. Information Theory—Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Merhav, N. A Universal Random Coding Ensemble for Sample-Wise Lossy Compression. Entropy 2023, 25, 1199. https://doi.org/10.3390/e25081199

AMA Style

Merhav N. A Universal Random Coding Ensemble for Sample-Wise Lossy Compression. Entropy. 2023; 25(8):1199. https://doi.org/10.3390/e25081199

Chicago/Turabian Style

Merhav, Neri. 2023. "A Universal Random Coding Ensemble for Sample-Wise Lossy Compression" Entropy 25, no. 8: 1199. https://doi.org/10.3390/e25081199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop