A Universal Random Coding Ensemble for Sample-Wise Lossy Compression

We propose a universal ensemble for the random selection of rate–distortion codes which is asymptotically optimal in a sample-wise sense. According to this ensemble, each reproduction vector, x^, is selected independently at random under the probability distribution that is proportional to 2−LZ(x^), where LZ(x^) is the code length of x^ pertaining to the 1978 version of the Lempel–Ziv (LZ) algorithm. We show that, with high probability, the resulting codebook gives rise to an asymptotically optimal variable-rate lossy compression scheme under an arbitrary distortion measure, in the sense that a matching converse theorem also holds. According to the converse theorem, even if the decoder knew the ℓ-th order type of source vector in advance (ℓ being a large but fixed positive integer), the performance of the above-mentioned code could not have been improved essentially for the vast majority of codewords pertaining to source vectors in the same type. Finally, we present a discussion of our results, which includes among other things, a clear indication that our coding scheme outperforms the one that selects the reproduction vector with the shortest LZ code length among all vectors that are within the allowed distortion from the source vector.


Introduction
We revisit the well-known problem of lossy source coding for finite-alphabet sequences with respect to (w.r.t.) a certain distortion measure [2], [4,Chap. 10], [7,Chap. 9], [9], [27,Chaps. 7,8].More concretely, our focus is on d-semifaithful codes, namely, variable-rate codes that meet a certain distortion constraint for every source sequence (and not only in expectation).As is very well known [2], the rate-distortion function quantifies the minimum achievable expected coding rate for a given memoryless source and distortion measure.
During several past decades, many research efforts were motivated by the fact that the source statistics are rarely (if not never) known in practice, and were therefore directed to the quest for universal coding schemes, namely, coding schemes which do not depend of the unknown statistics, but nevertheless, approach the lower bounds (i.e., the entropy, in lossless compression, or the ratedistortion function, in the lossy case) asymptotically, as the block length grows without bound.We next provide a very brief (and non-comprehensive) review of some of the relevant earlier works.
In lossless compression, the theory of universal source coding is very well developed and mature.
Davisson's work [6] concerning universal-coding redundancies has established the concepts of weak universality and strong universality (vanishing maximin and minimax redundancies, respectively), and has characterized the connection to the capacity of the 'channel' defined by family of conditional distributions of the data to be compressed given the index (or parameter) of the source in the class [8].For many of the frequently encountered parametric classes of sources, the minimum achievable redundancy of universal codes is well-known to be dominated by k log n 2n , where k is the number of degrees of freedom of the parameter, and n is the block length.A central idea that arises from Davisson's theory is to construct a Shannon code pertaining to the probability distribution of the data vector w.r.t. a mixture (with a certain prior function) of all sources in the class.
Rissanen, which was the inventor of the minimum description length (MDL) principle, has proved in [25] a converse to a coding theorem, which asserts that asymptotically, no universal code can achieve redundancy below (1 − ǫ) k log n 2n , with the possible exception of sources from a subset of the parameter space, whose volume tends to zero as n → ∞, for every positive ǫ.Merhav and Feder [21] have generalized this result to more general classes of sources, with the term k log n 2n substituted by the capacity of the above mentioned 'channel'.Further developments, including more refined redundancy analyses, have been carried out in later studies.
In the wider realm of universal lossy compression, the theory is, unfortunately, not as sharp and well-developed as in the lossless setting.We confine our attention, in this work, to d-semifaithful codes [24], namely, codes that satisfy the distortion requirement with probability one.Zhang, Yang and Wei [29] have proved that, unlike in lossless compression, in the lossy case, even if the source statistics are known perfectly, it is impossible to achieve redundancy below log n 2n (see also [15]), but log n n is achievable.Not knowing the source conveys the price of enlarging the multiplicative constant in front of log n n .Indeed, Yu and Speed [28] have established weak universality with a constant that grows with the cardinalities of the alphabets of the source and the reconstruction [26].Ornstein and Shields [24] have considered universal d-semifaithful coding for stationary and ergodic sources w.r.t. the Hamming distortion measure, and established convergence with probability one to the rate-distortion function.Kontoyiannis [10] had several interesting findings.The first is a certain central limit theorem (CLT), with a O(1/ √ n) redundancy term, whose coefficient is described as a limiting Gaussian random variable with some constant variance.The second is the so called law of iterated logarithm (LIL) with redundancy proportional to log(log n) n infinitely often with probability one.One of the counter-intuitive conclusions from [10] is that universality is priceless under these performance measures.In [11], many of the findings are based on the observation that optimal compression can be characterized in terms of the negative logarithm of the probability of a sphere of radius nD around the source vector w.r.t. the distortion measure, where D is the allowed per-letter distortion.In the same article, they proposed also the ensemble of random coding w.r.t. a probability distribution given by a mixture of all distributions in a certain class.In two recent articles, Mahmood and Wagner [12], [13] have studied d-semifaithful codes that are strongly universal w.r.t.both the source and the distortion function.The redundancy rates in [12] behave like log n n with different multiplicative constants.
A parallel line of research work on universal lossless and lossy compression, which was pioneered by Ziv, pertains to the individual-sequence approach.According to this approach, there are no assumptions on the statistical properties of the source.The source sequence to be compressed is considered an arbitrary deterministic (individual) sequence, but limitations are imposed on the encoder and/or the decoder to be implementable by finite-state machines.This includes, first and foremost, the celebrated Lempel-Ziv (LZ) algorithm [30], [33], as well as further developments that extend the scope to lossy compression with and without side information [22], [32], as well as to joint source-channel coding [16], [18], [19], [31].In the lossless case, the article [23] provides an individual-sequence analogue of the above-mentioned result due to Rissanen, where the expression k log n 2n continues to designate the best achievable redundancy, but the main term of the compression ratio there is the empirical entropy of the source vector instead of the ordinary entropy of the probabilistic setting.The converse bound of [23] applies to the vast majority of source sequences within each type, and the vast majority of types (in analogy to the vast majority of the parameter space in Rissanen's framework).In a way, this kind of a converse result still contains some flavor of the probabilistic setting, because arguing that the number of exceptional typical sequences is relatively small, is actually equivalent to imposing a uniform distribution across the type and asserting that the induced probability of violating the bound is small.A similar comment applies, of course, to the exclusion of a minority of the types.The achievability result of [23], on the other hand, holds pointwise, for every sequence.A similar comment applies to [14], where asymptotically pointwise lossy compression was established with respect to first order statistics (i.e., "memoryless" statistics) with an emphasis on distortion-universality, similarly as in [12] and [13].
A similar kind of a mix between the probabilistic setting and the individual-sequence setting is adopted in this paper as well, in the context of universal rate-distortion coding, but here, just like in [?], there is no limitation to finite-state encoders/decoders as in [23].In particular, our converse theorem asserts that given an arbitrary variable-rate code, and given an arbitrary distortion function within a certain wide class, the majority of reproduction vectors that represent source sequences of a given type (of any fixed order), must have a code-length that is essentially at least as large as the negative logarithm of the probability of a ball with normalized radius D (D being the allowed per-letter distortion), centered at the given source sequence.The probability of this ball is taken w.r.t. a universal distribution that is proportional to 2 −LZ(x) , where LZ(x) is the code-length of LZ encoding of the reproduction vector, x.On the other hand, we also present a matching achievability result, asserting that for every source sequence, this code length is essentially achievable by random coding, using a universal ensemble of codes, which is defined by independent random selection, where each codeword is drawn under the above-described universal probability distribution.
While the achievability result in [14] was pointwise as well, it was tailored to a memoryless structure in the sense that it was given in terms of the rate-distortion function of the first-order empirical distribution, which is blind to any empirical dependencies and repetitive patterns within the source sequence.In this paper, we both extend the scope to general individual sequences beyond the memoryless statistics and extend the allowable class of distortion measures.In terms of the technical aspects, the proof of the achievablity result is very similar to the parallel proof in [14], but the novelty lies considerably more in the converse theorem and its proof.
The outline of this paper is as follows.In Section 2, we establish the notation conventions, define a few terms and quantities, and provide some background.In Section 3, we present the converse theorem and its proof.In Section 4, we present the achievability theorem and prove it.
Finally, in Section 5, we summarize the paper and discuss our results.

Notation, Definitions and Background
Throughout the paper, random variables will be denoted by capital letters, specific values they may take will be denoted by the corresponding lower case letters, and their alphabets will be denoted by calligraphic letters.Random vectors and their realizations will be denoted, respectively, by capital letters and the corresponding lower case letters, both in the bold face font.Their alphabets will be superscripted by their dimensions.The source vector of length n, (x 1 , . . ., x n ), with components, x i , i = 1, . . ., n, from a finite-alphabet, X , will be denoted by x.The set of all such n-vectors will be denoted by X n , which is the n-th order Cartesian power of X .Likewise, a reproduction vector of length n, (x 1 , . . ., xn ), with components, xi , i = 1, . . ., n, from a finite-alphabet, X , will be denoted by x ∈ X n .We denote the cardinalities of X and X by J and K, respectively.
For i ≤ j, the notation x j i will be used to denote the substring (x i , x i+1 , . . ., x j ).Probability distributions will be denoted by the letter P or Q with possible subscripts, depending on the context.The probability of an event E will be denoted by Pr{E}, and the expectation operator with respect to (w.r.t.) a probability distribution P will be denoted by E{•}.n log an bn ≤ 0, and so on.The indicator function of an event E will be denoted by I{E}.The notation [x] + will stand for max{0, x}.The logarithmic function, log x, will be understood to be defined to the base 2.
Logarithms to the base e will be denote by ln.
Let ℓ be a positive integer that divides n.The ℓth order empirical distribution of x ∈ X n , which will be denoted by P ℓ x , is the vector of relative frequencies The set of all ℓth order empirical distributions of sequences in X n will be denoted by P ℓ n .For P ℓ ∈ P ℓ n , the type class, {x ∈ X n : P ℓ x = P ℓ }, will be denoted by T n (P ℓ ).Likewise, T n (Q ℓ ) will denote {x ∈ X n : P ℓ x = Q ℓ }, where P ℓ x is the ℓ-th order empirical distribution of x.Finally, P ℓ xx will denote the ℓth order joint empirical distribution of (x, x), i.e., For a given positive integer n, a distortion function, d, is a function from X n × X n into IR + .In the two main parts of this paper, different assumptions will be imposed on the distortion function.
1.For the achievability theorem, the distortion function can be completely arbitrary.
2. For the converse theorem, we assume that d(x, x) depends on x and x only via their first order joint empirical distribution, P 1 xx , and that for a given such distribution, it grows linearly in n, that is, d(x, x) = nρ( P 1 xx ), where the function ρ is independent of n.
Regarding item 2, additive distortion measures, which obviously comply with the requirement, are given by linear functionals of P 1 xx .However, here arbitrary non-linear functionals are allowed as well.
A rate-distortion block code of length n is a mapping, φ n : X n → B n , B n ⊂ {0, 1} * , that maps the space of source vectors of length n, X n , into a set, B n , of variable-length compressed bit strings.
The decoder is a mapping ψ n : B n → C n ⊆ X n that maps the set of compressed variable-length binary strings into a reproduction codebook, C n .A block code is called d-semifaithful if for every x ∈ X n , d(x, ψ n (φ n (x))) ≤ nD.The code-length for x, denoted L(x), is the number of bits of φ n (x).Since L(x) depends on x only via φ n (x), we will also denote it sometimes as L(φ n (x)) or by L(x) (x being the reproduction vector pertaining to φ n (x)), with a slight abuse of notation.
For the converse theorem, we assume that correspondence between B n and C n is one-to-one.For the achievability theorem, we consider prefix-free codes.Accordingly, the encoder can equivalently be presented as a cascade of a reproduction encoder (a.k.a.vector quantizer), which maps X n into C n , followed by an entropy coder, which maps C n into B n with no additional loss of information.
For the purpose of presenting both the converse theorem and the achievability theorem, we need to recall a few terms and facts concerning the 1978 version of LZ algorithm (a.k.a. the LZ78 algorithm) [33].The incremental parsing procedure of the LZ78 algorithm is a procedure of sequentially parsing a vector, x ∈ X n , such that each new phrase is the shortest string that has not been encountered before as a parsed phrase, with the possible exception of the last phrase, which might be incomplete.For example, the incremental parsing of the vector x = abbabaabbaaabaa is a,b,ba,baa,bb,aa,ab,aa.Let c(x) denote the number of phrases in x resulting from the incremental parsing procedure.Let LZ(x) denote the length of the LZ78 binary compressed code for x.
According to [33,Theorem 2], where we remind that K is the cardinality of X , and where ǫ(n) clearly tends to zero as n → ∞, at the rate of 1/ log n.We next define a universal probability distribution (see also [3], [20]): Finally, we define the D-sphere around x as and For later use, we also define Our purpose is to derive upper and lower bounds on the smallest achievable code length, L(x), for d-semifaithful block codes of length n, and individual sequences, {x}, from a given ℓth order type class, T n (P ℓ ).As will be seen shortly, in both the converse and the achievability theorems, the main term of the bound on the length function will be − log(U [S(x, D)]).

The Converse Theorem
The following converse theorem asserts that even if the type class of the source vector was known to the decoder ahead of time, the code length could not be much smaller than − log(U [S(x, D)]) for the vast majority of the codewords pertaining to that type.
Theorem 1 Let ℓ be a positive integer that divides n and let P ℓ be an arbitrary empirical distribution pertaining to a certain type class, T n ( P ℓ ), of source sequences in X n .Let d be a distortion function that depends on (x, x) only via Pxx .Then, for every every d-semifaithful variable-length block code, with one-to-one correspondence between B n and C n , and for every ǫ > 0, the following lower bound applies to a fraction of at least (1 − 2n −ǫ ) of the codewords, {φ n (x), x ∈ T n ( P ℓ )}: where ∆ n (ℓ) has the property lim n→∞ ∆ n (ℓ) = 1/ℓ.
As a technical note, observe that ∆ n (ℓ) can be made small only when ℓ is chosen large, as ∆ n (ℓ) behaves like 1/ℓ for fixed ℓ and large n.This suggests that the theorem is meaningful mainly when ℓ is appreciably large, which is not surprising, because the larger is ℓ, the better one can exploit empirical dependencies within the source sequence.
The remaining part of this section is devoted to the proof of Theorem 1.
Proof.We first establish a relationship that will be used later on.For two given types T n (P ℓ ) ⊂ X n and T n (Q ℓ ) ⊂ X n , consider the quantity, We can evaluate N (D) in two ways.The first is as follows: where the second equality is since |T n (Q ℓ ) S(x, D) is the same for all x ∈ T n (P ℓ ), due to the permutation-invariance assumption on the distortion function.By the same token, we can also express N (D) in the following manner: which follows from the same consideration by symmetry.It follows then that or, equivalently, Now, let Q ℓ * be the type of x that maximizes |T n (P ℓ ) Ŝ(x, D) .Then, the last equation implies that This relationship will be used shortly.
Let P ℓ ∈ P ℓ n be given.Any d-semifaithful code must fully cover the type class T n (P ℓ ) with spheres of radius nD (henceforth, referred to as D-spheres), centered at the various codewords.Let x1 , . . ., xM ∈ X n be M codewords.The number of members of T n (P ℓ ) that are covered by x1 , . . ., xM ∈ X n is upper bounded as follows.
and so, the necessary condition for complete covering, which is G ≥ |T n (P ℓ )|, amounts to where the second line is by (16).Consider now a variable-length code with a codebook of size M .
Let L(x) denote the length (in bits) of the compressed binary string that represents x.The number of codewords with L(x) ≤ log M − ǫ log n is upper bounded as follows: where in the first inequality we have used the assumed one-to-one property of the mapping between the reproduction codewords and their variable-length compressed binary representations.It follows then that for at least M (1 − 2n −ǫ ) out of the M codewords in C n (that is, the vast majority codewords), we have where U Q * is the uniform probability distribution across the type class Q ℓ * , i.e., We now argue that for every x ∈ X n For x / ∈ T n (Q ℓ * ), this is trivial as the l.h.s. is equal to zero.For x ∈ T n (Q ℓ * ), we have the following consideration: Combining eqs.( 30) and ( 32) of [17] together with the inequality [5, p. 17, Lemma where we have where and and where the second inequality in (25) follows from (3).The last line of ( 25) is equivalent to (22).
It follows then that for at least where in the last step we have applied Kraft's inequality to the LZ code-length function.This completes the proof of Theorem 1.

The Achievability Theorem
The lower bound of Theorem 1 naturally suggests achievability using the universal distribution, U , for random selection of the various codewords.The basic idea is quite standard and simple: The quantity U [S(x, D)] is the probability that a single randomly chosen reproduction vector, drawn under U , would fall within distance nD from the source vector, x.If all reproduction vectors are drawn independently under U , then the typical number of such random selections that it takes before one sees the first one in S(x, D), is of the exponential order of 1/U [S(x, D)].Given that the codebook is revealed to both the encoder and decoder, once it has been selected, the encoder merely needs to transmit the index of the first such reproduction vector within the codebook, and the description length of that index can be made essentially as small as log{1/U [S(x, D)]} = − log(U [S(x, D)).
We use this simple idea to prove achievability for an arbitrary distortion measure.The proof is very similar to the parallel proof in [14], and it is presented here mainly for completeness.
The achievability theorem is the following.
Theorem 2 Let d : X n × X n → IR + be an arbitrary distortion function.Then, for every ǫ > 0, there exists a sequence of d-semifaithful, variable-length block codes of block length n, such that for every x ∈ X n , the code length for x is upper bounded by where c > 0 is a constant and δ n = O(nJ n e −n 1+ǫ ).
Proof.The proof is based on the following simple well known fact: Given a source vector x ∈ X n and a codebook, C n , let I(x) denote the index, i, of the first vector, xi , such that d(x, xi ) ≤ nD, namely, xi ∈ S(x, D).If all reproduction vectors are drawn independently under U , then, for every positive integer, N : and so, if for some arbitrary positive sequence, {λ n }, that tends to infinity, then This fact will be used few times in this section.
For later use, we also need the following uniform lower bound to U [S(x, D)]: For a given x, let x0 ∈ X n denote an arbitrary reproduction vector within S(x, D).Then, Next, observe that LZ(x 0 ) is maximized by the K-ary extension of the counting sequence [33, p. 532], which is defined as follows: For i = 1, 2, . . ., m (m -positive integer), let u(i) denote the K-ary string of length iK i that lists, say, in lexicographic order, all the K i words from X i , and let The LZ incremental parsing of x0 , which is exactly (u(1), u(2), . . ., u(m)), yields: and so, considering eq. ( 3), it follows that LZ(x 0 ) ≤ (1 + ǫ n )n log K for some ǫ n → 0 as n → ∞. 1It follows then that Consider now an independent random selection of all reproduction vectors to form a codebook, . ., xM } has been drawn, it is revealed to both the encoder and the decoder.Consider next the following encoder.As defined before, let I(x) be defined as the index of the first codeword that falls within S(x, D), but now, with the small modification that if none of the A n codewords fall in S(x, D), then we define I(x) = A n nevertheless (and then the encoding fails).Next, we define the following probability distribution over the positive integers, 1, 2, . . ., A n : Given x, the encoder finds I(x) and encodes it using a variable-rate lossless code with the length function (in bits, and ignoring the integer length constraint), = log I(x) + log(n ln A + 1) where c = log(ln A + 1).It follows that the expected codeword length for x ∈ X n (w.r.t. the randomness of the code) is upper bounded by: and we denote Consider now the quantity where the expectation is w.r.t. the randomness of the code, C n .If E n can be upper bounded by δ n , which tends to zero as n → ∞, this will imply that there must exist a code for which both and at the same time.Observe that since the left-hand side of ( 43) is either zero or one, then if we know that it must be less than δ n → 0, for some codebook, C n , it means that it must vanish as soon as n is large enough such that δ n < 1, namely, d(x, x) ≤ nD for all x, in other words, the code is d-semifaithful.Also, by (44), for the same codebook, we must have and δ n adds a negligible redundancy term.
To prove that E n → 0, we first use the simple fact that the maximum of two non-negative numbers is upper bounded by their sum, i.e., and therefore, it is sufficient to prove that each one of these terms tends to zero.As for the first term, we have: where in (a) we have used (37).This quantity decays double-exponentially rapidly as n → ∞ since we have assumed A > K.
As for the second term of (46), we have: where in (a) we have used (39) and (41), and in (b) we have used (31).The right-most side of this chain of inequalities clearly decays as well when n grows without bound.This completes the proof.

Summary and Discussion
By deriving asymptotically matching upper and lower bounds, we have established the quantity ) as having the significance of an empirical rate distortion function for individual sequences.While this quantity is not easy to calculate for large n, the operative meaning of our results is that we propose a universal ensemble for rate-distortion coding.According to this ensemble, the codewords are drawn independently under the probability distribution that is proportional to 2 −LZ(x) .
There are several observations, insights and perspectives that should be addressed.
Relation to earlier converse bounds.The converse bound is given in terms of the probability of a sphere of radius nD around the source vector x, under the universal distribution, U , defined in (4).This is intimately related to a converse result due to Kontoyiannis and Zhang [11, Theorem 1, part i)], which states that for any d-semifaithful code, there exists a probability distribution Q on X n such that L(x) ≥ − log(Q[S(x, D)]) for all x (see also [10]).Here, upon giving up any claims on a minority of the codewords pertaining to a given type class, we derived a lower bound of essentially the same form with the benefit of specifying a concrete choice of the distribution Q, i.e., we propose Q = U , the universal distribution (unlike the distribution in [11, Section III.A], which is proportional to 2 −L(x) across the codebook).
Interpretation of the main term of the bound.Since LZ(x) is essentially bounded by a linear function of n (see (37)), we can approximate the main term as follows: This expression, when normalized by n, can be viewed as a certain extension of the rate distortion function, from the memoryless case to the general case, in the following sense: For a memoryless source P , the rate-distortion function has the following representation, which is parallel to (49): where the maximum over the empty set is understood to be −∞.Indeed, if we replace U by the the uniform distribution across the first-order type pertaining to the optimal P X , this is the corresponding single-letter expression of − log(P X [S(x, D)]) that is obtained using the method of types [5].
Comparing to the LZ description length of the most compressible x ∈ S(x, D).Since our achievable bound involves LZ compression, it is interesting to compare it to the conceptually simple coding scheme that encodes x by the vector x that minimizes LZ(x) within S(x, D).Consider the following chain of equalities and inequalities: which means that the performance of our proposed scheme is never worse (and conceivably, often much better) than that of selecting the vector x with the smallest LZ(x) among all reproduction vectors in S(x, D).The reason for the superiority of the proposed scheme is that it takes advantage of the fact that x cannot be any vector in X n , but it must be a member of the codebook, C n , i.e., one of the possible outputs of a vector quantizer.On the other hand, in view of [33], min x∈S(x,D) LZ(x) is essentially achievable upon compressing the output of a certain reproduction encoder (or vector quantizer) using a finite-state encoder, but a finite-state machine does not have enough memory resources to take advantage of the fact that vectors outside C n cannot be encountered by the encoder.Another interesting comparison between the two schemes is in terms of computational complexity.While in our scheme, the encoder has to carry out typically about 1/U [S(x, D)] distortion calculations before finding the first x ∈ S(x, D), in the alternative scheme the number of calculations is |S(x, D)|.The former is decreasing function of D, whereas the latter is an increasing function of D. Therefore, in terms of computational complexity, the preference between the two schemes might depend on D. Specifically, for an additive distortion measure, it is easy to see that Therefore, whenever D is large enough such that R(D, P 1 x ) < E(D, P 1 x )), it is guaranteed that the coding scheme proposed here is computationally less demanding than the alternative scheme of minimizing LZ(x) across S(x, D).
Implementation of the random coding distribution.The universal random coding distribution is not difficult to implement.One way to do this is by feeding the LZ decoder with a sequence of purely random bits (fair coin tosses) until we have obtained n symbols at the decoder output.The details can be found in [20].
Universality w.r.t. the distortion measure.As mentioned in the Introduction, in [12], [13] and [14], there are results on the existence of rate-distortion codes that are universal, not only in terms of the source, but also in the sense of the distortion measure.Since the proof of our achievability scheme is very similar to that of [14], it is possible to extend the achievability proof here too, so as to make our code distortion-universal for a wide class of distortion measures.This can be carried out by redefining E n to include maximization of both terms over a dense grid of distortion functions, as was done in [14].We opted not to include this in the present paper since it is straightforward, given the results we already have here and in [14].
For two positive sequences, a n and b n , the notation a n • = b n will stand for equality in the exponential scale, that is, lim n→∞ 1 n log an bn = 0. Similarly, a n • ≤ b n means that lim sup n→∞ 1