1. Introduction
It is well-known that the presence of correlated side information can potentially offer dramatic benefits for data compression [
1,
2]. Important applications where such side information is naturally present include the compression of genomic data [
3,
4], file and software management [
5,
6], and image and video compression [
7,
8].
In practice, the most common approach to the design of effective compression methods with side information is based on generalisations of the Lempel-Ziv family of algorithms [
9,
10,
11,
12,
13]. A different approach based on grammar-based codes was developed in [
14], turbo codes were applied in [
15], and a generalised version of context-tree weighting was used in [
16].
In this work, we examine the theoretical fundamental limits of the best possible performance that can be achieved in such problems. Let
be a source-side information pair;
X is the source to be compressed, and
Y is the associated side information process which is assumed to be available both to the encoder and the decoder. Under appropriate conditions, the best
average rate that can be achieved asymptotically [
2], is the
conditional entropy rate,
where
,
, and
denotes the conditional entropy of
given
; precise definitions will be given in
Section 2.
Our main goal is to derive sharp asymptotic expressions for the optimum compression rate (with side information available to both the encoder and decoder), not only in expectation but with probability 1. In addition to the best first-order performance, we also determine the best rate at which this performance can be achieved, as a function of the length of the data being compressed. Furthermore, we consider an idealised version of a Lempel-Ziv compression algorithm, and we show that it can achieve asymptotically optimal first- and second-order performance, universally over a broad class of stationary and ergodic source-side information pairs .
Specifically, we establish the following. In
Section 2.1 we describe the theoretically optimal one-to-one compressor
, for arbitrary source-side information pairs
. In
Section 2.2 we prove our first result, stating that the description lengths
can be well-approximated, with probability one, by the
conditional information density,
. Theorem 2 states that for any jointly stationary and ergodic source-side information pair
, the best asymptotically achievable compression rate is
bits/symbol, with probability 1. This generalises Kieffer’s corresponding result [
17] to the case of compression with side information.
Furthermore, in
Section 2.4 we show that there is a sequence of random variables
such that the description lengths
of
any sequence of compressors
satisfy a “one-sided” central limit theorem (CLT): Eventually, with probability 1,
where the
converge to a
distribution, and the term
is negligible compared to
. The lower bound (
1) is established in Theorem 3 where it is also shown that it is asymptotically achievable. This means that the rate obtained by any sequence of compressors has inevitable
fluctuations around the conditional entropy rate, and that the size of these fluctuations is quantified by the
conditional varentropy rate,
This generalises the
minimal coding variance of [
18]. The bound (
1) holds for a broad class of source-side information pairs, including all Markov chains with positive transition probabilities. Under the same conditions, a corresponding “one-sided” law of the iterated logarithm (LIL) is established in Theorem 4, which gives a precise description of the inevitable almost-sure fluctuations above
, for any sequence of compressors.
The proofs of all the results in
Section 2.3 and
Section 2.4 are based, in part, on analogous asymptotics for the conditional information density,
. These are established in
Section 2.5, where we state and prove a corresponding CLT and an LIL for
. These results, in turn, follow from the almost sure invariance principle for
, proved in
Appendix A. Theorem A1, which is of independent interest, generalises the invariance principle established for the (unconditional) information density
by Philipp and Stout [
19]. In fact, Theorem A1 along with the identification of the conditions under which it holds (Assumption 1) in
Section 2.4), are the more novel contributions of this work.
In a different direction, Nomura and Han [
20] establish finer coding theorems for the Slepian-Wolf problem, when the side information is only available to the decoder. There, they obtain general second-order asymptotics for the best achievable rate region, under an excess-rate probability constraint.
Section 3 is devoted to
universal compression. We consider a simple, idealised version of Lempel-Ziv coding with side information. As in the case of Lempel-Ziv compression without side information [
21,
22], the performance of this scheme is determined by the asymptotics of a family of
conditional recurrence times,
. Under appropriate, general conditions on the source-side information pair
, in Theorem 8 we show that the ideal description lengths,
, can be well-approximated by the conditional information density
. Combining this with our earlier results on the conditional information density, in Corollary 1 and Theorem 9 we show that the compression rate of this scheme converges to
, with probability 1, and that it is universally second-order optimal. The results of this section generalise the corresponding asymptotics without side information established in [
23,
24].
The proofs of the more technical results needed in
Section 2 and
Section 3 are given in the appendix.
2. Pointwise Asymptotics
In this section, we derive general, fine asymptotic bounds for the description lengths of arbitrary compressors with side information, as well as corresponding achievability results.
2.1. Preliminaries
Let be an arbitrary source to be compressed, and be an associated side information process. We let , denote their finite alphabets, respectively, and we refer to the joint process as a source-side information pair.
Let
be a source string, and let
an associated side information string which is available to both the encoder and decoder. A
fixed-to-variable one-to-one compressor with side information, of blocklength
n, is a collection of functions
, where each
takes a value in the set of all finite-length binary strings,
with the convention that
consists of just the empty string Ø of length zero. For each
, we assume that
is a one-to-one function from
to
, so that the compressed binary string
is always correctly decodable.
The main figure of merit in lossless compression is of course the description length,
where throughout,
denotes the length, in bits, of a binary string
s. It is easy to see that under quite general criteria, the optimal compressor
is easy to describe; see [
25] for an extensive discussion. For
, we use the shorthand notation
for the string
, and similarly
for the corresponding collection of random variables
.
Definition 1 (The optimal compressor ). For each side information string , is the optimal compressor for the distribution , namely the compressor that orders the strings in order of decreasing probability , and assigns them codewords from in lexicographic order.
2.2. The Conditional Information Density
Definition 2 (Conditional information density). For an arbitrary source-side information pair , the conditional information density of blocklength n is the random variable: .
[Throughout the paper, ‘log’ denotes ‘’, the logarithm taken to base 2, and all familiar information theoretic quantities are expressed in bits.]
The starting point is the following almost sure (a.s.) approximation result between the description lengths
of an arbitrary sequence of compressors and the conditional information density
of an arbitrary source-side information pair
. When it causes no confusion, we drop the subscripts for PMFs and conditional PMFs, e.g., simply writing
for
as in the definition above. Recall the definition of the optimal compressors
from
Section 2.1.
Theorem 1. For any source-side information pair , and any sequence that grows faster than logarithmically, i.e., such that as , we have:
- (a)
For any sequence of compressors with side information : - (b)
The optimal compressors achieve the above bound with equality.
Proof. Fix
arbitrary and let
. Applying the general converse in ([
25], Theorem 3.3) with
in place of
and
in place of
, gives,
which is summable in
n. Therefore, by the Borel-Cantelli lemma we have that eventually, a.s.,
Since
was arbitrary, this implies
. Part
follows from
together with the fact that
, a.s., by the general achievability result in ([
25], Theorem 3.1). □
2.3. First-Order Asymptotics
For any source-side information pair
, the
conditional entropy rate is defined as:
Throughout
and
denote the discrete entropy of
Z and the conditional entropy of
Z given
W, in bits. If
are jointly stationary, then the above
is in fact a limit, and it is equal to
, where
and
are the entropy rates of
and of
Y, respectively [
2]. Moreover, if
are also jointly ergodic, then by applying the Shannon-McMillan-Breiman theorem [
2] to
Y and to the pair
, we obtain its conditional version:
The next result states that the conditional entropy rate is the best asymptotically achievable compression rate, not only in expectation but also with probability 1. It is a consequence of Theorem 1 with
, combined with (
2).
Theorem 2. Suppose is a jointly stationary and ergodic source-side information pair with conditional entropy rate .
- (a)
For any sequence of compressors with side information : - (b)
The optimal compressors achieve the above bound with equality.
2.4. Finer Asymptotics
The refinements of Theorem 2 presented in this section will be derived as consequences of the general approximation results in Theorem 1, combined with corresponding refined asymptotics for the conditional information density
. For clarity of exposition these are stated separately, in
Section 2.5 below.
The results of this section will be established for a class of jointly stationary and ergodic source-side information pairs , that includes all Markov chains with positive transition probabilities. The relevant conditions, in their most general form, will be given in terms of the following mixing coefficients.
Definition 3. Suppose is a stationary process on a finite alphabet . For any pair of indices , let denote the σ-algebra generated by . For , define: Note that if
Z is an ergodic Markov chain of order
k, then
decays exponentially fast [
26], and
for all
. Moreover, if
is a Markov chain with all positive transition probabilities, then
also decays exponentially fast; cf. ([
27], Lemma 2.1).
Throughout this section we will assume that the following conditions hold:
Assumption 1. The source-side information pair is stationary and satisfies one of the following three conditions:
is a Markov chain with all positive transition probabilities; or
as well as Y are kth order, irreducible and aperiodic Markov chains; or
is jointly ergodic and satisfies the following mixing conditions: [Our source-side information pairs are only defined for with , whereas the coefficients and are defined for two-sided sequences . However, this does not impose an additional restriction, since any one-sided stationary process can be extended to a two-sided one by the Kolmogorov extension theorem [28].]
In view of the discussion following Definition 3,
and
. Therefore, all results stated under Assumption 1 will be proved under the weakest set of conditions, namely that (
3) hold.
Definition 4. For a source-side information pair , the conditional varentropy rate
is: Under the above assumptions, the
in (
4) is in fact a limit. Lemma 1 is proved in the
Appendix A.
Lemma 1. Under Assumption 1, the conditional varentropy rate is: Our first main result in this section is a “one-sided” central limit theorem (CLT), which states that the description lengths
of an arbitrary sequence of compressors with side information,
, are asymptotically at best Gaussian, with variance
. Recall the optimal compressors
described in
Section 2.1Theorem 3 (CLT for codelengths).
Suppose satisfy Assumption 1, and let denote the conditional varentropy rate (
4)
. Then there exists a sequence of random variables such that:- (a)
For any sequence of compressors with side information, , we have, where in distribution, as
- (b)
The optimal compressors achieve the lower bound in (
5)
with equality.
Proof. Letting , , and taking , both results follow by combining the approximation results of Theorem 1 with the corresponding CLT for the conditional information density in Theorem 5. □
Our next result is in the form of a “one-sided” law of the iterated logarithm (LIL) which states that with probability 1, the description lengths of any compressor with side information will have inevitable fluctuations of order bits around the conditional entropy rate ; throughout, denotes the natural logarithm to base e.
Theorem 4 (LIL for codelengths).
Suppose satisfy Assumption 1, and let denote the conditional varentropy rate (
4)
. Then:- (a)
For any sequence of compressors with side information, , we have: - (b)
The optimal compressors achieve the lower bounds in (
6)
and (7)
with equality.
Proof. Taking , the results of the theorem again follow by combining the approximation results of Theorem 1 with the corresponding LIL for the conditional information density in Theorem 6. □
Remark 1. Although the results in Theorems 3 and 4 are stated for one-to-one compressors , they remain valid for the class of prefix-free compressors. Since prefix-free codes are certainly one-to-one, the converse bounds in Theorem 3 and 4 are valid as stated, while for the achievability results it suffices to consider compressors with description lengths , and then apply Theorem 5.
Theorem 3 says that the compression rate of any sequence of compressors will have at best Gaussian fluctuations around , and similarly Theorem 4 says that with probability 1, the description lengths will have inevitable fluctuations of approximately bits around .
As both of these vanish when is zero, we note that if the source-side information pair is memoryless, so that are independent and identically distributed, then the conditional varentropy rate reduces to, which is equal to zero if and only if, for each , the conditional distribution of given is uniform on a subset , where all the have the same cardinality.
In the more general case when both the pair process and the side information Y are Markov chains, necessary and sufficient conditions for to be zero were recently established in [25]. In analogy with the source dispersion for the problem of lossless compression without side information [29,30], for an arbitrary source-side information pair the conditional dispersion was recently defined [25] as, There, it was shown that when both the pair and Y itself are irreducible and aperiodic Markov chains, the conditional dispersion coincides with the conditional varentropy rate:
2.5. Asymptotics of the Conditional Information Density
Here we show that the conditional information density itself,
, satisfies a CLT and a LIL. The next two theorems are consequences of the almost sure invariance principle established in Theorem A1, in the
Appendix A.
Theorem 5 (CLT for the conditional information density).
Suppose satisfy Assumption 1, and let denote the conditional varentropy rate (
4)
. Then, as : Proof. The conditions (
3), imply that as
,
, and
, cf. [
19], therefore also,
, so it suffices to show that as
,
Let
denote the space of
cadlag (right-continuous with left-hand limits) functions from
to
, and define, for each
,
, as in Theorem A1 in the
Appendix A. For all
, define
. Then Theorem A1 implies that as
,
where
is a standard Brownian motion; see, e.g., ([
19], Theorem E, p. 4). In particular, this implies that
which is exactly (
9). □
Theorem 6 (LIL for the conditional information density).
Suppose satisfy Assumption 1, and let denote the conditional varentropy rate (
4)
. Then: Proof. As in the proof of (
8), it suffices to prove (
10) with
in place of
. However, this is immediate from Theorem A1, since, for a standard Brownian motion
,
see, e.g., ([
31], Theorem 11.18). In addition, similarly for (11). □
3. Idealised LZ Compression with Side Information
Consider the following idealised version of Lempel-Ziv-like compression with side information. For a given source-side information pair , the encoder and decoder both have access to the infinite past and to the current side information . The encoder describes to the decoder as follows. First she searches for the first appearance of in the past , that is, for the first such that . Then she counts how many times appears in between locations and 0, namely how many indices there are, such that . Say there are such js. She describes to the decoder by telling him to look at the th position where appears in the past , and read off the corresponding X string.
This description takes
bits, and, as it turns out, the resulting compression rate is asymptotically optimal: As
, with probability 1,
Moreover, it is second-order optimal, in that it achieves equality in the CLT and LIL bounds given in Theorems 3 and 4 of
Section 2.
Our purpose in this section is to make these statements precise. We will prove (
12) as well as its CLT and LIL refinements, generalising the corresponding results for recurrence times without side information in [
24].
The use of recurrence times in understanding the Lempel-Ziv (LZ) family of algorithms was introduced by Willems [
21] and Wyner and Ziv [
22,
32]. In terms of practical methods for compression with side information, Subrahmanya and Berger [
9] proposed a side information analog of the sliding window LZ algorithm [
33], and Uyematsu and Kuzuoka [
10] proposed a side information version of the incremental parsing LZ algorithm [
34]. The Subrahmanya-Berger algorithm was shown to be asymptotically optimal in [
12,
13]. Different types of LZ-like algorithms for compression with side information were also considered in [
11].
Throughout this section, we assume is a jointly stationary and ergodic source-side information pair, with values in the finite alphabets , respectively. We use bold lower-case letters without subscripts to denote infinite realizations of , and the corresponding bold capital letters without subscripts to denote the entire process, .
The main quantities of interest are the recurrence times defined next.
Definition 5 (Recurrence times).
For a realization x of the process X, and , define the repeated recurrence times
of , recursively, as:For a realization of the pair and , the joint recurrence time
of is defined as, and the conditional recurrence time
of among the appearances is: An important tool in the asymptotic analysis of recurrence times is Kac’s Theorem [
35]. Its conditional version in Theorem 7 was first established in [
12] using Kakutani’s induced transformation [
36,
37].
Theorem 7 (Conditional Kac’s theorem). [
12]
Suppose is a jointly stationary and ergodic source-side information pair. For any pair of strings , : The following result states that we can asymptotically approximate
by the conditional information density not just in expectation as in Kac’s theorem, but also with probability 1. Its proof is in
Appendix B.
Theorem 8. Suppose is a jointly stationary and ergodic source-side information pair. For any sequence of non-negative real numbers such that , we have: Next we state the main consequences of Theorem 8 that we will need. Recall the definition of the coefficients
from
Section 2.4. Corollary 1 is proved in
Appendix B.
Corollary 1. Suppose are jointly stationary and ergodic.
- (a)
If, in addition, and , then for any : - (b)
In the general jointly ergodic case, we have:
From part
combined with the Shannon-McMillan-Breiman theorem as in (
2), we obtain the result (
12) promised in the beginning of this section:
This was first established in [
12]. However, at this point we have already done the work required to obtain much finer asymptotic results for the conditional recurrence time.
For any pair of infinite realizations
of
, let
be the continuous-time path, defined as:
The following theorem is a direct consequence of Corollary 1
combined with Theorem A1 in the
Appendix A. Recall Assumption 1 from
Section 2.4.
Theorem 9. Suppose satisfy Assumption 1, and let denote the conditional varentropy rate. Then can be redefined on a richer probability space that contains a standard Brownian motion such that for any : Two immediate consequences of Theorem 9 are the following:
Theorem 10 (CLT and LIL for the conditional recurrence times).
Suppose satisfy Assumption 1, and let denote the conditional varentropy rate. Then: