Universality of Logarithmic Loss in Successive Refinement

No, Albert

doi:10.3390/e21020158

Open AccessArticle

Universality of Logarithmic Loss in Successive Refinement^†

by

Albert No

^†

Department of Electronic and Electrical Engineering, Hongik University, Seoul 04066, Korea

^†

This paper is an extended version of our paper published in the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015.

Entropy 2019, 21(2), 158; https://doi.org/10.3390/e21020158

Submission received: 19 December 2018 / Revised: 1 February 2019 / Accepted: 7 February 2019 / Published: 8 February 2019

(This article belongs to the Special Issue Multiuser Information Theory II)

Download

Browse Figure

Versions Notes

Abstract

:

We establish an universal property of logarithmic loss in the successive refinement problem. If the first decoder operates under logarithmic loss, we show that any discrete memoryless source is successively refinable under an arbitrary distortion criterion for the second decoder. Based on this result, we propose a low-complexity lossy compression algorithm for any discrete memoryless source.

Keywords:

logarithmic loss; rate-distortion; successive refinability

1. Introduction

In the lossy compression problem, logarithmic loss is a criterion allowing a “soft” reconstruction of the source, a departure from the classical setting of a deterministic reconstruction. In this setting, the reconstruction alphabet is the set of probability distributions over the source alphabet. More precisely, let x be the source symbol from the source alphabet

X

, and

q (\cdot)

be the reconstruction symbol which is the probability measure on

X

. Then the logarithmic loss is given by

\begin{matrix} ℓ (x, q) = log \frac{1}{q (x)} . \end{matrix}

Clearly, if the reconstruction

q (\cdot)

has a small probability on the true source symbol x, the amount of loss will be large.

Although logarithmic loss plays a crucial role in the theory of learning and prediction, relatively little work has been done in the context of lossy compression, notwithstanding the two-encoder multi-terminal source coding problem under logarithmic loss [1,2], or recent work on the single-shot approach to lossy source coding under logarithmic loss [3]. Note that lossy compression under logarithmic loss is closely related to the information bottleneck method [4,5,6]. In this paper, we focus on universal properties of logarithmic loss in the context of successive refinement.

Successive refinement is a network lossy compression problem where one encoder wishes to describe the source to two decoders [7,8]. Instead of having two separate coding schemes, the successive refinement encoder designs a code for the decoder with a weaker link, and sends extra information to the second decoder on top of the message of the first decoder. In general, successive refinement coding cannot do as well as two separate encoding schemes optimized for the respective decoders. However, if we can achieve the point-to-point optimum rates using successive refinement coding, we say the source is successively refinable.

Although necessary and sufficient conditions of successive refinability is known [7,8], proving (or disproving) successive refinability of the source is not a simple task. Equitz and Cover [7] found a discrete source that is not successively refinable using Gerrish problem [9]. Chow and Berger found a continuous source that is not successively refinable using Gaussian mixture [10]. Lastras and Berger showed that all sources are nearly successively refinable [11]. However, still only a few sources are known to be successively refinable. In this paper, we show that any discrete memoryless source is successively refinable as long as the weaker link employs logarithmic loss, regardless of the distortion criterion used for the stronger link.

In the second part of the paper, we show that this result can be useful to design a lossy compression algorithm with low complexity. Recently, the idea of successive refinement is applied to reduce the complexity of point-to-point lossy compression algorithm. Venkataramanan et al. proposed a new lossy compression for Gaussian source where the codewords are linear combination of sub-codewords [12]. No and Weissman also proposed a low-complexity lossy compression algorithm for Gaussian source using extreme value theory [13]. Both algorithms are successively describing source and achieve low complexity. Roughly speaking, successive refinement algorithm provides a smaller size of codebook. For example, the naive random coding scheme has a codebook of size

e^{n R}

when the blocklength is n and the rate is R. On the other hand, if we can design a successive refinement scheme with half rate in the weaker link, then the size of codebook is

e^{n R / 2}

each. Thus, the overall codebook size is

2 e^{n R / 2}

. The above idea can be generalized to successive refinement scheme with L decoders [12,14]

The universal property of logarithmic loss in successive refinement implies that, for any point-to-point lossy compression of discrete memoryless source, we can insert a virtual intermediate decoder (weaker link) under logarithmic loss without losing any rates at the actual decoder (stronger link). As we discussed, this property allows us to design a lossy compression algorithm with low-complexity for any discrete source and distortion pair. Note that previous works only focused on specific source and distortion pair such as binary source with Hamming distortion.

The remainder of the paper is organized as follows. In Section 2, we revisit some of the known results pertaining to logarithmic loss. Section 3 is dedicated to successive refinement under logarithmic loss in the weaker link. In Section 4, we propose a low complexity compression scheme that can be applied to any discrete lossy compression problem. Finally, we conclude in Section 5.

Notation:

X^{n}

denotes an n-dimensional random vector

(X_{1}, X_{2}, \dots, X_{n})

while

x^{n}

denotes a specific possible realization of the random vector

X^{n}

.

X

denotes a support of random variable X. Also, Q denotes a random probability mass function while q denotes a specific probability mass function. We use natural logarithm and nats instead of bits.

2. Preliminaries

2.1. Successive Refinability

In this section, we review the successive refinement problem with two decoders. Let the source

X^{n}

be i.i.d. random vector with distribution

p_{X}

. The encoder wants to describe

X^{n}

to two decoders by sending a pair of messages

(m_{1}, m_{2})

where

1 \leq m_{i} \leq M_{i}

for

i \in {1, 2}

. The first decoder reconstructs

{\hat{X}}_{1}^{n} (m_{1}) \in {\hat{X}}_{1}^{n}

based only on the first message

m_{1}

. The second decoder reconstructs

{\hat{X}}_{2}^{n} (m_{1}, m_{2}) \in {\hat{X}}_{2}^{n}

based on both

m_{1}

and

m_{2}

. The setting is described in Figure 1.

Let

d_{i} (\cdot, \cdot) : X \times {\hat{X}}_{i} \to [0, \infty)

be a distortion measure for i-th decoder. The rates of code

(R_{1}, R_{2})

are simply defined as

\begin{matrix} R_{1} = & \frac{1}{n} log M_{1} \\ R_{2} = & \frac{1}{n} log M_{1} M_{2} . \end{matrix}

An

(n, R_{1}, R_{2}, D_{1}, D_{2}, ϵ)

-successive refinement code is a coding scheme with block length n and excess distortion probability

ϵ

where rates are

(R_{1}, R_{2})

and target distortions are

(D_{1}, D_{2})

. Since we have two decoders, the excess distortion probability is defined by

\Pr [d_{i} (X^{n}, {\hat{X}}_{i}^{n}) > D_{i} for some i]

.

Definition 1.

A rate-distortion tuple

(R_{1}, R_{2}, D_{1}, D_{2})

is said to be achievable if there is a family of

(n, R_{1}^{(n)}, R_{2}^{(n)}

,

D_{1}, D_{2}, ϵ^{(n)})

-successive refinement code where

\begin{matrix} lim_{n \to \infty} R_{i}^{(n)} = R_{i} f o r a l l i, \\ lim_{n \to \infty} ϵ^{(n)} = 0 . \end{matrix}

For some special cases, both decoders can achieve the point-to-point optimum rates simultaneously.

Definition 2.

Let

R_{i} (D_{i})

denote the rate-distortion function of the i-th decoder for

i \in {1, 2}

. If the rate-distortion tuple

(R_{1} (D_{1}), R_{2} (D_{2}), D_{1}, D_{2})

is achievable, then we say the source is successively refinable at

(D_{1}, D_{2})

. If the source is successively refinable at

(D_{1}, D_{2})

for all

D_{1}, D_{2}

, then we say the source is successively refinable.

The following theorem provides a necessary and sufficient condition of successive refinable sources.

Theorem 1 ([7,8]).

A source is successively refinable at

(D_{1}, D_{2})

if and only if there exists a conditional distribution

p_{{\hat{X}}_{1}, {\hat{X}}_{2} | X}

such that

X - {\hat{X}}_{2} - {\hat{X}}_{1}

forms a Markov chain and

\begin{matrix} R_{i} (D_{i}) & = I (X; {\hat{X}}_{i}) \\ E [d_{i} (X, {\hat{X}}_{i})] & \leq D_{i} \end{matrix}

for

i \in {1, 2}

.

Note that the above results of successive refinability can easily be generalized to the case of k decoders.

2.2. Logarithmic Loss

Let

X

be a set of discrete source symbols (

| X | < \infty

), and

M (X)

be the set of probability measures on

X

. Logarithmic loss

ℓ : X \times M (X) \to [0, \infty]

is defined by

\begin{matrix} ℓ (x, q) = log \frac{1}{q (x)} \end{matrix}

for

x \in X

and

q \in M (X)

. Logarithmic loss between n-tuples is defined by

\begin{matrix} ℓ_{n} (x^{n}, q^{n}) = \frac{1}{n} \sum_{i = 1}^{n} log \frac{1}{q_{i} (x_{i})}, \end{matrix}

i.e., the symbol-by-symbol extension of the single letter loss.

Let

X^{n}

be the discrete memoryless source with distribution

p_{X}

. Consider the lossy compression problem under logarithmic loss where the reconstruction alphabet is

M (X)

. The rate-distortion function is given by

\begin{matrix} R (D) = & inf_{p_{Q | X} : E [ℓ (X, Q)] \leq D} I (X; Q) \\ = & H (X) - D . \end{matrix}

The following lemma provides a property of the rate-distortion function achieving conditional distribution.

Lemma 1.

The rate-distortion function achieving conditional distribution

p_{Q^{⋆} | X}

satisfies

\begin{matrix} p_{X | Q^{⋆}} (\cdot | q) = & q \end{matrix}

(1)

\begin{matrix} H (X | Q^{⋆}) = & D \end{matrix}

(2)

for

p_{Q^{⋆}}

almost every

q \in M (X)

. Conversely, if

p_{Q | X}

satisfies (1) and (2), then it is a rate-distortion function achieving conditional distribution, i.e.,

\begin{matrix} I (X; Q) = & R (D) = H (X) - D \\ E [ℓ (X, Q)] = & D . \end{matrix}

The key idea is that we can replace Q by

p_{X | Q} (\cdot | Q)

, and have lower rate and distortion, i.e.,

\begin{matrix} I (X; Q) \geq & I (X; p_{X | Q} (\cdot | Q)) \\ E [ℓ (X, Q)] \geq & E [ℓ (X, p_{X | Q} (\cdot | Q)], \end{matrix}

which directly implies (1).

Interestingly, since the rate-distortion function in this case is a straight line, a simple time sharing scheme achieves the optimal rate-distortion trade-off. More precisely, the encoder losslessly compresses only the first

\frac{H (X) - D}{H (X)}

fraction of the source sequence components. Then, the decoder perfectly recovers those losslessly compressed components and uses

p_{X}

as its reconstruction for the remaining part. The resulting scheme obviously achieves distortion D with rate

H (X) - D

.

Furthermore, this simple scheme directly implies successive refinability of the source. For

D_{1} > D_{2}

, suppose the encoder losslessly compresses the first

\frac{H (X) - D_{2}}{H (X)}

fraction of the source. Then, the first decoder can perfectly reconstruct

\frac{H (X) - D_{1}}{H (X)}

fraction of the source with the message of rate

H (X) - D_{1}

and distortion

D_{1}

while the second decoder can achieve distortion

D_{2}

with rate

H (X) - D_{2}

. Since both decoders can achieve the best rate-distortion pair, it follows that any discrete memoryless source under logarithmic loss is successively refinable.

We can formally prove successive refinability of discrete memoryless source under logarithmic loss using Theorem 1. I.e., by finding random probability mass functions

Q_{1}, Q_{2} \in M (X)

that satisfy

\begin{matrix} I (X; Q_{1}) = & H (X) - D_{1}, E [ℓ (X, Q_{1})] = D_{1}, \end{matrix}

(3)

\begin{matrix} I (X; Q_{2}) = & H (X) - D_{2}, E [ℓ (X, Q_{2})] = D_{2}, \end{matrix}

(4)

where

X - Q_{2} - Q_{1}

forms a Markov chain.

Let

e_{x}

be a deterministic probability mass function (pmf) in

M (X)

that has a unit mass at x. In other words,

\begin{matrix} e_{x} (\tilde{x}) = \{\begin{matrix} 1 & if \tilde{x} = x \\ 0 & otherwise . \end{matrix} \end{matrix}

Then, consider random pmfs

Q_{1}, Q_{2} \in {e_{x} : x \in X} \cup {p_{X}}

. Since the support of

Q_{1}

and

Q_{2}

is finite, we can define the following conditional pmfs.

\begin{matrix} p_{Q_{2} | X} (q_{2} | x) = & \{\begin{matrix} \frac{H (X) - D_{2}}{H (X)} & if q_{2} = e_{x} \\ \frac{D_{2}}{H (X)} & if q_{2} = p_{X} \\ 0 & otherwise \end{matrix} \\ p_{Q_{1} | Q_{2}} (q_{1} | q_{2}) = & \{\begin{matrix} \frac{H (X) - D_{1}}{H (X) - D_{2}} & if q_{1} = q_{2} = e_{x} for some x \\ \frac{D_{1} - D_{2}}{H (X) - D_{2}} & if q_{1} = p_{X} and q_{2} = e_{x} for some x \\ 1 & if q_{1} = q_{2} = p_{X} \\ 0 & otherwise . \end{matrix} \end{matrix}

It is not hard to show that the above conditional pmfs satisfies (3) and (4).

3. Successive Refinability

Main Results

Consider the successive refinement problem with a discrete memoryless source as described in Section 2.1. Specifically, we are interested in the case where the first decoder is under logarithmic loss and the second decoder is under some arbitrary distortion measure

d (\cdot, \cdot)

. We only have a following benign assumption that if

d (x, {\hat{x}}_{1}) = d (x, {\hat{x}}_{2})

for all x, then

{\hat{x}}_{1} = {\hat{x}}_{2}

. This is not a hard restriction since if

{\hat{x}}_{1}

and

{\hat{x}}_{2}

have the same distortion values for all x, then there is no reason to have both reconstruction symbols.

The following theorem shows that any discrete memoryless source is successive refinable as long as the weaker link is under logarithmic loss. This implies an universal property of logarithmic loss in the context of successive refinement.

Theorem 2.

Let the source be arbitrary discrete memoryless. Suppose the distortion criterion of the first decoder is logarithmic loss while that of the second decoder is an arbitrary distortion criterion

d : X \times \hat{X} \to [0, \infty]

. Then the source is successively refinable.

Proof.

The source is successively refinable at

(D_{1}, D_{2})

if and only if there exists a

X - \hat{X} - Q

such that

\begin{matrix} I (X; Q) = & R_{1} (D_{1}), E [ℓ (X, Q)] \leq D_{1} \\ I (X; \hat{X}) = & R_{2} (D_{2}), E [d (X, \hat{X})] \leq D_{2} . \end{matrix}

Let

p_{{\hat{X}}^{⋆} | X}

be the conditional distribution for the second decoder that achieves the informational rate-distortion function

R_{2} (D_{2})

. i.e.,

\begin{matrix} I (X; {\hat{X}}^{⋆}) = R_{2} (D_{2}), E [d (X, {\hat{X}}^{⋆})] = D_{2} . \end{matrix}

Since the weaker link is under logarithmic loss, we have

R_{1} (D_{1}) \leq R_{2} (D_{2})

. This implies that

H (X) - D_{1} \leq H (X) - H (X | {\hat{X}}^{⋆})

. Thus, we can assume

H (X | {\hat{X}}^{⋆}) \leq D_{1}

throughout the proof. For simplicity, we further have a benign assumption that there is no

\hat{x}

such that

p_{X} (x) = p_{X | {\hat{X}}^{⋆}} (x | \hat{x})

for all x. (See Remark 1 for the case where such

\hat{x}

exists.)

Without loss of generality, suppose

\hat{X} = {0, 1, \dots, s - 1}

. Consider a random variable

Y \in Y = {0, 1, \dots, s}

with the following pmf for some

0 \leq ϵ \leq 1

:

\begin{matrix} p_{Y} (y) = \{\begin{matrix} (1 - ϵ) p_{{\hat{X}}^{⋆}} (y) & if y \leq s - 1 \\ ϵ & if y = s . \end{matrix} \end{matrix}

The conditional distribution is given by

\begin{matrix} p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) = \{\begin{matrix} 1 & if \hat{x} = y \leq s - 1 \\ 0 & if \hat{x} \neq y \leq s - 1 \\ p_{{\hat{X}}^{⋆}} (\hat{x}) & if y = s . \end{matrix} \end{matrix}

The joint distribution of

X, {\hat{X}}^{⋆}, Y

is given by

\begin{matrix} p_{X, {\hat{X}}^{⋆}, Y} (x, \hat{x}, y) = p_{X, {\hat{X}}^{⋆}} (x, \hat{x}) p_{Y | {\hat{X}}^{⋆}} (y | \hat{x}) . \end{matrix}

It is clear that

H (X | Y) = H (X | {\hat{X}}^{⋆})

if

ϵ = 0

and

H (X | Y) = H (X)

if

ϵ = 1

. Since

H (X | {\hat{X}}^{⋆}) \leq D_{1}

, there exists an

0 \leq ϵ \leq 1

such that

H (X | Y) = D_{1}

.

We are now ready to define the Markov chain. Let

Q = p_{X | Y} (\cdot | Y)

and

q^{(y)} = p_{X | Y} (\cdot | y)

for all

y \in Y

. The following lemma implies that there is a one-to-one mapping between q and y.

Lemma 2.

If

p_{X | Y} (x | y_{1}) = p_{X | Y} (x | y_{2})

for all

x \in X

, then

y_{1} = y_{2}

.

The proof of lemma is given in Appendix A. Since

Q = p_{X | Y} (\cdot | Y)

is a one-to-one mapping, we have

\begin{matrix} I (X; Q) = & I (X; Y) = H (X) - D_{1} = R_{1} (D_{1}) . \end{matrix}

Also, we have

\begin{matrix} E [ℓ (X, Q)] = & E [log \frac{1}{p_{X | Y} (X | Y)}] = H (X | Y) = D_{1} . \end{matrix}

Furthermore,

X - {\hat{X}}^{⋆} - Q

forms a Markov chain since

X - {\hat{X}}^{⋆} - Y

forms a Markov chain. This concludes the proof. □

The key idea of the theorem is that (1) is the only loose required condition for the rate-distortion function achieving conditional distribution. Thus, for any distortion criterion in the second stage, we are able to choose an appropriate distribution

p_{X, \hat{X}, Q}

that satisfies both (1) and the condition for successive refinability.

Remark 1.

The assumption

p_{X | {\hat{X}}^{⋆}} (\cdot | \hat{x}) \neq p_{X} (\cdot)

for all

\hat{x}

is not necessary. Appendix B shows another joint distribution

p_{X, {\hat{X}}^{⋆}, Y}

that satisfies conditions for successive refinability when the above assumption does not hold.

The distribution in the above proof is one simple example that has a single parameter ϵ, but we can always find other distributions that satisfy the condition for successive refinability. In the next section, we propose totally different distribution that achieves a Markov chain

X - {\hat{X}}^{⋆} - Y

with

H (X | Y) = D_{1}

. This implies that the above proof does not rely on the assumption.

Remark 2.

In the proof, we used random variable Y to define

Q = p_{X | Y} (\cdot | Y)

. On the other hand, if the joint distribution

p_{X, {\hat{X}}^{⋆}, Q}

satisfies conditions of successive refinability, there exists a random variable Y where

X - {\hat{X}}^{⋆} - Y

forms a Markov chain and

Q = p_{X | Y} (\cdot | Y)

. This is simply because we can set

Y = Q

, which implies

p_{X | Y} (\cdot | Y) = p_{X | Q} (\cdot | Q) = Q

.

Theorem 2 can be generalized to successive refinement problem with K intermediate decoders. Consider random variables

Y_{k} \in Y

for

1 \leq k \leq K

such that

X - {\hat{X}}^{⋆} - Y_{K} - \dots - Y_{1}

forms a Markov chain and the joint distribution of

X, {\hat{X}}^{⋆}, Y_{1}, \dots, Y_{K}

is given by

\begin{matrix} p_{X, {\hat{X}}^{⋆}, Y_{1}, \dots, Y_{K}} (x, \hat{x}, y_{1}, \dots, y_{K}) \\ = p_{X, {\hat{X}}^{⋆}} (x, \hat{x}) p_{Y_{1} | {\hat{X}}^{⋆}} (y_{1} | \hat{x}) \prod_{k = 1}^{K - 1} p_{Y_{k + 1} | Y_{k}} (y_{k + 1} | y_{k}) \end{matrix}

where

H (X | Y_{k}) = D_{k}

. Similar to the proof of Theorem 2, we can show that

Q_{k} = p_{X | Y_{k}} (\cdot | Y_{k})

for all

1 \leq k \leq K

satisfy the condition for successive refinability (where posterior distributions

p_{X | Y_{k}} (\cdot | y_{k})

should be distinct for all

y_{k} \in Y

to guarantee one-to-one correspondence). Thus, we can conclude that any discrete memoryless source with K intermediate decoders is successively refinable as long as all the intermediate decoders are under logarithmic loss.

4. Toward Lossy Compression with Low Complexity

As we mentioned in Remark 1, the choice of joint distribution

p_{X, {\hat{X}}^{⋆}, Q}

in the proof of Theorem 2 is not unique. In this section, we propose another joint distribution

p_{X, {\hat{X}}^{⋆}, Q}

that satisfies the conditions for successive refinability. It naturally suggests a new lossy compression algorithm which we will discuss in Section 4.3.

4.1. Rate-Distortion Achieving Joint Distribution: Small $D_{1}$

Recall that

H (X | {\hat{X}}^{⋆}) \leq D_{1}

. We first consider the case where

D_{1}

is not too large so that

D_{1}

is close to

H (X | {\hat{X}}^{⋆})

. We will clarify this later. For simplicity, we further assume that

p_{{\hat{X}}^{⋆}} (0) \geq \dots \geq p_{{\hat{x}}^{⋆}} (s - 1)

. Consider a random variable

Z_{ϵ}^{(s)} \in \hat{X}

with the following pmf for some

0 \leq ϵ \leq (s - 1) {min}_{\hat{x}} p_{{\hat{X}}^{⋆}} (\hat{x})

\begin{matrix} p_{Z_{ϵ}^{(s)}} (z) = \{\begin{matrix} 1 - ϵ & if z = 0 \\ \frac{ϵ}{s - 1} & if 1 \leq z \leq s - 1 . \end{matrix} \end{matrix}

If it is clear from context, we simply use

Z \equiv Z_{ϵ}^{(s)}

for the sake of notation. We further define a random variable Y that is independent to Z such that

{\hat{X}}^{⋆} = Y \oplus_{s} Z

, where

\oplus_{s}

denotes a sum modulo s. This can be achieved by following pmf and conditional pmf.

\begin{matrix} p_{Y} (y) = & \frac{p_{{\hat{X}}^{⋆}} (y) - \frac{ϵ}{s - 1}}{1 - \frac{s}{s - 1} ϵ} \\ p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) = & \{\begin{matrix} 1 - ϵ & if \hat{x} = y \\ \frac{ϵ}{s - 1} & if \hat{x} \neq y . \end{matrix} \end{matrix}

(5)

If

ϵ = 0

, we have

H (X | Y) = H (X | {\hat{X}}^{⋆})

. Also, it is clear that

H (X | Y)

increases as

ϵ

increases. Since we assume that

D_{1}

is not too large, there exists

0 \leq ϵ \leq (s - 1) min p_{{\hat{X}}^{⋆}} (\hat{x})

such that

H (X | Y) = D_{1}

. We will discuss about the case of general

D_{1}

in Section 4.2. The joint distribution of

X, {\hat{X}}^{⋆}, Y

is given by

\begin{matrix} p_{X, {\hat{X}}^{⋆}, Y} (x, \hat{x}, y) = p_{X, {\hat{X}}^{⋆}} (x, \hat{x}) p_{Y | {\hat{X}}^{⋆}} (y | \hat{x}) . \end{matrix}

We are now ready to define the Markov chain. Let

Q = p_{X | Y} (\cdot | Y)

and

q^{(y)} = p_{X | Y} (\cdot | y)

for all

y \in Y

where

Y = \hat{X} = {0, 1, \dots s - 1}

. For simplicity, we assume that

p_{X | Y} (\cdot | y_{1})

and

p_{X | Y} (\cdot | y_{2})

are not the same for all

y_{1} \neq y_{2}

. Since

Q = p_{X | Y} (\cdot | Y)

is a one-to-one mapping, we have

\begin{matrix} I (X; Q) = & I (X; Y) = H (X) - D_{1} = R_{1} (D_{1}) . \end{matrix}

Also, we have

\begin{matrix} E [ℓ (X, Q)] = & E [log \frac{1}{p_{X | Y} (X | Y)}] = H (X | Y) = D_{1} . \end{matrix}

Furthermore,

X - {\hat{X}}^{⋆} - Q

forms a Markov chain since

X - {\hat{X}}^{⋆} - Y

forms a Markov chain. Thus, the above construction of joint distribution

p_{X, {\hat{X}}^{⋆}, Q}

satisfies the conditions for successive refinability.

4.2. Rate-Distortion Achieving Joint Distribution: General $D_{1}$

The joint distribution in the previous section only works for small

D_{1}

. It is because

ϵ

has a natural upper-bound from (5) which is

ϵ \leq (s - 1) min p_{{\hat{X}}^{⋆}} (\hat{x})

. In this section, we generalize the proof in the previous section for general

D_{1}

. The key observation is that if we pick the maximum

ϵ = (s - 1) min p_{{\hat{X}}^{⋆}} (\hat{x})

, then

p_{Y} (s - 1) = 0

. This implies that we can focus on the smaller set of reconstruction alphabet

Y = {0, 1, \dots s - 2}

.

Let

U_{s} = {\hat{X}}^{⋆}

, and define random variables

{U_{k} : 1 \leq k \leq s - 1}

recursively. More precisely, we define the random variable

U_{k - 1}

based on

U_{k}

for

2 \leq k \leq s

.

\begin{matrix} U_{k} = & U_{k - 1} \oplus_{k} Z_{ϵ_{k}}^{(k)} \\ p_{Z_{ϵ_{k}}^{(k)}} (z) = & \{\begin{matrix} 1 - ϵ_{k} & if z = 0 \\ \frac{ϵ_{k}}{k - 1} & if 1 \leq z \leq k - 1 \end{matrix} \end{matrix}

where

\begin{matrix} ϵ_{k} = (k - 1) min_{u} p_{U_{k}} (u) . \end{matrix}

Similar to the definition of Y, we assume

U_{k - 1}

and

Z_{ϵ_{k}}^{(k)}

are independent, and

\oplus_{k}

denotes modulo k sum. Each time step, the alphabet size of

U_{k}

decreases by one. Thus, we have

0 \leq U_{k} \leq k - 1

, and therefore

U_{1} = 0

with probability 1. Furthermore, we have

\begin{matrix} H (X | U_{s}) \leq H (X | U_{s - 1}) \leq \dots \leq H (X | U_{1}) = H (X) . \end{matrix}

For

H (X | {\hat{X}}^{⋆}) \leq D_{1} < H (x)

, there exists k such that

H (X | U_{k}) > D_{1} \geq H (X | U_{k - 1})

. Thus, there exists Y that satisfies

H (X | Y) = D_{1}

and

U_{k} = Y \oplus_{k} Z_{ϵ}^{(k)}

for some

0 \leq ϵ \leq ϵ_{k}

. This implies that

\begin{matrix} {\hat{X}}^{⋆} = Z_{ϵ_{s}}^{(s)} \oplus_{s} [Z_{ϵ_{s - 1}}^{(s - 1)} \oplus_{s - 1} \dots (Z_{ϵ_{k + 1}}^{(k + 1)} \oplus_{k + 1} (Z_{ϵ}^{(k)} \oplus_{k} Y))] . \end{matrix}

Similar to the previous section, we assume that

p_{X | Y} (\cdot | y_{1}) \neq p_{X | Y} (\cdot | y_{2})

if

y_{1} \neq y_{2}

. Then, we can set

Q = p_{X | Y} (\cdot | Y)

which satisfies the conditions for successive refinability.

4.3. Iterative Lossy Compression Algorithm

The joint distribution from the previous section naturally suggests a simple successive refinement scheme. Consider the lossy compression problem where the source is i.i.d.

\sim p_{X}

and the distortion measure is

d : X \times \hat{X} \to [0, \infty)

. Let D be the target distortion, and

R > R (D)

be the rate of the scheme where

R (D)

is the rate-distortion function. Let

p_{X, {\hat{X}}^{⋆}}

be the rate-distortion achieving distribution.

For block length n, we propose a new lossy compression scheme that mimics successive refinement with

s - 1

decoders. Similar to the previous section, let

ϵ_{k} = (k - 1) {min}_{u} p_{U_{k}} (u)

and

\begin{matrix} {\hat{X}}^{⋆} = U_{s} = & U_{s - 1} \oplus_{s} Z_{ϵ_{s}}^{(s)} \\ U_{s - 1} = & U_{s - 2} \oplus_{s - 1} Z_{ϵ_{s - 1}}^{(s - 1)} \\ ⋮ \\ U_{2} = & U_{1} \oplus_{2} Z_{ϵ_{2}}^{(2)} . \end{matrix}

We further let

R_{k - 1} > I (X; U_{k}) - I (X; U_{k - 1})

for

2 \leq k \leq s

that satisfy

R = \sum_{k = 2}^{s} R_{k - 1}

. Now, we are ready to describe our coding scheme. Generate a sub-codebook

C_{1} = {z^{n} (1, m) : 1 \leq m \leq e^{R_{1}}}

where each sequence is generated according to

Z^{n} \sim

i.i.d.

p_{Z_{ϵ_{2}}^{(2)}}

for all m. Similarly, generate sub-codebooks

C_{k} = {z^{n} (k, m) : 1 \leq m \leq e^{n R_{k}}}

for

2 \leq k \leq s - 1

where each sequence is generated according to

Z^{n} \sim

i.i.d.

p_{Z_{ϵ_{k + 1}}^{(k + 1)}}

for all m.

Upon observing

x^{n} \in X^{n}

, the encoder finds

m_{1} \in C_{1}

that minimizes

d_{1} (x^{n}, z^{n} (1, m_{1}))

where the distortion measure

d_{1} (\cdot, \cdot)

is defined as follows.

\begin{matrix} d_{1} (x^{n}, z^{n}) = \frac{1}{n} \sum_{i = 1}^{n} log \frac{1}{p_{X | U_{2}} (x_{i} | z_{i})} . \end{matrix}

Note that

d_{1} (x, z)

is simply the logarithmic loss between x and

p_{X | U_{2}} (\cdot | z)

.

Similarly, for

2 \leq k \leq s - 1

, the encoder iteratively finds

m_{k} \in C_{k}

that minimizes

d_{k} (x^{n}, [[z^{n} (1, m_{1}) \oplus_{3} \dots \oplus_{k} z^{n} (k - 1, m_{k - 1})] \oplus_{k + 1} z^{n} (k, m_{k})])

where

\begin{matrix} d_{k} (x^{n}, z^{n}) = \frac{1}{n} \sum_{i = 1}^{n} log \frac{1}{p_{X | U_{k + 1}} (x_{i} | z_{i})} . \end{matrix}

Upon receiving

m_{1}, m_{2}, \dots, m_{s - 1}

, the decoder reconstructs

\begin{matrix} {\hat{x}}^{n} = [[z^{n} (1, m_{1}) \oplus_{3} z^{n} (2, m_{2})] \oplus \dots \oplus_{s} z^{n} (s - 1, m_{s - 1})] . \end{matrix}

Suppose

R_{1} \approx R_{2} \approx \dots \approx R_{s - 1} \approx \frac{R}{s - 1}

, and

L = s - 1

. Similar to [12,14], this scheme has two main advantages compare to naive random coding scheme. First, the number of codewords in the proposed scheme is

L \cdot e^{n R / L}

, while the naive scheme requires

e^{n R}

codewords. Also, in each iteration, the encoder finds the best codeword among

e^{n R / L}

sub-codewords. Thus, the overall complexity is

L \cdot e^{n R / L}

as well. On the other hand, the naive scheme requires

e^{n R}

complexity.

Remark 3.

The proposed scheme constructs

{\hat{X}}^{n}

from binary sequences. The reconstruction after each stage can be viewed as

\begin{matrix} u_{k}^{n} (m_{1}, \dots m_{k - 1}) = [[z^{n} (1, m_{1}) \oplus_{3} \dots] \oplus_{k} z^{n} (k - 1, m_{k - 1})] \end{matrix}

where

0 \leq u_{k} \leq k - 1

. Thus, the decoder starts from binary sequence

u_{2}^{n} (m_{1})

, and the alphabet size increases by 1 at each iteration. After

(s - 1)

-th iteration, it reaches the final reconstruction

{\hat{X}}^{n}

where the size of alphabet is s.

5. Conclusions

To conclude our discussion, we summarize our main contributions. In the context of successive refinement problem, we showed another universal property of logarithmic loss that any discrete memoryless source is successively refinable as long as the intermediate decoders operate under logarithmic loss. We applied the result to the point-to-point lossy compression problem and proposed a lossy compression scheme with lower complexity.

Funding

This work was supported by 2017 Hongik University Research Fund.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Lemma 2

For

y \leq s - 1

,

\begin{matrix} p_{X | Y} (x | y) = & \sum_{\hat{x}} p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) \\ = & p_{X | {\hat{X}}^{⋆}} (x | y) . \end{matrix}

On the other hand, for

y = s

,

\begin{matrix} p_{X | Y} (x | y) = & \sum_{\hat{x}} p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) \\ = & \sum_{\hat{x}} p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) p_{{\hat{X}}^{⋆}} (\hat{x}) \\ = & p_{X} (x) . \end{matrix}

Let

j_{X} (x, D)

be d-tilted information [15]:

\begin{matrix} j_{X} (x, D) = log \frac{1}{E [exp {λ^{⋆} D - λ^{⋆} d (x, {\hat{X}}^{⋆})}]} . \end{matrix}

where

λ^{⋆} = - R^{'} (D)

and the expectation is with respect to marginal distribution

p_{{\hat{X}}^{⋆}}

. Csiszár [16] showed that for

p_{{\hat{X}}^{⋆}}

-almost every

\hat{x}

,

\begin{matrix} j_{X} (x, D) = & log \frac{p_{X, {\hat{X}}^{⋆}} (x, \hat{x})}{p_{X} (x) p_{{\hat{X}}^{⋆}} (\hat{x})} + λ^{⋆} d (x, \hat{x}) - λ^{⋆} D \\ = & log \frac{p_{X | {\hat{X}}^{⋆}} (x | \hat{x})}{p_{X} (x)} + λ^{⋆} d (x, \hat{x}) - λ^{⋆} D . \end{matrix}

If

p_{X | {\hat{X}}^{⋆}} (x | {\hat{x}}_{1}) = p_{X | {\hat{X}}^{⋆}} (x | {\hat{x}}_{2})

for all x, it implies that

d (x, {\hat{x}}_{1}) = d (x, {\hat{x}}_{2})

for all x which contradicts our assumption. On the other hand, if

p_{X | {\hat{X}}^{⋆}} (x | {\hat{x}}_{1}) = p_{X} (x)

for all x, it also contradicts our assumption. Thus,

p_{X | Y} (\cdot | y)

are different from each other for all

0 \leq y \leq s

.

Appendix B. Proof of the Special Case of Theorem 2

Similar to the main proof of Theorem 2, we assume

\hat{X} = Y = {0, 1, \dots, s - 1}

. Suppose there exists

\hat{x}

such that

p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) = p_{X} (x)

for all x. Without loss of generality, we assume

\hat{x} = 0

, i.e.,

p_{X | {\hat{X}}^{⋆}} (x | 0) = p_{X} (x)

for all x.

Consider a random variable

Y \in Y

with the following conditional pmf for some

0 \leq ϵ \leq 1

:

\begin{matrix} p_{Y | {\hat{X}}^{⋆}} (y | \hat{x}) = \{\begin{matrix} 1 & if \hat{x} = y = 0 \\ ϵ & if \hat{x} \neq 0 and y = 0 \\ 1 - ϵ & if \hat{x} = y \neq 0 . \\ 0 & otherwise . \end{matrix} \end{matrix}

It is clear that

H (X | Y) = H (X | {\hat{X}}^{⋆})

if

ϵ = 0

and

H (X | Y) = H (X)

if

ϵ = 1

. Since

H (X | {\hat{X}}^{⋆}) \leq D_{1}

, there exists an

0 \leq ϵ \leq 1

such that

H (X | Y) = D_{1}

. We also have

Q = p_{X | Y} (\cdot | Y)

and

q^{(y)} = p_{X | Y} (\cdot | y)

for all

y \in Y

. The following lemma implies the one-to-one mapping between q and y.

Lemma A1.

If

p_{X | Y} (x | y_{1}) = p_{X | Y} (x | y_{2})

for all

x \in X

, then

y_{1} = y_{2}

.

Proof.

If

y = 0

, the conditional distribution

p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y)

is given by

\begin{matrix} p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) = \{\begin{matrix} \frac{p_{{\hat{X}}^{⋆}} (0)}{(1 - ϵ) p_{{\hat{X}}^{⋆}} (0) + ϵ} & if \hat{x} = 0 \\ \frac{ϵ \cdot p_{{\hat{X}}^{⋆}} (\hat{x})}{(1 - ϵ) p_{{\hat{X}}^{⋆}} (0) + ϵ} & if \hat{x} \neq 0 . \end{matrix} \end{matrix}

Then,

\begin{matrix} p_{X | Y} (x | y) = & \sum_{\hat{x}} p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) p_{{\hat{X}}^{⋆} | Y} (\hat{x} | 0) \\ = & p_{X | {\hat{X}}^{⋆}} (x | 0) \frac{p_{{\hat{X}}^{⋆}} (0)}{(1 - ϵ) p_{{\hat{X}}^{⋆}} (0) + ϵ} + \sum_{\hat{x} \neq 0} p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) \frac{ϵ \cdot p_{{\hat{X}}^{⋆}} (\hat{x})}{(1 - ϵ) p_{{\hat{X}}^{⋆}} (0) + ϵ} \\ = & p_{X | {\hat{X}}^{⋆}} (x | 0) \frac{(1 - ϵ) \cdot p_{{\hat{X}}^{⋆}} (0)}{(1 - ϵ) p_{{\hat{X}}^{⋆}} (0) + ϵ} + p_{X} (x) \frac{ϵ}{(1 - ϵ) p_{{\hat{X}}^{⋆}} (0) + ϵ} \\ = & p_{X | {\hat{X}}^{⋆}} (x | 0) \end{matrix}

where the last equality is because

p_{X | {\hat{X}}^{⋆}} (x | 0) = p_{X} (x)

for all x. In other words,

p_{X | Y} (x | 0) = p_{X | {\hat{X}}^{⋆}} (x | 0)

.

On the other hand, if

y \neq 0

, the conditional distribution

p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y)

is given by

\begin{matrix} p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) = \{\begin{matrix} 1 & if \hat{x} = y \\ 0 & otherwise . \end{matrix} \end{matrix}

Then,

\begin{matrix} p_{X | Y} (x | y) = & \sum_{\hat{x}} p_{X | {\hat{X}}^{⋆}} (x | \hat{x}) p_{{\hat{X}}^{⋆} | Y} (\hat{x} | y) \\ = & p_{X | {\hat{X}}^{⋆}} (x | y) . \end{matrix}

As we have seen in Appendix A,

p_{X | {\hat{X}}^{⋆}} (\cdot | {\hat{x}}_{1})

cannot be equal to

p_{X | {\hat{X}}^{⋆}} (\cdot | {\hat{x}}_{2})

if

{\hat{x}}_{1} \neq {\hat{x}}_{2}

. Since

p_{X | Y} (x | y) = p_{X | {\hat{X}}^{⋆}} (x | y)

for all x, we can say that

p_{X | Y} (x | y_{1}) = p_{X | Y} (x | y_{2})

for all x implies

y_{1} = y_{2}

. □

The remaining part of the proof is exactly the same as the main proof.

References

Courtade, T.A.; Wesel, R.D. Multiterminal source coding with an entropy-based distortion measure. In Proceedings of the 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 31 July–5 August 2011; pp. 2040–2044. [Google Scholar]
Courtade, T.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef]
Shkel, Y.Y.; Verdú, S. A single-shot approach to lossy source coding under logarithmic loss. IEEE Trans. Inf. Theory 2018, 64, 129–147. [Google Scholar] [CrossRef]
Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
Gilad-Bachrach, R.; Navot, A.; Tishby, N. An information theoretic tradeoff between complexity and accuracy. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
Equitz, W.H.; Cover, T.M. Successive refinement of information. IEEE Trans. Inf. Theory 1991, 37, 269–275. [Google Scholar] [CrossRef]
Koshelev, V. Hierarchical coding of discrete sources. Probl. Peredachi Inf. 1980, 16, 31–49. [Google Scholar]
Gerrish, A.M. Estimation of Information Rates. Ph.D. Thesis, Yale University, New Haven, CT, USA, 1963. [Google Scholar]
Chow, J.; Berger, T. Failure of successive refinement for symmetric Gaussian mixtures. IEEE Trans. Inf. Theory 1997, 43, 350–352. [Google Scholar] [CrossRef]
Lastras, L.; Berger, T. All sources are nearly successively refinable. IEEE Trans. Inf. Theory 2001, 47, 918–926. [Google Scholar] [CrossRef]
Venkataramanan, R.; Sarkar, T.; Tatikonda, S. Lossy Compression via Sparse Linear Regression: Computationally Efficient Encoding and Decoding. IEEE Trans. Inf. Theory 2014, 60, 3265–3278. [Google Scholar] [CrossRef]
No, A.; Weissman, T. Rateless lossy compression via the extremes. IEEE Trans. Inf. Theory 2016, 62, 5484–5495. [Google Scholar] [CrossRef] [PubMed]
No, A.; Ingber, A.; Weissman, T. Strong successive refinability and rate-distortion-complexity tradeoff. IEEE Trans. Inf. Theory 2016, 62, 3618–3635. [Google Scholar] [CrossRef]
Kostina, V.; Verdú, S. Fixed-length lossy compression in the finite blocklength regime. IEEE Trans. Inf. Theory 2012, 58, 3309–3338. [Google Scholar] [CrossRef]
Csiszár, I. On an extremum problem of information theory. Stud. Sci. Math. Hung. 1974, 9, 57–71. [Google Scholar]

Figure 1. Successive Refinement.

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

No, A. Universality of Logarithmic Loss in Successive Refinement. Entropy 2019, 21, 158. https://doi.org/10.3390/e21020158

AMA Style

No A. Universality of Logarithmic Loss in Successive Refinement. Entropy. 2019; 21(2):158. https://doi.org/10.3390/e21020158

Chicago/Turabian Style

No, Albert. 2019. "Universality of Logarithmic Loss in Successive Refinement" Entropy 21, no. 2: 158. https://doi.org/10.3390/e21020158

APA Style

No, A. (2019). Universality of Logarithmic Loss in Successive Refinement. Entropy, 21(2), 158. https://doi.org/10.3390/e21020158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Universality of Logarithmic Loss in Successive Refinement^†

Abstract

1. Introduction

2. Preliminaries

2.1. Successive Refinability

2.2. Logarithmic Loss

3. Successive Refinability

Main Results

4. Toward Lossy Compression with Low Complexity

4.1. Rate-Distortion Achieving Joint Distribution: Small $D_{1}$

4.2. Rate-Distortion Achieving Joint Distribution: General $D_{1}$

4.3. Iterative Lossy Compression Algorithm

5. Conclusions

Funding

Conflicts of Interest

Appendix A. Proof of Lemma 2

Appendix B. Proof of the Special Case of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Universality of Logarithmic Loss in Successive Refinement †

Abstract

1. Introduction

2. Preliminaries

2.1. Successive Refinability

2.2. Logarithmic Loss

3. Successive Refinability

Main Results

4. Toward Lossy Compression with Low Complexity

4.1. Rate-Distortion Achieving Joint Distribution: Small D 1

4.2. Rate-Distortion Achieving Joint Distribution: General D 1

4.3. Iterative Lossy Compression Algorithm

5. Conclusions

Funding

Conflicts of Interest

Appendix A. Proof of Lemma 2

Appendix B. Proof of the Special Case of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Universality of Logarithmic Loss in Successive Refinement^†

4.1. Rate-Distortion Achieving Joint Distribution: Small $D_{1}$

4.2. Rate-Distortion Achieving Joint Distribution: General $D_{1}$