A Coding Theorem for f-Separable Distortion Measures

Yanina Shkel; Sergio Verdú

doi:10.3390/e20020111

and

Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA

^*

Author to whom correspondence should be addressed.

Entropy2018, 20(2), 111;https://doi.org/10.3390/e20020111

This article belongs to the Special Issue Rate-Distortion Theory and Information Theory

Version Notes

Order Reprints

Abstract

In this work we relax the usual separability assumption made in rate-distortion literature and propose

f

-separable distortion measures, which are well suited to model non-linear penalties. The main insight behind

f

-separable distortion measures is to define an n-letter distortion measure to be an

f

-mean of single-letter distortions. We prove a rate-distortion coding theorem for stationary ergodic sources with

f

-separable distortion measures, and provide some illustrative examples of the resulting rate-distortion functions. Finally, we discuss connections between

f

-separable distortion measures, and the subadditive distortion measure previously proposed in literature.

Keywords:

rate-distortion function; f-separable distortion measure; subadditive distortion measure

1. Introduction

Rate-distortion theory, a branch of information theory that studies models for lossy data compression, was introduced by Claude Shannon in [1]. The approach of [1] is to model the information source with distribution

P_{X}

on

X

, a reconstruction alphabet

\hat{X}

, and a distortion measure

d : X \times \hat{X} \to [0, \infty)

. When the information source produces a sequence of n realizations, the source

P_{X^{n}}

is defined on

X^{n}

with reconstruction alphabet

{\hat{X}}^{n}

, where

X^{n}

and

{\hat{X}}^{n}

are n-fold Cartesian products of

X

and

\hat{X}

. In that case, [1] extended the notion of a single-letter distortion measure to the n-letter distortion measure,

d^{n} : X^{n} \times {\hat{X}}^{n} \to [0, \infty)

, by taking an arithmetic average of single-letter distortions,

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = \frac{1}{n} \sum_{i = 1}^{n} d (x_{i}, {\hat{x}}_{i}) . \end{matrix}

(1)

Distortion measures that satisfy (1) are referred to as separable (also additive, per-letter, averaging); the separability assumption has been ubiquitous throughout rate-distortion literature ever since its inception in [1].

On the one hand, the separability assumption is quite natural and allows for a tractable characterization of the fundamental trade-off between the rate of compression and the average distortion. For example, in the case when

X^{n}

is a stationary and memoryless source the rate-distortion function, which captures this trade-off, admits a simple characterization:

\begin{matrix} R (d) = \inf_{P_{\hat{X} | X} : E [d (X, \hat{X})] \leq d} I (X; \hat{X}) . \end{matrix}

(2)

On the other hand, the separability assumption is very restrictive as it only models distortion penalties that are linear functions of the per-letter distortions in the source reproduction. Real-world distortion measures, however, may be highly non-linear; it is desirable to have a theory that also accommodates non-linear distortion measures. To this end, we propose the following definition:

Definition 1 (f-separable distortion measure).

Let

f (z)

be a continuous, increasing function on

[0, \infty)

. An n-letter distortion measure

d^{n} (\cdot, \cdot)

is

f

-separable with respect to a single-letter distortion

d (\cdot, \cdot)

if it can be written as

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = f^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} f (d (x_{i}, {\hat{x}}_{i}))) . \end{matrix}

(3)

For

f (z) = z

this is the classical separable distortion set up. By selecting

f

appropriately, it is possible to model a large class of non-linear distortion measures, see Figure 1 for illustrative examples.

Figure 1. The number of reconstruction errors for an information source with 100 bits vs. the penalty assessed by

f

-separable distortion measures based on the Hamming single-letter distortion. The

f (z) = z

plot corresponds to the separable distortion. The

f

-separable assumption accommodates all of the other plots, and many more, with the appropriate choice of the function

f

.

In this work, we characterize the rate-distortion function for stationary and ergodic information sources with

f

-separable distortion measures. In the special case of memoryless and stationary sources we obtain the following intuitive result:

\begin{matrix} R_{f} (d) = \inf_{P_{\hat{X} | X} : E [f (d (X, \hat{X}))] \leq f (d)} I (X; \hat{X}) . \end{matrix}

(4)

A pleasing implication of this result is that much of rate-distortion theory (e.g., the Blahut-Arimoto algorithm) developed since [1] can be leveraged to work under the far more general

f

-separable assumption.

The rest of this paper is structured as follows. The remainder of Section 1 overviews related work: Section 1.1 provides the intuition behind Definition 1, Section 1.2 reviews related work in other compression problems, and Section 1.3 connects

f

-separable distortion measures with sub-additive distortion measures. Section 2 formally sets up the problem and demonstrates why convexity of the rate-distortion function does not always hold under the

f

-separable assumption. Section 3 presents our main result, Theorem 1, as well as some illustrative examples. Additional discussion about problem formulation and sub-additive distortion measures is given in Section 4. We conclude the paper in Section 5.

1.1. Generalized $f$ -Mean and Rényi Entropy

To understand the intuition behind Definition 1, consider aggregating n numbers

(z_{1}, \dots, z_{n})

by defining a sequence of functions (indexed by n)

\begin{matrix} M_{n} (z) = f^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} f (z_{i})) \end{matrix}

(5)

where

f

is a continuous, increasing function on

[z_{\min}, z_{\max}]

,

z_{\min} = \min {z_{i}}_{i = 1}^{n}

, and

z_{\max} = \max {z_{i}}_{i = 1}^{n}

. It is easy to see that (5) satisfies the following properties:

$M_{n} (z)$ is continuous and monotonically increasing in each $z_{i}$ ,
$M_{n} (z)$ is a symmetric function of each $z_{i}$ ,
If $z_{i} = z$ for all i, then $M_{n} (z) = z$ ,
For any $m \leq n$

$\begin{matrix} M_{n} (z) = M_{n} (M_{m} (z_{1}^{m}), \dots, M_{m} (z_{1}^{m}), z_{m + 1}, \dots, z_{n}) . \end{matrix}$

(6)

Moreover, it is shown in [2] that any sequence of functions

M_{n}

that satisfies these properties must have the form of Equation (5) for some continuous, increasing

f

. The function

M_{n}

is referred to as “Kolmogorov mean”, “quasi-arithmetic mean”, or “generalized

f

-mean”. The most prominent examples are the geometric mean,

f (z) = \log z

, and the root-mean-square,

f (z) = z^{2}

.

The main insight behind Definition 1 is to define an n-letter distortion measure to be an

f

-mean of single-letter distortions. The

f

-separable distortion measures include all n-letter distortion measures that satisfy the above properties, with the last property saying that the non-linear “shape” of distortion measure (cf. Figure 1) is independent of n.

Finally, we note that Rényi also arrived at his well-known family of entropies [3] by taking an

f

-mean of the information random variable:

\begin{matrix} H_{α} (X) & = f_{α}^{- 1} E [f_{α} (ı_{X} (X))], α \in (0, 1) \cup (1, \infty) \end{matrix}

(7)

where the information at x is

\begin{matrix} ı_{X} (x) & = \log \frac{1}{P_{X} (x)} . \end{matrix}

(8)

Rényi [3] limited his consideration to functions of the form

f_{α} (z) = \exp \{(1 - α) z\}

in order to ensure that entropy is additive for independent random variables.

1.2. Compression with Non-Linear Cost

Source coding with non-linear cost has already been explored in the variable-length lossless compression setting. Let

ℓ (x)

denote the length of the encoding of x by a given variable length code. Campbell [4,5] proposed minimizing a cost function of the form

\begin{matrix} f^{- 1} E [f (ℓ (X))], \end{matrix}

(9)

instead of the usual expected length. The main result of [4,5] is that for

\begin{matrix} f_{t} (z) = \exp {t z}, t \in (- 1, 0) \cup (0, \infty), \end{matrix}

(10)

the fundamental limit of such setup is Rényi entropy of order

α = \frac{1}{t + 1}

. For more general

f

, this problem was handled by Kieffer [6], who showed that (9) has a fundamental limit for a large class of functions

f

. That limit is Rényi entropy of order

α = \frac{1}{t + 1}

with

\begin{matrix} t = \lim_{z \to \infty} \frac{f^{''} (z)}{f^{'} (z)} . \end{matrix}

(11)

More recently, a number of works [7,8,9] studied related source coding paradigms, such as guessing and task encoding. These works also focused on the exponential functions given in (10); in [7,8] Rényi entropy is shown to be a fundamental limit yet again.

1.3. Sub-Additive Distortion Measures

A notable departure from the separability assumption in rate-distortion theory is sub-additive distortion measures discussed in [10]. Namely, a distortion measure is sub-additive if

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) \leq \frac{1}{n} \sum_{i = 1}^{n} d (x_{i}, {\hat{x}}_{i}) . \end{matrix}

(12)

In the present setting, an

f

-separable distortion measure is sub-additive if

f

is concave:

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = f^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} f (d (x_{i}, {\hat{x}}_{i}))) \leq \frac{1}{n} \sum_{i = 1}^{n} d (x_{i}, {\hat{x}}_{i}) . \end{matrix}

(13)

Thus, the results for sub-additive distortion measures, such as the convexity of the rate-distortion function, are applicable to

f

-separable distortion measures when

f

is concave.

2. Preliminaries

Let X be a random variable defined on

X

with distribution

P_{X}

, with reconstruction alphabet

\hat{X}

, and a distortion measure

d : X \times \hat{X} \to [0, \infty)

. Let

M = {1, \dots, M}

be the message set.

Definition 2 (Lossy source code).

A lossy source code

(g, c)

is a pair of mappings,

\begin{matrix} g & : X \to M \end{matrix}

(14)

\begin{matrix} c & : M \to \hat{X} . \end{matrix}

(15)

A lossy source-code

(g, c)

is an

(M, d)

-lossy source code on

(X, \hat{X}, d)

if

\begin{matrix} E [d (X, c (g (X))] \leq d . \end{matrix}

(16)

A lossy source code

(g, c)

is an

(M, d, ϵ)

-lossy source code on

(X, \hat{X}, d)

if

\begin{matrix} P [d (X, c (g (X)) > d] \leq ϵ . \end{matrix}

(17)

Definition 3.

An information source

X

is a stochastic process

\begin{matrix} X = {\{X^{n} = (X_{1}, \dots, X_{n})\}}_{n = 1}^{\infty} . \end{matrix}

(18)

If

(g, c)

is an

(M, d)

-lossy source code for

X^{n}

on

(X^{n}, {\hat{X}}^{n}, d^{n})

, we say

(g, c)

is an

(n, M, d)

-lossy source code. Likewise, an

(M, d, ϵ)

-lossy source code for

X^{n}

on

(X^{n}, {\hat{X}}^{n}, d^{n})

is an

(n, M, d, ϵ)

-lossy source code.

2.1. Rate-Distortion Function (Average Distortion)

Definition 4.

Let a sequence of distortion measures

{d^{n}}

be given. The rate-distortion pair

(R, d)

is achievable if there exists a sequence of

(n, M_{n}, d_{n})

-lossy source codes such that

\begin{matrix} \underset{n \to \infty}{\lim \sup} \frac{1}{n} \log M_{n} \leq R, a n d \underset{n \to \infty}{\lim \sup} d_{n} \leq d . \end{matrix}

Our main object of study is the following rate-distortion function with respect to

f

-separable distortion measures.

Definition 5.

Let

{d^{n}}

be a sequence of

f

-separable distortion measures. Then,

\begin{matrix} R_{f} (d) = \inf {R : (R, d) i s a c h i e v a b l e} . \end{matrix}

(19)

If

f

is the identity, then we omit the subscript

f

and simply write

R (d)

.

2.2. Rate-Distortion Function (Excess Distortion)

It is useful to consider the rate-distortion function for

f

-separable distortion measures under the excess distortion paradigm.

Definition 6.

Let a sequence of distortion measures

{d^{n}}

be given. The rate-distortion pair

(R, d)

is (excess distortion) achievable if for any

γ > 0

there exists a sequence of

(n, M_{n}, d + γ, ϵ_{n})

-lossy source codes such that

\begin{matrix} \underset{n \to \infty}{\lim \sup} \frac{1}{n} \log M_{n} \leq R, a n d \underset{n \to \infty}{\lim \sup} ϵ_{n} = 0 . \end{matrix}

Definition 7.

Let

{d^{n}}

be a sequence of

f

-separable distortion measures. Then,

\begin{matrix} R_{f}^{'} (d) = \inf {R & : (R, d) i s (e x c e s s d i s t o r t i o n) a c h i e v a b l e} . \end{matrix}

(20)

Characterizing the

f

-separable rate-distortion function is particularly simple under the excess distortion paradigm, as shown in the following lemma.

Lemma 1.

Let the single-letter distortion

d

and an increasing, continuous function

f

be given. Then,

\begin{matrix} R_{f}^{'} (d) = {\tilde{R}}^{'} (f (d)) \end{matrix}

(21)

where

{\tilde{R}}^{'} (d)

is computed with respect to

\tilde{d} (x, \hat{x}) = f (d (x, \hat{x}))

.

Proof.

Let

{d^{n}}

be a sequence of

f

-separable distortions based on

d (\cdot, \cdot)

and let

\{{\tilde{d}}^{n}\}

be a sequence of separable distortion measures based on

\tilde{d} (\cdot, \cdot) = f (d (\cdot, \cdot))

.

Since

f

is increasing and continuous at d, then for any

γ > 0

there exists

0 < \tilde{γ}

such that

\begin{matrix} f (d + γ) - f (d) = \tilde{γ} . \end{matrix}

(22)

The reverse is also true by continuity of

f

: for any

\tilde{γ} > 0

there exists

γ > 0

such that (22) is satisfied.

Any source code

(g_{n}, c_{n})

is an

(n, M_{n}, d + γ, ϵ_{n})

-lossless code under

f

-separable distortion

d^{n}

if and only if

(g_{n}, c_{n})

is also an

(n, M_{n}, f (d) + \tilde{γ}, ϵ_{n})

-lossless code under separable distortion

{\tilde{d}}^{n}

. Indeed,

\begin{matrix} ϵ_{n} & \geq P [d^{n} (X^{n}, c_{n} (g_{n} (X^{n}))) \geq d + γ] \end{matrix}

(23)

\begin{matrix} = P [f^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} f (d (X_{i}, {\hat{X}}_{i}))) \geq d + γ] \end{matrix}

(24)

\begin{matrix} = P [\frac{1}{n} \sum_{i = 1}^{n} f (d (X_{i}, {\hat{X}}_{i})) \geq f (d + γ)] \end{matrix}

(25)

\begin{matrix} = P [{\tilde{d}}^{n} (X^{n}, c_{n} (g_{n} (X^{n}))) \geq f (d) + \tilde{γ}] \end{matrix}

(26)

where

{\hat{X}}^{n} = c_{n} (g_{n} (X^{n}))

. It follows that

(R, d)

is (excess distortion) achievable with respect to

{d^{n}}

if and only if

(R, f (d))

is (excess distortion) achievable with respect to

{{\tilde{d}}^{n}}

. The lemma statement follows from this observation and Definition 6. ☐

2.3. $f$ -Separable Rate-Distortion Functions and Convexity

While it is a well-established result in rate-distortion theory that all separable rate-distortion functions are convex ([11], Lemma 10.4.1), this need not hold for

f

-separable rate-distortion functions.

The convexity argument for separable distortion measures is based on the idea of time sharing; that is, suppose there exists an

(n_{1}, M_{1}, d_{1})

-lossy source code of blocklength

n_{1}

and an

(n_{2}, M_{2}, d_{2})

-lossy source code of blocklength

n_{2}

. Then, there exists an

(n, M, d)

-lossy source code of blocklength n with

M = M_{1} M_{2}

and

d = \frac{n_{1}}{n_{1} + n_{2}} d_{1} + \frac{n_{1}}{n_{1} + n_{2}} d_{2}

: such a code is just a concatenation of codes over blocklengths

n_{1}

and

n_{2}

. The distortion d is achievable since

\begin{matrix} d^{n_{1}} (x^{n_{1}}, {\hat{x}}^{n_{1}}) = \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} d (x_{i}, {\hat{x}}_{i}) = d_{1} \end{matrix}

(27)

and letting

n = n_{1} + n_{2}

,

\begin{matrix} d^{n_{2}} (x_{n_{1} + 1}^{n}, {\hat{x}}_{n_{1} + 1}^{n}) = \frac{1}{n_{2}} \sum_{i = n_{1} + 1}^{n} d (x_{i}, {\hat{x}}_{i}) = d_{2} . \end{matrix}

(28)

Time sharing between the two schemes gives

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = \frac{1}{n} \sum_{i = 1}^{n} d (x_{i}, {\hat{x}}_{i}) = \frac{n_{1}}{n} d_{1} + \frac{n_{2}}{n} d_{2} . \end{matrix}

(29)

However, this bound on the distortion need not hold for

f

-separable distortions. Consider

f

which is strictly convex and suppose

\begin{matrix} f^{- 1} (\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} f (d (x_{i}, {\hat{x}}_{i}))) = d_{1}, f^{- 1} (\frac{1}{n_{2}} \sum_{i = n_{1} + 1}^{n} f (d (x_{i}, {\hat{x}}_{i}))) = d_{2} . \end{matrix}

(30)

We can write

\begin{matrix} f^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} f (d (x_{i}, {\hat{x}}_{i}))) = f^{- 1} (\frac{n_{1}}{n} f (d_{1}) + \frac{n_{2}}{n} f (d_{2})) > \frac{n_{1}}{n} d_{1} + \frac{n_{2}}{n} d_{2} . \end{matrix}

(31)

Thus, concatinating the two schemes together does not guarantee that the distortion assigned by the

f

-separable distortion measure is bounded by d.

3. Main Result

In this section we make the following standard assumptions, see [12].

$X$ is a stationary and ergodic source.
The single-letter distortion function $d (\cdot, \cdot)$ and the continuous and increasing function $f (\cdot)$ are such that

$\begin{matrix} \inf_{\hat{x} \in \hat{X}} E [f (d (X, \hat{x}))] < \infty . \end{matrix}$

(32)
For each $d > 0$ , there exists a countable subset ${{\hat{x}}_{i}}$ of $\hat{X}$ and a countable measurable partition ${E_{i}}$ of $X$ such that $d (x, {\hat{x}}_{i}) \leq d$ , $x \in E_{i}$ for each ${\hat{x}}_{i}$ , and

$\begin{matrix} \sum_{i} P_{X_{1}} (E_{i}) \log \frac{1}{P_{X_{1}} (E_{i})} < \infty . \end{matrix}$

(33)

Theorem 1.

Under the stated assumptions, the rate-distortion function is given by

\begin{matrix} R_{f} (d) = \tilde{R} (f (d)) \end{matrix}

(34)

where

\begin{matrix} \tilde{R} (f (d)) = \lim_{n \to \infty} \inf_{P_{{\hat{X}}^{n} | X^{n}} : \frac{1}{n} \sum_{i = 1}^{n} E \tilde{d} (X_{i}, {\hat{X}}_{i}) \leq f (d)} \frac{1}{n} I (X^{n}; {\hat{X}}^{n}) \end{matrix}

(35)

is the rate-distortion function computed with respect to the separable distortion measure given by

\tilde{d} (x, \hat{x}) = f (d (x, \hat{x}))

.

For stationary memoryless sources (34) particularizes to

\begin{matrix} R_{f} (d) = \inf_{P_{\hat{X} | X} : E [f (d (X, \hat{X}))] \leq f (d)} I (X; \hat{X}) . \end{matrix}

(36)

Proof.

Equations (35) and (36) are widely known in literature (see, for example, [10,11,13]); it remains to show (34). Under the stated assumptions,

\begin{matrix} R_{f} (d) \overset{(a)}{\leq} R_{f}^{'} (d) \overset{(b)}{=} {\tilde{R}}^{'} (f (d)) \overset{(c)}{=} \tilde{R} (f (d)) \end{matrix}

(37)

where (a) follows from assumption (2) and Theorem A1 in the Appendix A, (b) is shown in Lemma 1, and (c) is due to [14] (see also ([13], Theorem 5.9.1)). The other direction,

\begin{matrix} R_{f} (d) \geq \tilde{R} (f (d)) \end{matrix}

(38)

is a consequence of the strong converse by Kieffer [12], see Lemma A1 in the Appendix A. ☐

An immediate application of Theorem 1 gives the

f

-separable rate-distortion function for several well-known binary memoryless sources (BMS).

Example 1 (BMS, Hamming distortion).

Let

X

be the binary memoryless source. That is,

X = \hat{X} = {0, 1}

,

X_{i}

is a Bernoulli

(p)

random variable, and

d (\cdot, \cdot)

is the usual Hamming distortion measure. Then, for any continuous increasing

f (\cdot)

and

p \leq \frac{1}{2}

,

R_{f} (d) = \{\begin{matrix} h (p) - h (\frac{f (d) - f (0)}{f (1) - f (0)}), & \frac{f (d) - f (0)}{f (1) - f (0)} < p \\ 0, & o . w . \end{matrix}\}

where

\begin{matrix} h (p) = p \log \frac{1}{p} + (1 - p) \log \frac{1}{1 - p} \end{matrix}

(39)

is the binary entropy function. The result follows from a series of obvious equalities,

\begin{matrix} R_{f} (d) & = \inf_{P_{\hat{X} | X} : E [f (d (X, \hat{X}))] \leq f (d)} I (X; \hat{X}) \end{matrix}

(40)

\begin{matrix} = \inf_{P_{\hat{X} | X} : \frac{E [f (d (X, \hat{X}))] - f (0)}{f (1) - f (0)} \leq \frac{f (d) - f (0)}{f (1) - f (0)}} I (X; \hat{X}) \end{matrix}

(41)

\begin{matrix} = \inf_{P_{\hat{X} | X} : E [\frac{f (d (X, \hat{X})) - f (0)}{f (1) - f (0)}] \leq \frac{f (d) - f (0)}{f (1) - f (0)}} I (X; \hat{X}) \end{matrix}

(42)

\begin{matrix} = \inf_{P_{\hat{X} | X} : E [d (X, \hat{X})] \leq \frac{f (d) - f (0)}{f (1) - f (0)}} I (X; \hat{X}) \end{matrix}

(43)

\begin{matrix} = R (\frac{f (d) - f (0)}{f (1) - f (0)}) . \end{matrix}

(44)

The rate-distortion function given in Example 1 is plotted in Figure 2 for different functions

f

. The simple derivation in Example 1 could be applied to any source for which the single-letter distortion measure can take on only two values, as is shown in the next example.

Figure 2.

R_{f} (d)

for the binary memoryless source with

p = 0.5

. Compare these to the

f

-separable distortion measures plotted for the binary source with Hamming distortion in Figure 1.

Example 2 (BMS, Erasure distortion).

Let

X

be the binary memoryless source and let the reconstruction alphabet have the erasure option. That is,

X = {0, 1}

,

\hat{X} = {0, 1, e}

, and

X_{i}

is a Bernoulli

(\frac{1}{2})

random variable. Let

d (\cdot, \cdot)

be the usual erasure distortion measure:

d (x, \hat{x}) = \{\begin{matrix} 0, & x = \hat{x} \\ 1, & \hat{x} = e \\ \infty, & o . w . \end{matrix}\} .

The separable rate-distortion function for the erasure distortion is given by

R (d) = 1 - d,

see ([11], Problem 10.7). Then, for any continuous increasing

f (\cdot)

,

R_{f} (d) = 1 - \frac{f (d) - f (0)}{f (1) - f (0)} .

The rate-distortion function given in Example 2 is plotted in Figure 3 for different functions

f

. Observe that for concave

f

(i.e., subadditive distortion) the resulting rate-distortion function is convex, which is consistent with [10]. However, for

f

that are not concave, the rate-distortion function is not always convex. Unlike in the conventional separable distortion measure, an

f

-separable distortion measure is not convex in general.

Figure 3.

R_{f} (d)

for the binary memoryless source with

p = 0.5

and erasure per-letter distortion.

Having a closed-form analytic expression for a separable distortion measure does not always mean that we could easily derive such an expression for an

f

-separable distortion measure with the same per-letter distortion. For example, consider the Gaussian source with the mean-square-error (MSE) per-letter distortion. According to Theorem 1, letting

f (z) = \sqrt{z}

recovers the Gaussian source with the absolute value per-letter distortion. This setting, and variations on it, is a difficult problem in general [15]. However, we can recover the

f

-separable rate-distortion function whenever the per-letter distortion

d (\cdot, \cdot)

composed with

f (\cdot)

reconstructs the MSE distortion, see Figure 4.

Figure 4.

R_{f} (d)

for the Gaussian memoryless source with mean zero and unit variance.

Theorem 1 shows that for well-behaved stationary ergodic sources,

R_{f} (d)

admits a simple characterization. According to Lemma 1, the same characterization holds for the excess distortion paradigm without stationary and ergodic assumptions. The next example shows that, in general,

R_{f} (d) \neq \tilde{R} (f (d))

within the average distortion paradigm. Thus, assumption (1) is necessary for Theorem 1 to hold.

Example 3 (Mixed Source).

Fix

λ \in (0, 1)

and let the source

X

be a mixture of two i.i.d. sources,

\begin{matrix} P_{X^{n}} (x^{n}) = λ \prod_{i = 1}^{n} P^{1} (x_{i}) + (1 - λ) \prod_{i = 1}^{n} P^{2} (x_{i}) . \end{matrix}

(45)

We can alternatively express

X

as

\begin{matrix} X^{n} = Z X_{1}^{n} + (1 - Z) X_{2}^{n} \end{matrix}

(46)

where Z is a Bernoulli(λ) random variable. Then, the rate-distortion function for the mixture source (45) and continuous increasing

f

is given in Lemma A2 in the Appendix B. Namely,

\begin{matrix} R_{f} (d) = \min_{(d_{1}, d_{2}) : λ d_{1} + (1 - λ) d_{2} \leq d} \max \{R_{f}^{1} (d_{1}), R_{f}^{2} (d_{2})\} \end{matrix}

(47)

where

R_{f}^{1} (d)

and

R_{f}^{2} (d)

are the rate-distortion functions for discrete memoryless soruces given by

P^{1}

and

P^{2}

, respectively. Likewise,

\begin{matrix} \tilde{R} (f (d)) = \min_{(d_{1}, d_{2}) : λ d_{1} + (1 - λ) d_{2} \leq f (d)} \max \{{\tilde{R}}^{1} (d_{1}), {\tilde{R}}^{2} (d_{2})\} . \end{matrix}

(48)

As shown in Figure 5, Equations (47) and (48) are not equal in general.

Figure 5. Mixed binary source with

p_{1} = 0.5

,

p_{2} = 0.001

, and

λ = 0.5

. Three examples of

f

-separable rate-distortion functions are given. For

f (z) = z

, the relation

R (d) = \tilde{R} (d)

follows immediately. When

f

is not the identity,

R_{f} (d) \neq \tilde{R} (f (d))

in general for non-ergodic sources.

4. Discussion

4.1. Sub-Additive Distortion Measures

Recall that an

f

-separable distortion measure is sub-additive if

f

is concave (cf. Section 1.3). Clearly, not all

f

-separable distortion measures are sub-additive, and not all sub-additive distortion measures are

f

-separable. An examplar of a sub-additive distortion measure (which is not

f

-separable) given in ([10], Chapter 5.2) is

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = \frac{1}{n} {(\sum_{i = 1}^{n} d^{q} (x_{i}, {\hat{x}}_{i}))}^{1 / q}, q > 1 . \end{matrix}

(49)

The sub-additivity of (49) follows from the Minkowski inequality. Comparing (49) to a sub-additive,

f

-separable distortion measure given by

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = {(\frac{1}{n} \sum_{i = 1}^{n} d^{q} (x_{i}, {\hat{x}}_{i}))}^{1 / q}, 0 \leq q \leq 1, \end{matrix}

(50)

we see that the discrepancy between (49) and (50) has to do not only with the different ranges of q but with the scaling factor as a function of n.

Consider a binary source with Hamming distortion and let

x^{n} = 0^{n}, {\hat{x}}^{n} = 1^{n}

. Rewriting (49) we obtain

\begin{matrix} d^{n} (x^{n}, {\hat{x}}^{n}) = \frac{1}{n^{(q - 1) / q}} {(\frac{1}{n} \sum_{i = 1}^{n} d^{q} (x_{i}, {\hat{x}}_{i}))}^{1 / q} \end{matrix}

(51)

and

\begin{matrix} \lim_{n \to \infty} d^{n} (x^{n}, {\hat{x}}^{n}) & = \lim_{n \to \infty} \frac{1}{n^{(q - 1) / q}} {(\frac{1}{n} \sum_{i = 1}^{n} d^{q} (0, 1))}^{1 / q} \end{matrix}

(52)

\begin{matrix} = \lim_{n \to \infty} \frac{1}{n^{(q - 1) / q}} {(\frac{1}{n} \sum_{i = 1}^{n} 1)}^{1 / q} \end{matrix}

(53)

\begin{matrix} = \lim_{n \to \infty} \frac{1}{n^{(q - 1) / q}} = 0 . \end{matrix}

(54)

In the binary example, the limiting distortion of (49) is zero even when the reconstruction of

x^{n}

gets every single symbol wrong. It is easy to observe that example (49) is similarly degenerate in many cases of interest. The distortion measure given by (50), on the other hand, is an example of a non-trivial sub-additive distortion measure, as can be seen in Figure 2 and Figure 3 for

q = \frac{1}{2}

.

4.2. A Consequence of Theorem 1

In light of the discussion in Section 1.1, an alert reader may consider modifying (16) to

\begin{matrix} f^{- 1} (E [f (d (X, c (g (X)))]) \leq d, \end{matrix}

(55)

and studying the

(M, d)

-lossy source codes under this new paradigm. Call the corresponding rate-distortion function

R^{f} (d)

and assume that n-letter distortion measures are separable. Thus, at block length n the constraint (55) is

\begin{matrix} E [f (\frac{1}{n} \sum_{i = 1}^{n} d (X_{i}, {\hat{X}}_{i}))] \leq f (d) \end{matrix}

(56)

where

\hat{X} = c (g (X))

. This is equivalent to the following constraints:

\begin{matrix} E [f (\frac{1}{n} \sum_{i = 1}^{n} f^{- 1} (\tilde{d} (X_{i}, {\hat{X}}_{i})))] & \leq f (d) \end{matrix}

(57)

\begin{matrix} and E [{\tilde{d}}^{n} (X_{i}, {\hat{X}}_{i}))] & \leq f (d) \end{matrix}

(58)

where

{\tilde{d}}^{n}

is an

f^{- 1}

-separable distortion measure. Putting these observations together with Theorem 1 yields

\begin{matrix} R^{f} (d) = {\tilde{R}}_{f^{- 1}} (f (d)) = R (f^{- 1} (f (d))) = R (d) . \end{matrix}

(59)

A consequence of Theorem 1 is that the rate distortion function remains unchanged under this new paradigm.

5. Conclusions

This paper proposes

f

-separable distortion measures as a good model for non-linear distortion penalties. The rate-distortion function for

f

-separable distortion measures is characterized in terms of separable rate-distortion function with respect to a new single-letter distortion measure,

f (d (\cdot, \cdot))

. This characterization is straightforward for the excess distortion paradigm, as seen in Lemma 1. The proof is more involved for the average distortion paradigm, as seen in Theorem 1. An important implication of Theorem 1 is that many prominant results in rate-distortion literature (e.g., Blahut-Arimoto algorithm) can be leveraged to work for

f

-separable distortion measures.

Finally, we mention that a similar generalization is well-suited for channels with non-linear costs. That is, we say that

b^{n}

is an

f

-separable cost function if it can be written as

\begin{matrix} b^{n} (x^{n}) = f^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} f (b (x_{i}))) . \end{matrix}

(60)

With this generalization we can state the following result which is out of the scope of this special issue.

Theorem 2 (Channels with cost).

The capacity of a stationary memoryless channel given by

P_{Y | X}

and

f

-separable cost function based on single-letter function

b (x)

is

\begin{matrix} C_{f} (β) = \sup_{P_{X} : E [f (b (X))] \leq f (β)} I (X; Y) . \end{matrix}

(61)

Author Contributions

Authors had equal contributions in the paper. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Lemmas for Theorem 1

Theorem A1 can be distilled from several proofs in literature. We state it here, with proof, for completeness; it is given in its present form in [16]. The condition of Theorem A1 applies when the source satisfies assumptions (1)–(3) in Section 3. This is a consequence of the ergodic theorem and continuity of

f

.

Theorem A1.

Suppose that the source and distortion measure are such that for any

γ > 0

there exists

0 < ▵_{γ} < \infty

and a sequence

b_{1}, b_{2}, \dots

such that

\begin{matrix} E [d^{n} (X^{n}, b^{n}) 1 {▵_{γ} < d^{n} (X^{n}, b^{n})}] \leq_{n} γ . \end{matrix}

(A1)

If a rate-distortion pair

(R, d)

is achievable under the excess distortion criterion, it is achievable under the average distortion criterion.

Proof.

Choose

γ > 0

. Suppose there is a code

(g^{n}, c^{n})

with M codewords that achieves

\begin{matrix} \lim_{n \to \infty} P [d^{n} (X^{n}, c^{n} (g^{n} (X))) > d + γ] = 0 . \end{matrix}

(A2)

We construct a new code

({\hat{g}}^{n}, {\hat{c}}^{n})

with

M + 1

codewords:

\begin{matrix} {\hat{c}}^{n} (m) = \{\begin{matrix} b^{n}, & if m = 0 \\ c^{n} (m), & if m = 1, \dots, M, \end{matrix} \end{matrix}

(A3)

\begin{matrix} {\hat{g}}^{n} (x^{n}) = \{\begin{matrix} 0, & if d^{n} (x^{n}, c^{n} (g^{n} (x^{n}))) > d^{n} (x^{n}, b^{n}) \\ g^{n} (x^{n}), & if d^{n} (x^{n}, c^{n} (g^{n} (x^{n}))) \leq d^{n} (x^{n}, b^{n}) . \end{matrix} \end{matrix}

(A4)

Then

\begin{matrix} d^{n} (x^{n}, {\hat{c}}^{n} ({\hat{g}}^{n} (x^{n}))) = \min \{d^{n} (x^{n}, c^{n} (g^{n} (x^{n}))), d^{n} (x^{n}, b^{n})\} . \end{matrix}

(A5)

For brevity denote,

\begin{matrix} V_{n} & = d^{n} (X^{n}, {\hat{c}}^{n} ({\hat{g}}^{n} (X^{n}))) \end{matrix}

(A6)

\begin{matrix} W_{n} & = d^{n} (X^{n}, c^{n} (g^{n} (X^{n}))) \end{matrix}

(A7)

\begin{matrix} Z_{n} & = d^{n} (X^{n}, b^{n}) . \end{matrix}

(A8)

Then,

\begin{matrix} E [V_{n}] \leq E [V_{n} 1 {V_{n} \leq d + γ}] \end{matrix}

\begin{matrix} + E [V_{n} 1 {d + γ < V_{n} \leq ▵_{γ}}] + E [V_{n} 1 {▵_{γ} < V_{n}}] \end{matrix}

(A9)

\begin{matrix} \leq d + γ + ▵_{γ} P [1 {d + γ < V_{n}}] + E [V_{n} 1 {▵_{γ} < V_{n}}] \end{matrix}

(A10)

\begin{matrix} \leq d + γ + ▵_{γ} P [1 {d + γ < W_{n}}] + E [Z_{n} 1 {▵_{γ} < Z_{n}}] \end{matrix}

(A11)

\begin{matrix} \leq_{n} d + 2 γ + E [Z_{n} 1 {▵_{γ} < Z_{n}}] \end{matrix}

(A12)

\begin{matrix} \leq_{n} d + 3 γ . \end{matrix}

(A13)

☐

The following theorem is shown in ([12], Theorem 1).

Theorem A2 (Kieffer).

Let

X

be an information source satisfying conditions (1)–(3) in Section 3, with

f

being the identity. Let

d^{n}

be separable. Given an arbitrary sequence of

(n, M_{n}, d, ϵ_{n})

-lossy source codes, if

\begin{matrix} \lim_{n \to \infty} \frac{1}{n} \log M_{n} < R (d) \end{matrix}

(A14)

then

\begin{matrix} \lim_{n \to \infty} ϵ_{n} = 1 . \end{matrix}

(A15)

An important implication of Theorem A2 for

f

-separable rate-distortion functions is given in the following lemma.

Lemma A1.

Let

X

be an information source satisfying conditions (1)–(3) in Section 3. Then,

\begin{matrix} R_{f} (d) \geq \tilde{R} (f (d)) \end{matrix}

(A16)

Proof.

If

\tilde{R} (f (d)) = 0

, we are done. Suppose

\tilde{R} (f (d)) > 0

. Assume there exists a sequence

{(g_{n}, c_{n})}_{n = 1}^{\infty}

of

(n, M_{n}, d_{n})

-lossy source codes (under

f

-separable distortion) with

\begin{matrix} \underset{n \to \infty}{\lim \sup} \frac{1}{n} \log M_{n} < \tilde{R} (f (d)) \end{matrix}

(A17)

and

\begin{matrix} \underset{n \to \infty}{\lim \sup} d_{n} \leq d . \end{matrix}

(A18)

Since

\tilde{R} (f (d))

is continuous and decreasing, there exists some

γ > 0

such that

\begin{matrix} \lim_{n \to \infty} \frac{1}{n} \log M_{n} < \tilde{R} (f (d + γ)) < \tilde{R} (f (d)) . \end{matrix}

(A19)

For every n, the

(g_{n}, c_{n})

lossy source code is also an

(n, M_{n}, d + γ, ϵ_{n})

-lossy source code for some

ϵ_{n} \in [0, 1]

and

f

-separable

d^{n}

. It is also an

(n, M_{n}, f (d + γ), ϵ_{n})

-lossy source code with respect to separable distortion

{\tilde{d}}^{n} (\cdot, \cdot)

. We can therefore apply Theorem A2 to obtain

\begin{matrix} \lim_{n \to \infty} ϵ_{n} = 1 . \end{matrix}

(A20)

Thus,

\begin{matrix} d_{n} \geq E [d^{n} (X^{n}, c^{n} (g^{n} (X^{n})))] & \geq ϵ_{n} (d + γ), \end{matrix}

(A21)

\begin{matrix} > d + \frac{γ}{2} \end{matrix}

(A22)

where (A22) holds for all sufficiently large n. The result follows since we obtained a contradiction with (A18). ☐

Appendix B. Rate-Distortion Function for a Mixed Source

Lemma A2.

The rate-distortion function with respect to

f

-separable distortion for the mixture source (45) is given by

\begin{matrix} R_{f} (d) = \min_{(d_{1}, d_{2}) : λ d_{1} + (1 - λ) d_{2} \leq d} \max \{R_{f}^{1} (d_{1}), R_{f}^{2} (d_{2})\} \end{matrix}

(A23)

where

R_{f}^{1} (d)

and

R_{f}^{2} (d)

are the rate-distortion functions with respect to

f

-separable distortion for stationary memoryless sources given by

P^{1}

and

P^{2}

, respectively.

Proof.

Observe that,

\begin{matrix} M_{f}^{*} (d) & \geq \min_{(d_{1}, d_{2}) \in D} \max \{M_{f}^{1} (d_{1}), M_{f}^{2} (d_{2})\} \end{matrix}

(A24)

\begin{matrix} M_{f}^{*} (d) & \leq \min_{(d_{1}, d_{2}) \in D} \max \{2 M_{f}^{1} (d_{1}), 2 M_{f}^{2} (d_{2})\} \end{matrix}

(A25)

where

\begin{matrix} D = \{(d_{1}, d_{2}) : λ d_{1} + (1 - λ) d_{2} \leq d\}, \end{matrix}

(A26)

M_{f}^{1} (d_{1})

and

M_{f}^{2} (d_{1})

are the non-asymptotic limits for

P^{1}

and

P^{2}

, respectively. Indeed, the upper bound follows by designing optimal codes for

P_{1}

and

P_{2}

separately, and then combining them to give

\begin{matrix} M_{f}^{*} (d) & \leq \min_{(d_{1}, d_{2}) \in D} \{M_{f}^{1} (d_{1}) + M_{f}^{2} (d_{2})\} \end{matrix}

(A27)

\begin{matrix} \leq \min_{(d_{1}, d_{2}) \in D} \max \{2 M_{f}^{1} (d_{1}), 2 M_{f}^{2} (d_{2})\} . \end{matrix}

(A28)

The lower bound follows by the following argument. Fix an

(M, d)

-lossy source code (

f

-separable distortion),

(g, c)

. Define

\begin{matrix} d_{1} = E [d^{n} (X^{n}, c (g (X^{n}))) | Z = 0], \end{matrix}

(A29)

\begin{matrix} d_{2} = E [d^{n} (X^{n}, c (g (X^{n}))) | Z = 1] . \end{matrix}

(A30)

Clearly,

(d_{1}, d_{2}) \in D

. It also follows that

\begin{matrix} M \geq M_{f}^{1} (d_{1}) \end{matrix}

(A31)

since

(g, c)

is an

(M, d_{1})

-lossy source code (

f

-separable distortion) code for

X_{1}^{n}

. Likewise,

\begin{matrix} M \geq M_{f}^{2} (d_{2}) \end{matrix}

(A32)

which proves the lower bound. The result follows directly from (A25). ☐

References

Shannon, C.E. Coding Theorems for a Discrete Source with a Fidelity Criterion Institute of Radio Engineers, International Convention Record, vol. 7, 1959. In Claude E. Shannon: Collected Papers; Wiley-IEEE Press: Hoboken, NJ, USA, 1993. [Google Scholar]
Tikhomirov, V. On the Notion of Mean. In Selected Works of A. N. Kolmogorov; Tikhomirov, V., Ed.; Springer: Dordrecht, The Netherlands, 1991; Volume 25, pp. 144–146. [Google Scholar]
Rényi, A. On Measures of Entropy and Information; Contributions to the Theory of Statistics. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; Volume 1, pp. 547–561. [Google Scholar]
Campbell, L. A coding theorem and Rényi’s entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef]
Campbell, L. Definition of entropy by means of a coding problem. Z. Wahrscheinlichkeitstheorie Verwandte Gebiete 1966, 6, 113–118. [Google Scholar] [CrossRef]
Kieffer, J. Variable-length source coding with a cost depending only on the code word length. Inf. Control 1979, 41, 136–146. [Google Scholar] [CrossRef]
Arikan, E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 1996, 42, 99–105. [Google Scholar] [CrossRef]
Bunte, C.; Lapidoth, A. Encoding Tasks and Rényi Entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
Arikan, E.; Merhav, N. Guessing subject to distortion. IEEE Trans. Inf. Theory 1998, 44, 1041–1056. [Google Scholar] [CrossRef]
Gray, R.M. Entropy and Information Theory, 2nd ed.; Springer: Berlin, Germany, 2011. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
Kieffer, J. Strong converses in source coding relative to a fidelity criterion. IEEE Trans. Inf. Theory 1991, 37, 257–262. [Google Scholar] [CrossRef]
Han, T.S. Information-Spectrum Methods in Information Theory; Springer: Berlin, Germany, 2003. [Google Scholar]
Steinberg, Y.; Verdú, S. Simulation of random processes and rate-distortion theory. IEEE Trans. Inf. Theory 1996, 42, 63–86. [Google Scholar] [CrossRef]
Dytso, A.; Bustin, R.; Poor, H.V.; Shitz, S.S. On additive channels with generalized Gaussian noise. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 426–430. [Google Scholar]
Verdú, S. ELE528: Information Theory Lecture Notes; Princeton University: Princeton, NJ, USA, 2015. [Google Scholar]

Figure 1. The number of reconstruction errors for an information source with 100 bits vs. the penalty assessed by

f

-separable distortion measures based on the Hamming single-letter distortion. The

f (z) = z

plot corresponds to the separable distortion. The

f

-separable assumption accommodates all of the other plots, and many more, with the appropriate choice of the function

f

.

Figure 2.

R_{f} (d)

for the binary memoryless source with

p = 0.5

. Compare these to the

f

-separable distortion measures plotted for the binary source with Hamming distortion in Figure 1.

Figure 3.

R_{f} (d)

for the binary memoryless source with

p = 0.5

and erasure per-letter distortion.

Figure 4.

R_{f} (d)

for the Gaussian memoryless source with mean zero and unit variance.

Figure 5. Mixed binary source with

p_{1} = 0.5

,

p_{2} = 0.001

, and

λ = 0.5

. Three examples of

f

-separable rate-distortion functions are given. For

f (z) = z

, the relation

R (d) = \tilde{R} (d)

follows immediately. When

f

is not the identity,

R_{f} (d) \neq \tilde{R} (f (d))

in general for non-ergodic sources.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Coding Theorem for f-Separable Distortion Measures

Abstract

1. Introduction

1.1. Generalized $f$ -Mean and Rényi Entropy

1.2. Compression with Non-Linear Cost

1.3. Sub-Additive Distortion Measures

2. Preliminaries

2.1. Rate-Distortion Function (Average Distortion)

2.2. Rate-Distortion Function (Excess Distortion)

2.3. $f$ -Separable Rate-Distortion Functions and Convexity

3. Main Result

4. Discussion

4.1. Sub-Additive Distortion Measures

4.2. A Consequence of Theorem 1

5. Conclusions

Author Contributions

Conflicts of Interest

Appendix A. Lemmas for Theorem 1

Appendix B. Rate-Distortion Function for a Mixed Source

References

Article Metrics

Citations

Article Access Statistics

A Coding Theorem for f-Separable Distortion Measures

Abstract

1. Introduction

1.1. Generalized f -Mean and Rényi Entropy

1.2. Compression with Non-Linear Cost

1.3. Sub-Additive Distortion Measures

2. Preliminaries

2.1. Rate-Distortion Function (Average Distortion)

2.2. Rate-Distortion Function (Excess Distortion)

2.3. f -Separable Rate-Distortion Functions and Convexity

3. Main Result

4. Discussion

4.1. Sub-Additive Distortion Measures

4.2. A Consequence of Theorem 1

5. Conclusions

Author Contributions

Conflicts of Interest

Appendix A. Lemmas for Theorem 1

Appendix B. Rate-Distortion Function for a Mixed Source

References

Article Metrics

Citations

Article Access Statistics

1.1. Generalized $f$ -Mean and Rényi Entropy

2.3. $f$ -Separable Rate-Distortion Functions and Convexity