Lossless Transformations and Excess Risk Bounds in Statistical Inference

Györfi, László; Linder, Tamás; Walk, Harro

doi:10.3390/e25101394

Open AccessArticle

Lossless Transformations and Excess Risk Bounds in Statistical Inference

by

László Györfi

¹

,

Tamás Linder

^2,* and

Harro Walk

³

¹

Department of Computer Science and Information Theory, Budapest University of Technology and Economics, H-1111 Budapest, Hungary

²

Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada

³

Fachbereich Mathematik, Universität Stuttgart, 70569 Stuttgart, Germany

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(10), 1394; https://doi.org/10.3390/e25101394

Submission received: 2 August 2023 / Revised: 26 September 2023 / Accepted: 26 September 2023 / Published: 28 September 2023

(This article belongs to the Special Issue Advances in Information and Coding Theory II)

Download Versions Notes

Abstract

:

We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss when estimating a random variable from an observed feature vector and the minimum expected loss when estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless, and we show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a

δ

-lossless transformation and give sufficient conditions for a given transformation to be universally

δ

-lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottlenecks, and deep learning are also surveyed.

Keywords:

statistical inference with loss; strongly consistent test; information-theoretic bounds; classification; regression; portfolio selection; information bottleneck; deep learning

1. Introduction

We consider the standard setting of statistical inference, where Y is a real random variable, having a range

Y \subset R

, which is to be estimated (predicted) from a random observation (feature) vector X, taking values in

R^{d}

. Given a measurable predictor

f : R^{d} \to Y

and measurable loss function

ℓ : Y \times Y \to R_{+}

, the loss incurred is

ℓ (Y, f (X))

. The minimum expected risk in predicting Y from the random vector X is

L_{ℓ}^{*} (Y | X) = inf_{f : R^{d} \to Y} E [ℓ (Y, f (X))],

where the infimum is over all measurable f.

Suppose that the tasks of collecting data and making the prediction are separated in time or in space. For example, the separation in time happens when the data are collected first and the statistical modeling and analysis are performed much later. Separation in space can be due, for example, to collecting data at a remote location and making predictions centrally. Such situations are modeled using a transformation

T : R^{d} \to R^{d^{'}}

, so that the prediction regarding Y is made from the transformed observation

T (X)

, instead of X. An important example for such a transformation is quantization, in which case

T (X)

is a discrete random variable. Clearly, one always has

L_{ℓ}^{*} (Y | T (X)) \geq L_{ℓ}^{*} (Y | X)

. The difference

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X)

is sometimes referred to in the literature as excess risk. A part of this paper is concerned with transformations for which the excess risk is zero, no matter the underlying loss function ℓ. Such transformations are universally lossless, in the sense that they can be chosen before the cost function ℓ for the underlying problem is known. More formally, we can state the following definition.

Definition 1

(lossless transformation). For a fixed joint distribution of Y and X, a (measurable) transformation

T : R^{d} \to R^{d^{'}}

is called universally lossless if for any loss function

ℓ : Y \times Y \to R_{+}

we have

L_{ℓ}^{*} (Y | T (X)) = L_{ℓ}^{*} (Y | X) .

An important special transformation is feature selection. Formally, for the observation (feature) vector

X = (X^{(1)}, \dots, X^{(d)})

and

S \subset {1, \dots, d}

, consider the

| S |

-dimensional vector

X_{S} = (X^{(i)}, i \in S)

. Typically, the dimension

| S |

of

X_{S}

is significantly smaller than d, the dimension of X. If we have

L_{ℓ}^{*} (Y | X_{S}) = L_{ℓ}^{*} (Y | X),

for all loss functions ℓ, then the feature selector

X \mapsto X_{S}

is universally lossless. For fixed loss ℓ, the performance of any statistical inference method is sensitive to the dimension of the feature vector. Therefore, dimension reduction is crucial before choosing or constructing an inference method. If

X_{S}

is universally lossless, then the complement feature subvector

X_{S^{c}}

is irrelevant. It is an open research problem how to efficiently search a universally lossless

X_{S}

with minimum size

| S |

. Since, typically, the distribution of the pair X and Y is not known and must be inferred from data, any such search algorithm needs a procedure for testing for the universal losslessness property of a feature selector.

In the first part of this paper, we give a necessary and sufficient condition for a given transformation T to be universally lossless and then construct a partitioning-based statistic for testing this condition if independent and identically distributed training data are available. With the null hypothesis being that a given transformation is universally lossless, the test is shown to be strongly consistent, in the sense that it almost surely (a.s.) makes finitely many Type I and II errors.

In many situations, requiring that a transformation T is universally lossless is too demanding. The next definition relaxes this requirement.

Definition 2

(

δ

-lossless transformation). For a fixed joint distribution of Y and X, and

δ > 0

, a transformation

T : R^{d} \to R^{d^{'}}

is called universally

δ

-lossless with respect to a class of loss functions

L

, if we have

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) \leq δ f o r a l l ℓ \in L .

In the second part of this paper, we derive bounds on the excess minimum risk

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X)

in terms of the mutual information difference

I (Y; X) - I (Y; T (X))

under various assumptions about ℓ. With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be

δ

-lossless with respect to fairly general classes of loss functions ℓ. Applications to classification, nonparametric regression, portfolio strategies, the information bottleneck method, and deep learning are also reviewed.

Relationship with prior work

Our first result, Theorem 1, which shows that a transformation is universally lossless if and only if it is a sufficient statistic, is likely known, but we could not find it in this explicit form in the literature (a closely related result is the classical Rao–Blackwell theorem of mathematical statistics, e.g., Schervish ([1], Theorem 3.22)). Due to this result, testing with independent data whether or not a given transformation is universally lossless turns into a test for conditional independence. Our test in Theorem 2 is based on the main results in Györfi and Walk [2], but our construction is more general and we also correct an error in the proof of ([2], Theorem 1). Apart from [2], most of the results in the literature of testing for conditional independence are for real-valued random variables and/or assume certain special distribution types, typically the existence of a joint probability density function. Such assumptions exclude problems where Y is discrete and X is continuous, as is typical in classification, or problems where the observation X is concentrated on a lower dimensional subspace or manifold. In contrast, our test construction is completely distribution free and its convergence properties are also (almost) distribution free. A more detailed review of related work is given in Section 2.1.

The main result in Section 3 is Theorem 3, which bounds the excess risk in terms of the square root of the mutual information difference

I (Y; X) - I (Y; T (X))

. There is a history of such bounds, possibly starting with Xu and Raginsky [3], where the generalization error of a learning algorithm was upper bounded using constant times the square root of the mutual information between the hypothesis and the training data (see also the references in [3,4]). This result has since been extended in various forms, mostly concentrating on providing information-theoretic bounds for the generalization capabilities of learning algorithms, instead of looking at the excess risk; see, e.g., Raginsky et al. [5], Lugosi and Neu [6], Jose and Simeone [7], and the references therein, just to mention a few of these works. The most relevant recent work relating to our bounds in Section 3 seems to be Xu and Raginsky [4], where, among other things, information-theoretic bounds were developed on the excess risk in a Bayesian learning framework; see also Hafez-Kolahi et al. [8]. The bounds in [4] are not on the excess risk

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X)

; they involve training data, but their forms are similar to ours. It appears that our Theorem 3 gives a bound that holds uniformly for a larger class of loss functions ℓ and joint distributions of Y and X; however, in [4], several other bounds are presented that are tighter and/or allow more general distributions, for specific fixed loss functions.

Organization

This paper is organized as follows. In Section 2, we characterize universally lossless transformations and introduce a novel strongly consistent test for the property of universal losslessness. In Section 3, information-theoretic bounds on the excess minimum risk are developed and are used to characterize the

δ

-losslessness property of transformations. Section 4 surveys connections with, and applications to, specific prediction problems, as well as the information bottleneck method in deep learning. The somewhat lengthy proof of the strong consistency of the test in Theorem 2 is given in Section 5. Concluding remarks are given in Section 6.

2. Testing the Universal Losslessness Property

In this section, we first give a characterization of universally lossless transformations for a given distribution of the pair

(X, Y)

. In practice, the distribution of

(X, Y)

may not be known, but a sequence of independent and identically distributed (i.i.d.) copies of

(X, Y)

may be available. For this case, we construct a procedure to test if a given transformation is universally lossless and prove that, under mild conditions, the test is strongly consistent.

2.1. Universally Lossless Transformations

Based on Definition 1, we introduce the null hypothesis

\begin{matrix} H_{0} = {T : the transformation T is universally lossless .} \end{matrix}

(1)

A transformation (statistic)

T (X)

is called sufficient if the random variables Y,

T (X)

, X form a Markov chain in this order, denoted by

Y \to T (X) \to X

(see, e.g., Definition 3.8 and Theorem 3.9 in Polyanskiy and Wu [9]).

For a binary valued Y, Theorems 32.5 and 32.6 from Devroye et al. [10] imply that the statistic

T (X)

is universally lossless if, and only if, it is sufficient. The following theorem extends this property to general Y. This result is likely known, but we could not find it in the given form.

Theorem 1.

The transformation T is universally lossless if, and only if,

Y \to T (X) \to X

is a Markov chain.

Proof.

Assume first that

Y \to T (X) \to X

is a Markov chain. This is equivalent to having

P (Y \in A | X, T (X)) = P (Y \in A | T (X))

almost surely (a.s.) for any measurable

A \subset Y

. Then we have

\begin{matrix} L_{ℓ}^{*} (Y | X, T (X)) & = E [inf_{y \in Y} E [ℓ (Y, y) | X, T (X)]] \\ = E [inf_{y \in Y} E [ℓ (Y, y) | T (X)]] \\ = L_{ℓ}^{*} (Y | T (X)) . \end{matrix}

Since

L_{ℓ}^{*} (Y | X, T (X)) \leq L_{ℓ}^{*} (Y | X) \leq L_{ℓ}^{*} (Y | T (X))

always holds, we obtain

L_{ℓ}^{*} (Y | T (X)) = L_{ℓ}^{*} (Y | X)

for all ℓ, so

T (X)

is universally lossless.

Now, assume that the Markov chain condition

Y \to T (X) \to X

does not hold. Then, there exists a measurable

A \subset Y

with

0 < P (Y \in A) < 1

and

B \subset R^{d}

with

P (X \in B) > 0

, such that

P (Y \in A | X, T (X)) \neq P (Y \in A | T (X)) if X \in B .

Let

h (y) = I_{y \in A}

, where

I_{E}

is the indicator function of event E, and define the binary valued

\hat{Y}

as

\hat{Y} = h (Y)

. Then, the Markov chain condition

\hat{Y} \to T (X) \to X

does not hold. For this special case, Theorems 32.5 and 32.6 in [10] show that there a loss function

\hat{ℓ} : {0, 1}^{2} \to R_{+}

exists, such that

L_{\hat{ℓ}}^{*} (\hat{Y} | T (X)) > L_{\hat{ℓ}}^{*} (\hat{Y} | X)

. Finally, letting

ℓ (y, y^{'}) = \hat{ℓ} (h (y), h (y^{'}))

, we have

L_{ℓ}^{*} (Y | T (X)) = L_{\hat{ℓ}}^{*} (\hat{Y} | T (X)) > L_{\hat{ℓ}}^{*} (\hat{Y} | X) = L_{ℓ}^{*} (Y | X),

which shows that

T (X)

is not universally lossless. □

2.2. A Strongly Consistent Test

Theorem 1 implies an equivalent form of the losslessness null hypothesis defined by (1)

\begin{matrix} H_{0} = {T : Y \to T (X) \to X is a Markov chain}, \end{matrix}

(2)

or equivalently,

H_{0}

holds if and only if X and Y are conditionally independent given

T (X)

:

H_{0} : P (X \in A, Y \in B ∣ T (X)) = P (X \in A ∣ T (X)) P (Y \in B ∣ T (X)) a . s .

for arbitrary Borel sets

A, B

. Furthermore, we consider the general case where the alternative hypothesis

H_{1}

is the complement of

H_{0}

:

H_{1} = H_{0}^{c}

.

Now, assume that the joint distribution of

(X, Y, T (X))

is not known but instead a sample of independent and identically distributed (i.i.d.) random vectors

(X_{1}, Y_{1}, Z_{1}), \dots, (X_{n}, Y_{n}, Z_{n})

having a common distribution of

(X, Y, Z)

is given, where

Z_{i} = T (X_{i})

and

Z = T (X)

.The goal is to test the hypothesis

H_{0}

of conditional independence based on these data. In fact, our goal is to provide a strongly consistent test; i.e., a test that, with a probability of one, only makes finitely many Type I and II errors.

For testing conditional independence, most of the results in the literature used real valued

X, Y, Z

. Based on kernel density estimation, Cai et al. [11] introduced a test statistic and under the null hypothesis calculated its limit distribution. In Neykov et al. [12], a gap was introduced between the null and alternative hypotheses. This gap was characterized by the total variation distance, which decreased with increasing n. Under certain smoothness conditions, minimax bounds were derived. According to Shah and Peters [13], a regularity condition such as our Lipschitz condition (5) below cannot be omitted if a test for conditional independence is to be consistent. This is a consequence of their no-free lunch theorem that states that, under general conditions, if with the null hypothesis the bound on the error probability is non-asymptotic, then under the alternative hypothesis the rate of convergence of the error probability can be arbitrarily slow, which is a well-known phenomenon in nonparametric statistics. We note that these cited results, and indeed most of the results in the literature when testing for conditional independence, were for real-valued random variables and/or assumed certain special distribution types, typically the existence of a joint probability density function or that both X and Y are discrete, as in [12]. As we remarked earlier, such assumptions exclude problems where Y is discrete and X is continuous (typical in classification) or problems where the observation X is concentrated on a lower dimensional subspace or manifold. In contrast, our test construction is completely distribution-free and its convergence properties are almost distribution-free (we do assume a mild Lipschitz-type condition; see the upcoming Condition 1).

In our hypotheses testing setup, the alternative hypothesis,

H_{1}

, is the complement to the null hypothesis,

H_{0}

; therefore, there is no separation gap between the hypotheses. Dembo and Peres [14] and Nobel [15] characterized hypothesis pairs that admitted strongly consistent tests; i.e., tests that with a probability of one only make finitely many Type I and II errors. This property is called discernibility. As an illustration of the intricate nature of the discernibility concept, Dembo and Peres [14] demonstrated an exotic example, where the null hypothesis is that the mean of a random variable is rational, while the alternative hypothesis is that this mean minus

\sqrt{2}

is rational. (See also Cover [16] and Kulkarni and Zeitouni [17].) The discernibility property shows up in Biau and Györfi [18] (testing homogeneity), Devroye and Lugosi [19] (classification of densities), Gretton and Györfi [20] (testing independence), Morvai and Weiss [21] and Nobel [15] (classification of stationary processes), among others.

In the remainder of this section, under mild conditions for the distribution of

(X, Y)

, we study discernibility in the context of lossless transformations for statistical inference with general risk. We will make strong use of the multivariate-partitioning-based test of Györfi and Walk [2].

Let

P_{X Y Z}

denote the joint distribution of

(X, Y, Z)

and similarly for any marginal distribution of

(X, Y, Z)

; e.g.,

P_{X Z}

denotes the distribution of the pair

(X, Z)

. As in Györfi and Walk [2], introduce the following empirical distributions:

\begin{matrix} P_{X Y Z}^{n} (A, B, C) & = \frac{# {i : (X_{i}, Y_{i}, Z_{i}) \in A \times B \times C, i = 1, \dots, n}}{n}, \\ P_{X Z}^{n} (A, C) & = \frac{# {i : (X_{i}, Z_{i}) \in A \times C, i = 1, \dots, n}}{n}, \\ P_{Y Z}^{n} (B, C) & = \frac{# {i : (Y_{i}, Z_{i}) \in B \times C, i = 1, \dots, n}}{n}, \end{matrix}

and

P_{Z}^{n} (C) = \frac{# {i : Z_{i} \in C, i = 1, \dots, n}}{n},

for Borel sets

A \subset R^{d}

,

B \subset R

, and

C \subset R^{d^{'}}

.

For the sake of simplicity, assume that X, Y, and

Z = T (X)

are bounded. Otherwise, we apply a componentwise, one-to-one scaling in the interval

[0, 1]

. Obviously, the losslessness null hypothesis

H_{0}

is invariant under such a scaling. Let

P_{n} = {A_{n, 1}, \dots, A_{n, m_{n}}}, Q_{n} = {B_{n, 1}, \dots, B_{n, m_{n}^{'}}}, R_{n} = {C_{n, 1}, \dots, C_{n, m_{n}^{''}}}

be the finite cubic partitions of the ranges X, Y, and Z, with all the cubes having common side lengths

h_{n}

(thus,

h_{n}

is proportional to

1 / m_{n}^{'}

). As in [2], we define the test statistic

\begin{matrix} L_{n} & = \sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} |P_{X Y Z}^{n} (A, B, C) - \frac{P_{X Z}^{n} (A, C) P_{Y Z}^{n} (B, C)}{P_{Z}^{n} (C)}| . \end{matrix}

(3)

Our test rejects

H_{0}

if

L_{n} \geq t_{n},

and accepts it if

L_{n} < t_{n}

, where the threshold

t_{n}

is set to

\begin{matrix} t_{n} & = c_{1} (\sqrt{\frac{m_{n} m_{n}^{'} m_{n}^{''}}{n}} + \sqrt{\frac{m_{n}^{'} m_{n}^{''}}{n}} + \sqrt{\frac{m_{n} m_{n}^{''}}{n}} + \sqrt{\frac{m_{n}^{''}}{n}}) + (log n) h_{n}, \end{matrix}

(4)

where the constant

c_{1}

satisfies

c_{1} > \sqrt{2 log 2} \approx 1.177 .

In this setup, the distribution of

(X, Y)

is arbitrary; its components can be discrete or absolutely continuous, or a mixture of the two or even singularly continuous. It is important to note that to construct this test, there is no need to know the type of distribution.

We assume that the joint distribution of X, Y, and

Z = T (X)

satisfies the following assumption.

Condition 1.

Let

p (\cdot | z)

be the density of the conditional distribution

P_{X | Z = z} = P (X \in \cdot | Z = z)

with respect to the distribution

P_{X}

as a dominating measure and introduce the notation

C_{n} (z) = C_{n, j} i f z \in C_{n, j} .

Assume that for some

C^{*} > 0

,

p (x | z)

satisfies the condition

\begin{matrix} \int \int | p (x | z) - \frac{\int_{C_{n} (z)} p (x | z^{'}) P_{Z} (d z^{'})}{P_{Z} (C_{n} (z))} | P_{X} (d x) P_{Z} (d z) & \leq C^{*} h_{n}, \end{matrix}

(5)

for all n.

We note that the ordinary Lipschitz condition

\begin{matrix} \int | p (x | z) - p (x | z^{'}) | P_{X} (d x) & \leq \frac{C^{*}}{\sqrt{d^{'}}} ∥ z - z^{'} ∥ for all z, z^{'} \in R^{d^{'}} \end{matrix}

(6)

implies (5). This latter condition is equivalent to

d_{T V} (P_{X | Z = z}, P_{X | Z = z^{'}}) \leq \frac{C^{*}}{2 \sqrt{d^{'}}} ∥ z - z^{'} ∥ for all z, z^{'} \in R^{d^{'}},

where

d_{T V} (P, Q)

denotes the total variation distance between distributions P and Q. In Neykov, Balakrishnan, and Wasserman [12], condition (6) is called the Null TV Lipschitz condition.

The next theorem is an adaptation and extension of the results in Györfi and Walk [2] to this particular problem of lossless transformation. In [2], it was assumed that the sequence of partitions

{P_{n}, Q_{n}, R_{n}}

is nested, while we make no such assumption. The proof, in which an error made in [2] is also corrected, is relegated to Section 5.

Theorem 2.

Suppose that X, Y, and

Z = T (X)

are bounded and Condition 1 holds for all n. If the sequence

h_{n}

satisfies

lim_{n \to \infty} n h_{n}^{d + 1 + d^{'}} = \infty

(7)

and

\begin{matrix} lim_{n \to \infty} h_{n}^{d^{'}} log n = 0, \end{matrix}

(8)

then we have the following:

(a): Under the losslessness null hypothesis $H_{0}$ , we have for all $n \geq e^{C^{*}}$ ,

$P (L_{n} \geq t_{n}) \leq 4 e^{- (c_{1}^{2} / 2 - log 2) m_{n}^{''}},$

(9)

and therefore, because $\sum_{n = 1}^{\infty} P (L_{n} \geq t_{n}) < \infty$ by (8) and (9), after a random sample size, the test produces no error with a probability of one.
(b): Under the alternative hypothesis $H_{1} = H_{0}^{c}$ ,

$\underset{n \to \infty}{lim inf} L_{n} > 0 a . s .,$

thus, with a probability of one, after a random sample size, the test produces no error.

Remark 1.

(i): The choice $h_{n} = n^{- δ}$ with $0 < δ < 1 / (d + 1 + d^{'})$ satisfies both conditions (7) and (8).
(ii): Note that using (4), $t_{n}$ is of order $c_{1} \sqrt{\frac{m_{n} m_{n}^{'} m_{n}^{''}}{n}} + (log n) h_{n}$ . Since we have

$\begin{matrix} m_{n} = O (1 / h_{n}^{d}), m_{n}^{'} = O (1 / h_{n}), m_{n}^{''} = O (1 / h_{n}^{d^{'}}), \end{matrix}$

this means that

t_{n}

is of order

\sqrt{1 / (n h_{n}^{d + 1 + d^{'}})} + (log n) h_{n}

.

An important special transformation is given by the feature selection

X_{S}

defined in the Introduction. Theorem 2 demonstrates the possibility of universally lossless dimension reduction for any multivariate feature vector. Note that in the setup of feature selection, the partition

P_{n}

can be the nested version of

R_{n}

and so the calculation of the test statistic

L_{n}

is easier.

3. Universally $δ$ -Lossless Transformations

Here, we develop bounds on the excess minimum risk, in terms of mutual information under various assumptions about the loss function. With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be universally

δ

-lossless with respect to fairly general classes of loss functions ℓ.

3.1. Preliminaries on Mutual Information

Let

P_{X Y}

denote the joint distribution of the pair

(X, Y)

and let

P_{X} P_{Y}

denote the product of the marginal distributions of X and Y, respectively. The mutual information between X and Y, denoted by

I (X; Y)

, is defined as

I (X; Y) = D (P_{X Y} ∥ P_{X} P_{Y}),

where

D (P ∥ Q) = \{\begin{matrix} \int \frac{d P}{d Q} log (\frac{d P}{d Q}) d Q & if P ≪ Q \\ \infty & otherwise, \end{matrix}

is the Kullback–Leibler (KL) divergence between probability distributions P and Q (here,

P ≪ Q

means that P is absolutely continuous with respect to Q with the Radon–Nikodym derivative

\frac{d P}{d Q}

). Thus,

I (X; Y)

is always nonnegative and

I (X; Y) = 0

if and only if X and Y are independent (note that

I (X; Y) = \infty

is possible). In this definition and throughout the paper, log denotes the natural logarithm.

For random variables U and V (both taking values in finite-dimensional Euclidean spaces), let

P_{U | V}

denote the conditional distribution of U, given V. Furthermore, let

P_{U | V = v}

denote the stochastic kernel (regular conditional probability) induced by

P_{U | V}

. Thus, in particular,

P_{U | V = v} (A) = P (U \in A | V = v)

for each measurable set A.

Given another random variable Z, the conditional mutual information

I (X; Y | Z)

is defined as

I (X; Y | Z) = \int D (P_{X Y | Z = z} ∥ P_{X | Z = z} P_{Y | Z = z}) P_{Z} (d z) .

The integral above can also be denoted by

D (P_{Y X | Z} ∥ P_{Y | Z} P_{X | Z} | P_{Z})

and is called the conditional KL divergence. One can define

I (X; Y | Z = z) = D (P_{X Y | Z = z} ∥ P_{X | Z = z} P_{Y | Z = z})

so that

I (X; Y | Z) = \int I (X; Y | Z = z) P_{Z} (d z) .

(10)

From this definition it is clear that

I (X; Y | Z) = 0

if and only if X and Y are conditionally independent given Z, i.e., if and only if

Y \to Z \to X

(or equivalently, if and only if

X \to Z \to Y

).

Another way of expressing

I (X; Y)

is

I (X; Y) = \int D (P_{Y | X = x} ∥ P_{Y}) P_{X} (d x) .

(11)

One can see that in a similar way to

I (X; Y | Z)

can be expressed as

I (X; Y | Z) = \int \int D (P_{Y | X = x, Z = z} ∥ P_{Y | Z = z}) P_{X | Z = z} (d x) P_{Z} (d z) .

(12)

Properties of mutual information and conditional mutual information, their connections to the KL divergence, and identities involving these information measures are detailed in, e.g., Cover and Thomas ([22], Chapter 2) and Polyanskiy and Wu ([9], Chapter 3).

3.2. Mutual Information Bounds and $δ$ -Lossless Transformations

A real random variable U with finite expectation is said to be

σ^{2}

-sub-Gaussian for some

σ^{2} > 0

if

log E [e^{λ (U - E [U])}] \leq \frac{σ^{2} λ^{2}}{2} for all λ \in R .

Furthermore, we say that U is conditionally

σ^{2}

-sub-Gaussian given another random variable V if we have a.s.

log E [e^{λ (U - E [U | V])} | V] \leq \frac{σ^{2} λ^{2}}{2} for all λ \in R .

(13)

The following result gives a quantitative upper bound on the excess minimum risk

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X)

in terms of the mutual information difference

I (Y; X) - I (Y; T (X))

under certain, not too restrictive, conditions. Note that

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) \geq 0

always holds.

Given

ϵ > 0

, we call an estimator

f^{'} : R^{d} \to Y

ϵ

-optimal if

E [ℓ (Y, f^{'} (X))] < L_{ℓ}^{*} (Y | X) + ϵ

.

Theorem 3.

Let

T : R^{d} \to R^{d^{'}}

be a measurable transformation and assume that for any

ϵ > 0

, there exists an ϵ-optimal estimator

f^{'}

of Y from X, such that

ℓ (y, f^{'} (X)))

is conditionally

σ^{2} (y)

-sub-Gaussian given

T (X)

for every

y \in Y

, i.e.,

log E [e^{λ (ℓ (y, f^{'} (X)) - E [ℓ (y, f^{'} (X)) | T (X)])} ∣ T (X)] \leq \frac{σ^{2} (y) λ^{2}}{2} a . s

(14)

for all

λ \in R

and

y \in R

, where

σ^{2} : R \to R_{+}

satisfies

E [σ^{2} (Y)] < \infty

. Then, one has

\begin{matrix} L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) & \leq \sqrt{2 E [σ^{2} (Y)] I (Y; X | T (X))} \\ = \sqrt{2 E [σ^{2} (Y)] (I (Y; X) - I (Y; T (X)))} . \end{matrix}

(15)

Remark 2.

(i): In case $I (Y; X | T (X)) = \infty$ , we interpret the right hand side of (15) as ∞. With this interpretation, the bound always holds.
(ii): We show in Section 4.2 that the sub-Gaussian condition (14) holds for the regression problem with squared error $ℓ (y, y^{'}) = {(y - y^{'})}^{2}$ if $Y = m (X) + N$ , where N is independent noise having a zero mean and finite fourth moment $E [N^{4}] < \infty$ , and the regression function $m (x) = E [Y | X = x]$ is bounded. In particular, the bound in the theorem holds if N is normal with zero mean and m is bounded.
We note that Theorem 6 and Corollary 3 in Xu and Raginsky [4] give bounds similar to (15), in the somewhat different context of Bayesian learning. However, the conditions there exclude, e.g., regression models in the form $Y = m (X) + N$ if $E [| N |^{δ}] = \infty$ for some $δ > 0$ .
(iii): Although hidden in the notation, $E [σ^{2} (Y)]$ depends on the loss function ℓ. Thus, the upper bound (15) is the product of two terms, the second of which,

$\sqrt{I (Y; X) - I (Y; T (X))},$

is independent of the loss function.
(iv): The bound in the theorem is not tight in general. In Section 4.3, an example is given in the context of portfolio selection, where the excess risk can be upper bounded by the difference $I (Y; X) - I (Y; T (X))$ .
(v): The proof of Theorem 3 and those of its corollaries go through virtually without change if we replace $T (X)$ with any $R^{d^{'}}$ -valued random variable Z, such that $Y \to X \to Z$ . Under the conditions of the theorem, we then have

$\begin{matrix} L_{ℓ}^{*} (Y | Z) - L_{ℓ}^{*} (Y | X) & \leq \sqrt{2 E [σ^{2} (Y)] I (Y; X | Z)} \\ = \sqrt{2 E [σ^{2} (Y)] (I (Y; X) - I (Y; Z))} . \end{matrix}$

In fact, Theorem 3 and its corollaries hold for general random variables Y, X, and Z taking values in complete and separable metric (Polish) spaces $Y$ , $X$ , and $Z$ , respectively, if $Y \to X \to Z$ .

The proof of Theorem 3 is based on a slight generalization of Raginsky et al. ([5], Lemma 10.2), which we state next. In the lemma, U and V are arbitrary abstract random variables defined for the same probability space and taking values in spaces

U

and

V

, respectively;

\bar{U}

and

\bar{V}

are independent copies of U and V (so that

P_{\bar{U} \bar{V}} = P_{U} P_{V}

); and

h : U \times V \to R

is a measurable function.

Lemma 1.

Assume that

h (u, V)

is

σ^{2} (u)

-sub-Gaussian for all

u \in U

, where

E [σ^{2} (U)] < \infty

. Then,

| E [h (U, V)] - E [h (\bar{U}, \bar{V})] | \leq \sqrt{2 E [σ^{2} (U)] I (U; V)} .

Proof.

We essentially copy the proof of ([5], Lemma 10.2), where it was assumed that

σ^{2} (u)

does not depend on u. With this restriction, the sub-Gaussian condition (14) in Theorem 3 would have to hold with

σ^{2} (y) \leq σ^{2}

uniformly over y. This condition would exclude regression models with independent sub-Gaussian noise and, a fortiori, models with independent noise that do not possess finite absolute moments of all orders, while our Theorem 2 can also be applied in such cases (see Section 4.2).

We make use of the Donsker–Varadhan variational representation of the relative entropy ([23] Corollary 4.15), which states that

D (P ∥ Q) = sup_{F} (\int F d P - log \int e^{F} d Q),

where the supremum is over all measurable

F : Ω \to R

, such that

\int e^{F} d Q < \infty

. Applying this with

P = P_{V | U = u}

,

Q = P_{V}

, and

F = λ h (u, V)

, we obtain

\begin{matrix} D (P_{V | U = u} ∥ P_{V}) & \geq E [λ h (u, V) | U = u] - log E [e^{λ h (u, V)}] \\ \geq λ (E [h (u, V) | U = u] - E [λ h (u, V)]) - \frac{λ^{2} σ^{2} (u)}{2}, \end{matrix}

(16)

where the second inequality follows from the assumption that

h (u, V)

is

σ^{2} (u)

-sub-Gaussian. Maximizing the right-hand side of (16) over

λ \in R

gives, after rearrangement,

| E [h (u, V) | U = u] - E [h (u, V)] | \leq \sqrt{2 σ^{2} (u) D (P_{V | U = u} ∥ P_{V})} .

(17)

Since

\bar{U}

and

\bar{V}

are independent,

E [h (u, V)] = E [h (\bar{U}, \bar{V}) | \bar{U} = u]

, and we obtain

\begin{array}{l} | E [h (U, V)] - E [h (\bar{U}, \bar{V})] | \\ = | \int (E [h (U, V) | U = u] - E [h (\bar{U}, \bar{V}) | \bar{U} = u]) P_{U} (d u) | \\ = | \int (E [h (u, V) | U = u] - h (u, V)) P_{U} (d u) | \\ \leq \int | E [h (u, V) | U = u] - h (u, V) | P_{U} (d u) \end{array}

(18)

\leq \int \sqrt{2 σ^{2} (u) D (P_{V | U = u} ∥ P_{V})} P_{U} (d u)

(19)

\leq \sqrt{\int 2 σ^{2} (u) P_{U} (d u)} \sqrt{\int D (P_{V | U = u} ∥ P_{V}) P_{U} (d u)}

(20)

= \sqrt{2 E [σ^{2} (U)] I (U; V)},

(21)

where (18) follows from Jensen’s inequality, (19) follows from (17), in (20) we used the Cauchy–Schwarz inequality, and the last equality follows from (11). □

Proof of Theorem 3.

Let

\bar{Y}

and

\bar{X}

be random variables, such that

P_{\bar{Y} | T (\bar{X})} = P_{Y | T (X)}

,

P_{\bar{X} | T (\bar{X})} = P_{X | T (X)}

,

P_{T (\bar{X})} = P_{T (X)}

, and

\bar{Y}

and

\bar{X}

are conditionally independent given

T (\bar{X})

. Thus, the joint distribution of the triple

(\bar{Y}, \bar{X}, T (\bar{X}))

is

P_{\bar{Y} \bar{X} T (\bar{X})} = P_{Y | T (X)} P_{X | T (X)} P_{T (X)}

.

We apply Lemma 1 with

U = Y

,

V = X

, and

h (u, v) = ℓ (y, f^{'} (x))

. Note that, using the conditions of the theorem, we can choose an

ϵ

-optimal

f^{'}

, such that for every y,

ℓ (y, f^{'} (X))

is conditionally

σ^{2} (y)

-sub-Gaussian given

T (X)

. Consider

E [ℓ (Y, f^{'} (X)) | T (X) = z]

and

E [ℓ (\bar{Y}, f^{'} (\bar{X})), | T (\bar{X}) = z]

as regular (unconditional) expectations taken with respect to

P_{Y X | T (X) = z}

and

P_{\bar{Y} \bar{X} | T (\bar{X}) = z}

respectively, and consider

I (Y; X | T (X) = z)

as regular mutual information between random variables with the distribution

P_{Y X | T (X) = z}

. Since

\bar{Y}

and

\bar{X}

are conditionally independent given

T (\bar{X}) = z

, Lemma 1 yields

\begin{matrix} | E [ℓ (Y, f^{'} (X)) | T (X) = z] - E [ℓ (\bar{Y}, f^{'} (\bar{Y})) | T (\bar{X}) = z] | \\ \leq \sqrt{2 E [σ^{2} (Y) | T (X) = z] I (X; Y | T (X) = z)} . \end{matrix}

Recalling that

T (\bar{X})

and

T (X)

have the same distribution, and applying Jensen’s inequality and the Cauchy–Schwarz inequality as in (18) and (20), we obtain

\begin{matrix} | E [ℓ (Y, f^{'} (X))] - E [ℓ (\bar{Y}, f^{'} (\bar{X}))] | \\ \leq \int | E [ℓ (Y, f^{'} (X)) | T (X) = z] - E [ℓ (\bar{Y}, f^{'} (\bar{X})) | T (\bar{X}) = z] | P_{T (X)} (d z) \\ \leq \int \sqrt{2 E [σ^{2} (Y) | T (X) = z] I (Y; X | T (X) = z)} P_{T (X)} (d z) \\ \leq \sqrt{2 \int E [σ^{2} (Y) | T (X) = z] P_{T (X)} (d z)} \sqrt{\int I (Y; X | T (X) = z) P_{T (X)} (d z)} \\ = \sqrt{2 E [σ^{2} (Y)] I (Y; X | T (X))} . \end{matrix}

(22)

On the one hand, we have

E [ℓ (\bar{Y}; f^{'} (\bar{X}))] \geq L_{ℓ}^{*} (\bar{Y} | \bar{X}) = L_{ℓ}^{*} (\bar{Y} | T (\bar{X})) = L_{ℓ}^{*} (Y | T (X)),

(23)

where the first equality follows from Theorem 1 with the conditional independence of

\bar{Y}

and

\bar{X}

given

T (\bar{X})

, and the second follows, since

(\bar{Y}, T (\bar{X}))

and

(Y, T (X))

have the same distribution by construction. On the other hand,

L_{ℓ}^{*} (Y | X) \geq E [ℓ (Y, f^{'} (X))] - ϵ

. Thus, (22) and (23) imply

\begin{matrix} 0 \leq L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) & \leq E [ℓ (\bar{Y}, f^{'} (\bar{X}))] - E [ℓ (Y, f^{'} (X))] + ϵ \\ \leq \sqrt{2 E [σ^{2} (Y)] I (Y; X | T (X))} + ϵ, \end{matrix}

which proves the upper bound in (15), since

ϵ > 0

is arbitrary. By expanding

I (Y; X | Z)

in two different ways using the chain rule for mutual information (e.g., Cover and Thomas ([22], Thm. 2.5.2)), and using the conditional independence of Y and

T (X)

given X, one obtains

I (Y; X | T (X)) = I (Y; X) - I (Y; T (X))

, which shows the equality in (15). □

We state two corollaries for special cases. In the first, we assume that ℓ is uniformly bounded, i.e.,

{∥ ℓ ∥}_{\infty} = {sup}_{y, y^{'} \in Y} ℓ (y, y^{'}) < \infty

. For any

c > 0

, let

L (c)

denote the collection of all loss functions ℓ with

{∥ ℓ ∥}_{\infty} \leq c

. Recall the notion of a universally

δ

-lossless transformation from Definition 2.

Corollary 1.

Suppose the loss function ℓ is bounded. Then, for any measurable

T : R^{d} \to R^{d^{'}}

, we have

\begin{matrix} L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) & \leq \frac{{∥ ℓ ∥}_{\infty}}{\sqrt{2}} \sqrt{I (Y; X) - I (Y; T (X))} . \end{matrix}

(24)

Therefore, whenever

I (Y; X) - I (Y; T (X)) \leq \frac{2 δ^{2}}{c^{2}},

(25)

the transformation T is universally δ-lossless for the family

L (c)

, i.e.,

L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) \leq δ

for all ℓ with

{∥ ℓ ∥}_{\infty} \leq c

.

Remark 3.

(i): The bound of the theorem can be used to give an estimation-theoretic motivation of the information bottleneck (IB) problem; see Section 4.4.
(ii): Let $L_{ℓ}^{*} (Y) = L_{ℓ}^{*} (Y | \emptyset) = {inf}_{y \in Y} E [ℓ (Y, y)]$ . For bounded ℓ, the inequality

$L_{ℓ}^{*} (Y) - L_{ℓ}^{*} (Y | X) \leq 2 \sqrt{2} {∥ ℓ ∥}_{\infty} \sqrt{I (Y; X)}$

was proven in Makhdoumi et al. ([24], Theorem 1) for discrete alphabets, to solve the so-called privacy funnel problem.This inequality follows from (15) by setting $Z = T (X)$ to be constant there.
(iii): A simple self-contained proof of (24) (see below) was provided by Or Ordentlich and communicated to the second author by Shlomo Shamai [25], in response to an early version of this manuscript. The bound in (24) seems to have first appeared in published form in Hafez-Kolahi et al. ([26], Lemma 1), where the proof was attributed to Xu and Raginsky [27].

Proof of Corollary 1.

If ℓ is uniformly bounded, then for any

f : R^{d} \to Y

one has

ℓ (y, f (x)) \in [0, ∥ ℓ ∥_{\infty}]

for all y and x. Then Hoeffding’s lemma (e.g., Boucheron et al. ([23], Lemma 2.2)) implies that for all y,

ℓ (y, f (X))

is conditionally

σ^{2}

-sub-Gaussian with

σ^{2} = \frac{{∥ ℓ ∥}_{\infty}^{2}}{4}

given

T (X)

. Since an

ϵ

-optimal estimator

f^{'}

exists for any

ϵ > 0

and

ℓ (y, f^{'} (X))

is conditionally

σ^{2}

-sub-Gaussian, given

T (X)

using the preceding argument, (24) follows from Theorem 3. The second statement follows directly from (24) and the fact that

{∥ ℓ ∥}_{\infty} \leq c

for all

ℓ \in L (c)

.

The following alternative argument by Or Ordentlich [25] is based on Pinsker’s inequality for the total variation distance in terms of the KL divergence (see, e.g., ([9], Theorem 7.9)). For bounded ℓ, this gives a direct proof of an analogue of the key inequality (22) in the proof of Theorem 3. This argument avoids Lemma 1 and the machinery introduced by the sub-Gaussian assumption.

Using the same notation as in the proof of Theorem 3 and letting

P = P_{Y X Z}

and

Q = P_{\bar{Y} \bar{X} T (\bar{Y})}

, we have

\begin{matrix} E [ℓ (Y, f^{'} (X))] - E [ℓ (\bar{Y}, f^{'} (\bar{X}))] | & = \int ℓ (y, f^{'} (x)) d P - \int ℓ (y, f^{'} (x)) d Q \\ \leq {∥ ℓ ∥}_{\infty} d_{TV} (P, Q) \\ \leq \frac{{∥ ℓ ∥}_{\infty}}{\sqrt{2}} \sqrt{D (P ∥ Q)} (by Pinsker' s inequality) \\ = \frac{{∥ ℓ ∥}_{\infty}}{\sqrt{2}} \sqrt{D (P_{Y X | T (X)} P_{Z} ∥ P_{Y | T (X)} P_{X | T (X)} P_{T (X)})} \\ = \frac{{∥ ℓ ∥}_{\infty}}{\sqrt{2}} \sqrt{D (P_{Y X | T (X)} ∥ P_{Y | T (X)} P_{X | T (X)} | P_{T (X)})} \\ = \frac{{∥ ℓ ∥}_{\infty}}{\sqrt{2}} \sqrt{I (X, Y | T (X))} . \end{matrix}

The rest of the proof proceeds exactly as in Theorem 3. □

In the second corollary, we do not require that ℓ be bounded but assume that an optimal estimator

f_{ℓ}^{*}

from X to Y exists, such that

ℓ (y, f_{ℓ}^{*} (X))

is conditionally

σ^{2} (y)

-sub-Gaussian given

T (X)

, where

E [σ^{2} (Y)] < \infty

.

Corollary 2.

Assume that an optimal estimator

f_{ℓ}^{*}

of Y from X exists, i.e., the measurable function

f_{ℓ}^{*}

satisfies

E [ℓ (Y, f_{ℓ}^{*} (X))] = L_{ℓ}^{*} (Y | X)

. Furthermore, suppose that the sub-Gaussian condition of Theorem 3 holds with

f^{'} = f_{ℓ}^{*}

(i.e., (14) holds for

f^{'} = f_{ℓ}^{*})

. Then,

\begin{matrix} L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) & \leq \sqrt{2 E [σ^{2} (Y)] (I (Y; X) - I (Y; T (X)))} . \end{matrix}

(26)

Proof.

The corollary immediately follows from Theorem 3, since an optimal

f_{ℓ}^{*}

is

ϵ

-optimal for all

ϵ > 0

. □

For the next corollary, let

\hat{L} (c)

denote the collection of all loss functions ℓ, such that

ℓ (y, f_{ℓ}^{*} (X)) \leq g_{ℓ} (y) a . s .

for some function

g_{ℓ} : Y \to R_{+}

with

E [g_{ℓ}^{2} (Y)] \leq c^{2}

.

Corollary 3.

If T is a transformation such that

I (Y; X) - I (Y; T (X)) \leq \frac{2 δ^{2}}{c^{2}},

then T is universally δ-lossless for the family

\hat{L} (c)

.

Proof.

Since

ℓ (y, f_{ℓ}^{*} (X))

is a.s. upper bounded by

g_{ℓ} (y)

for any

ℓ \in \hat{L} (c)

, using Hoeffding’s lemma ([23], Lemma 2.2), we have that

ℓ (y, f_{ℓ}^{*} (X))

is conditionally

\frac{g_{ℓ}^{2} (y)}{4}

-sub-Gaussian given

T (X)

. Thus, from Corollary 2, for all

ℓ \in \hat{L} (c)

, we have

\begin{matrix} L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) & \leq \sqrt{\frac{1}{2} E [g_{ℓ}^{2} (Y)] (I (Y; X) - I (Y; T (X)))} \\ \leq \sqrt{\frac{c^{2}}{2} (I (Y; X) - I (Y; T (X)))} \\ \leq δ \end{matrix}

if

I (Y; X) - I (Y; T (X)) \leq \frac{2 δ^{2}}{c^{2}}

. □

The next corollary generalizes and gives a much simplified proof of Faragó and Györfi [28], see also Devroye, Györfi, and Lugosi ([10], Theorem. 32.3). This result states for binary classification (Y is 0-1-valued and

ℓ (y, y^{'}) = I_{y \neq y^{'}}

) that if a sequence of functions

T_{n} : R^{d} \to R^{d}

is such that

∥ X - T_{n} (X) ∥ \to 0

in probability as

n \to \infty

, then

L_{ℓ}^{*} (Y | T_{n} (X)) \to L_{ℓ}^{*} (Y | X)

as

n \to \infty

.

Corollary 4.

Assume that a sequence of transformations

T_{n} : R^{d} \to R^{d}

is such that

T_{n} (X) \to X

in distribution (i.e.,

P_{T_{n} (X)} \to P_{X}

weakly) as

n \to \infty

. Then, for any bounded loss function ℓ,

lim_{n \to \infty} L_{ℓ}^{*} (Y | T_{n} (X)) = L_{ℓ}^{*} (Y | X) .

(27)

Note that this corollary and its proof still hold without any changes if X takes values in an arbitrary complete separable metric space. For example, in the setup of function classification, X may take values in an

L_{p}

function space for

1 \leq p < \infty

, and

T_{n}

is a truncated series expansion or a quantizer. Interestingly, here the asymptotic losslessness property is guaranteed, even in the case where the sequence of transformations

T_{n}

and the loss function ℓ are not matched at all.

Proof.

If

T_{n} (X) \to X

in distribution, then clearly

(Y, T_{n} (X)) \to (Y, X)

in distribution. Thus, the lower semicontinuity of mutual information with respect to convergence in distribution (see, e.g., Polyanskiy and Wu ([9], Equation (4.28))) implies

\underset{n \to \infty}{lim inf} I (Y; T_{n} (X)) \geq I (Y; X) .

Since

I (Y; T_{n} (X)) \leq I (Y; X)

for all n, we obtain

lim_{n \to \infty} I (Y; T_{n} (X)) = I (Y; X) .

Combined with Corollary 1 (with T replaced with

T_{n}

), this gives

\begin{matrix} 0 \leq lim_{n \to \infty} L_{ℓ}^{*} (Y | T_{n} (X)) - L_{ℓ}^{*} (Y | X) & \leq lim_{n \to \infty} \frac{{∥ ℓ ∥}_{\infty}}{\sqrt{2}} \sqrt{I (Y; X) - I (Y; T_{n} (X))} \\ = 0 . \end{matrix}

□

4. Applications

4.1. Classification

For classification,

Y

is the finite set

{1, \dots, M}

and the cost is the

0 - 1

loss

\begin{matrix} ℓ (y, y^{'}) = I_{y \neq y^{'}} . \end{matrix}

In this setup, the risk of estimator f is the error probability

P (Y \neq f (X))

. With the notation

\begin{matrix} P_{y} (x) = P (Y = y | X = x), \end{matrix}

the optimal estimator is the Bayes decision

\begin{matrix} f^{*} (x) = \underset{y \in Y}{\arg \max} P_{y} (x), \end{matrix}

and the minimum risk is the Bayes error probability

\begin{matrix} L^{*} (X) = 1 - E [max_{y \in Y} P_{y} (X)] . \end{matrix}

If

L^{*} (T (X)) = 1 - E [{max}_{y} E [P_{y} (X) ∣ T (X)]]

stands for the Bayes error probability of the transformed observation vector

T (X)

, then (24) with

{∥ ℓ ∥}_{\infty} = 1

yields the upper bound

L^{*} (T (X)) - L^{*} (X) \leq \frac{1}{\sqrt{2}} \sqrt{I (Y; X) - I (Y; T (X))};

see also ([4], Corollary 2) for a similar bound in the context of Bayesian learning.

As a special case, the feature selector

X \mapsto X_{S}

is lossless if

\begin{matrix} L^{*} (X) = L^{*} (X_{S}) . \end{matrix}

(28)

Györfi and Walk [29] studied the corresponding hypothesis testing problem. Using a k-nearest-neighbor (k-NN) estimate of the excess Bayes error probability

L^{*} (X_{S}) - L^{*} (X)

, they introduced a test statistic and accepted the hypothesis (28), if the test statistic is less than a threshold. Under certain mild conditions, the strong consistency of this test has been proven.

4.2. Nonparametric Regression

For the nonparametric regression problem, the cost is the squared loss

\begin{matrix} ℓ (y, y^{'}) = {(y - y^{'})}^{2}, y, y^{'} \in R, \end{matrix}

and the best statistical inference is the regression function

\begin{matrix} m (X) = E [Y | X] \end{matrix}

(here, we assume

E [Y^{2}] < \infty

). Then, the minimum risk is the residual variance

\begin{matrix} L_{ℓ}^{*} (Y | X) = E [{(Y - m (X))}^{2}] . \end{matrix}

If

L^{*} (X) = L_{ℓ}^{*} (Y | X)

and

L^{*} (T (X)) = L_{ℓ}^{*} (Y | T (X))

denote the residual variances for the observation vectors X and

T (X)

, respectively, then

\begin{matrix} L^{*} (T (X)) - L^{*} (X) & = E [{(Y - E [m (X) ∣ T (X)])}^{2}] - E [{(Y - m (X))}^{2}] \\ = E [{(m (X) - E [m (X) ∣ T (X)])}^{2}] . \end{matrix}

Note that the excess residual variance

L^{*} (T (X)) - L^{*} (X)

does not depend on the distribution of the residual

Y - m (X)

.

Next, we show that the conditions of Corollary 2 hold with

f_{ℓ}^{*} (x) = m (x)

for the important case

Y = m (X) + N,

where N is a zero-mean noise variable that is independent of X and satisfies

E [N^{4}] < \infty

, and m is bounded as

| m (x) | \leq K

for all x. For this model, we have

ℓ (y, f^{*} (X)) = {(y - m (X))}^{2} \leq (| y | + {| m (X) |)}^{2} \leq {(| y | + K)}^{2} .

Thus,

ℓ (y, f^{*} (X))

is a nonnegative random variable a.s. bounded by

{(| y | + K)}^{2}

, which implies via Hoeffding’s lemma (e.g., ([23], Lemma 2.2)) that it is

σ^{2} (y)

-sub-Gaussian given

T (X)

with

σ^{2} (y) = \frac{{(| y | + K)}^{4}}{4}

. We have

\begin{matrix} E [σ^{2} (Y)] & = \frac{E [{(| Y | + K)}^{4}]}{4} \\ \leq \frac{E [{(| N | + | m (X) | + K)}^{4}]}{4} \\ \leq \frac{E [{(| N | + 2 K)}^{4}]}{4} \\ \leq \frac{8 E [{| N |}^{4} + {(2 K)}^{4}]}{4} \\ \leq 2 E [N^{4}] + 32 K^{4} \\ < \infty, \end{matrix}

thus, the conditions of Corollary 2 hold and we obtain

\begin{matrix} L_{ℓ}^{*} (Y | T (X)) - L_{ℓ}^{*} (Y | X) & = E [{(Y - E [Y | T (X)])}^{2}] - E [{(Y - E [Y | X])}^{2}] \\ \leq \sqrt{(2 E [N^{4}] + 32 K^{4}) (I (Y; X) - I (Y; T (X)))} . \end{matrix}

Again, the feature selection

X_{S}

is called lossless, when

L^{*} (X) = L^{*} (X_{S})

holds. As a test statistic, Devroye et al. [30] introduced a 1-NN estimate of

L^{*} (X_{S}) - L^{*} (X)

and proved the strong consistency of the corresponding test.

4.3. Portfolio Selection

The next example is related to the negative of the log-loss or log-utility; see Algoet and Cover [31], Barron and Cover [32], Chapters 6 and 16 in Cover and Thomas [22], Györfi et al. [33].

Consider a market consisting of

d_{a}

assets. The evolution of the market in time is represented by a sequence of (random) price vectors

S_{1}, S_{2}, \dots \in R_{+}^{d_{a}}

with

S_{n} = (S_{n}^{(1)}, \dots, S_{n}^{(d_{a})}),

where the jth component

S_{n}^{(j)}

of

S_{n}

denotes the price of the jth asset in the nth trading period. Let us transform the sequence of price vectors

{S_{n}}

into the sequence of return (relative price) vectors

{R_{n}}

, defined as

R_{n} = (R_{n}^{(1)}, \dots, R_{n}^{(d_{a})}),

where

R_{n}^{(j)} = \frac{S_{n}^{(j)}}{S_{n - 1}^{(j)}} .

Constantly rebalanced portfolio selection is a multi-period investment strategy, where at the beginning of each trading period the investor redistributes the wealth among the assets. The investor is allowed to diversify their capital at the beginning of each trading period according to a portfolio vector

b = (b^{(1)}, \dots b^{(d_{a})})

. The jth component

b^{(j)}

of b denotes the proportion of the investor’s capital invested in asset j. Here, we assume that the portfolio vector b has nonnegative components with

\sum_{j = 1}^{d_{a}} b^{(j)} = 1

. The simplex of possible portfolio vectors is denoted by

Δ_{d_{a}}

.

Let

S_{0} = 1

denote the investor’s initial capital. Then, at the beginning of the first trading period,

S_{0} b^{(j)}

is invested into asset j, and this results in return

S_{0} b^{(j)} R_{1}^{(j)}

, and therefore at the end of the first trading period the investor’s wealth becomes

S_{1} = S_{0} \sum_{j = 1}^{d_{a}} b^{(j)} R_{1}^{(j)} = 〈b, R_{1}〉,

where

〈\cdot, \cdot〉

denotes the standard inner product in

R^{d_{a}}

. For the second trading period,

S_{1}

is the new initial capital

S_{2} = S_{1} \cdot 〈b, R_{2}〉 = 〈b, R_{1}〉 \cdot 〈b, R_{2}〉 .

By induction, for the trading period n, the initial capital is

S_{n - 1}

, and therefore

S_{n} = S_{n - 1} 〈b, R_{n}〉 = \prod_{i = 1}^{n} 〈b, R_{i}〉 .

The asymptotic average growth rate of this portfolio selection strategy is

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log S_{n} & = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} log 〈b, R_{i}〉 \end{matrix}

assuming a limit exists.

If the market process

{R_{i}}

is memory-less, i.e., it is a sequence of i.i.d. random return vectors, then the strong law of large numbers implies that the best constantly rebalanced portfolio (BCRP) is the log-optimal portfolio:

b^{*} = \underset{b \in Δ_{d_{a}}}{\arg \max} E [log 〈b, R_{1}〉],

while the best asymptotic average growth rate is

W^{*} = max_{b \in Δ_{d_{a}}} E [log 〈b, R_{1}〉] .

Barron and Cover [32] extended this setup to portfolio selection with side information. Assume that

X_{1}, X_{2}, \dots

are

R^{d}

valued side information vectors, such that

(R_{1}, X_{1}), (R_{2}, X_{2}), \dots

are i.i.d. and in each round n the portfolio vector may depend on

X_{n}

. The strong law of large numbers yields

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log S_{n} & = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} log 〈b (X_{i}), R_{i}〉 = E [log 〈b (X_{1}), R_{1}〉] a . s . \end{matrix}

Therefore, the log-optimal portfolio has the form

b^{*} (X_{1}) = \underset{b \in Δ_{d_{a}}}{\arg \max} E [log 〈b, R_{1}〉 ∣ X_{1}]

and the best asymptotic average growth rate is

W^{*} (X) = E [max_{b \in Δ_{d_{a}}} E [log 〈b, R_{1}〉 ∣ X_{1}]] .

Barron and Cover ([32], Thm. 2) proved that

W^{*} (X) - W^{*} \leq I (R_{1}; X_{1}) .

(29)

The next theorem generalizes this result by upper bounding the loss of the best asymptotic growth rate when, instead of X, only degraded side information

T (X)

is available.

Theorem 4.

For any measurable

T : R^{d} \to R^{d^{'}}

,

W^{*} (X) - W^{*} (T (X)) \leq I (R_{1}; X_{1}) - I (R_{1}; T (X_{1}))

assuming the terms on the right hand side are finite.

Remark 4.

(i): As in Theorem 3, the difference $I (R_{1}; X_{1}) - I (R_{1}; T (X_{1}))$ in the upper bound is equal to $I (R_{1}; X_{1} | T (X_{1}))$ , a quantity that is always nonnegative but may be equal to ∞. In this case, we interpret the right hand side as ∞.
(ii): There is a correspondence between this setup of portfolio selection and the setup in previous sections. In particular, Y from the previous sections is equal to R with a range $R_{+}^{d_{a}}$ and the inference is $b (X)$ taking values in $Δ_{d_{a}}$ . Then, the loss is $- log 〈b (X), R〉$ . If we assume that for all $j = 1, \dots d_{a}$ ,

$\begin{matrix} | log R^{(j)} | \leq c_{m a x} a . s ., \end{matrix}$

(30)

then

$\begin{matrix} | log 〈b (X), R〉 | \leq c_{m a x} a . s . \end{matrix}$

and so Corollary 1 implies

$W^{*} (X) - W^{*} (T (X)) \leq \frac{c_{m a x}}{\sqrt{2}} \sqrt{I (R_{1}; X_{1}) - I (R_{1}; T (X_{1}))} .$

Note that, from the point of view of application, (30) is a mild condition. For example, for NYSE daily data $c_{m a x} \leq 0.3$ ; see Györfi et al. [34].

Proof.

Let

(R, X)

be a generic copy of the

(R_{i}, X_{i})

. Writing out explicitly the dependence of

W^{*}

on

P_{R}

, we have

W^{*} (X) = \int W^{*} (P_{R | X = x}) P_{X} (d x)

and from (11) we have

I (R; X) = \int D (P_{R | X = x} ∥ P_{R}) P_{X} (d x) .

Thus, the bound

W^{*} (X) - W^{*} \leq I (R_{1}; X_{1})

in (29) can be written as

\begin{matrix} \int W^{*} (P_{R | X = x}) P_{X} (d x) - W^{*} & \leq \int D (P_{R | X = x} ∥ P_{R}) P_{X} (d x) . \end{matrix}

(31)

Furthermore, letting

Z = T (X)

, we have

\begin{matrix} W^{*} (X) - W^{*} (Z) & = \int W^{*} (P_{R | X = x}) P_{X} (d x) - \int W^{*} (P_{R | Z = z}) P_{Z} (d z) . \end{matrix}

Since

R \to X \to Z

is a Markov chain,

P_{R | X = x} = P_{R | X = x, Z = z}

, and we obtain

\begin{matrix} W^{*} (X) - W^{*} (Z) \\ = \int (\int W^{*} (P_{R | X = x, Z = z}) P_{X | Z = z} (d x) - W^{*} (P_{R | Z = z})) P_{Z} (d z) . \end{matrix}

Applying (31) with

W^{*} (P_{R | X = x})

replaced with

W^{*} (P_{R | X = x, Z = z})

and

W^{*}

replaced with

W^{*} (P_{R | Z = z})

with z fixed, we can bound the expression in parentheses as

\begin{matrix} \int W^{*} (P_{R | X = x, Z = z}) P_{X | Z = z} (d x) - W^{*} (P_{R | Z = z}) \\ \leq \int D (P_{R | X = x, Z = z} ∥ P_{R | Z = z}) P_{X | Z = z} (d x), \end{matrix}

and therefore

\begin{matrix} W^{*} (X) - W^{*} (Z) \\ \leq \int \int D (P_{R | X = x, Z = z} ∥ P_{R | Z = z}) P_{X | Z = z} (d x) P_{Z} (d z) \\ = I (R; X | Z), \end{matrix}

(32)

where (32) follows from the alternative expression (12) of the conditional mutual information.

As in the proof Theorem 3, the conditional independence of R and

Z = T (X)

given X implies

I (R; X | Z) = I (R; X) - I (R; T (X)),

which completes the proof. □

4.4. Information Bottleneck

Let X and Y be random variables as in Section 2. When

Y \to X \to Z

, the joint distribution

P_{Y X Z}

of the triple

(Y, X, Z)

is determined (for fixed

P_{Y X}

) by the conditional distribution (transition kernel)

P_{Z | X}

as

P_{Y X Z} = P_{Y X} P_{Z | X}

. The information bottleneck (IB) framework can be formulated as the study of the constrained optimization problem

\begin{matrix} maximize & I (Y; Z) \\ subject to & I (X; Z) \leq α \end{matrix}

(33)

for a given

α > 0

, where the maximization is over all transition kernels

P_{Z | X}

.

Originally proposed by Tishby et al. [35], the solution to the IB problem is a transition kernel

P_{Z | X}

, interpreted as a stochastic transformation, that “encodes” X into a “compressed” representation Z that preserves relevant information about Y through maximizing

I (Y; Z)

, while compressing X by requiring that

I (X; Z) \leq α

. The intuition behind this framework is that by maximizing

I (Y; Z)

, the representation Z will retain the predictive power of X with respect to Y, while the requirement

I (X; Z) \leq α

makes the representation Z concise.

Note that, in case X is discrete and has finite entropy

H (X)

, setting

α = H (X)

, or setting formally

α = \infty

in the general case, the constraint

I (X; Z) \leq α

becomes vacuous and (assuming the alphabet of Z is sufficiently large) the resulting Z will achieve the upper bound

I (Y; Z) = I (Y; X)

, so that

I (Y; X | Z) = I (Y; X) - I (Y; Z) = 0

, i.e.,

Y \to Z \to X

. Thus, the solution to (33) can be considered as a stochastically relaxed version of a minimal sufficient statistic for X in predicting Y (see Goldfeld and Polyanskiy ([36], Section II.C) for more on this interpretation). Recent tutorials on the IB problem include Asoodeh and Calmon [37] and Zaidi et al. [38].

Theorem 3 and its corollaries can be used to motivate the IB principle from an estimation-theoretic viewpoint. Let

I (α) = sup_{P_{Z | X} : I (X; Z) \leq α} I (Y; Z)

be the value function for (33) and

Z_{α}

a resulting optimal Z (assuming such a maximizer exists). From the remark after Theorem 3, we know that the bounds given in the theorem and in its corollaries remain valid if we replace

T (X)

with a random variable Z, such that

Y \to X \to Z

. Then, for example, Corollary 1 implies that

L_{ℓ}^{*} (Y | Z_{α}) - L_{ℓ}^{*} (Y | X) \leq c \sqrt{I (Y; X) - I (α)}

for all ℓ such that

{∥ ℓ ∥}_{\infty} \leq \sqrt{2} c

.

Thus, the IB paradigm minimizes, under the complexity constraint

I (X; Z)

≤

α

, an upper bound on the difference

L_{ℓ}^{*} (Y | Z) - L_{ℓ}^{*} (Y | X)

that universally holds for all loss functions ℓ with

{∥ ℓ ∥}_{\infty} \leq \sqrt{2} c

. The resulting

Z_{α}

will then have guaranteed performance in predicting Y with respect to all sufficiently bounded loss functions. This gives a novel operational interpretation of the IB framework that seems to have been overlooked in the literature.

4.5. Deep Learning

The IB paradigm can also serve as a learning objective in deep neural networks (DNNs). Here the Lagrangian relaxation of (33) is considered. In particular, letting X denote the input and

Z^{θ}

the output of the last hidden layer of the DNN, where

θ \in Θ \subset R^{K}

is the collection of network parameters (weights), the objective is to maximize

I (Y; Z^{θ}) - β I (X; Z^{θ})

(34)

over

θ \in Θ

for a given

β > 0

. The parameter

β

controls the trade-off between how informative

Z^{θ}

is about Y, measured by

I (Y; Z^{θ})

, and how much

Z^{θ}

is “compressed,” measured by

I (X; Z^{θ})

. Clearly, larger values of

β

correspond to smaller values of

I (X; Z^{θ})

and thus to more compression. Here,

Z^{θ}

is either a deterministic function of X in the form of

Z^{θ} = T^{θ} (X)

, where

T^{θ} : R^{d} \to R^{d^{'}}

represents the deterministic DNN, or it is produced by a stochastic kernel

P_{Z | X}^{θ}

, parameterized by the network parameters

θ \in Θ

. The latter is achieved by injecting independent noise into the network’s intermediate layers.

In addition to the motivation explained in the previous section, the IB framework for DNNs can be thought as a regularization method that results in improved generalization capabilities for a network trained on data using stochastic gradient-based methods, see, e.g., Tishby and Zaslavsky [39], Shwartz-Ziv and Tishby [40], Alemi et al. [41], as well as many other references in the excellent survey article Goldfeld and Polyanskiy [36], and the special issue [42] on information bottleneck and deep learning.

As in the previous section, our Theorem 1 and corollaries can serve as a (partial) justification for setting (34) as a learning objective. Assume that after training with a given

β > 0

, the obtained

Z^{θ (β)}

has (true) mutual information

I (Y; Z^{θ (β)})

with Y (typically, this will not be the optimal solution, since maximizing (34) is not feasible and in practice only a proxy lower bound is optimized during training, see, e.g., Alemi et al. [41]). Then, by Corollary 1 the obtained network has a guaranteed predictive performance

L_{ℓ}^{*} (Y | Z^{θ (β)}) \leq L_{ℓ}^{*} (Y | X) + c ϵ

for all loss functions ℓ with

{∥ ℓ ∥}_{\infty} \leq \sqrt{2} c

, where

ϵ = \sqrt{I (Y; X) - I (Y; Z^{θ (β)})} .

5. Proof of Theorem 2

Proof Theorem 2.

(a): The bounds given in the proof of Theorem 1 in [2] imply

$\begin{matrix} L_{n} & \leq J_{n, 1} + J_{n, 2} + J_{n, 3} + J_{n, 4} + J_{n, 5}, \end{matrix}$

where

$\begin{array}{l} J_{n, 1} & = \sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} |P_{X Y Z}^{n} (A, B, C) - P_{X Y Z} (A, B, C)|, \\ J_{n, 2} & = \sum_{B \in Q_{n}, C \in R_{n}} |P_{Y Z} (B, C) - P_{Y Z}^{n} (B, C)|, \\ J_{n, 3} & = \sum_{A \in P_{n}, C \in R_{n}} |P_{X Z} (A, C) - P_{X Z}^{n} (A, C)|, \\ J_{n, 4} & = \sum_{C \in R_{n}} |P_{Z}^{n} (C) - P_{Z} (C)|, \end{array}$

and

$\begin{matrix} J_{n, 5} & = \sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} |P_{X Y Z} (A, B, C) - \frac{P_{X Z} (A, C) P_{Y Z} (B, C)}{P_{Z} (C)}| . \end{matrix}$

Using large deviation inequalities from Beirlant et al. [43] and in Biau and Györfi [18], Györfi and Walk [2] proved that for all $ε_{i} > 0$ , $i = 1, \dots, 4$ and $δ > 0$ ,

$\begin{array}{l} P (L_{n} > ε_{1} + ε_{2} + ε_{3} + ε_{4} + δ) \\ \leq P (J_{n, 1} > ε_{1}) + P (J_{n, 2} > ε_{2}) + P (J_{n, 3} > ε_{3}) + P (J_{n, 4} > ε_{4}) + I_{J_{n, 5} > δ} \\ \leq 2^{m_{n} \cdot m_{n}^{'} \cdot m_{n}^{''}} e^{- n ε_{1}^{2} / 2} + 2^{m_{n}^{'} \cdot m_{n}^{''}} e^{- n ε_{2}^{2} / 2} + 2^{m_{n} \cdot m_{n}^{''}} e^{- n ε_{3}^{2} / 2} + 2^{m_{n}^{''}} e^{- n ε_{4}^{2} / 2} \\ + I_{J_{n, 5} > δ} . \end{array}$

(35)

We note that the bounds on the probabilities $P (J_{n, i} > ε_{i})$ for $i = 1, \dots, 4$ were proven in [2] without either assuming the null hypothesis $H_{0}$ or using the condition that the partitions are nested. Under the null hypothesis, Györfi and Walk [2] claimed that

$\begin{matrix} J_{n, 5} = 0 . \end{matrix}$

As Neykov et al. [12] observed, this was incorrect. In order to resolve the gap, we show that under ${lim}_{n} h_{n} = 0$ and condition (5) and under the null hypothesis, the last term in (35) is $o (1)$ , i.e.,

$\begin{matrix} I_{\sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} |P_{X Y Z} (A, B, C) - \frac{P_{X Z} (A, C) P_{Y Z} (B, C)}{P_{Z} (C)}| > δ} = 0 \end{matrix}$

if n is large enough. The null hypothesis implies that

$\begin{array}{l} P_{X Y Z} (A, B, C) - \frac{P_{X Z} (A, C) P_{Y Z} (B, C)}{P_{Z} (C)} \\ = P (X \in A, Y \in B, Z \in C) - \frac{P_{X Z} (A, C) P (Y \in B, Z \in C)}{P_{Z} (C)} \\ = E [P (X \in A, Y \in B ∣ Z) I_{Z \in C}] - E [P (Y \in B ∣ Z) I_{Z \in C}] \frac{P_{X Z} (A, C)}{P_{Z} (C)} \\ = E [P (X \in A ∣ Z) P (Y \in B ∣ Z) I_{Z \in C}] - E [P (Y \in B ∣ Z) I_{Z \in C}] \frac{P_{X Z} (A, C)}{P_{Z} (C)} . \end{array}$

Thus,

$\begin{matrix} \sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} |P_{X Y Z} (A, B, C) - \frac{P_{X Z} (A, C) P_{Y Z} (B, C)}{P_{Z} (C)}| \\ \leq \sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} E [P (Y \in B ∣ Z) |P (X \in A ∣ Z) I_{Z \in C} - I_{Z \in C} \frac{P_{X Z} (A, C)}{P_{Z} (C)}|] \\ = \sum_{A \in P_{n}, C \in R_{n}} E [|P (X \in A ∣ Z) I_{Z \in C} - I_{Z \in C} \frac{P_{X Z} (A, C)}{P_{Z} (C)}|] . \end{matrix}$

Let $p (\cdot ∣ z)$ and $C_{n} (z)$ be as in Condition 1. Then,

$\begin{array}{l} \sum_{A \in P_{n}, C \in R_{n}} E [|P (X \in A ∣ Z) I_{Z \in C} - I_{Z \in C} \frac{P_{X Z} (A, C)}{P_{Z} (C)}|] \\ = \sum_{A \in P_{n}, C \in R_{n}} \int_{C} |\int_{A} p (x ∣ z) P_{X} (d x) - \frac{\int_{A} [\int_{C} p (x ∣ z^{'}) P_{Z} (d z^{'})] P_{X} (d x)}{P_{Z} (C)}| P_{Z} (d z) \\ \leq \int \int |p (x ∣ z) - \frac{\int_{C_{n} (z)} p (x ∣ z^{'}) P_{Z} (d z^{'})}{P_{Z} (C_{n} (z))}| P_{Z} (d z) P_{X} (d x) \\ \leq C^{*} h_{n}, \end{array}$

(36)

where in the last step we use condition (5). The inequalities (35) and (36) imply that

$\begin{array}{l} P (L_{n} > c_{1} (\sqrt{\frac{m_{n} m_{n}^{'} m_{n}^{''}}{n}} + \sqrt{\frac{m_{n}^{'} m_{n}^{''}}{n}} + \sqrt{\frac{m_{n} m_{n}^{''}}{n}} + \sqrt{\frac{m_{n}^{''}}{n}}) + (log n) h_{n}) \\ \leq 4 e^{- (c_{1}^{2} / 2 - log 2) m_{n}^{''}} + I_{C^{*} h_{n} > (log n) h_{n}} \\ \leq 4 e^{- (c_{1}^{2} / 2 - log 2) m_{n}^{''}}, \end{array}$

if $n \geq e^{C^{*}}$ . Since $m_{n}^{''}$ is proportional to $1 / h_{n}^{d^{'}}$ , condition (8) on $h_{n}$ implies $\sum_{n = 1}^{\infty} P (L_{n} \geq t_{n}) < \infty$ , and thus using the Borel–Cantelli lemma, after a random sample size, the test has no error with a probability of one.
(b): This proof is a refinement of the proof of Corollary 1 in [2], in which we avoid the condition used there that the sequences of partitions $P_{n}$ and $Q_{n}$ are nested. According to the proof of Part (a) (see the remark after (35)), we obtain that

$\underset{n \to \infty}{lim inf} L_{n} \geq \underset{n \to \infty}{lim inf} (L_{n} - J_{n, 5}) + \underset{n \to \infty}{lim inf} J_{n, 5} = \underset{n \to \infty}{lim inf} J_{n, 5} a . s .$

To simplify the notation, let $P_{X Y | z} = P_{X Y | Z = z}$ , $P_{X | z} = P_{X | Z = z}$ , and $P_{Y | z} = P_{Y | Z = z}$ . Let $L^{*}$ be the expected total variation distance between $P_{X Y | z}$ and $P_{X | z} P_{Y | z}$ :

$L^{*} = \int sup_{F} | P_{X Y | z} (F) - P_{X | z} P_{Y | z} (F) | P_{Z} (d z),$

where the supremum is taken over all Borel subsets F of $R^{d} \times R^{d^{'}}$ . It suffices to prove that using the condition ${lim}_{n} h_{n} = 0$ ,

$\underset{n \to \infty}{lim inf} J_{n, 5} \geq 2 L^{*} > 0 .$

One has that

$\begin{array}{r} \sum_{A \in P_{n}, B \in Q_{n}, C \in R_{n}} |P_{X Y Z} (A, B, C) - \frac{P_{X Z} (A, C) P_{Y Z} (B, C)}{P_{Z} (C)}| \\ \geq \int \sum_{A \in P_{n}, B \in Q_{n}} |P_{X Y | z} (A, B) - P_{X | z} (A) P_{Y | z} (B)| P_{Z} (d z) - W_{n}, \end{array}$

where

$\begin{matrix} W_{n} & \leq \sum_{A \in P_{n}, B \in Q_{n}} \int |\frac{\int_{C_{n} (z)} P_{X Y | z^{'}} (A, B) P_{Z} (z^{'})}{P_{Z} (C_{n} (z))} - P_{X Y | z} (A, B)| P_{Z} (d z) \end{matrix}$

(37)

$\begin{matrix} + \sum_{A \in P_{n}} \int |\frac{\int_{C_{n} (z)} P_{X | z^{'}} (A) P_{Z} (d z^{'})}{P_{Z} (C_{n} (z))} - P_{X | z} (A)| P_{Z} (d z) \end{matrix}$

(38)

$\begin{matrix} + \sum_{B \in Q_{n}} \int |\frac{\int_{C_{n} (z)} P_{Y | z^{'}} (B) P_{Z} (d z^{'})}{P_{Z} (C_{n} (z))} - P_{Y | z} (B)| P_{Z} (d z) \end{matrix}$

(39)

In [2], it was shown that the condition ${lim}_{n} h_{n} = 0$ implies ${lim}_{n} W_{n} = 0$ if the sequence of partitions ${P_{n}, Q_{n}}_{n \geq 1}$ is nested. In order to avoid this nestedness condition, introduce the density $p (x, y | z)$ of the conditional distribution $P_{X Y | z}$ with respect to the distribution $P_{X Y}$ of $(X, Y)$ as a dominating measure, and similarly let $p_{n} (x, y | z)$ be the density of the conditional distribution $\int_{C_{n} (z)} P_{X Y | z^{'}} (\cdot, \cdot) P_{Z} (d z^{'}) / P_{Z} (C_{n} (z))$ with respect to $P_{X Y}$ , i.e., $p_{n} (x, y | z) = \int_{C_{n} (z)} p (x, y | z^{'}) P_{Z} (d z^{'}) / P_{Z} (C_{n} (z))$ . Then,

$\begin{array}{l} \sum_{A \in P_{n}, B \in Q_{n}} |\frac{\int_{C_{n} (z)} P_{X Y | z^{'}} (A, B) P_{Z} (d z^{'})}{P_{Z} (C_{n} (z))} - P_{X Y | z} (A, B)| \\ \leq \int \int | p_{n} (x, y | z) - p (x, y | z) | P_{X Y} (d x, d y), \end{array}$

and therefore the term on the right-hand side of (37) will converge to zero, as long as

$\begin{matrix} \int \int \int | p_{n} (x, y | z) - p (x, y | z) | P_{X Y} (d x, d y) P_{Z} (d z) \to 0, \end{matrix}$

which follows from ${lim}_{n} h_{n} = 0$ using the standard technique of the bias of partitioning regression estimate for the regression function $p (\cdot, \cdot | z)$ ; see Theorem 4.2 in [44]. The terms in (38) and (39) can be dealt with analogously. Thus,

$\begin{matrix} \underset{n \to \infty}{lim inf} J_{n, 5} & \geq \underset{n \to \infty}{lim inf} \int \sum_{A \in P_{n}, B \in Q_{n}} |P_{X Y | z} (A, B) - P_{X | z} (A) P_{Y | z} (B)| P_{Z} (d z) . \end{matrix}$

For fixed z, ${lim}_{n} h_{n} = 0$ implies

$\begin{array}{l} lim_{n \to \infty} \sum_{A \in P_{n}, B \in Q_{n}} |P_{X Y | z} (A, B) - P_{X | z} (A) P_{Y | z} (B)| \\ = 2 sup_{F} | P_{X Y | z} (F) - P_{X | z} P_{Y | z} (F) |, \end{array}$

see Abou-Jaoude [45] and Csiszár [46]. Therefore, the dominated convergence theorem yields

$\begin{array}{l} lim_{n \to \infty} \int \sum_{A \in P_{n}, B \in Q_{n}} |P_{X Y | z} (A, B) - P_{X | z} (A) P_{Y | z} (B)| P_{Z} (d z) \\ = 2 \int sup_{F} | P_{X Y | z} (F) - P_{X | z} P_{Y | z} (F) | P_{Z} (d z) \\ = 2 L^{*} . \end{array}$

□

Note that in the proof of Part (b) the condition (5) is not used, at all.

6. Concluding Remarks

We studied the excess minimum risk in statistical inference and under mild conditions gave a strongly consistent procedure for testing from data if a given transformation of an observed feature vector results in zero excess minimum risk for all loss functions. It is an open research problem whether a strong universal test exists, i.e., a test that is strongly consistent without any conditions on the transformation and on the underlying distribution. We also developed information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. The bounds were not stated in the most general form possible, in that the observed quantities were restricted to taking values in Euclidean spaces and we did not allow transformations that were random functions of the observation, both of which restrictions could be relaxed.The bounds could be sharpened, e.g., in specific cases, but in their present form are already useful. For example, they give an additional theoretical motivation for applying the information bottleneck approach in deep learning.

Author Contributions

Conceptualization, L.G., T.L. and H.W.; Methodology, L.G. and T.L.; Validation, H.W.; Formal analysis, T.L.; Investigation, H.W.; Writing—original draft, L.G.; Writing—review & editing, L.G., T.L., L.G., T.L. and H.W. equally contributed to the published work. All authors have read and agreed to the published version of the manuscript.

Funding

The research of László Györfi has been supported by the National Research, Development and Innovation Fund of Hungary under the 2019-1.1.1-PIACI-KFI-2019-00018 funding scheme. Tamás Linder’s research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

T. Linder would like to thank O. Ordentlich and S. Shamai for their helpful comments on an earlier version of this manuscript and for pointing out relevant literature.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schervish, M.J. Theory of Statistics; Springer Series in Statistics; Springer: New York, NY, USA, 1995. [Google Scholar]
Györfi, L.; Walk, H. Strongly consistent nonparametric tests of conditional independence. Stat. Probab. Lett. 2012, 82, 1145–1150. [Google Scholar] [CrossRef]
Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Adv. Neural Inf. Process. Syst. 2017, 30, 2521–2530. [Google Scholar]
Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. IEEE Trans. Inf. Theory 2022, 68, 7935–7955. [Google Scholar] [CrossRef]
Raginsky, M.; Rakhlin, A.; Xu, A. Information-Theoretic Stability and Generalization. In Information-Theoretic Methods in Data Science; Rodrigues, M., Eldar, Y., Eds.; Cambridge University Press: Cambridge, UK, 2021; pp. 302–329. [Google Scholar] [CrossRef]
Lugosi, G.; Neu, G. Generalization bounds via convex analysis. In Proceedings of the 34th Annual Conference on Learning Theory (COLT), London, UK, 2–5 July 2022; pp. 3524–3546. [Google Scholar]
Jose, S.T.; Simeone, O. Information-theoretic generalization bounds for meta-learning and applications. Entropy 2021, 23, 126. [Google Scholar] [CrossRef]
Hafez-Kolahi, H.; Moniri, B.; Kasaei, S. Information-theoretic analysis of minimax excess risk. IEEE Trans. Inf. Theory 2023, 69, 4659–4674. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Wu, Y. Information Theory: From Coding to Learning; Cambridge University Press: Cambridge, UK, 2022; Forthcoming; Available online: https://people.lids.mit.edu/yp/homepage/data/itbook-export.pdf (accessed on 5 July 2023).
Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
Cai, Z.; Li, R.; Zhang, Y. A distribution free conditional independence test with applications to causal discovery. J. Mach. Learn. Res. 2022, 23, 1–41. [Google Scholar]
Neykov, M.; Balakrishnan, S.; Wasserman, L. Minimax optimal conditional independence testing. Ann. Stat. 2021, 49, 2151–2177. [Google Scholar] [CrossRef]
Shah, R.D.; Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Ann. Stat. 2020, 48, 1514–1538. [Google Scholar] [CrossRef]
Dembo, A.; Peres, Y. A topological criterion for hypothesis testing. Ann. Stat. 1994, 22, 106–117. [Google Scholar] [CrossRef]
Nobel, A.B. Hypothesis testing for families of ergodic processes. Bernoulli 2006, 12, 251–269. [Google Scholar] [CrossRef]
Cover, T. On determining the irrationality of the mean of a random variable. Ann. Stat. 1973, 1, 862–871. [Google Scholar] [CrossRef]
Kulkarni, S.R.; Zeitouni, O. Can one decide the type of the mean from the empirical distribution? Stat. Probab. Lett. 1991, 12, 323–327. [Google Scholar] [CrossRef]
Biau, G.; Györfi, L. On the asymptotic properties of a nonparametric L₁-test statistic of homogeneity. IEEE Trans. Inf. Theory 2005, 51, 3965–3973. [Google Scholar] [CrossRef]
Devroye, L.; Lugosi, G. Almost sure classification of densities. J. Nonparametr. Stat. 2002, 14, 675–698. [Google Scholar] [CrossRef]
Gretton, A.; Györfi, L. Consistent nonparametric tests of independence. J. Mach. Learn. Res. 2010, 11, 1391–1423. [Google Scholar]
Morvai, G.; Weiss, B. On universal algorithms for classifying and predicting stationary processes. Probab. Surv. 2021, 18, 77–131. [Google Scholar] [CrossRef]
Cover, T.; Thomas, J. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the information bottleneck to the privacy funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW), Hobart, TAS, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
Ordentlich, O.; (School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel); Shamai, S.; (Department of Electrical Engineering, Technion, Haifa, Israel). Personal communication, July 2020.
Hafez-Kolahi, H.; Moniri, B.; Kasaei, S.; Baghshah, M.S. Rate-distortion analysis of minimum excess risk in Bayesian learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 3998–4007. [Google Scholar]
Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. arXiv 2020, arXiv:2012.14868. [Google Scholar]
Faragó, T.; Györfi, L. On the continuity of error distortion function for multiple-hypotheses decisions. IEEE Trans. Inf. Theory 1975, IT-21, 458–560. [Google Scholar] [CrossRef]
Györfi, L.; Walk, H. Detecting Ineffective Features for Pattern Recognition. Oberwolfach Preprint. 2017. Available online: http://publications.mfo.de/handle/mfo/1314 (accessed on 15 July 2023).
Devroye, L.; Györfi, L.; Lugosi, G.; Walk, H. A nearest neighbor estimate of the residual variance. Electron. J. Stat. 2018, 12, 1752–1778. [Google Scholar] [CrossRef]
Algoet, P.; Cover, T.M. Asymptotic optimality asymptotic equipartition properties of log-optimum investments. Ann. Probab. 1988, 16, 876–898. [Google Scholar] [CrossRef]
Barron, A.R.; Cover, T.M. A bound on the financial value of information. IEEE Trans. Inf. Theory 1988, 34, 1097–1100. [Google Scholar] [CrossRef]
Györfi, L.; Ottucsák, G.; Urbán, A. Empirical log-optimal portfolio selections: A survey. In Machine Learning for Financial Engineering; Györfi, L., Ottucsák, G., Walk, H., Eds.; Imperial College Press: London, UK, 2012; pp. 81–118. [Google Scholar]
Györfi, L.; Ottucsák, G.; Walk, H. The growth optimal investment strategy is secure, too. In Optimal Financial Decision Making under Uncertainty; Consigli, G., Kuhn, D., Brandimarte, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; pp. 201–223. [Google Scholar]
Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA; 1999; pp. 368–377. [Google Scholar]
Goldfeld, Z.; Polyanskiy, Y. The Information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
Asoodeh, S.; Calmon, F.P. Bottleneck problems: An information and estimation-theoretic view. Entropy 2020, 22, 1325. [Google Scholar] [CrossRef] [PubMed]
Zaidi, A.; Aguerri, I.E.; Shamai, S. On the information bottleneck problems: Models, connections, applications and information theoretic views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef] [PubMed]
Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017; pp. 368–377. [Google Scholar]
Geiger, B.C.; Kubin, G. Information bottleneck: Theory and applications in deep learning, Editorial for special issue on “Information Bottleneck: Theory and Applications in Deep Learning”. Entropy 2020, 22, 1408. [Google Scholar] [CrossRef]
Beirlant, J.; Devroye, L.; Györfi, L.; Vajda, I. Large deviations of divergence measures on partitions. J. Stat. Plan. Inference 2001, 93, 1–16. [Google Scholar] [CrossRef]
Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
Abou-Jaoude, S. Conditions nécessaires et suffisantes de convergence L₁ en prohabilité de l’histogramme pour une densité. Ann. L’Institut Henri Poincaré 1976, 12, 213–231. [Google Scholar]
Csiszár, I. Generalized entropy and quantization problems. In Proceedings of the Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Prague, Czech Republic, 19–25 September 1971; Academia: Prague, Czech Republic, 1973; pp. 159–174. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Györfi, L.; Linder, T.; Walk, H. Lossless Transformations and Excess Risk Bounds in Statistical Inference. Entropy 2023, 25, 1394. https://doi.org/10.3390/e25101394

AMA Style

Györfi L, Linder T, Walk H. Lossless Transformations and Excess Risk Bounds in Statistical Inference. Entropy. 2023; 25(10):1394. https://doi.org/10.3390/e25101394

Chicago/Turabian Style

Györfi, László, Tamás Linder, and Harro Walk. 2023. "Lossless Transformations and Excess Risk Bounds in Statistical Inference" Entropy 25, no. 10: 1394. https://doi.org/10.3390/e25101394

APA Style

Györfi, L., Linder, T., & Walk, H. (2023). Lossless Transformations and Excess Risk Bounds in Statistical Inference. Entropy, 25(10), 1394. https://doi.org/10.3390/e25101394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lossless Transformations and Excess Risk Bounds in Statistical Inference

Abstract

1. Introduction

2. Testing the Universal Losslessness Property

2.1. Universally Lossless Transformations

2.2. A Strongly Consistent Test

3. Universally $δ$ -Lossless Transformations

3.1. Preliminaries on Mutual Information

3.2. Mutual Information Bounds and $δ$ -Lossless Transformations

4. Applications

4.1. Classification

4.2. Nonparametric Regression

4.3. Portfolio Selection

4.4. Information Bottleneck

4.5. Deep Learning

5. Proof of Theorem 2

6. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Lossless Transformations and Excess Risk Bounds in Statistical Inference

Abstract

1. Introduction

2. Testing the Universal Losslessness Property

2.1. Universally Lossless Transformations

2.2. A Strongly Consistent Test

3. Universally δ -Lossless Transformations

3.1. Preliminaries on Mutual Information

3.2. Mutual Information Bounds and δ -Lossless Transformations

4. Applications

4.1. Classification

4.2. Nonparametric Regression

4.3. Portfolio Selection

4.4. Information Bottleneck

4.5. Deep Learning

5. Proof of Theorem 2

6. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Universally $δ$ -Lossless Transformations

3.2. Mutual Information Bounds and $δ$ -Lossless Transformations