Lossless Transformations and Excess Risk Bounds in Statistical Inference

We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss when estimating a random variable from an observed feature vector and the minimum expected loss when estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless, and we show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a δ-lossless transformation and give sufficient conditions for a given transformation to be universally δ-lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottlenecks, and deep learning are also surveyed.


Introduction
We consider the standard setting of statistical inference, where Y is a real random variable, having range Y ⊂ R, that is to be estimated (predicted) from a random observation (feature) vector X taking values in R d .Given a measurable predictor f : R d → Y and measurable loss function ℓ : Y × Y → R + , the loss incurred is ℓ(Y, f (X)).The minimum expected risk in predicting Y from the random vector X is where the infimum is over all measurable f .Suppose that the tasks of collecting data and making the prediction are separated in time or in space.For example, the separation in time happens, when first the data are collected and the statistical modelling and analysis are made much later.Separation in space can be due, for example, to collecting data at a remote location and making predictions centrally.Such situations are modeled by a transformation T : R d → R d ′ so that the prediction regarding Y is made from the transformed observation T (X), instead of X.An important example for such a transformation is quantization, in which case T (X) is a discrete random variable.Clearly, one always has ) is sometimes referred to in the literature as excess risk.A part of this paper is concerned with transformations for which the excess risk is zero no matter what the underlying loss function ℓ is.Such transformations are universally lossless in the sense that they can be chosen before the cost function ℓ for the underlying problem is known.More formally we make the following definition.
Definition 1 (lossless transformation).For a fixed joint distribution of Y and X, a (measurable) transformation T : R d → R d ′ is called universally lossless if for any loss function ℓ : Y ×Y → R + we have ).An important special transformation is feature selection.Formally, for the observation (feature) vector X = (X (1) , . . ., X (d) ) and S ⊂ {1, . . ., d}, consider the |S|-dimensional vector X S = (X (i) , i ∈ S).Typically, the dimension |S| of X S is significantly smaller than d, the dimension of X.If we have , for all loss functions ℓ, then the feature selector X → X S is universally lossless.For fixed loss ℓ, the performance of any statistical inference method is sensitive to the dimension of the feature vector.Therefore, dimension reduction is crucial before choosing or constructing such a method.If X S universally lossless, then the complement feature subvector X S c is irrelevant.It is an open research problem how to efficiently search a universally lossless X S with minimum size |S|.Since typically the distribution of the pair X and Y is not known and must be inferred from data, any such search algorithm needs a procedure for testing for the universal losslessness property of a feature selector.
In the first part of this paper we give a necessary and sufficient condition for a given transformation T to be universally lossless and then construct a partitioning based statistic for testing this condition if independent and identically distributed training data are available.With the null hypothesis being that a given transformation is universally lossless, the test is shown to be strongly consistent in the sense that almost surely (a.s.) it makes finitely many Type I and II errors.
In many situations requiring that a transformation T be universally lossless may be too demanding.The next definition relaxes this requirement.
Definition 2 (δ-lossless transformation).For a fixed joint distribution of Y and X, and δ > 0, a transformation T : R d → R d ′ is called universally δ-lossless with respect to a class of loss functions L, if we have ≤ δ for all ℓ ∈ L. In the second part of the paper we derive bounds on the excess minimum risk L * ℓ (Y |T (X)) − L * ℓ (Y |X) in terms of the mutual information difference I(Y ; X) − I(Y ; T (X)) under various assumptions on ℓ.With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be δ-lossless with respect to fairly general classes of loss functions ℓ.Applications to classification, nonparametric regression, portfolio strategies, the information bottleneck method, and deep learning are also reviewed.

Relationship with prior work:
Our first result, Theorem 1, which shows that a transformation is universally lossless if and only if it is a sufficient statistic, is likely known, but we could not find it in this explicit form in the literature (a closely related result is the classical Rao-Blackwell theorem of mathematical statistics, e.g., Schervish [38,Theorem 3.22]).Due to this result, testing from independent data whether or not given a transformation is universally lossless turns into a test for conditional independence.Our test in Theorem 2 is based on the main results in Györfi and Walk [24], but our construction is more general and we also correct an error in the proof of [24,Theorem 1].Apart from [24], most of the results in the literature for testing for conditional independence are for real valued random variables and/or assume certain special distribution types; typically the existence of a joint probability density function.Such assumptions exclude problems where Y is discrete and X is continuous, as is typical in classification, or problems where the observation X is concentrated on a lower dimensional subspace or manifold.In contrast, our test construction is completely distribution free and its convergence properties are also (almost) distribution free.A more detailed review of related work is given in Section 2.1.
The main result in Section 3 is Theorem 3, which bounds the excess risk in terms of the square root of the mutual information difference I(Y ; X) − I(Y ; T (X)).There is a history of such bounds starting possibly with Xu and Raginsky [43], where the generalization error of a learning algorithm was upper bounded by constant times the square root of the mutual information between the hypothesis and the training data (see also the references in [43] and [45]).This result has since been extended in various forms, mostly concentrating on providing information-theoretic bounds on the generalization capabilities of learning algorithms instead of looking at the excess risk; see, e.g., Raginsky et al. [37], Lugosi and Neu [30], Jose and Simeone [28], and the references therein, just to mention a few of these works.The most relevant recent work relating to our bounds in Section 3 seems to be Xu and Raginsky [45], where, among other things, information-theoretic bounds were developed on the excess risk in a Bayesian learning framework; see also Hafez-Kolahi et al. [26].The bounds in [45] are not on the excess risk L * ℓ (Y |T (X)) − L * ℓ (Y |X); they involve training data, but their forms are similar to ours.It appears that our Theorem 3 gives a bound that holds uniformly for a larger class of loss functions ℓ and joint distributions of Y and X; however, in [45] several other bounds are presented that are tighter and/or allow more general distributions, for specific, fixed, loss functions.

Organization:
The paper is organized as follows.In Section 2 we characterize universally lossless transformations and introduce a novel strongly consistent test for the universal losslessness property.In Section 3 information-theoretic bounds on the excess minimum risk are developed and are used to characterize the δ-losslessness property of transformations.Section 4 surveys connections with and applications to specific prediction problems as well as the information bottleneck method in deep learning.The somewhat lengthy proof of the strong consistency of the test in Theorem 2 is given in Section 5. Concluding remarks are given in Section 6.

Testing the universal losslessness property
In this section, we first give a characterization of universally lossless transformations for a given distribution of the pair (X, Y ).In practice, the distribution of (X, Y ) may not be known, but a sequence of independent and identically distributed (i.i.d.) copies of (X, Y ) may be available.For this case, we construct a procedure to test if a given transformation is universally lossless and prove that, under mild conditions, the test is strongly consistent.

Universally lossless transformations
Based on Definition 1, introduce the null hypothesis A transformation (statistic) T (X) is called sufficient if the random variables Y , T (X), X form a Markov chain in this order, denoted by Y → T (X) → X (see, e.g., Definition 3.8 and Theorem 3.9 in Polyanskiy and Wu [36]).
For binary valued Y , Theorems 32.5 and 32.6 in Devroye et al. [14] imply that the statistic T (X) is universally lossless if and only if it is sufficient.The following theorem extends this property to general Y .The result is likely known, but we could not find it in the given form.

Theorem 1. The transformation T is universally lossless if and only if
Proof.Assume first that Y → T (X) → X is a Markov chain.This is equivalent to having P(Y ∈ A|X, T (X)) = P(Y ∈ A|T (X)) almost surely (a.s.) for any measurable A ⊂ Y. Then we have for all ℓ, so T (X) is universally lossless.Now assume that that the Markov chain condition Y → T (X) → X does not hold.Then there exist measurable A ⊂ Y with 0 < P(Y ∈ A) < 1 and B ⊂ R d with P(X ∈ B) > 0 such that Let h(y) = I y∈A , where I E is the indicator function of event E, and define the binary valued Ŷ as Ŷ = h(Y ).Then the Markov chain condition Ŷ → T (X) → X does not hold.For this special case, Theorems 32.5 and 32.6 in [14] show that that there exist a loss function l : {0, Finally, letting ℓ(y, y ′ ) = l(h(y), h(y ′ )), we have which shows that T (X) is not universally lossless.

A strongly consistent test
Theorem 1 implies an equivalent form of the losslessness null hypothesis defined by (1): or equivalently, H 0 holds if and only if X and Y are conditionally independent given T (X): for arbitrary Borel sets A, B. Furthermore, we consider the general case where the alternative hypothesis H 1 is the complement of H 0 : H 1 = H c 0 .Now assume that the joint distribution of (X, Y, T (X)) is not known, but instead a sample of independent and identically distributed (i.i.d.) random vectors (X 1 , Y 1 , Z 1 ), . . ., (X n , Y n , Z n ) having common distribution that of (X, Y, Z) is given, where Z i = T (X i ) and Z = T (X).The goal is to test the hypothesis H 0 of conditional independence based on this data.
For testing conditional independence, most of the results in the literature are on real valued X, Y, Z. Based on kernel density estimation, Cai et al. [9] introduced a test statistic and under the null hypothesis calculated its limit distribution.In Neykov et al. [33] a gap is introduced between the null and alternative hypotheses.This gap is characterized by the total variation distance, which decreases with increasing n.Under some smoothness conditions, minimax bounds were derived.According to Shah and Peters [39], a regularity condition such as our Lipschitz condition (5) below cannot be omitted if a test for conditional independence is to be consistent.This is a consequence of their No-Free Lunch Theorem that states under general conditions that if under the null hypothesis the bound on the error probability is non-asymptotic, then under the alternative hypothesis the rate of convergence of the error probability can be arbitrarily slow, which is a well known phenomenon in nonparametric statistics.We note that these cited results, and indeed most of the results in the literature for testing for conditional independence, are for real valued random variables and/or assume certain special distribution types , typically the existence of a joint probability density function or that both X and Y are discrete, as in [33].As we remarked earlier, such assumptions exclude problems where Y is discrete and X is continuous (typical in classification), or problems where the observation X is concentrated on a lower dimensional subspace or manifold.In contrast, our test construction is completely distribution free and its convergence properties are almost distribution free (we do assume a mild Lipschitz-type condition; see the upcoming Condition 1) In our hypotheses testing setup, the alternative hypothesis, H 1 , is the complement of the null hypothesis, H 0 ; therefore there is no separation gap between the hypotheses.Dembo and Peres [13] and Nobel [34] characterized hypotheses pairs that admit strongly consistent tests, i.e., tests that, with probability one, only make finitely many Type I and II errors.This property is called discernibility.As an illustration of the intricate nature of the discernibility concept, Dembo and Peres [13] demonstrated an exotic example, where the null hypothesis is that the mean of a random variable is rational, while the alternative hypothesis is that this mean minus √ 2 is rational.(See also Cover [10] and Kulkarni and Zeitouni [29].)The discernibility property shows up in Biau and Györfi [7] (testing homogeneity), Devroye and Lugosi [15] (classification of densities), Gretton and Györfi [20] (testing independence), Morvai and Weiss [32] and Nobel [34] (classification of stationary processes), among others.
In the rest of this section, under mild conditions on the distribution of (X, Y ), we study discernibility in the context of lossless transformations for statistical inference with general risk.We will make strong use of the multivariate partitioning based test of Györfi and Walk [24].
Let P XY Z denote the joint distribution of (X, Y, Z) and similarly for any marginal distribution of (X, Y, Z); e.g., P XZ denotes the distribution of the pair (X, Z).As in Györfi and Walk [24], introduce the following empirical distributions: and For the sake of simplicity, assume that X, Y and Z = T (X) are bounded.Otherwise, we apply a componentwise, one-to-one scaling into the interval [0, 1].Obviously, the losslessness null hypothesis H 0 is invariant under such scaling.Let n } be finite cubic partitions of the ranges of X, Y and Z with all the cubes having common side lengths h n (thus h n is proportional to 1/m ′ n ).As in [24] we define the test statistic Our test rejects and accepts it if L n < t n , where the threshold t n is set to where the constant c 1 satisfies In this setup the distribution of (X, Y ) is arbitrary; its components can be discrete or absolutely continuous or the mixture of the two or even singularly continuous.It is important to note that for constructing this test, there is no need to know the type of the distribution.
We assume that the joint distribution of X, Y , and Z = T (X) satisfies the following assumption.
Condition 1.Let p( • |z) be the density of the conditional distribution P X|Z=z = P(X ∈ • |Z = z) with respect to the distribution P X as a dominating measure and introduce the notation Assume that for some C * > 0, p(x|z) satisfies the condition for all n.
We note that the ordinary Lipschitz condition implies (5).This latter condition is equivalent to where d T V (P, Q) denotes the total variation distance between distributions P and Q.In Neykov, Balakrishnan and Wasserman [33], condition ( 6) is called the Null TV Lipschitz condition.
The next theorem is an adaptation and extension of the results in Györfi and Walk [24] to this particular problem of lossless transformation.In [24] it was assumed that the sequence of partitions {P n , Q n , R n } is nested, while we make no such assumption.The proof, in which an error made in [24] is also corrected, is relegated to Section 5.
Theorem 2. Suppose that X, Y and Z = T (X) are bounded and Condition 1 holds for all n.If the sequence h n satisfies lim and then we have following: (a) Under the losslessness null hypothesis H 0 , we have for all n ≥ e C * , and therefore, because 8) and (9), after a random sample size, the test makes no error with probability one.

(b) Under the alternative hypothesis
and so, with probability one, after a random sample size the test makes no error. Remarks: (i) The choice h n = n −δ with 0 < δ < 1/(d + 1 + d ′ ) satisfies both conditions ( 7) and ( 8).
(ii) Note that by (4), t n is of order c 1 An important special transformation is given by the feature selection X S defined in the Introduction.Theorem 2 demonstrates the possibility of universally lossless dimension reduction for any multivariate feature vector.Note that in the setup of feature selection, the partition P n can be the nested version of R n and so the calculation of the test statistic L n is easier.

Universally δ-lossless transformations
Here we develop bounds on the excess minimum risk in terms of mutual information under various assumptions on the loss function.With the aid of these bounds, we give informationtheoretic sufficient conditions for a transformation T to be universally δ-lossless with respect to fairly general classes of loss functions ℓ.

Preliminaries on mutual information
Let P XY denote the joint distribution of the pair (X, Y ) and let P X P Y denote the product of the marginal distributions of X and Y , respectively.The mutual information between X and Y , denoted by I(X; Y ), is defined as where is the Kullback-Leibler (KL) divergence between probability distributions P and Q (here P ≪ Q means that P is absolutely continuous with respect to Q with Radon-Nikodym derivative dP dQ ).Thus I(X; Y ) is always nonnegative and I(X; Y ) = 0 if and only if X and Y are independent (note that I(X; Y ) = ∞ is possible).In this definition and throughout the paper log denotes the natural logarithm.
For random variables U and V (both taking values in finite-dimensional Euclidean spaces), let P U |V denote the conditional distribution of U given V .Furthermore, let P U |V =v denote the stochastic kernel (regular conditional probability) induced by P U |V .Thus, in particular, Given another random variable Z, the conditional mutual information One can check that in a similar way I(X; Y |Z) can be expressed as Properties of mutual information and conditional mutual information, their connections to the KL divergence, and identities involving these information measures are detailed in, e.g., Cover

Mutual information bounds and δ-lossless transformations
A real random variable U with finite expectation is said to be σ 2 -subgaussian for some Furthermore, we say that U is conditionally σ 2 -subgaussian given another random variable V if we have a.s.
The following result gives a quantitative upper bound on the excess minimum risk in terms of the mutual information difference I(Y ; X) − I(Y ; T (X)) under certain, not too restrictive, conditions.Note that Theorem 3. Let T : R d → R d ′ be a measurable transformation and assume that for any ǫ > 0, there exists an ǫ-optimal estimator f ′ of Y from X such that ℓ(y, f ′ (X))) is conditionally σ 2 (y)subgaussian given T (X) for every y ∈ Y, i.e., for all λ ∈ R and y ∈ R, where Then, one has Remarks: (i) In case I(Y ; X|T (X)) = ∞, we interpret the right hand side of (15) as ∞.With this interpretation, the bound always holds.
(ii) We show in Section 4.2 that the subgaussian condition ( 14) holds for the regression problem with squared error ℓ(y, y ′ ) = (y − y ′ ) 2 if Y = m(X) + N , where N is independent noise having zero mean and finite fourth moment E[N 4 ] < ∞, and the regression function In particular, the bound in the theorem holds if N is normal with zero mean and m is bounded.
(iii) Although hidden in the notation, E[σ 2 (Y )] depends on the loss function ℓ.Thus the upper bound (15) is the product of two terms, the second of which, I(Y ; X) − I(Y ; T (X)), is independent of the loss function.
(iv) The bound in the theorem is not tight in general.In Section 4.3 an example is given in the context of portfolio selection, where the excess risk can be upper bounded by the difference I(Y ; X) − I(Y ; T (X)).
(v) The proof of Theorem 3 and those of its corollaries go through virtually without change if we replace T (X) with any R d ′ -valued random variable Z such that Y → X → Z.Under the conditions of the theorem, we then have In fact, Theorem 3 and its corollaries hold for general random variables Y , X, and Z taking values in complete and separable metric (Polish) spaces Y, X , and Z, respectively, if Y → X → Z.
The proof of Theorem 3 is based on a slight generalization of Raginsky et al. [37,Lemma 10.2], which we state next.In the lemma, U and V are arbitrary abstract random variables defined on the same probability space and taking values in spaces U and V, respectively, Ū and V are independent copies of U and V (so that P Ū V = P U P V ), and h : U × V → R is a measurable function.
Proof.We essentially copy the proof of [37,Lemma 10.2] where it was assumed that σ 2 (u) does not depend on u.With this restriction, the subgaussian condition (14) in Theorem 3 would have to hold with σ 2 (y) ≤ σ 2 uniformly over y.This condition would exclude regression models with independent subgaussian noise and, a fortiori, models with independent noise that does not possess finite absolute moments of all orders , while our Theorem 2 can also be applied in such cases (see Section 4.2) We make use of the Donsker-Varadhan variational representation of the relative entropy [8, Corollary 4.15], which states that where the supremum is over all measurable F : Ω → R such that e F dQ < ∞.Applying this with P = P V |U =u , Q = P V , and F = λh(u, V ), we obtain where the second inequality follows from assumption that h(u, V ) is σ 2 (u)-subgaussian.Maximizing the right hand side of ( 16) over λ ∈ R gives, after rearrangement, Since Ū and V are independent, and we obtain where (18) follows from Jensen's inequality, (19) follows from (17), in (20) we used the Cauchy-Schwarz inequality, and the last equality follows from (11).
Proof of Theorem 3. Let Ȳ and X be random variables such that P Ȳ |T ( X) = P Y |T (X) , P X|T ( X) = P X|T (X) , P T ( X) = P T (X) , and Ȳ and X are conditionally independent given T ( X).Thus the joint distribution of the triple ( Ȳ , X, T ( X)) is P Ȳ XT ( X) = P Y |T (X) P X|T (X) P T (X) .We apply Lemma 1 with U = Y , V = X, and h(u, v) = ℓ(y, f ′ (x)).Note that by the conditions of the theorem we can choose an ǫ-optimal f ′ such for every y, ℓ(y, f ′ (X)) is conditionally as regular (unconditional) expectations taken with respect to P Y X|T (X)=z and P Ȳ X|T ( X)=z respectively, and consider I(Y ; X|T (X) = z) as regular mutual information between random variables with distribution P Y X|T (X)=z .Since Ȳ and X are conditionally independent given T ( X) = z, Lemma 1 yields Recalling that T ( X) and T (X) have the same distribution and applying Jensen's inequality and the Cauchy-Schwarz inequality as in (18) and (20), we obtain On the one hand, we have where the first equality follows from Theorem 1 by the conditional independence of Ȳ and X given T ( X), and the second follows since ( Ȳ , T ( X)) and (Y, T (X)) have the same distribution by construction.On the other hand, 22) and ( 23) imply We state two corollaries for special cases.In the first, we assume that ℓ is uniformly bounded, i.e., ℓ ∞ = sup y,y ′ ∈Y ℓ(y, y ′ ) < ∞.For any c > 0, let L(c) denote the collection of all loss functions ℓ with ℓ ∞ ≤ c.Recall the notion of a universally δ-lossless transformation from Definition 2.
Corollary 1. Suppose the loss function ℓ is bounded.Then for any measurable T : R d → R d ′ , we have Therefore, whenever the transformation T is universally δ-lossless for the family L(c), i.e.,

Remarks:
(i) The bound of the theorem can be used to give an estimation-theoretic motivation of the information bottleneck (IB) problem; see Section 4.4.
For bounded ℓ, the inequality was proved in Makhdoumi et al. [31,Theorem 1] for discrete alphabets to motivate the so-called privacy funnel problem.This inequality follows from ( 15) by setting Z = T (X) to be constant there.
(iii) A simple self-contained proof of (24) (see below) was provided by Or Ordentlich and communicated to the second author by Shlomo Shamai [35], in response to an early version of this manuscript.The bound in (24)  given T (X).Since an ǫ-optimal estimator f ′ exists for any ǫ > 0 and ℓ(y, f ′ (X)) is conditionally σ 2 -subgaussian given T (X) by the preceding argument, (24) follows from Theorem 3. The second statement follows directly from (24) and the fact that ℓ ∞ ≤ c for all ℓ ∈ L(c).
The following alternative argument by Or Ordentlich [35] is based on Pinsker's inequality on the total variation distance in terms of the KL divergence (see, e.g., [36,Theorem 7.9]).For bounded ℓ, it gives a direct proof of an analogue of the key inequality (22) in the proof of Theorem 3.This argument avoids Lemma 1 and the machinery introduced by the subgaussian assumption.
Using the same notation as in the proof of Theorem 3 and letting P = P Y XZ and Q = P Ȳ XT ( Ȳ ) , we have The rest of the proof proceeds exactly as in Theorem 3.
In the second corollary we do not require that ℓ be bounded, but assume that an optimal estimator f * ℓ from X to Y exists such that ℓ(y, f * ℓ (X)) is conditionally σ 2 (y)-subgaussian given T (X), where E[σ 2 (Y )] < ∞.

Corollary 2. Assume that an optimal estimator *
ℓ of Y from X exist, i.e., the measurable function Suppose furthermore that the subgaussian condition of Theorem 3 holds with f ′ = f * ℓ (i.e., (14) Proof.The corollary immediately follows from Theorem 3 since an optimal f * ℓ is ǫ-optimal for all ǫ > 0.
For the next corollary, let L(c) denote the collection of all loss functions ℓ such that for some function g ℓ : then T is universally δ-lossless for the family L(c).
Proof.Since ℓ(y, f * ℓ (X)) is a.s.upper bounded by g ℓ (y) for any ℓ ∈ L(c), by Hoeffding's lemma [8, Lemma 2.2], we have that ℓ(y, f * ℓ (X)) is conditionally 4 -subgaussian given T (X).Thus from Corollary 2, for all ℓ ∈ L(c), we have The next corollary generalizes and gives a much simplified proof of Faragó and Györfi [17], see also Devroye, Györfi and Lugosi [14,Theorem. 32.3].This result states for binary classification (Y is 0-1-valued and ℓ(y, y ′ ) = I y =y ′ ) that if a sequence of functions Corollary 4. Assume that a sequence of transformations T n : R d → R d is such that T n (X) → X in distribution (i.e., P Tn(X) → P X weakly) as n → ∞.Then for any bounded loss function ℓ, Note that this corollary and its proof still hold without any change if X takes values in an arbitrary complete separable metric space.For example, in the setup of function classification, X may take values in an L p function space for 1 ≤ p < ∞, and T n is a truncated series expansion or a quantizer.Interestingly, here the asymptotic losslessness property is guaranteed even in the case where the sequence of transformations T n and the loss function ℓ are not matched at all.
Proof.If T n (X) → X distribution, then clearly (Y, T n (X)) → (Y, X) in distribution.Thus the lower semicontinuity of mutual information with respect to convergence in distribution (see, e.g., Polyanskiy and Wu [36,Eq. 4.28]) implies Combined with Corollary 1 (with T replaced with T n ), this gives For classification, Y is the finite set {1, . . ., M } and the cost is the 0 − 1 loss In this setup the risk of estimator f is the error probability P(Y = f (X)).With the notation the optimal estimator is the Bayes decision and the minimum risk is the Bayes error probability ] stands for the Bayes error probability for the transformed observation vector T (X), then (24) with ℓ ∞ = 1 yields the upper bound see also [45,Corollary 2] for a similar bound in the context of Bayesian learning.As a special case, the feature selector X → X S is lossless if Györfi and Walk [25] studied the corresponding hypothesis testing problem.Using a k-nearestneighbor (k-NN) estimate of the excess Bayes error probability L * (X S )−L * (X), they introduced a test statistic and accepted the hypothesis (28), if the test statistic is less than a threshold.Under some mild condition the strong consistency of this test has been proved.

Nonparametric regression.
For the nonparametric regression the cost is the squared loss ℓ(y, y ′ ) = (y − y ′ ) 2 , y, y ′ ∈ R, and the best statistical inference is the regression function (here we assume E[Y 2 ] < ∞).Then, the minimum risk is the residual variance ) denote the residual variances for the observation vectors X and T (X), respectively, then Note that the excess residual variance L * (T (X)) − L * (X) does not depend on the distribution of the residual Y − m(X).
Next we show that the conditions of Corollary 2 hold with where N is a zero-mean noise variable that is independent of X and satisfies E[N 4 ] < ∞, and m is bounded as |m(x)| ≤ K for all x.For this model we have Thus ℓ(y, f * (X)) is a nonnegative random variable a.s.bounded by (|y| + K) 2 , which implies via Hoeffding's lemma (e.g., [8,Lemma 2.2]) that it is conditionally σ 2 (y)-subgaussian given T (X) with σ 2 (y) = (|y|+K) 4

4
. We have so the conditions of Corollary 2 hold and we obtain Again, the feature selection X S is called lossless, when L * (X) = L * (X S ) holds.As a test statistic, Devroye et al. [16] introduced a 1-NN estimate of L * (X S ) − L * (X) and proved the strong consistency of the corresponding test.
If the market process {R i } memoryless, i.e., it is a sequence of i.i.d.random return vectors, then the strong law of large numbers implies that the best constantly rebalanced portfolio (BCRP) is the log-optimal portfolio: b * = arg max b∈∆ da E log b , R 1 , while the best asymptotic average growth rate is Barron and Cover [5] extended this setup to portfolio selection with side information.Assume that X 1 , X 2 , . . .are R d valued side information vectors such that (R 1 , X 1 ), (R 2 , X 2 ), . . .are i.i.d. and in each round n the portfolio vector may depend on X n .The strong law of large numbers yields Therefore, the log-optimal portfolio has the form and the best asymptotic average growth rate is Barron and Cover [5, Thm.2] proved that The next theorem generalizes this result by upper bounding the loss of the best asymptotic growth rate when instead of X, only degraded side information T (X) is available.
Theorem 4. For any measurable T : assuming the terms on the right hand side are finite. Remarks: (i) As in Theorem 3, the difference I(R 1 ; X 1 ) − I(R 1 ; T (X 1 )) in the upper bound is equal to I(R 1 ; X 1 |T (X 1 )), a quantity that is always nonnegative but may be equal to ∞.In this case, we interpret the right hand side as ∞.
(ii) There is a correspondence between this of portfolio selection and the setup in previous sections.In particular, Y from the previous sections is equal to R with range R da + and the inference is b(X) taking values in ∆ da .Then, the loss is − log b(X) , R .If we assume that for all j = 1, . . .d a , | log R (j) | ≤ c max a.s., (30) then and so Corollary 1 implies Note that from the point of view of applications, ( 30) is a mild condition.For example, for NYSE daily data c max ≤ 0.3; see Györfi et al. [23].
Proof.Let (R, X) be a generic copy of the (R i , X i ) .Writing out explicitly the dependence of W * on P R , we have and from (11) we have Thus the bound W * (X) − W * ≤ I(R 1 ; X 1 ) in ( 29) can be written as Furthermore, letting Z = T (X), we have Since R → X → Z is a Markov chain, P R|X=x = P R|X=x,Z=z , and we obtain Applying (31) where (32) follows from the alternative expression ( 12) of the conditional mutual information.
As in the proof Theorem 3, the conditional independence of R and Z = T (X) given X implies I(R; X|Z) = I(R; X) − I(R; T (X)), which completes the proof.

Information bottleneck.
Let X and Y be random variables as in Section 2. When Y → X → Z, the joint distribution P Y XZ of the triple (Y, X, Z) is determined (for fixed P Y X ) by the conditional distribution (transition kernel) P Z|X as P Y XZ = P Y X P Z|X .The information bottleneck (IB) framework can be formulated as the study of the constrained optimization problem maximize I(Y ; Z) for a given α > 0, where the maximization is over all transition kernels P Z|X .Originally proposed by Tishby et al. [41], the solution of the IB problem is a transition kernel P Z|X , interpreted as a stochastic transformation, that "encodes" X into a "compressed" representation Z that preserves relevant information about Y through maximizing I(Y ; Z), while compressing X by requiring that I(X; Z) ≤ α.The intuition behind this framework is that by maximizing I(Y ; Z), the representation Z will retain the predictive power of X with respect to Y , while the requirement I(X; Z) ≤ α makes the representation Z concise.
Note that in case X is discrete and has finite entropy H(X), setting α = H(X), or setting formally α = ∞ in the general case, the constraint I(X; Z) ≤ α becomes vacuous and (assuming the alphabet of Z is sufficiently large) the resulting Z will achieve the upper bound I(Y ; Z) = I(Y ; X), so that I(Y ; X|Z) = I(Y ; X) − I(Y ; Z) = 0, i.e., Y → Z → X.Thus the solution to (33) can be considered as a stochastically relaxed version of a minimal sufficient statistic for X in predicting Y (see Goldfeld and Polyanskiy [19, Section II.C] for more on this interpretation).Recent tutorials on the IB problem include Asoodeh and Calmon [4] and Zaidi et al. [46].
Theorem 3 and its corollaries can be used to motivate the IB principle from an estimationtheoretic viewpoint.Let I(α) = sup be the value function for (33) and Z α a resulting optimal Z (assuming such a maximizer exists).
From the remark after Theorem 3 we know that the bounds given in the theorem and in its corollaries remain valid if we replace T (X) with a random variable Z such that Y → X → Z.
Then, for example, Corollary 1 implies that for all ℓ such that ℓ ∞ √ 2c.Thus the IB paradigm minimizes, under the complexity constraint I(X; Z) ≤ α, an upper bound on the difference L * ℓ (Y |Z) − L * ℓ (Y |X) that universally holds for all loss functions ℓ with ℓ ∞ ≤ √ 2c.The resulting Z α will then have guaranteed performance in predicting Y with respect to all sufficiently bounded loss functions.This gives an novel operational interpretation of the IB framework that seems to have been overlooked in the literature.

Deep learning
The IB paradigm can also serve as a learning objective in deep neural networks (DNNs).Here the Lagrangian relaxation of ( 33) is considered.In particular, letting X denote the input and Z θ the output of the last hidden layer of the DNN, where θ ∈ Θ ⊂ R K is the collection of network parameters (weights), the objective is to maximize over θ ∈ Θ for a given β > 0. The parameter β controls the trade-off between how informative Z θ is about Y , measured by I(Y ; Z θ ), and how much Z θ is "compressed," measured by I(X; Z θ ).
Clearly, larger values of β correspond to smaller values of I(X; Z θ ) and thus to more compression.
Here Z θ is either a deterministic function of X in the form of Z θ = T θ (X), where T θ : R d → R d ′ represents the deterministic DNN, or it is produced by a stochastic kernel P θ Z|X , parametrized by the network parameters θ ∈ Θ.This latter is achieved by injecting independent noise into the network's intermediate layers.
In addition to the motivation explained in the previous section, the IB framework for DNNs can be thought as a regularization method that results in improved generalization capabilities for a network trained on data via stochastic gradient based methods, see, e.g., Tishby and Zaslavsky [42], Shwartz-Ziv and Tishby [40], Alemi et al. [2], as well as many other references in the excellent survey article Goldfeld and Polyanskiy [19], and the special issue [18] on information bottleneck and deep learning.
As in the previous section, our Theorem 1 and corollaries can serve as a (partial) justification for setting (34) as a learning objective.Assume that after training with a given β > 0, the obtained Z θ(β) has (true) mutual information I(Y ; Z θ(β) ) with Y (typically, this will not be the optimal solution since maximizing (34) is not feasible and in practice only a proxy lower bound is optimized during training, see, e.g., Alemi et al. [2]).Then by Corollary 1 the obtained network has guaranteed predictive performance

Proof of Theorem 2
Proof of (a).The bounds given in the proof of Theorem 1 in [24] imply L n ≤ J n,1 + J n,2 + J n,3 + J n,4 + J n,5 , see Abou-Jaoude [1] and Csiszár [12].Therefore, dominated convergence theorem yields Note that in the proof of Part (b) the condition ( 5) is not used, at all.

Concluding remarks
We studied the excess minimum risk in statistical inference and under mild conditions gave a strongly consistent procedure to test from data if a given transformation of the observed feature vector results in zero excess minimum risk for all loss functions.It is an open research problem whether a strong universal test exists, i.e., a test that is strongly consistent without any condition on the transformation and on the underlying distribution.We also developed information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions.The bounds have not been stated in their possible most general form in that the observed quantities were restricted to take values in Euclidean spaces and we did not allow transformations that are random functions of the observation, both of which restrictions can be relaxed.The bounds are possible to sharpen, e.g., in specific cases, but in their present form are already useful.For example, they give additional theoretical motivation for applying the information bottleneck approach in deep learning.
which proves the upper bound in(15) since ǫ > 0 is arbitrary.By expanding I(Y ; X|Z) in two different ways using the chain rule for mutual information (e.g., Cover and Thomas [11, Thm.2.5.2]), and using the conditional independence of Y and T (X) given X, one obtains I(Y ; X|T (X)) = I(Y ; X) − I(Y ; T (X)), which shows the equality in(15).
I(X; Y |Z) is defined as I(X; Y |Z) = D(P XY |Z=z P X|Z=z P Y |Z=z )P Z (dz).The integral above is also denoted by D(P Y X|Z P Y |Z P X|Z |P Z ) and is called the conditional KL divergence.One can define From the definition it is clear that I(X; Y |Z) = 0 if and only if X and Y are conditionally independent given Z, i.e., if and only if Y → Z → X (or equivalently, if and only if X → Z → Y ).
seems to have first appeared in published form in Hafez-Kolahi et al. [27, Lemma 1], where the proof was attributed to Xu and Raginsky [44] Proof of Corollary 1.If ℓ is uniformly bounded, then for any f : R d → Y one has ℓ(y, f (x)) ∈ [0, ℓ ∞ ] for all y and x.Then Hoeffding's lemma (e.g.,Boucheron etal.[8, Lemma 2.2]) implies that for all y, ℓ(y, f (X)) is conditionally σ 2 -subgaussian with σ 2 = ℓ 2 ∞ 4 with W * (P R|X=x ) replaced with W * (P R|X=x,Z=z ) and W * replaced with W * (P R|Z=z ) with z fixed, we can bound the expression in parentheses as W R|X=x,Z=z P R|Z=z )P X|Z=z (dx)P Z (dz) * (P R|X=x,Z=z )P X|Z=z (dx) − W * (P R|Z=z ) ≤ D(P R|X=x,Z=z P R|Z=z )P X|Z=z (dx), and therefore W * (X) − W (Z) ≤ D(P