Next Article in Journal
Deep-Learning-Based Classification of Cyclic-Alternating-Pattern Sleep Phases
Next Article in Special Issue
Variable-Length Resolvability for General Sources and Channels
Previous Article in Journal
Complexity Synchronization of Organ Networks
Previous Article in Special Issue
On Optimal and Quantum Code Construction from Cyclic Codes over FqPQ with Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lossless Transformations and Excess Risk Bounds in Statistical Inference

1
Department of Computer Science and Information Theory, Budapest University of Technology and Economics, H-1111 Budapest, Hungary
2
Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada
3
Fachbereich Mathematik, Universität Stuttgart, 70569 Stuttgart, Germany
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(10), 1394; https://doi.org/10.3390/e25101394
Submission received: 2 August 2023 / Revised: 26 September 2023 / Accepted: 26 September 2023 / Published: 28 September 2023
(This article belongs to the Special Issue Advances in Information and Coding Theory II)

Abstract

:
We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss when estimating a random variable from an observed feature vector and the minimum expected loss when estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless, and we show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a δ -lossless transformation and give sufficient conditions for a given transformation to be universally δ -lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottlenecks, and deep learning are also surveyed.

1. Introduction

We consider the standard setting of statistical inference, where Y is a real random variable, having a range Y R , which is to be estimated (predicted) from a random observation (feature) vector X, taking values in R d . Given a measurable predictor f : R d Y and measurable loss function : Y × Y R + , the loss incurred is ( Y , f ( X ) ) . The minimum expected risk in predicting Y from the random vector X is
L ( Y | X ) = inf f : R d Y E [ ( Y , f ( X ) ) ] ,
where the infimum is over all measurable f.
Suppose that the tasks of collecting data and making the prediction are separated in time or in space. For example, the separation in time happens when the data are collected first and the statistical modeling and analysis are performed much later. Separation in space can be due, for example, to collecting data at a remote location and making predictions centrally. Such situations are modeled using a transformation T : R d R d , so that the prediction regarding Y is made from the transformed observation T ( X ) , instead of X. An important example for such a transformation is quantization, in which case T ( X ) is a discrete random variable. Clearly, one always has L ( Y | T ( X ) ) L ( Y | X ) . The difference L ( Y | T ( X ) ) L ( Y | X ) is sometimes referred to in the literature as excess risk. A part of this paper is concerned with transformations for which the excess risk is zero, no matter the underlying loss function . Such transformations are universally lossless, in the sense that they can be chosen before the cost function for the underlying problem is known. More formally, we can state the following definition.
Definition 1
(lossless transformation). For a fixed joint distribution of Y and X, a (measurable) transformation T : R d R d is called universally lossless if for any loss function : Y × Y R + we have
L ( Y | T ( X ) ) = L ( Y | X ) .
An important special transformation is feature selection. Formally, for the observation (feature) vector X = ( X ( 1 ) , , X ( d ) ) and S { 1 , , d } , consider the | S | -dimensional vector X S = ( X ( i ) , i S ) . Typically, the dimension | S | of X S is significantly smaller than d, the dimension of X. If we have
L ( Y | X S ) = L ( Y | X ) ,
for all loss functions , then the feature selector X X S is universally lossless. For fixed loss , the performance of any statistical inference method is sensitive to the dimension of the feature vector. Therefore, dimension reduction is crucial before choosing or constructing an inference method. If X S is universally lossless, then the complement feature subvector X S c is irrelevant. It is an open research problem how to efficiently search a universally lossless X S with minimum size | S | . Since, typically, the distribution of the pair X and Y is not known and must be inferred from data, any such search algorithm needs a procedure for testing for the universal losslessness property of a feature selector.
In the first part of this paper, we give a necessary and sufficient condition for a given transformation T to be universally lossless and then construct a partitioning-based statistic for testing this condition if independent and identically distributed training data are available. With the null hypothesis being that a given transformation is universally lossless, the test is shown to be strongly consistent, in the sense that it almost surely (a.s.) makes finitely many Type I and II errors.
In many situations, requiring that a transformation T is universally lossless is too demanding. The next definition relaxes this requirement.
Definition 2
( δ -lossless transformation). For a fixed joint distribution of Y and X, and δ > 0 , a transformation T : R d R d is called universally δ -lossless with respect to a class of loss functions L , if we have
L ( Y | T ( X ) ) L ( Y | X ) δ f o r a l l L .
In the second part of this paper, we derive bounds on the excess minimum risk L ( Y | T ( X ) ) L ( Y | X ) in terms of the mutual information difference I ( Y ; X ) I ( Y ; T ( X ) ) under various assumptions about . With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be δ -lossless with respect to fairly general classes of loss functions . Applications to classification, nonparametric regression, portfolio strategies, the information bottleneck method, and deep learning are also reviewed.
  • Relationship with prior work
Our first result, Theorem 1, which shows that a transformation is universally lossless if and only if it is a sufficient statistic, is likely known, but we could not find it in this explicit form in the literature (a closely related result is the classical Rao–Blackwell theorem of mathematical statistics, e.g., Schervish ([1], Theorem 3.22)). Due to this result, testing with independent data whether or not a given transformation is universally lossless turns into a test for conditional independence. Our test in Theorem 2 is based on the main results in Györfi and Walk [2], but our construction is more general and we also correct an error in the proof of ([2], Theorem 1). Apart from [2], most of the results in the literature of testing for conditional independence are for real-valued random variables and/or assume certain special distribution types, typically the existence of a joint probability density function. Such assumptions exclude problems where Y is discrete and X is continuous, as is typical in classification, or problems where the observation X is concentrated on a lower dimensional subspace or manifold. In contrast, our test construction is completely distribution free and its convergence properties are also (almost) distribution free. A more detailed review of related work is given in Section 2.1.
The main result in Section 3 is Theorem 3, which bounds the excess risk in terms of the square root of the mutual information difference I ( Y ; X ) I ( Y ; T ( X ) ) . There is a history of such bounds, possibly starting with Xu and Raginsky [3], where the generalization error of a learning algorithm was upper bounded using constant times the square root of the mutual information between the hypothesis and the training data (see also the references in [3,4]). This result has since been extended in various forms, mostly concentrating on providing information-theoretic bounds for the generalization capabilities of learning algorithms, instead of looking at the excess risk; see, e.g., Raginsky et al. [5], Lugosi and Neu [6], Jose and Simeone [7], and the references therein, just to mention a few of these works. The most relevant recent work relating to our bounds in Section 3 seems to be Xu and Raginsky [4], where, among other things, information-theoretic bounds were developed on the excess risk in a Bayesian learning framework; see also Hafez-Kolahi et al. [8]. The bounds in [4] are not on the excess risk L ( Y | T ( X ) ) L ( Y | X ) ; they involve training data, but their forms are similar to ours. It appears that our Theorem 3 gives a bound that holds uniformly for a larger class of loss functions and joint distributions of Y and X; however, in [4], several other bounds are presented that are tighter and/or allow more general distributions, for specific fixed loss functions.
  • Organization
This paper is organized as follows. In Section 2, we characterize universally lossless transformations and introduce a novel strongly consistent test for the property of universal losslessness. In Section 3, information-theoretic bounds on the excess minimum risk are developed and are used to characterize the δ -losslessness property of transformations. Section 4 surveys connections with, and applications to, specific prediction problems, as well as the information bottleneck method in deep learning. The somewhat lengthy proof of the strong consistency of the test in Theorem 2 is given in Section 5. Concluding remarks are given in Section 6.

2. Testing the Universal Losslessness Property

In this section, we first give a characterization of universally lossless transformations for a given distribution of the pair ( X , Y ) . In practice, the distribution of ( X , Y ) may not be known, but a sequence of independent and identically distributed (i.i.d.) copies of ( X , Y ) may be available. For this case, we construct a procedure to test if a given transformation is universally lossless and prove that, under mild conditions, the test is strongly consistent.

2.1. Universally Lossless Transformations

Based on Definition 1, we introduce the null hypothesis
H 0 = { T : the   transformation T is   universally   lossless . }
A transformation (statistic) T ( X ) is called sufficient if the random variables Y, T ( X ) , X form a Markov chain in this order, denoted by Y T ( X ) X (see, e.g., Definition 3.8 and Theorem 3.9 in Polyanskiy and Wu [9]).
For a binary valued Y, Theorems 32.5 and 32.6 from Devroye et al. [10] imply that the statistic T ( X ) is universally lossless if, and only if, it is sufficient. The following theorem extends this property to general Y. This result is likely known, but we could not find it in the given form.
Theorem 1.
The transformation T is universally lossless if, and only if, Y T ( X ) X is a Markov chain.
Proof. 
Assume first that Y T ( X ) X is a Markov chain. This is equivalent to having P ( Y A | X , T ( X ) ) = P ( Y A | T ( X ) ) almost surely (a.s.) for any measurable A Y . Then we have
L ( Y | X , T ( X ) ) = E inf y Y E [ ( Y , y ) | X , T ( X ) ] = E inf y Y E [ ( Y , y ) | T ( X ) ] = L ( Y | T ( X ) ) .
Since L ( Y | X , T ( X ) ) L ( Y | X ) L ( Y | T ( X ) ) always holds, we obtain L ( Y | T ( X ) ) = L ( Y | X ) for all , so T ( X ) is universally lossless.
Now, assume that the Markov chain condition Y T ( X ) X does not hold. Then, there exists a measurable A Y with 0 < P ( Y A ) < 1 and B R d with P ( X B ) > 0 , such that
P ( Y A | X , T ( X ) ) P ( Y A | T ( X ) ) if X B .
Let h ( y ) = I y A , where I E is the indicator function of event E, and define the binary valued Y ^ as Y ^ = h ( Y ) . Then, the Markov chain condition Y ^ T ( X ) X does not hold. For this special case, Theorems 32.5 and 32.6 in [10] show that there a loss function ^ : { 0 , 1 } 2 R + exists, such that L ^ ( Y ^ | T ( X ) ) > L ^ ( Y ^ | X ) . Finally, letting ( y , y ) = ^ ( h ( y ) , h ( y ) ) , we have
L ( Y | T ( X ) ) = L ^ ( Y ^ | T ( X ) ) > L ^ ( Y ^ | X ) = L ( Y | X ) ,
which shows that T ( X ) is not universally lossless. □

2.2. A Strongly Consistent Test

Theorem 1 implies an equivalent form of the losslessness null hypothesis defined by (1)
H 0 = { T : Y T ( X ) X is   a   Markov   chain } ,
or equivalently, H 0 holds if and only if X and Y are conditionally independent given T ( X ) :
H 0 : P X A , Y B T ( X ) = P X A T ( X ) P Y B T ( X ) a . s .
for arbitrary Borel sets A , B . Furthermore, we consider the general case where the alternative hypothesis H 1 is the complement of H 0 : H 1 = H 0 c .
Now, assume that the joint distribution of ( X , Y , T ( X ) ) is not known but instead a sample of independent and identically distributed (i.i.d.) random vectors ( X 1 , Y 1 , Z 1 ) , , ( X n , Y n , Z n ) having a common distribution of ( X , Y , Z ) is given, where Z i = T ( X i ) and Z = T ( X ) .The goal is to test the hypothesis H 0 of conditional independence based on these data. In fact, our goal is to provide a strongly consistent test; i.e., a test that, with a probability of one, only makes finitely many Type I and II errors.
For testing conditional independence, most of the results in the literature used real valued X , Y , Z . Based on kernel density estimation, Cai et al. [11] introduced a test statistic and under the null hypothesis calculated its limit distribution. In Neykov et al. [12], a gap was introduced between the null and alternative hypotheses. This gap was characterized by the total variation distance, which decreased with increasing n. Under certain smoothness conditions, minimax bounds were derived. According to Shah and Peters [13], a regularity condition such as our Lipschitz condition (5) below cannot be omitted if a test for conditional independence is to be consistent. This is a consequence of their no-free lunch theorem that states that, under general conditions, if with the null hypothesis the bound on the error probability is non-asymptotic, then under the alternative hypothesis the rate of convergence of the error probability can be arbitrarily slow, which is a well-known phenomenon in nonparametric statistics. We note that these cited results, and indeed most of the results in the literature when testing for conditional independence, were for real-valued random variables and/or assumed certain special distribution types, typically the existence of a joint probability density function or that both X and Y are discrete, as in [12]. As we remarked earlier, such assumptions exclude problems where Y is discrete and X is continuous (typical in classification) or problems where the observation X is concentrated on a lower dimensional subspace or manifold. In contrast, our test construction is completely distribution-free and its convergence properties are almost distribution-free (we do assume a mild Lipschitz-type condition; see the upcoming Condition 1).
In our hypotheses testing setup, the alternative hypothesis, H 1 , is the complement to the null hypothesis, H 0 ; therefore, there is no separation gap between the hypotheses. Dembo and Peres [14] and Nobel [15] characterized hypothesis pairs that admitted strongly consistent tests; i.e., tests that with a probability of one only make finitely many Type I and II errors. This property is called discernibility. As an illustration of the intricate nature of the discernibility concept, Dembo and Peres [14] demonstrated an exotic example, where the null hypothesis is that the mean of a random variable is rational, while the alternative hypothesis is that this mean minus 2 is rational. (See also Cover [16] and Kulkarni and Zeitouni [17].) The discernibility property shows up in Biau and Györfi [18] (testing homogeneity), Devroye and Lugosi [19] (classification of densities), Gretton and Györfi [20] (testing independence), Morvai and Weiss [21] and Nobel [15] (classification of stationary processes), among others.
In the remainder of this section, under mild conditions for the distribution of ( X , Y ) , we study discernibility in the context of lossless transformations for statistical inference with general risk. We will make strong use of the multivariate-partitioning-based test of Györfi and Walk [2].
Let P X Y Z denote the joint distribution of ( X , Y , Z ) and similarly for any marginal distribution of ( X , Y , Z ) ; e.g., P X Z denotes the distribution of the pair ( X , Z ) . As in Györfi and Walk [2], introduce the following empirical distributions:
P X Y Z n ( A , B , C ) = # { i : ( X i , Y i , Z i ) A × B × C , i = 1 , , n } n , P X Z n ( A , C ) = # { i : ( X i , Z i ) A × C , i = 1 , , n } n , P Y Z n ( B , C ) = # { i : ( Y i , Z i ) B × C , i = 1 , , n } n ,
and
P Z n ( C ) = # { i : Z i C , i = 1 , , n } n ,
for Borel sets A R d , B R , and C R d .
For the sake of simplicity, assume that X, Y, and Z = T ( X ) are bounded. Otherwise, we apply a componentwise, one-to-one scaling in the interval [ 0 , 1 ] . Obviously, the losslessness null hypothesis H 0 is invariant under such a scaling. Let
P n = { A n , 1 , , A n , m n } , Q n = { B n , 1 , , B n , m n } , R n = { C n , 1 , , C n , m n }
be the finite cubic partitions of the ranges X, Y, and Z, with all the cubes having common side lengths h n (thus, h n is proportional to 1 / m n ). As in [2], we define the test statistic
L n = A P n , B Q n , C R n P X Y Z n ( A , B , C ) P X Z n ( A , C ) P Y Z n ( B , C ) P Z n ( C ) .
Our test rejects H 0 if
L n t n ,
and accepts it if L n < t n , where the threshold t n is set to
t n = c 1 m n m n m n n + m n m n n + m n m n n + m n n + ( log n ) h n ,
where the constant c 1 satisfies
c 1 > 2 log 2 1.177 .
In this setup, the distribution of ( X , Y ) is arbitrary; its components can be discrete or absolutely continuous, or a mixture of the two or even singularly continuous. It is important to note that to construct this test, there is no need to know the type of distribution.
We assume that the joint distribution of X, Y, and Z = T ( X ) satisfies the following assumption.
Condition 1.
Let p ( · | z ) be the density of the conditional distribution P X | Z = z = P ( X · | Z = z ) with respect to the distribution P X as a dominating measure and introduce the notation
C n ( z ) = C n , j i f z C n , j .
Assume that for some C > 0 , p ( x | z ) satisfies the condition
| p ( x | z ) C n ( z ) p ( x | z ) P Z ( d z ) P Z ( C n ( z ) ) | P X ( d x ) P Z ( d z ) C h n ,
for all n.
We note that the ordinary Lipschitz condition
| p ( x | z ) p ( x | z ) | P X ( d x ) C d z z for all z , z R d
implies (5). This latter condition is equivalent to
d T V P X | Z = z , P X | Z = z C 2 d z z for all z , z R d ,
where d T V ( P , Q ) denotes the total variation distance between distributions P and Q. In Neykov, Balakrishnan, and Wasserman [12], condition (6) is called the Null TV Lipschitz condition.
The next theorem is an adaptation and extension of the results in Györfi and Walk [2] to this particular problem of lossless transformation. In [2], it was assumed that the sequence of partitions { P n , Q n , R n } is nested, while we make no such assumption. The proof, in which an error made in [2] is also corrected, is relegated to Section 5.
Theorem 2.
Suppose that X, Y, and Z = T ( X ) are bounded and Condition 1 holds for all n. If the sequence h n satisfies
lim n n h n d + 1 + d =
and
lim n h n d log n = 0 ,
then we have the following:
(a)
Under the losslessness null hypothesis H 0 , we have for all n e C ,
P ( L n t n ) 4 e ( c 1 2 / 2 log 2 ) m n ,
and therefore, because n = 1 P ( L n t n ) < by (8) and (9), after a random sample size, the test produces no error with a probability of one.
(b)
Under the alternative hypothesis H 1 = H 0 c ,
lim inf n L n > 0 a . s . ,
thus, with a probability of one, after a random sample size, the test produces no error.
Remark 1.
(i)
The choice h n = n δ with 0 < δ < 1 / ( d + 1 + d ) satisfies both conditions (7) and (8).
(ii)
Note that using (4), t n is of order c 1 m n m n m n n + ( log n ) h n . Since we have
m n = O ( 1 / h n d ) , m n = O ( 1 / h n ) , m n = O ( 1 / h n d ) ,
this means that t n is of order 1 / ( n h n d + 1 + d ) + ( log n ) h n .
An important special transformation is given by the feature selection X S defined in the Introduction. Theorem 2 demonstrates the possibility of universally lossless dimension reduction for any multivariate feature vector. Note that in the setup of feature selection, the partition P n can be the nested version of R n and so the calculation of the test statistic L n is easier.

3. Universally δ -Lossless Transformations

Here, we develop bounds on the excess minimum risk, in terms of mutual information under various assumptions about the loss function. With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be universally δ -lossless with respect to fairly general classes of loss functions .

3.1. Preliminaries on Mutual Information

Let P X Y denote the joint distribution of the pair ( X , Y ) and let P X P Y denote the product of the marginal distributions of X and Y, respectively. The mutual information between X and Y, denoted by I ( X ; Y ) , is defined as
I ( X ; Y ) = D ( P X Y P X P Y ) ,
where
D ( P Q ) = d P d Q log d P d Q d Q if P Q otherwise ,
is the Kullback–Leibler (KL) divergence between probability distributions P and Q (here, P Q means that P is absolutely continuous with respect to Q with the Radon–Nikodym derivative d P d Q ). Thus, I ( X ; Y ) is always nonnegative and I ( X ; Y ) = 0 if and only if X and Y are independent (note that I ( X ; Y ) = is possible). In this definition and throughout the paper, log denotes the natural logarithm.
For random variables U and V (both taking values in finite-dimensional Euclidean spaces), let P U | V denote the conditional distribution of U, given V. Furthermore, let P U | V = v denote the stochastic kernel (regular conditional probability) induced by P U | V . Thus, in particular, P U | V = v ( A ) = P ( U A | V = v ) for each measurable set A.
Given another random variable Z, the conditional mutual information I ( X ; Y | Z ) is defined as
I ( X ; Y | Z ) = D ( P X Y | Z = z P X | Z = z P Y | Z = z ) P Z ( d z ) .
The integral above can also be denoted by D ( P Y X | Z P Y | Z P X | Z | P Z ) and is called the conditional KL divergence. One can define
I ( X ; Y | Z = z ) = D ( P X Y | Z = z P X | Z = z P Y | Z = z )
so that
I ( X ; Y | Z ) = I ( X ; Y | Z = z ) P Z ( d z ) .
From this definition it is clear that I ( X ; Y | Z ) = 0 if and only if X and Y are conditionally independent given Z, i.e., if and only if Y Z X (or equivalently, if and only if X Z Y ).
Another way of expressing I ( X ; Y ) is
I ( X ; Y ) = D ( P Y | X = x P Y ) P X ( d x ) .
One can see that in a similar way to I ( X ; Y | Z ) can be expressed as
I ( X ; Y | Z ) = D ( P Y | X = x , Z = z P Y | Z = z ) P X | Z = z ( d x ) P Z ( d z ) .
Properties of mutual information and conditional mutual information, their connections to the KL divergence, and identities involving these information measures are detailed in, e.g., Cover and Thomas ([22], Chapter 2) and Polyanskiy and Wu ([9], Chapter 3).

3.2. Mutual Information Bounds and δ -Lossless Transformations

A real random variable U with finite expectation is said to be σ 2 -sub-Gaussian for some σ 2 > 0 if
log E e λ ( U E [ U ] ) σ 2 λ 2 2 for all λ R .
Furthermore, we say that U is conditionally σ 2 -sub-Gaussian given another random variable V if we have a.s.
log E e λ ( U E [ U | V ] ) | V σ 2 λ 2 2 for all λ R .
The following result gives a quantitative upper bound on the excess minimum risk L ( Y | T ( X ) ) L ( Y | X ) in terms of the mutual information difference I ( Y ; X ) I ( Y ; T ( X ) ) under certain, not too restrictive, conditions. Note that L ( Y | T ( X ) ) L ( Y | X ) 0 always holds.
Given ϵ > 0 , we call an estimator f : R d Y ϵ -optimal if E [ ( Y , f ( X ) ) ] < L ( Y | X ) + ϵ .
Theorem 3.
Let T : R d R d be a measurable transformation and assume that for any ϵ > 0 , there exists an ϵ-optimal estimator f of Y from X, such that ( y , f ( X ) ) ) is conditionally σ 2 ( y ) -sub-Gaussian given T ( X ) for every y Y , i.e.,
log E e λ ( y , f ( X ) ) E [ ( y , f ( X ) ) | T ( X ) ] T ( X ) σ 2 ( y ) λ 2 2 a . s
for all λ R and y R , where σ 2 : R R + satisfies E [ σ 2 ( Y ) ] < . Then, one has
L ( Y | T ( X ) ) L ( Y | X ) 2 E [ σ 2 ( Y ) ] I ( Y ; X | T ( X ) ) = 2 E [ σ 2 ( Y ) ] I ( Y ; X ) I ( Y ; T ( X ) ) .
Remark 2.
(i)
In case I ( Y ; X | T ( X ) ) = , we interpret the right hand side of (15) as ∞. With this interpretation, the bound always holds.
(ii)
We show in Section 4.2 that the sub-Gaussian condition (14) holds for the regression problem with squared error ( y , y ) = ( y y ) 2 if Y = m ( X ) + N , where N is independent noise having a zero mean and finite fourth moment E [ N 4 ] < , and the regression function m ( x ) = E [ Y | X = x ] is bounded. In particular, the bound in the theorem holds if N is normal with zero mean and m is bounded.
We note that Theorem 6 and Corollary 3 in Xu and Raginsky [4] give bounds similar to (15), in the somewhat different context of Bayesian learning. However, the conditions there exclude, e.g., regression models in the form Y = m ( X ) + N if E [ | N | δ ] = for some δ > 0 .
(iii)
Although hidden in the notation, E [ σ 2 ( Y ) ] depends on the loss function ℓ. Thus, the upper bound (15) is the product of two terms, the second of which,
I ( Y ; X ) I ( Y ; T ( X ) ) ,
is independent of the loss function.
(iv)
The bound in the theorem is not tight in general. In Section 4.3, an example is given in the context of portfolio selection, where the excess risk can be upper bounded by the difference I ( Y ; X ) I ( Y ; T ( X ) ) .
(v)
The proof of Theorem 3 and those of its corollaries go through virtually without change if we replace T ( X ) with any R d -valued random variable Z, such that Y X Z . Under the conditions of the theorem, we then have
L ( Y | Z ) L ( Y | X ) 2 E [ σ 2 ( Y ) ] I ( Y ; X | Z ) = 2 E [ σ 2 ( Y ) ] I ( Y ; X ) I ( Y ; Z ) .
In fact, Theorem 3 and its corollaries hold for general random variables Y, X, and Z taking values in complete and separable metric (Polish) spaces Y , X , and Z , respectively, if Y X Z .
The proof of Theorem 3 is based on a slight generalization of Raginsky et al. ([5], Lemma 10.2), which we state next. In the lemma, U and V are arbitrary abstract random variables defined for the same probability space and taking values in spaces U and V , respectively; U ¯ and V ¯ are independent copies of U and V (so that P U ¯ V ¯ = P U P V ); and h : U × V R is a measurable function.
Lemma 1.
Assume that h ( u , V ) is σ 2 ( u ) -sub-Gaussian for all u U , where E [ σ 2 ( U ) ] < . Then,
| E [ h ( U , V ) ] E [ h ( U ¯ , V ¯ ) ] | 2 E [ σ 2 ( U ) ] I ( U ; V ) .
Proof. 
We essentially copy the proof of ([5], Lemma 10.2), where it was assumed that σ 2 ( u ) does not depend on u. With this restriction, the sub-Gaussian condition (14) in Theorem 3 would have to hold with σ 2 ( y ) σ 2 uniformly over y. This condition would exclude regression models with independent sub-Gaussian noise and, a fortiori, models with independent noise that do not possess finite absolute moments of all orders, while our Theorem 2 can also be applied in such cases (see Section 4.2).
We make use of the Donsker–Varadhan variational representation of the relative entropy ([23] Corollary 4.15), which states that
D ( P Q ) = sup F F d P log e F d Q ,
where the supremum is over all measurable F : Ω R , such that e F d Q < . Applying this with P = P V | U = u , Q = P V , and F = λ h ( u , V ) , we obtain
D ( P V | U = u P V ) E [ λ h ( u , V ) | U = u ] log E [ e λ h ( u , V ) ] λ E [ h ( u , V ) | U = u ] E [ λ h ( u , V ) ] λ 2 σ 2 ( u ) 2 ,
where the second inequality follows from the assumption that h ( u , V ) is σ 2 ( u ) -sub-Gaussian. Maximizing the right-hand side of (16) over λ R gives, after rearrangement,
| E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] | 2 σ 2 ( u ) D ( P V | U = u P V ) .
Since U ¯ and V ¯ are independent, E [ h ( u , V ) ] = E [ h ( U ¯ , V ¯ ) | U ¯ = u ] , and we obtain
| E [ h ( U , V ) ] E [ h ( U ¯ , V ¯ ) ] | = | E [ h ( U , V ) | U = u ] E [ h ( U ¯ , V ¯ ) | U ¯ = u ] P U ( d u ) | = | E [ h ( u , V ) | U = u ] h ( u , V ) P U ( d u ) | | E [ h ( u , V ) | U = u ] h ( u , V ) | P U ( d u )
2 σ 2 ( u ) D ( P V | U = u P V ) P U ( d u )
2 σ 2 ( u ) P U ( d u ) D ( P V | U = u P V ) P U ( d u )
= 2 E [ σ 2 ( U ) ] I ( U ; V ) ,
where (18) follows from Jensen’s inequality, (19) follows from (17), in (20) we used the Cauchy–Schwarz inequality, and the last equality follows from (11). □
Proof of Theorem 3.
Let Y ¯ and X ¯ be random variables, such that P Y ¯ | T ( X ¯ ) = P Y | T ( X ) , P X ¯ | T ( X ¯ ) = P X | T ( X ) , P T ( X ¯ ) = P T ( X ) , and Y ¯ and X ¯ are conditionally independent given T ( X ¯ ) . Thus, the joint distribution of the triple ( Y ¯ , X ¯ , T ( X ¯ ) ) is P Y ¯ X ¯ T ( X ¯ ) = P Y | T ( X ) P X | T ( X ) P T ( X ) .
We apply Lemma 1 with U = Y , V = X , and h ( u , v ) = ( y , f ( x ) ) . Note that, using the conditions of the theorem, we can choose an ϵ -optimal f , such that for every y, ( y , f ( X ) ) is conditionally σ 2 ( y ) -sub-Gaussian given T ( X ) . Consider E [ ( Y , f ( X ) ) | T ( X ) = z ] and E [ ( Y ¯ , f ( X ¯ ) ) , | T ( X ¯ ) = z ] as regular (unconditional) expectations taken with respect to P Y X | T ( X ) = z and P Y ¯ X ¯ | T ( X ¯ ) = z respectively, and consider I ( Y ; X | T ( X ) = z ) as regular mutual information between random variables with the distribution P Y X | T ( X ) = z . Since Y ¯ and X ¯ are conditionally independent given T ( X ¯ ) = z , Lemma 1 yields
| E [ ( Y , f ( X ) ) | T ( X ) = z ] E [ ( Y ¯ , f ( Y ¯ ) ) | T ( X ¯ ) = z ] | 2 E [ σ 2 ( Y ) | T ( X ) = z ] I ( X ; Y | T ( X ) = z ) .
Recalling that T ( X ¯ ) and T ( X ) have the same distribution, and applying Jensen’s inequality and the Cauchy–Schwarz inequality as in (18) and (20), we obtain
| E [ ( Y , f ( X ) ) ] E [ ( Y ¯ , f ( X ¯ ) ) ] | | E [ ( Y , f ( X ) ) | T ( X ) = z ] E [ ( Y ¯ , f ( X ¯ ) ) | T ( X ¯ ) = z ] | P T ( X ) ( d z ) 2 E [ σ 2 ( Y ) | T ( X ) = z ] I ( Y ; X | T ( X ) = z ) P T ( X ) ( d z ) 2 E [ σ 2 ( Y ) | T ( X ) = z ] P T ( X ) ( d z ) I ( Y ; X | T ( X ) = z ) P T ( X ) ( d z ) = 2 E [ σ 2 ( Y ) ] I ( Y ; X | T ( X ) ) .
On the one hand, we have
E ( Y ¯ ; f ( X ¯ ) ) L ( Y ¯ | X ¯ ) = L ( Y ¯ | T ( X ¯ ) ) = L ( Y | T ( X ) ) ,
where the first equality follows from Theorem 1 with the conditional independence of Y ¯ and X ¯ given T ( X ¯ ) , and the second follows, since ( Y ¯ , T ( X ¯ ) ) and ( Y , T ( X ) ) have the same distribution by construction. On the other hand, L ( Y | X ) E [ ( Y , f ( X ) ) ] ϵ . Thus, (22) and (23) imply
0 L ( Y | T ( X ) ) L ( Y | X ) E [ ( Y ¯ , f ( X ¯ ) ) ] E [ ( Y , f ( X ) ) ] + ϵ 2 E [ σ 2 ( Y ) ] I ( Y ; X | T ( X ) ) + ϵ ,
which proves the upper bound in (15), since ϵ > 0 is arbitrary. By expanding I ( Y ; X | Z ) in two different ways using the chain rule for mutual information (e.g., Cover and Thomas ([22], Thm. 2.5.2)), and using the conditional independence of Y and T ( X ) given X, one obtains I ( Y ; X | T ( X ) ) = I ( Y ; X ) I ( Y ; T ( X ) ) , which shows the equality in (15). □
We state two corollaries for special cases. In the first, we assume that is uniformly bounded, i.e., = sup y , y Y ( y , y ) < . For any c > 0 , let L ( c ) denote the collection of all loss functions with c . Recall the notion of a universally δ -lossless transformation from Definition 2.
Corollary 1.
Suppose the loss function ℓ is bounded. Then, for any measurable T : R d R d , we have
L ( Y | T ( X ) ) L ( Y | X ) 2 I ( Y ; X ) I ( Y ; T ( X ) ) .
Therefore, whenever
I ( Y ; X ) I ( Y ; T ( X ) ) 2 δ 2 c 2 ,
the transformation T is universally δ-lossless for the family L ( c ) , i.e., L ( Y | T ( X ) ) L ( Y | X ) δ for all ℓ with c .
Remark 3.
(i)
The bound of the theorem can be used to give an estimation-theoretic motivation of the information bottleneck (IB) problem; see Section 4.4.
(ii)
Let L ( Y ) = L ( Y | ) = inf y Y E [ ( Y , y ) ] . For bounded ℓ, the inequality
L ( Y ) L ( Y | X ) 2 2 I ( Y ; X )
was proven in Makhdoumi et al. ([24], Theorem 1) for discrete alphabets, to solve the so-called privacy funnel problem.This inequality follows from (15) by setting Z = T ( X ) to be constant there.
(iii)
A simple self-contained proof of (24) (see below) was provided by Or Ordentlich and communicated to the second author by Shlomo Shamai [25], in response to an early version of this manuscript. The bound in (24) seems to have first appeared in published form in Hafez-Kolahi et al. ([26], Lemma 1), where the proof was attributed to Xu and Raginsky [27].
Proof of Corollary 1. 
If is uniformly bounded, then for any f : R d Y one has ( y , f ( x ) ) [ 0 , ] for all y and x. Then Hoeffding’s lemma (e.g., Boucheron et al. ([23], Lemma 2.2)) implies that for all y, ( y , f ( X ) ) is conditionally σ 2 -sub-Gaussian with σ 2 = 2 4 given T ( X ) . Since an ϵ -optimal estimator f exists for any ϵ > 0 and ( y , f ( X ) ) is conditionally σ 2 -sub-Gaussian, given T ( X ) using the preceding argument, (24) follows from Theorem 3. The second statement follows directly from (24) and the fact that c for all L ( c ) .
The following alternative argument by Or Ordentlich [25] is based on Pinsker’s inequality for the total variation distance in terms of the KL divergence (see, e.g., ([9], Theorem 7.9)). For bounded , this gives a direct proof of an analogue of the key inequality (22) in the proof of Theorem 3. This argument avoids Lemma 1 and the machinery introduced by the sub-Gaussian assumption.
Using the same notation as in the proof of Theorem 3 and letting P = P Y X Z and Q = P Y ¯ X ¯ T ( Y ¯ ) , we have
E ( Y , f ( X ) ) E ( Y ¯ , f ( X ¯ ) ) | = ( y , f ( x ) ) d P ( y , f ( x ) ) d Q d TV ( P , Q ) 2 D ( P Q ) ( by Pinsker s inequality ) = 2 D ( P Y X | T ( X ) P Z P Y | T ( X ) P X | T ( X ) P T ( X ) ) = 2 D ( P Y X | T ( X ) P Y | T ( X ) P X | T ( X ) | P T ( X ) ) = 2 I ( X , Y | T ( X ) ) .
The rest of the proof proceeds exactly as in Theorem 3. □
In the second corollary, we do not require that be bounded but assume that an optimal estimator f from X to Y exists, such that ( y , f ( X ) ) is conditionally σ 2 ( y ) -sub-Gaussian given T ( X ) , where E [ σ 2 ( Y ) ] < .
Corollary 2.
Assume that an optimal estimator f of Y from X exists, i.e., the measurable function f satisfies E [ ( Y , f ( X ) ) ] = L ( Y | X ) . Furthermore, suppose that the sub-Gaussian condition of Theorem 3 holds with f = f (i.e., (14) holds for f = f ) . Then,
L ( Y | T ( X ) ) L ( Y | X ) 2 E [ σ 2 ( Y ) ] I ( Y ; X ) I ( Y ; T ( X ) ) .
Proof. 
The corollary immediately follows from Theorem 3, since an optimal f is ϵ -optimal for all ϵ > 0 . □
For the next corollary, let L ^ ( c ) denote the collection of all loss functions , such that
( y , f ( X ) ) g ( y ) a . s .
for some function g : Y R + with E g 2 ( Y ) c 2 .
Corollary 3.
If T is a transformation such that
I ( Y ; X ) I ( Y ; T ( X ) ) 2 δ 2 c 2 ,
then T is universally δ-lossless for the family L ^ ( c ) .
Proof. 
Since ( y , f ( X ) ) is a.s. upper bounded by g ( y ) for any L ^ ( c ) , using Hoeffding’s lemma ([23], Lemma 2.2), we have that ( y , f ( X ) ) is conditionally g 2 ( y ) 4 -sub-Gaussian given T ( X ) . Thus, from Corollary 2, for all L ^ ( c ) , we have
L ( Y | T ( X ) ) L ( Y | X ) 1 2 E [ g 2 ( Y ) ] I ( Y ; X ) I ( Y ; T ( X ) ) c 2 2 I ( Y ; X ) I ( Y ; T ( X ) ) δ
if I ( Y ; X ) I ( Y ; T ( X ) ) 2 δ 2 c 2 . □
The next corollary generalizes and gives a much simplified proof of Faragó and Györfi [28], see also Devroye, Györfi, and Lugosi ([10], Theorem. 32.3). This result states for binary classification (Y is 0-1-valued and ( y , y ) = I y y ) that if a sequence of functions T n : R d R d is such that X T n ( X ) 0 in probability as n , then L ( Y | T n ( X ) ) L ( Y | X ) as n .
Corollary 4.
Assume that a sequence of transformations T n : R d R d is such that T n ( X ) X in distribution (i.e., P T n ( X ) P X weakly) as n . Then, for any bounded loss function ℓ,
lim n L ( Y | T n ( X ) ) = L ( Y | X ) .
Note that this corollary and its proof still hold without any changes if X takes values in an arbitrary complete separable metric space. For example, in the setup of function classification, X may take values in an L p function space for 1 p < , and T n is a truncated series expansion or a quantizer. Interestingly, here the asymptotic losslessness property is guaranteed, even in the case where the sequence of transformations T n and the loss function are not matched at all.
Proof. 
If T n ( X ) X in distribution, then clearly ( Y , T n ( X ) ) ( Y , X ) in distribution. Thus, the lower semicontinuity of mutual information with respect to convergence in distribution (see, e.g., Polyanskiy and Wu ([9], Equation (4.28))) implies
lim inf n I ( Y ; T n ( X ) ) I ( Y ; X ) .
Since I ( Y ; T n ( X ) ) I ( Y ; X ) for all n, we obtain
lim n I ( Y ; T n ( X ) ) = I ( Y ; X ) .
Combined with Corollary 1 (with T replaced with T n ), this gives
0 lim n L ( Y | T n ( X ) ) L ( Y | X ) lim n 2 I ( Y ; X ) I ( Y ; T n ( X ) ) = 0 .

4. Applications

4.1. Classification

For classification, Y is the finite set { 1 , , M } and the cost is the 0 1 loss
( y , y ) = I y y .
In this setup, the risk of estimator f is the error probability P ( Y f ( X ) ) . With the notation
P y ( x ) = P ( Y = y | X = x ) ,
the optimal estimator is the Bayes decision
f ( x ) = arg max y Y P y ( x ) ,
and the minimum risk is the Bayes error probability
L ( X ) = 1 E max y Y P y ( X ) .
If L ( T ( X ) ) = 1 E max y E [ P y ( X ) T ( X ) ] stands for the Bayes error probability of the transformed observation vector T ( X ) , then (24) with = 1 yields the upper bound
L ( T ( X ) ) L ( X ) 1 2 I ( Y ; X ) I ( Y ; T ( X ) ) ;
see also ([4], Corollary 2) for a similar bound in the context of Bayesian learning.
As a special case, the feature selector X X S is lossless if
L ( X ) = L ( X S ) .
Györfi and Walk [29] studied the corresponding hypothesis testing problem. Using a k-nearest-neighbor (k-NN) estimate of the excess Bayes error probability L ( X S ) L ( X ) , they introduced a test statistic and accepted the hypothesis (28), if the test statistic is less than a threshold. Under certain mild conditions, the strong consistency of this test has been proven.

4.2. Nonparametric Regression

For the nonparametric regression problem, the cost is the squared loss
( y , y ) = ( y y ) 2 , y , y R ,
and the best statistical inference is the regression function
m ( X ) = E [ Y | X ]
(here, we assume E [ Y 2 ] < ). Then, the minimum risk is the residual variance
L ( Y | X ) = E [ ( Y m ( X ) ) 2 ] .
If L ( X ) = L ( Y | X ) and L ( T ( X ) ) = L ( Y | T ( X ) ) denote the residual variances for the observation vectors X and T ( X ) , respectively, then
L ( T ( X ) ) L ( X ) = E ( Y E [ m ( X ) T ( X ) ] ) 2 E ( Y m ( X ) ) 2 = E ( m ( X ) E [ m ( X ) T ( X ) ] ) 2 .
Note that the excess residual variance L ( T ( X ) ) L ( X ) does not depend on the distribution of the residual Y m ( X ) .
Next, we show that the conditions of Corollary 2 hold with f ( x ) = m ( x ) for the important case
Y = m ( X ) + N ,
where N is a zero-mean noise variable that is independent of X and satisfies E [ N 4 ] < , and m is bounded as | m ( x ) | K for all x. For this model, we have
( y , f ( X ) ) = ( y m ( X ) ) 2 ( | y | + | m ( X ) | ) 2 ( | y | + K ) 2 .
Thus, ( y , f ( X ) ) is a nonnegative random variable a.s. bounded by ( | y | + K ) 2 , which implies via Hoeffding’s lemma (e.g., ([23], Lemma 2.2)) that it is σ 2 ( y ) -sub-Gaussian given T ( X ) with σ 2 ( y ) = ( | y | + K ) 4 4 . We have
E [ σ 2 ( Y ) ] = E ( | Y | + K ) 4 4 E ( | N | + | m ( X ) | + K ) 4 4 E ( | N | + 2 K ) 4 4 8 E | N | 4 + ( 2 K ) 4 4 2 E [ N 4 ] + 32 K 4 < ,
thus, the conditions of Corollary 2 hold and we obtain
L ( Y | T ( X ) ) L ( Y | X ) = E [ ( Y E [ Y | T ( X ) ] ) 2 ] E [ ( Y E [ Y | X ] ) 2 ] 2 E [ N 4 ] + 32 K 4 I ( Y ; X ) I ( Y ; T ( X ) ) .
Again, the feature selection X S is called lossless, when L ( X ) = L ( X S ) holds. As a test statistic, Devroye et al. [30] introduced a 1-NN estimate of L ( X S ) L ( X ) and proved the strong consistency of the corresponding test.

4.3. Portfolio Selection

The next example is related to the negative of the log-loss or log-utility; see Algoet and Cover [31], Barron and Cover [32], Chapters 6 and 16 in Cover and Thomas [22], Györfi et al. [33].
Consider a market consisting of d a assets. The evolution of the market in time is represented by a sequence of (random) price vectors S 1 , S 2 , R + d a with
S n = ( S n ( 1 ) , , S n ( d a ) ) ,
where the jth component S n ( j ) of S n denotes the price of the jth asset in the nth trading period. Let us transform the sequence of price vectors { S n } into the sequence of return (relative price) vectors { R n } , defined as
R n = ( R n ( 1 ) , , R n ( d a ) ) ,
where
R n ( j ) = S n ( j ) S n 1 ( j ) .
Constantly rebalanced portfolio selection is a multi-period investment strategy, where at the beginning of each trading period the investor redistributes the wealth among the assets. The investor is allowed to diversify their capital at the beginning of each trading period according to a portfolio vector b = ( b ( 1 ) , b ( d a ) ) . The jth component b ( j ) of b denotes the proportion of the investor’s capital invested in asset j. Here, we assume that the portfolio vector b has nonnegative components with j = 1 d a b ( j ) = 1 . The simplex of possible portfolio vectors is denoted by Δ d a .
Let S 0 = 1 denote the investor’s initial capital. Then, at the beginning of the first trading period, S 0 b ( j ) is invested into asset j, and this results in return S 0 b ( j ) R 1 ( j ) , and therefore at the end of the first trading period the investor’s wealth becomes
S 1 = S 0 j = 1 d a b ( j ) R 1 ( j ) = b , R 1 ,
where · , · denotes the standard inner product in R d a . For the second trading period, S 1 is the new initial capital
S 2 = S 1 · b , R 2 = b , R 1 · b , R 2 .
By induction, for the trading period n, the initial capital is S n 1 , and therefore
S n = S n 1 b , R n = i = 1 n b , R i .
The asymptotic average growth rate of this portfolio selection strategy is
lim n 1 n log S n = lim n 1 n i = 1 n log b , R i
assuming a limit exists.
If the market process { R i } is memory-less, i.e., it is a sequence of i.i.d. random return vectors, then the strong law of large numbers implies that the best constantly rebalanced portfolio (BCRP) is the log-optimal portfolio:
b = arg max b Δ d a E log b , R 1 ,
while the best asymptotic average growth rate is
W = max b Δ d a E log b , R 1 .
Barron and Cover [32] extended this setup to portfolio selection with side information. Assume that X 1 , X 2 , are R d valued side information vectors, such that ( R 1 , X 1 ) , ( R 2 , X 2 ) , are i.i.d. and in each round n the portfolio vector may depend on X n . The strong law of large numbers yields
lim n 1 n log S n = lim n 1 n i = 1 n log b ( X i ) , R i = E log b ( X 1 ) , R 1 a . s .
Therefore, the log-optimal portfolio has the form
b ( X 1 ) = arg max b Δ d a E log b , R 1 X 1
and the best asymptotic average growth rate is
W ( X ) = E max b Δ d a E log b , R 1 X 1 .
Barron and Cover ([32], Thm. 2) proved that
W ( X ) W I ( R 1 ; X 1 ) .
The next theorem generalizes this result by upper bounding the loss of the best asymptotic growth rate when, instead of X, only degraded side information T ( X ) is available.
Theorem 4.
For any measurable T : R d R d ,
W ( X ) W ( T ( X ) ) I ( R 1 ; X 1 ) I ( R 1 ; T ( X 1 ) )
assuming the terms on the right hand side are finite.
Remark 4.
(i)
As in Theorem 3, the difference I ( R 1 ; X 1 ) I ( R 1 ; T ( X 1 ) ) in the upper bound is equal to I ( R 1 ; X 1 | T ( X 1 ) ) , a quantity that is always nonnegative but may be equal to ∞. In this case, we interpret the right hand side as ∞.
(ii)
There is a correspondence between this setup of portfolio selection and the setup in previous sections. In particular, Y from the previous sections is equal to R with a range R + d a and the inference is b ( X ) taking values in Δ d a . Then, the loss is log b ( X ) , R . If we assume that for all j = 1 , d a ,
| log R ( j ) | c m a x a . s . ,
then
| log b ( X ) , R | c m a x a . s .
and so Corollary 1 implies
W ( X ) W ( T ( X ) ) c m a x 2 I ( R 1 ; X 1 ) I ( R 1 ; T ( X 1 ) ) .
Note that, from the point of view of application, (30) is a mild condition. For example, for NYSE daily data c m a x 0.3 ; see Györfi et al. [34].
Proof. 
Let ( R , X ) be a generic copy of the ( R i , X i ) . Writing out explicitly the dependence of W on P R , we have
W ( X ) = W ( P R | X = x ) P X ( d x )
and from (11) we have
I ( R ; X ) = D ( P R | X = x P R ) P X ( d x ) .
Thus, the bound W ( X ) W I ( R 1 ; X 1 ) in (29) can be written as
W ( P R | X = x ) P X ( d x ) W D ( P R | X = x P R ) P X ( d x ) .
Furthermore, letting Z = T ( X ) , we have
W ( X ) W ( Z ) = W ( P R | X = x ) P X ( d x ) W ( P R | Z = z ) P Z ( d z ) .
Since R X Z is a Markov chain, P R | X = x = P R | X = x , Z = z , and we obtain
W ( X ) W ( Z ) = W ( P R | X = x , Z = z ) P X | Z = z ( d x ) W ( P R | Z = z ) P Z ( d z ) .
Applying (31) with W ( P R | X = x ) replaced with W ( P R | X = x , Z = z ) and W replaced with W ( P R | Z = z ) with z fixed, we can bound the expression in parentheses as
W ( P R | X = x , Z = z ) P X | Z = z ( d x ) W ( P R | Z = z ) D ( P R | X = x , Z = z P R | Z = z ) P X | Z = z ( d x ) ,
and therefore
W ( X ) W ( Z ) D ( P R | X = x , Z = z P R | Z = z ) P X | Z = z ( d x ) P Z ( d z ) = I ( R ; X | Z ) ,
where (32) follows from the alternative expression (12) of the conditional mutual information.
As in the proof Theorem 3, the conditional independence of R and Z = T ( X ) given X implies
I ( R ; X | Z ) = I ( R ; X ) I ( R ; T ( X ) ) ,
which completes the proof. □

4.4. Information Bottleneck

Let X and Y be random variables as in Section 2. When Y X Z , the joint distribution P Y X Z of the triple ( Y , X , Z ) is determined (for fixed P Y X ) by the conditional distribution (transition kernel) P Z | X as P Y X Z = P Y X P Z | X . The information bottleneck (IB) framework can be formulated as the study of the constrained optimization problem
maximize I ( Y ; Z ) subject to I ( X ; Z ) α
for a given α > 0 , where the maximization is over all transition kernels P Z | X .
Originally proposed by Tishby et al. [35], the solution to the IB problem is a transition kernel P Z | X , interpreted as a stochastic transformation, that “encodes” X into a “compressed” representation Z that preserves relevant information about Y through maximizing I ( Y ; Z ) , while compressing X by requiring that I ( X ; Z ) α . The intuition behind this framework is that by maximizing I ( Y ; Z ) , the representation Z will retain the predictive power of X with respect to Y, while the requirement I ( X ; Z ) α makes the representation Z concise.
Note that, in case X is discrete and has finite entropy H ( X ) , setting α = H ( X ) , or setting formally α = in the general case, the constraint I ( X ; Z ) α becomes vacuous and (assuming the alphabet of Z is sufficiently large) the resulting Z will achieve the upper bound I ( Y ; Z ) = I ( Y ; X ) , so that I ( Y ; X | Z ) = I ( Y ; X ) I ( Y ; Z ) = 0 , i.e., Y Z X . Thus, the solution to (33) can be considered as a stochastically relaxed version of a minimal sufficient statistic for X in predicting Y (see Goldfeld and Polyanskiy ([36], Section II.C) for more on this interpretation). Recent tutorials on the IB problem include Asoodeh and Calmon [37] and Zaidi et al. [38].
Theorem 3 and its corollaries can be used to motivate the IB principle from an estimation-theoretic viewpoint. Let
I ( α ) = sup P Z | X : I ( X ; Z ) α I ( Y ; Z )
be the value function for (33) and Z α a resulting optimal Z (assuming such a maximizer exists). From the remark after Theorem 3, we know that the bounds given in the theorem and in its corollaries remain valid if we replace T ( X ) with a random variable Z, such that Y X Z . Then, for example, Corollary 1 implies that
L ( Y | Z α ) L ( Y | X ) c I ( Y ; X ) I ( α )
for all such that 2 c .
Thus, the IB paradigm minimizes, under the complexity constraint I ( X ; Z ) α , an upper bound on the difference L ( Y | Z ) L ( Y | X ) that universally holds for all loss functions with 2 c . The resulting Z α will then have guaranteed performance in predicting Y with respect to all sufficiently bounded loss functions. This gives a novel operational interpretation of the IB framework that seems to have been overlooked in the literature.

4.5. Deep Learning

The IB paradigm can also serve as a learning objective in deep neural networks (DNNs). Here the Lagrangian relaxation of (33) is considered. In particular, letting X denote the input and Z θ the output of the last hidden layer of the DNN, where θ Θ R K is the collection of network parameters (weights), the objective is to maximize
I ( Y ; Z θ ) β I ( X ; Z θ )
over θ Θ for a given β > 0 . The parameter β controls the trade-off between how informative Z θ is about Y, measured by I ( Y ; Z θ ) , and how much Z θ is “compressed,” measured by I ( X ; Z θ ) . Clearly, larger values of β correspond to smaller values of I ( X ; Z θ ) and thus to more compression. Here, Z θ is either a deterministic function of X in the form of Z θ = T θ ( X ) , where T θ : R d R d represents the deterministic DNN, or it is produced by a stochastic kernel P Z | X θ , parameterized by the network parameters θ Θ . The latter is achieved by injecting independent noise into the network’s intermediate layers.
In addition to the motivation explained in the previous section, the IB framework for DNNs can be thought as a regularization method that results in improved generalization capabilities for a network trained on data using stochastic gradient-based methods, see, e.g., Tishby and Zaslavsky [39], Shwartz-Ziv and Tishby [40], Alemi et al. [41], as well as many other references in the excellent survey article Goldfeld and Polyanskiy [36], and the special issue [42] on information bottleneck and deep learning.
As in the previous section, our Theorem 1 and corollaries can serve as a (partial) justification for setting (34) as a learning objective. Assume that after training with a given β > 0 , the obtained Z θ ( β ) has (true) mutual information I ( Y ; Z θ ( β ) ) with Y (typically, this will not be the optimal solution, since maximizing (34) is not feasible and in practice only a proxy lower bound is optimized during training, see, e.g., Alemi et al. [41]). Then, by Corollary 1 the obtained network has a guaranteed predictive performance
L ( Y | Z θ ( β ) ) L ( Y | X ) + c ϵ
for all loss functions with 2 c , where
ϵ = I ( Y ; X ) I ( Y ; Z θ ( β ) ) .

5. Proof of Theorem 2

Proof Theorem 2.
(a)
The bounds given in the proof of Theorem 1 in [2] imply
L n J n , 1 + J n , 2 + J n , 3 + J n , 4 + J n , 5 ,
where
J n , 1 = A P n , B Q n , C R n P X Y Z n ( A , B , C ) P X Y Z ( A , B , C ) , J n , 2 = B Q n , C R n P Y Z ( B , C ) P Y Z n ( B , C ) , J n , 3 = A P n , C R n P X Z ( A , C ) P X Z n ( A , C ) , J n , 4 = C R n P Z n ( C ) P Z ( C ) ,
and
J n , 5 = A P n , B Q n , C R n P X Y Z ( A , B , C ) P X Z ( A , C ) P Y Z ( B , C ) P Z ( C ) .
Using large deviation inequalities from Beirlant et al. [43] and in Biau and Györfi [18], Györfi and Walk [2] proved that for all ε i > 0 , i = 1 , , 4 and δ > 0 ,
P ( L n > ε 1 + ε 2 + ε 3 + ε 4 + δ ) P ( J n , 1 > ε 1 ) + P ( J n , 2 > ε 2 ) + P ( J n , 3 > ε 3 ) + P ( J n , 4 > ε 4 ) + I J n , 5 > δ 2 m n · m n · m n e n ε 1 2 / 2 + 2 m n · m n e n ε 2 2 / 2 + 2 m n · m n e n ε 3 2 / 2 + 2 m n e n ε 4 2 / 2 + I J n , 5 > δ .
We note that the bounds on the probabilities P ( J n , i > ε i ) for i = 1 , , 4 were proven in [2] without either assuming the null hypothesis H 0 or using the condition that the partitions are nested. Under the null hypothesis, Györfi and Walk [2] claimed that
J n , 5 = 0 .
As Neykov et al. [12] observed, this was incorrect. In order to resolve the gap, we show that under lim n h n = 0 and condition (5) and under the null hypothesis, the last term in (35) is o ( 1 ) , i.e.,
I A P n , B Q n , C R n P X Y Z ( A , B , C ) P X Z ( A , C ) P Y Z ( B , C ) P Z ( C ) > δ = 0
if n is large enough. The null hypothesis implies that
P X Y Z ( A , B , C ) P X Z ( A , C ) P Y Z ( B , C ) P Z ( C ) = P ( X A , Y B , Z C ) P X Z ( A , C ) P ( Y B , Z C ) P Z ( C ) = E P ( X A , Y B Z ) I Z C E P ( Y B Z ) I Z C P X Z ( A , C ) P Z ( C ) = E P ( X A Z ) P ( Y B Z ) I Z C E P ( Y B Z ) I Z C P X Z ( A , C ) P Z ( C ) .
Thus,
A P n , B Q n , C R n P X Y Z ( A , B , C ) P X Z ( A , C ) P Y Z ( B , C ) P Z ( C ) A P n , B Q n , C R n E P ( Y B Z ) P ( X A Z ) I Z C I Z C P X Z ( A , C ) P Z ( C ) = A P n , C R n E P ( X A Z ) I Z C I Z C P X Z ( A , C ) P Z ( C ) .
Let p ( · z ) and C n ( z ) be as in Condition 1. Then,
A P n , C R n E P ( X A Z ) I Z C I Z C P X Z ( A , C ) P Z ( C ) = A P n , C R n C A p ( x z ) P X ( d x ) A [ C p ( x z ) P Z ( d z ) ] P X ( d x ) P Z ( C ) P Z ( d z ) p ( x z ) C n ( z ) p ( x z ) P Z ( d z ) P Z ( C n ( z ) ) P Z ( d z ) P X ( d x ) C h n ,
where in the last step we use condition (5). The inequalities (35) and (36) imply that
P L n > c 1 m n m n m n n + m n m n n + m n m n n + m n n + ( log n ) h n 4 e ( c 1 2 / 2 log 2 ) m n + I C h n > ( log n ) h n 4 e ( c 1 2 / 2 log 2 ) m n ,
if n e C . Since m n is proportional to 1 / h n d , condition (8) on h n implies n = 1 P ( L n t n ) < , and thus using the Borel–Cantelli lemma, after a random sample size, the test has no error with a probability of one.
(b)
This proof is a refinement of the proof of Corollary 1 in [2], in which we avoid the condition used there that the sequences of partitions P n and Q n are nested. According to the proof of Part (a) (see the remark after (35)), we obtain that
lim inf n L n lim inf n ( L n J n , 5 ) + lim inf n J n , 5 = lim inf n J n , 5 a . s .
To simplify the notation, let P X Y | z = P X Y | Z = z , P X | z = P X | Z = z , and P Y | z = P Y | Z = z . Let L be the expected total variation distance between P X Y | z and P X | z P Y | z :
L = sup F | P X Y | z ( F ) P X | z P Y | z ( F ) | P Z ( d z ) ,
where the supremum is taken over all Borel subsets F of R d × R d . It suffices to prove that using the condition lim n h n = 0 ,
lim inf n J n , 5 2 L > 0 .
One has that
A P n , B Q n , C R n P X Y Z ( A , B , C ) P X Z ( A , C ) P Y Z ( B , C ) P Z ( C ) A P n , B Q n P X Y | z ( A , B ) P X | z ( A ) P Y | z ( B ) P Z ( d z ) W n ,
where
W n A P n , B Q n C n ( z ) P X Y | z ( A , B ) P Z ( z ) P Z ( C n ( z ) ) P X Y | z ( A , B ) P Z ( d z )
+ A P n C n ( z ) P X | z ( A ) P Z ( d z ) P Z ( C n ( z ) ) P X | z ( A ) P Z ( d z )
+ B Q n C n ( z ) P Y | z ( B ) P Z ( d z ) P Z ( C n ( z ) ) P Y | z ( B ) P Z ( d z )
In [2], it was shown that the condition lim n h n = 0 implies lim n W n = 0 if the sequence of partitions { P n , Q n } n 1 is nested. In order to avoid this nestedness condition, introduce the density p ( x , y | z ) of the conditional distribution P X Y | z with respect to the distribution P X Y of ( X , Y ) as a dominating measure, and similarly let p n ( x , y | z ) be the density of the conditional distribution C n ( z ) P X Y | z ( · , · ) P Z ( d z ) / P Z ( C n ( z ) ) with respect to P X Y , i.e., p n ( x , y | z ) = C n ( z ) p ( x , y | z ) P Z ( d z ) / P Z ( C n ( z ) ) . Then,
A P n , B Q n C n ( z ) P X Y | z ( A , B ) P Z ( d z ) P Z ( C n ( z ) ) P X Y | z ( A , B ) | p n ( x , y | z ) p ( x , y | z ) | P X Y ( d x , d y ) ,
and therefore the term on the right-hand side of (37) will converge to zero, as long as
| p n ( x , y | z ) p ( x , y | z ) | P X Y ( d x , d y ) P Z ( d z ) 0 ,
which follows from lim n h n = 0 using the standard technique of the bias of partitioning regression estimate for the regression function p ( · , · | z ) ; see Theorem 4.2 in [44]. The terms in (38) and (39) can be dealt with analogously. Thus,
lim inf n J n , 5 lim inf n A P n , B Q n P X Y | z ( A , B ) P X | z ( A ) P Y | z ( B ) P Z ( d z ) .
For fixed z, lim n h n = 0 implies
lim n A P n , B Q n P X Y | z ( A , B ) P X | z ( A ) P Y | z ( B ) = 2 sup F | P X Y | z ( F ) P X | z P Y | z ( F ) | ,
see Abou-Jaoude [45] and Csiszár [46]. Therefore, the dominated convergence theorem yields
lim n A P n , B Q n P X Y | z ( A , B ) P X | z ( A ) P Y | z ( B ) P Z ( d z ) = 2 sup F | P X Y | z ( F ) P X | z P Y | z ( F ) | P Z ( d z ) = 2 L .
Note that in the proof of Part (b) the condition (5) is not used, at all.

6. Concluding Remarks

We studied the excess minimum risk in statistical inference and under mild conditions gave a strongly consistent procedure for testing from data if a given transformation of an observed feature vector results in zero excess minimum risk for all loss functions. It is an open research problem whether a strong universal test exists, i.e., a test that is strongly consistent without any conditions on the transformation and on the underlying distribution. We also developed information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. The bounds were not stated in the most general form possible, in that the observed quantities were restricted to taking values in Euclidean spaces and we did not allow transformations that were random functions of the observation, both of which restrictions could be relaxed.The bounds could be sharpened, e.g., in specific cases, but in their present form are already useful. For example, they give an additional theoretical motivation for applying the information bottleneck approach in deep learning.

Author Contributions

Conceptualization, L.G., T.L. and H.W.; Methodology, L.G. and T.L.; Validation, H.W.; Formal analysis, T.L.; Investigation, H.W.; Writing—original draft, L.G.; Writing—review & editing, L.G., T.L., L.G., T.L. and H.W. equally contributed to the published work. All authors have read and agreed to the published version of the manuscript.

Funding

The research of László Györfi has been supported by the National Research, Development and Innovation Fund of Hungary under the 2019-1.1.1-PIACI-KFI-2019-00018 funding scheme. Tamás Linder’s research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

T. Linder would like to thank O. Ordentlich and S. Shamai for their helpful comments on an earlier version of this manuscript and for pointing out relevant literature.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Schervish, M.J. Theory of Statistics; Springer Series in Statistics; Springer: New York, NY, USA, 1995. [Google Scholar]
  2. Györfi, L.; Walk, H. Strongly consistent nonparametric tests of conditional independence. Stat. Probab. Lett. 2012, 82, 1145–1150. [Google Scholar] [CrossRef]
  3. Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Adv. Neural Inf. Process. Syst. 2017, 30, 2521–2530. [Google Scholar]
  4. Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. IEEE Trans. Inf. Theory 2022, 68, 7935–7955. [Google Scholar] [CrossRef]
  5. Raginsky, M.; Rakhlin, A.; Xu, A. Information-Theoretic Stability and Generalization. In Information-Theoretic Methods in Data Science; Rodrigues, M., Eldar, Y., Eds.; Cambridge University Press: Cambridge, UK, 2021; pp. 302–329. [Google Scholar] [CrossRef]
  6. Lugosi, G.; Neu, G. Generalization bounds via convex analysis. In Proceedings of the 34th Annual Conference on Learning Theory (COLT), London, UK, 2–5 July 2022; pp. 3524–3546. [Google Scholar]
  7. Jose, S.T.; Simeone, O. Information-theoretic generalization bounds for meta-learning and applications. Entropy 2021, 23, 126. [Google Scholar] [CrossRef]
  8. Hafez-Kolahi, H.; Moniri, B.; Kasaei, S. Information-theoretic analysis of minimax excess risk. IEEE Trans. Inf. Theory 2023, 69, 4659–4674. [Google Scholar] [CrossRef]
  9. Polyanskiy, Y.; Wu, Y. Information Theory: From Coding to Learning; Cambridge University Press: Cambridge, UK, 2022; Forthcoming; Available online: https://people.lids.mit.edu/yp/homepage/data/itbook-export.pdf (accessed on 5 July 2023).
  10. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
  11. Cai, Z.; Li, R.; Zhang, Y. A distribution free conditional independence test with applications to causal discovery. J. Mach. Learn. Res. 2022, 23, 1–41. [Google Scholar]
  12. Neykov, M.; Balakrishnan, S.; Wasserman, L. Minimax optimal conditional independence testing. Ann. Stat. 2021, 49, 2151–2177. [Google Scholar] [CrossRef]
  13. Shah, R.D.; Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Ann. Stat. 2020, 48, 1514–1538. [Google Scholar] [CrossRef]
  14. Dembo, A.; Peres, Y. A topological criterion for hypothesis testing. Ann. Stat. 1994, 22, 106–117. [Google Scholar] [CrossRef]
  15. Nobel, A.B. Hypothesis testing for families of ergodic processes. Bernoulli 2006, 12, 251–269. [Google Scholar] [CrossRef]
  16. Cover, T. On determining the irrationality of the mean of a random variable. Ann. Stat. 1973, 1, 862–871. [Google Scholar] [CrossRef]
  17. Kulkarni, S.R.; Zeitouni, O. Can one decide the type of the mean from the empirical distribution? Stat. Probab. Lett. 1991, 12, 323–327. [Google Scholar] [CrossRef]
  18. Biau, G.; Györfi, L. On the asymptotic properties of a nonparametric L1-test statistic of homogeneity. IEEE Trans. Inf. Theory 2005, 51, 3965–3973. [Google Scholar] [CrossRef]
  19. Devroye, L.; Lugosi, G. Almost sure classification of densities. J. Nonparametr. Stat. 2002, 14, 675–698. [Google Scholar] [CrossRef]
  20. Gretton, A.; Györfi, L. Consistent nonparametric tests of independence. J. Mach. Learn. Res. 2010, 11, 1391–1423. [Google Scholar]
  21. Morvai, G.; Weiss, B. On universal algorithms for classifying and predicting stationary processes. Probab. Surv. 2021, 18, 77–131. [Google Scholar] [CrossRef]
  22. Cover, T.; Thomas, J. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
  23. Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
  24. Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the information bottleneck to the privacy funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW), Hobart, TAS, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
  25. Ordentlich, O.; (School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel); Shamai, S.; (Department of Electrical Engineering, Technion, Haifa, Israel). Personal communication, July 2020.
  26. Hafez-Kolahi, H.; Moniri, B.; Kasaei, S.; Baghshah, M.S. Rate-distortion analysis of minimum excess risk in Bayesian learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 3998–4007. [Google Scholar]
  27. Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. arXiv 2020, arXiv:2012.14868. [Google Scholar]
  28. Faragó, T.; Györfi, L. On the continuity of error distortion function for multiple-hypotheses decisions. IEEE Trans. Inf. Theory 1975, IT-21, 458–560. [Google Scholar] [CrossRef]
  29. Györfi, L.; Walk, H. Detecting Ineffective Features for Pattern Recognition. Oberwolfach Preprint. 2017. Available online: http://publications.mfo.de/handle/mfo/1314 (accessed on 15 July 2023).
  30. Devroye, L.; Györfi, L.; Lugosi, G.; Walk, H. A nearest neighbor estimate of the residual variance. Electron. J. Stat. 2018, 12, 1752–1778. [Google Scholar] [CrossRef]
  31. Algoet, P.; Cover, T.M. Asymptotic optimality asymptotic equipartition properties of log-optimum investments. Ann. Probab. 1988, 16, 876–898. [Google Scholar] [CrossRef]
  32. Barron, A.R.; Cover, T.M. A bound on the financial value of information. IEEE Trans. Inf. Theory 1988, 34, 1097–1100. [Google Scholar] [CrossRef]
  33. Györfi, L.; Ottucsák, G.; Urbán, A. Empirical log-optimal portfolio selections: A survey. In Machine Learning for Financial Engineering; Györfi, L., Ottucsák, G., Walk, H., Eds.; Imperial College Press: London, UK, 2012; pp. 81–118. [Google Scholar]
  34. Györfi, L.; Ottucsák, G.; Walk, H. The growth optimal investment strategy is secure, too. In Optimal Financial Decision Making under Uncertainty; Consigli, G., Kuhn, D., Brandimarte, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; pp. 201–223. [Google Scholar]
  35. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA; 1999; pp. 368–377. [Google Scholar]
  36. Goldfeld, Z.; Polyanskiy, Y. The Information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
  37. Asoodeh, S.; Calmon, F.P. Bottleneck problems: An information and estimation-theoretic view. Entropy 2020, 22, 1325. [Google Scholar] [CrossRef] [PubMed]
  38. Zaidi, A.; Aguerri, I.E.; Shamai, S. On the information bottleneck problems: Models, connections, applications and information theoretic views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef] [PubMed]
  39. Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
  40. Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
  41. Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017; pp. 368–377. [Google Scholar]
  42. Geiger, B.C.; Kubin, G. Information bottleneck: Theory and applications in deep learning, Editorial for special issue on “Information Bottleneck: Theory and Applications in Deep Learning”. Entropy 2020, 22, 1408. [Google Scholar] [CrossRef]
  43. Beirlant, J.; Devroye, L.; Györfi, L.; Vajda, I. Large deviations of divergence measures on partitions. J. Stat. Plan. Inference 2001, 93, 1–16. [Google Scholar] [CrossRef]
  44. Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
  45. Abou-Jaoude, S. Conditions nécessaires et suffisantes de convergence L1 en prohabilité de l’histogramme pour une densité. Ann. L’Institut Henri Poincaré 1976, 12, 213–231. [Google Scholar]
  46. Csiszár, I. Generalized entropy and quantization problems. In Proceedings of the Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Prague, Czech Republic, 19–25 September 1971; Academia: Prague, Czech Republic, 1973; pp. 159–174. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Györfi, L.; Linder, T.; Walk, H. Lossless Transformations and Excess Risk Bounds in Statistical Inference. Entropy 2023, 25, 1394. https://doi.org/10.3390/e25101394

AMA Style

Györfi L, Linder T, Walk H. Lossless Transformations and Excess Risk Bounds in Statistical Inference. Entropy. 2023; 25(10):1394. https://doi.org/10.3390/e25101394

Chicago/Turabian Style

Györfi, László, Tamás Linder, and Harro Walk. 2023. "Lossless Transformations and Excess Risk Bounds in Statistical Inference" Entropy 25, no. 10: 1394. https://doi.org/10.3390/e25101394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop