Abstract
We study the excess minimum risk in statistical inference, defined as the difference between the minimum expected loss when estimating a random variable from an observed feature vector and the minimum expected loss when estimating the same random variable from a transformation (statistic) of the feature vector. After characterizing lossless transformations, i.e., transformations for which the excess risk is zero for all loss functions, we construct a partitioning test statistic for the hypothesis that a given transformation is lossless, and we show that for i.i.d. data the test is strongly consistent. More generally, we develop information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. Based on these bounds, we introduce the notion of a -lossless transformation and give sufficient conditions for a given transformation to be universally -lossless. Applications to classification, nonparametric regression, portfolio strategies, information bottlenecks, and deep learning are also surveyed.
1. Introduction
We consider the standard setting of statistical inference, where Y is a real random variable, having a range , which is to be estimated (predicted) from a random observation (feature) vector X, taking values in . Given a measurable predictor and measurable loss function , the loss incurred is . The minimum expected risk in predicting Y from the random vector X is
where the infimum is over all measurable f.
Suppose that the tasks of collecting data and making the prediction are separated in time or in space. For example, the separation in time happens when the data are collected first and the statistical modeling and analysis are performed much later. Separation in space can be due, for example, to collecting data at a remote location and making predictions centrally. Such situations are modeled using a transformation , so that the prediction regarding Y is made from the transformed observation , instead of X. An important example for such a transformation is quantization, in which case is a discrete random variable. Clearly, one always has . The difference is sometimes referred to in the literature as excess risk. A part of this paper is concerned with transformations for which the excess risk is zero, no matter the underlying loss function ℓ. Such transformations are universally lossless, in the sense that they can be chosen before the cost function ℓ for the underlying problem is known. More formally, we can state the following definition.
Definition 1
(lossless transformation). For a fixed joint distribution of Y and X, a (measurable) transformation is called universally lossless if for any loss function we have
An important special transformation is feature selection. Formally, for the observation (feature) vector and , consider the -dimensional vector . Typically, the dimension of is significantly smaller than d, the dimension of X. If we have
for all loss functions ℓ, then the feature selector is universally lossless. For fixed loss ℓ, the performance of any statistical inference method is sensitive to the dimension of the feature vector. Therefore, dimension reduction is crucial before choosing or constructing an inference method. If is universally lossless, then the complement feature subvector is irrelevant. It is an open research problem how to efficiently search a universally lossless with minimum size . Since, typically, the distribution of the pair X and Y is not known and must be inferred from data, any such search algorithm needs a procedure for testing for the universal losslessness property of a feature selector.
In the first part of this paper, we give a necessary and sufficient condition for a given transformation T to be universally lossless and then construct a partitioning-based statistic for testing this condition if independent and identically distributed training data are available. With the null hypothesis being that a given transformation is universally lossless, the test is shown to be strongly consistent, in the sense that it almost surely (a.s.) makes finitely many Type I and II errors.
In many situations, requiring that a transformation T is universally lossless is too demanding. The next definition relaxes this requirement.
Definition 2
(-lossless transformation). For a fixed joint distribution of Y and X, and , a transformation is called universally -lossless with respect to a class of loss functions , if we have
In the second part of this paper, we derive bounds on the excess minimum risk in terms of the mutual information difference under various assumptions about ℓ. With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be -lossless with respect to fairly general classes of loss functions ℓ. Applications to classification, nonparametric regression, portfolio strategies, the information bottleneck method, and deep learning are also reviewed.
- Relationship with prior work
Our first result, Theorem 1, which shows that a transformation is universally lossless if and only if it is a sufficient statistic, is likely known, but we could not find it in this explicit form in the literature (a closely related result is the classical Rao–Blackwell theorem of mathematical statistics, e.g., Schervish ([1], Theorem 3.22)). Due to this result, testing with independent data whether or not a given transformation is universally lossless turns into a test for conditional independence. Our test in Theorem 2 is based on the main results in Györfi and Walk [2], but our construction is more general and we also correct an error in the proof of ([2], Theorem 1). Apart from [2], most of the results in the literature of testing for conditional independence are for real-valued random variables and/or assume certain special distribution types, typically the existence of a joint probability density function. Such assumptions exclude problems where Y is discrete and X is continuous, as is typical in classification, or problems where the observation X is concentrated on a lower dimensional subspace or manifold. In contrast, our test construction is completely distribution free and its convergence properties are also (almost) distribution free. A more detailed review of related work is given in Section 2.1.
The main result in Section 3 is Theorem 3, which bounds the excess risk in terms of the square root of the mutual information difference . There is a history of such bounds, possibly starting with Xu and Raginsky [3], where the generalization error of a learning algorithm was upper bounded using constant times the square root of the mutual information between the hypothesis and the training data (see also the references in [3,4]). This result has since been extended in various forms, mostly concentrating on providing information-theoretic bounds for the generalization capabilities of learning algorithms, instead of looking at the excess risk; see, e.g., Raginsky et al. [5], Lugosi and Neu [6], Jose and Simeone [7], and the references therein, just to mention a few of these works. The most relevant recent work relating to our bounds in Section 3 seems to be Xu and Raginsky [4], where, among other things, information-theoretic bounds were developed on the excess risk in a Bayesian learning framework; see also Hafez-Kolahi et al. [8]. The bounds in [4] are not on the excess risk ; they involve training data, but their forms are similar to ours. It appears that our Theorem 3 gives a bound that holds uniformly for a larger class of loss functions ℓ and joint distributions of Y and X; however, in [4], several other bounds are presented that are tighter and/or allow more general distributions, for specific fixed loss functions.
- Organization
This paper is organized as follows. In Section 2, we characterize universally lossless transformations and introduce a novel strongly consistent test for the property of universal losslessness. In Section 3, information-theoretic bounds on the excess minimum risk are developed and are used to characterize the -losslessness property of transformations. Section 4 surveys connections with, and applications to, specific prediction problems, as well as the information bottleneck method in deep learning. The somewhat lengthy proof of the strong consistency of the test in Theorem 2 is given in Section 5. Concluding remarks are given in Section 6.
2. Testing the Universal Losslessness Property
In this section, we first give a characterization of universally lossless transformations for a given distribution of the pair . In practice, the distribution of may not be known, but a sequence of independent and identically distributed (i.i.d.) copies of may be available. For this case, we construct a procedure to test if a given transformation is universally lossless and prove that, under mild conditions, the test is strongly consistent.
2.1. Universally Lossless Transformations
Based on Definition 1, we introduce the null hypothesis
A transformation (statistic) is called sufficient if the random variables Y, , X form a Markov chain in this order, denoted by (see, e.g., Definition 3.8 and Theorem 3.9 in Polyanskiy and Wu [9]).
For a binary valued Y, Theorems 32.5 and 32.6 from Devroye et al. [10] imply that the statistic is universally lossless if, and only if, it is sufficient. The following theorem extends this property to general Y. This result is likely known, but we could not find it in the given form.
Theorem 1.
The transformation T is universally lossless if, and only if, is a Markov chain.
Proof.
Assume first that is a Markov chain. This is equivalent to having almost surely (a.s.) for any measurable . Then we have
Since always holds, we obtain for all ℓ, so is universally lossless.
Now, assume that the Markov chain condition does not hold. Then, there exists a measurable with and with , such that
Let , where is the indicator function of event E, and define the binary valued as . Then, the Markov chain condition does not hold. For this special case, Theorems 32.5 and 32.6 in [10] show that there a loss function exists, such that . Finally, letting , we have
which shows that is not universally lossless. □
2.2. A Strongly Consistent Test
Theorem 1 implies an equivalent form of the losslessness null hypothesis defined by (1)
or equivalently, holds if and only if X and Y are conditionally independent given :
for arbitrary Borel sets . Furthermore, we consider the general case where the alternative hypothesis is the complement of : .
Now, assume that the joint distribution of is not known but instead a sample of independent and identically distributed (i.i.d.) random vectors having a common distribution of is given, where and .The goal is to test the hypothesis of conditional independence based on these data. In fact, our goal is to provide a strongly consistent test; i.e., a test that, with a probability of one, only makes finitely many Type I and II errors.
For testing conditional independence, most of the results in the literature used real valued . Based on kernel density estimation, Cai et al. [11] introduced a test statistic and under the null hypothesis calculated its limit distribution. In Neykov et al. [12], a gap was introduced between the null and alternative hypotheses. This gap was characterized by the total variation distance, which decreased with increasing n. Under certain smoothness conditions, minimax bounds were derived. According to Shah and Peters [13], a regularity condition such as our Lipschitz condition (5) below cannot be omitted if a test for conditional independence is to be consistent. This is a consequence of their no-free lunch theorem that states that, under general conditions, if with the null hypothesis the bound on the error probability is non-asymptotic, then under the alternative hypothesis the rate of convergence of the error probability can be arbitrarily slow, which is a well-known phenomenon in nonparametric statistics. We note that these cited results, and indeed most of the results in the literature when testing for conditional independence, were for real-valued random variables and/or assumed certain special distribution types, typically the existence of a joint probability density function or that both X and Y are discrete, as in [12]. As we remarked earlier, such assumptions exclude problems where Y is discrete and X is continuous (typical in classification) or problems where the observation X is concentrated on a lower dimensional subspace or manifold. In contrast, our test construction is completely distribution-free and its convergence properties are almost distribution-free (we do assume a mild Lipschitz-type condition; see the upcoming Condition 1).
In our hypotheses testing setup, the alternative hypothesis, , is the complement to the null hypothesis, ; therefore, there is no separation gap between the hypotheses. Dembo and Peres [14] and Nobel [15] characterized hypothesis pairs that admitted strongly consistent tests; i.e., tests that with a probability of one only make finitely many Type I and II errors. This property is called discernibility. As an illustration of the intricate nature of the discernibility concept, Dembo and Peres [14] demonstrated an exotic example, where the null hypothesis is that the mean of a random variable is rational, while the alternative hypothesis is that this mean minus is rational. (See also Cover [16] and Kulkarni and Zeitouni [17].) The discernibility property shows up in Biau and Györfi [18] (testing homogeneity), Devroye and Lugosi [19] (classification of densities), Gretton and Györfi [20] (testing independence), Morvai and Weiss [21] and Nobel [15] (classification of stationary processes), among others.
In the remainder of this section, under mild conditions for the distribution of , we study discernibility in the context of lossless transformations for statistical inference with general risk. We will make strong use of the multivariate-partitioning-based test of Györfi and Walk [2].
Let denote the joint distribution of and similarly for any marginal distribution of ; e.g., denotes the distribution of the pair . As in Györfi and Walk [2], introduce the following empirical distributions:
and
for Borel sets , , and .
For the sake of simplicity, assume that X, Y, and are bounded. Otherwise, we apply a componentwise, one-to-one scaling in the interval . Obviously, the losslessness null hypothesis is invariant under such a scaling. Let
be the finite cubic partitions of the ranges X, Y, and Z, with all the cubes having common side lengths (thus, is proportional to ). As in [2], we define the test statistic
Our test rejects if
and accepts it if , where the threshold is set to
where the constant satisfies
In this setup, the distribution of is arbitrary; its components can be discrete or absolutely continuous, or a mixture of the two or even singularly continuous. It is important to note that to construct this test, there is no need to know the type of distribution.
We assume that the joint distribution of X, Y, and satisfies the following assumption.
Condition 1.
Let be the density of the conditional distribution with respect to the distribution as a dominating measure and introduce the notation
Assume that for some , satisfies the condition
for all n.
We note that the ordinary Lipschitz condition
implies (5). This latter condition is equivalent to
where denotes the total variation distance between distributions P and Q. In Neykov, Balakrishnan, and Wasserman [12], condition (6) is called the Null TV Lipschitz condition.
The next theorem is an adaptation and extension of the results in Györfi and Walk [2] to this particular problem of lossless transformation. In [2], it was assumed that the sequence of partitions is nested, while we make no such assumption. The proof, in which an error made in [2] is also corrected, is relegated to Section 5.
Theorem 2.
Suppose that X, Y, and are bounded and Condition 1 holds for all n. If the sequence satisfies
and
then we have the following:
- (a)
- Under the losslessness null hypothesis , we have for all ,and therefore, because by (8) and (9), after a random sample size, the test produces no error with a probability of one.
- (b)
- Under the alternative hypothesis ,thus, with a probability of one, after a random sample size, the test produces no error.
Remark 1.
- (i)
- The choice with satisfies both conditions (7) and (8).
- (ii)
- Note that using (4), is of order . Since we have
this means that is of order .
An important special transformation is given by the feature selection defined in the Introduction. Theorem 2 demonstrates the possibility of universally lossless dimension reduction for any multivariate feature vector. Note that in the setup of feature selection, the partition can be the nested version of and so the calculation of the test statistic is easier.
3. Universally -Lossless Transformations
Here, we develop bounds on the excess minimum risk, in terms of mutual information under various assumptions about the loss function. With the aid of these bounds, we give information-theoretic sufficient conditions for a transformation T to be universally -lossless with respect to fairly general classes of loss functions ℓ.
3.1. Preliminaries on Mutual Information
Let denote the joint distribution of the pair and let denote the product of the marginal distributions of X and Y, respectively. The mutual information between X and Y, denoted by , is defined as
where
is the Kullback–Leibler (KL) divergence between probability distributions P and Q (here, means that P is absolutely continuous with respect to Q with the Radon–Nikodym derivative ). Thus, is always nonnegative and if and only if X and Y are independent (note that is possible). In this definition and throughout the paper, log denotes the natural logarithm.
For random variables U and V (both taking values in finite-dimensional Euclidean spaces), let denote the conditional distribution of U, given V. Furthermore, let denote the stochastic kernel (regular conditional probability) induced by . Thus, in particular, for each measurable set A.
Given another random variable Z, the conditional mutual information is defined as
The integral above can also be denoted by and is called the conditional KL divergence. One can define
so that
From this definition it is clear that if and only if X and Y are conditionally independent given Z, i.e., if and only if (or equivalently, if and only if ).
Another way of expressing is
One can see that in a similar way to can be expressed as
Properties of mutual information and conditional mutual information, their connections to the KL divergence, and identities involving these information measures are detailed in, e.g., Cover and Thomas ([22], Chapter 2) and Polyanskiy and Wu ([9], Chapter 3).
3.2. Mutual Information Bounds and -Lossless Transformations
A real random variable U with finite expectation is said to be -sub-Gaussian for some if
Furthermore, we say that U is conditionally -sub-Gaussian given another random variable V if we have a.s.
The following result gives a quantitative upper bound on the excess minimum risk in terms of the mutual information difference under certain, not too restrictive, conditions. Note that always holds.
Given , we call an estimator -optimal if .
Theorem 3.
Let be a measurable transformation and assume that for any , there exists an ϵ-optimal estimator of Y from X, such that is conditionally -sub-Gaussian given for every , i.e.,
for all and , where satisfies . Then, one has
Remark 2.
- (i)
- In case , we interpret the right hand side of (15) as ∞. With this interpretation, the bound always holds.
- (ii)
- We show in Section 4.2 that the sub-Gaussian condition (14) holds for the regression problem with squared error if , where N is independent noise having a zero mean and finite fourth moment , and the regression function is bounded. In particular, the bound in the theorem holds if N is normal with zero mean and m is bounded.We note that Theorem 6 and Corollary 3 in Xu and Raginsky [4] give bounds similar to (15), in the somewhat different context of Bayesian learning. However, the conditions there exclude, e.g., regression models in the form if for some .
- (iii)
- Although hidden in the notation, depends on the loss function ℓ. Thus, the upper bound (15) is the product of two terms, the second of which,is independent of the loss function.
- (iv)
- The bound in the theorem is not tight in general. In Section 4.3, an example is given in the context of portfolio selection, where the excess risk can be upper bounded by the difference .
- (v)
- The proof of Theorem 3 and those of its corollaries go through virtually without change if we replace with any -valued random variable Z, such that . Under the conditions of the theorem, we then haveIn fact, Theorem 3 and its corollaries hold for general random variables Y, X, and Z taking values in complete and separable metric (Polish) spaces , , and , respectively, if .
The proof of Theorem 3 is based on a slight generalization of Raginsky et al. ([5], Lemma 10.2), which we state next. In the lemma, U and V are arbitrary abstract random variables defined for the same probability space and taking values in spaces and , respectively; and are independent copies of U and V (so that ); and is a measurable function.
Lemma 1.
Assume that is -sub-Gaussian for all , where . Then,
Proof.
We essentially copy the proof of ([5], Lemma 10.2), where it was assumed that does not depend on u. With this restriction, the sub-Gaussian condition (14) in Theorem 3 would have to hold with uniformly over y. This condition would exclude regression models with independent sub-Gaussian noise and, a fortiori, models with independent noise that do not possess finite absolute moments of all orders, while our Theorem 2 can also be applied in such cases (see Section 4.2).
We make use of the Donsker–Varadhan variational representation of the relative entropy ([23] Corollary 4.15), which states that
where the supremum is over all measurable , such that . Applying this with , , and , we obtain
where the second inequality follows from the assumption that is -sub-Gaussian. Maximizing the right-hand side of (16) over gives, after rearrangement,
Since and are independent, , and we obtain
where (18) follows from Jensen’s inequality, (19) follows from (17), in (20) we used the Cauchy–Schwarz inequality, and the last equality follows from (11). □
Proof of Theorem 3.
Let and be random variables, such that , , , and and are conditionally independent given . Thus, the joint distribution of the triple is .
We apply Lemma 1 with , , and . Note that, using the conditions of the theorem, we can choose an -optimal , such that for every y, is conditionally -sub-Gaussian given . Consider and as regular (unconditional) expectations taken with respect to and respectively, and consider as regular mutual information between random variables with the distribution . Since and are conditionally independent given , Lemma 1 yields
Recalling that and have the same distribution, and applying Jensen’s inequality and the Cauchy–Schwarz inequality as in (18) and (20), we obtain
On the one hand, we have
where the first equality follows from Theorem 1 with the conditional independence of and given , and the second follows, since and have the same distribution by construction. On the other hand, . Thus, (22) and (23) imply
which proves the upper bound in (15), since is arbitrary. By expanding in two different ways using the chain rule for mutual information (e.g., Cover and Thomas ([22], Thm. 2.5.2)), and using the conditional independence of Y and given X, one obtains , which shows the equality in (15). □
We state two corollaries for special cases. In the first, we assume that ℓ is uniformly bounded, i.e., . For any , let denote the collection of all loss functions ℓ with . Recall the notion of a universally -lossless transformation from Definition 2.
Corollary 1.
Suppose the loss function ℓ is bounded. Then, for any measurable , we have
Therefore, whenever
the transformation T is universally δ-lossless for the family , i.e., for all ℓ with .
Remark 3.
- (i)
- The bound of the theorem can be used to give an estimation-theoretic motivation of the information bottleneck (IB) problem; see Section 4.4.
- (ii)
- Let . For bounded ℓ, the inequalitywas proven in Makhdoumi et al. ([24], Theorem 1) for discrete alphabets, to solve the so-called privacy funnel problem.This inequality follows from (15) by setting to be constant there.
- (iii)
- A simple self-contained proof of (24) (see below) was provided by Or Ordentlich and communicated to the second author by Shlomo Shamai [25], in response to an early version of this manuscript. The bound in (24) seems to have first appeared in published form in Hafez-Kolahi et al. ([26], Lemma 1), where the proof was attributed to Xu and Raginsky [27].
Proof of Corollary 1.
If ℓ is uniformly bounded, then for any one has for all y and x. Then Hoeffding’s lemma (e.g., Boucheron et al. ([23], Lemma 2.2)) implies that for all y, is conditionally -sub-Gaussian with given . Since an -optimal estimator exists for any and is conditionally -sub-Gaussian, given using the preceding argument, (24) follows from Theorem 3. The second statement follows directly from (24) and the fact that for all .
The following alternative argument by Or Ordentlich [25] is based on Pinsker’s inequality for the total variation distance in terms of the KL divergence (see, e.g., ([9], Theorem 7.9)). For bounded ℓ, this gives a direct proof of an analogue of the key inequality (22) in the proof of Theorem 3. This argument avoids Lemma 1 and the machinery introduced by the sub-Gaussian assumption.
Using the same notation as in the proof of Theorem 3 and letting and , we have
The rest of the proof proceeds exactly as in Theorem 3. □
In the second corollary, we do not require that ℓ be bounded but assume that an optimal estimator from X to Y exists, such that is conditionally -sub-Gaussian given , where .
Corollary 2.
Assume that an optimal estimator of Y from X exists, i.e., the measurable function satisfies . Furthermore, suppose that the sub-Gaussian condition of Theorem 3 holds with (i.e., (14) holds for . Then,
Proof.
The corollary immediately follows from Theorem 3, since an optimal is -optimal for all . □
For the next corollary, let denote the collection of all loss functions ℓ, such that
for some function with .
Corollary 3.
If T is a transformation such that
then T is universally δ-lossless for the family .
Proof.
Since is a.s. upper bounded by for any , using Hoeffding’s lemma ([23], Lemma 2.2), we have that is conditionally -sub-Gaussian given . Thus, from Corollary 2, for all , we have
if . □
The next corollary generalizes and gives a much simplified proof of Faragó and Györfi [28], see also Devroye, Györfi, and Lugosi ([10], Theorem. 32.3). This result states for binary classification (Y is 0-1-valued and ) that if a sequence of functions is such that in probability as , then as .
Corollary 4.
Assume that a sequence of transformations is such that in distribution (i.e., weakly) as . Then, for any bounded loss function ℓ,
Note that this corollary and its proof still hold without any changes if X takes values in an arbitrary complete separable metric space. For example, in the setup of function classification, X may take values in an function space for , and is a truncated series expansion or a quantizer. Interestingly, here the asymptotic losslessness property is guaranteed, even in the case where the sequence of transformations and the loss function ℓ are not matched at all.
Proof.
If in distribution, then clearly in distribution. Thus, the lower semicontinuity of mutual information with respect to convergence in distribution (see, e.g., Polyanskiy and Wu ([9], Equation (4.28))) implies
Since for all n, we obtain
Combined with Corollary 1 (with T replaced with ), this gives
□
4. Applications
4.1. Classification
For classification, is the finite set and the cost is the loss
In this setup, the risk of estimator f is the error probability . With the notation
the optimal estimator is the Bayes decision
and the minimum risk is the Bayes error probability
If stands for the Bayes error probability of the transformed observation vector , then (24) with yields the upper bound
see also ([4], Corollary 2) for a similar bound in the context of Bayesian learning.
As a special case, the feature selector is lossless if
Györfi and Walk [29] studied the corresponding hypothesis testing problem. Using a k-nearest-neighbor (k-NN) estimate of the excess Bayes error probability , they introduced a test statistic and accepted the hypothesis (28), if the test statistic is less than a threshold. Under certain mild conditions, the strong consistency of this test has been proven.
4.2. Nonparametric Regression
For the nonparametric regression problem, the cost is the squared loss
and the best statistical inference is the regression function
(here, we assume ). Then, the minimum risk is the residual variance
If and denote the residual variances for the observation vectors X and , respectively, then
Note that the excess residual variance does not depend on the distribution of the residual .
Next, we show that the conditions of Corollary 2 hold with for the important case
where N is a zero-mean noise variable that is independent of X and satisfies , and m is bounded as for all x. For this model, we have
Thus, is a nonnegative random variable a.s. bounded by , which implies via Hoeffding’s lemma (e.g., ([23], Lemma 2.2)) that it is -sub-Gaussian given with . We have
thus, the conditions of Corollary 2 hold and we obtain
Again, the feature selection is called lossless, when holds. As a test statistic, Devroye et al. [30] introduced a 1-NN estimate of and proved the strong consistency of the corresponding test.
4.3. Portfolio Selection
The next example is related to the negative of the log-loss or log-utility; see Algoet and Cover [31], Barron and Cover [32], Chapters 6 and 16 in Cover and Thomas [22], Györfi et al. [33].
Consider a market consisting of assets. The evolution of the market in time is represented by a sequence of (random) price vectors with
where the jth component of denotes the price of the jth asset in the nth trading period. Let us transform the sequence of price vectors into the sequence of return (relative price) vectors , defined as
where
Constantly rebalanced portfolio selection is a multi-period investment strategy, where at the beginning of each trading period the investor redistributes the wealth among the assets. The investor is allowed to diversify their capital at the beginning of each trading period according to a portfolio vector . The jth component of b denotes the proportion of the investor’s capital invested in asset j. Here, we assume that the portfolio vector b has nonnegative components with . The simplex of possible portfolio vectors is denoted by .
Let denote the investor’s initial capital. Then, at the beginning of the first trading period, is invested into asset j, and this results in return , and therefore at the end of the first trading period the investor’s wealth becomes
where denotes the standard inner product in . For the second trading period, is the new initial capital
By induction, for the trading period n, the initial capital is , and therefore
The asymptotic average growth rate of this portfolio selection strategy is
assuming a limit exists.
If the market process is memory-less, i.e., it is a sequence of i.i.d. random return vectors, then the strong law of large numbers implies that the best constantly rebalanced portfolio (BCRP) is the log-optimal portfolio:
while the best asymptotic average growth rate is
Barron and Cover [32] extended this setup to portfolio selection with side information. Assume that are valued side information vectors, such that are i.i.d. and in each round n the portfolio vector may depend on . The strong law of large numbers yields
Therefore, the log-optimal portfolio has the form
and the best asymptotic average growth rate is
Barron and Cover ([32], Thm. 2) proved that
The next theorem generalizes this result by upper bounding the loss of the best asymptotic growth rate when, instead of X, only degraded side information is available.
Theorem 4.
For any measurable ,
assuming the terms on the right hand side are finite.
Remark 4.
- (i)
- As in Theorem 3, the difference in the upper bound is equal to , a quantity that is always nonnegative but may be equal to ∞. In this case, we interpret the right hand side as ∞.
- (ii)
- There is a correspondence between this setup of portfolio selection and the setup in previous sections. In particular, Y from the previous sections is equal to R with a range and the inference is taking values in . Then, the loss is . If we assume that for all ,thenand so Corollary 1 impliesNote that, from the point of view of application, (30) is a mild condition. For example, for NYSE daily data ; see Györfi et al. [34].
Proof.
Let be a generic copy of the . Writing out explicitly the dependence of on , we have
and from (11) we have
Thus, the bound in (29) can be written as
Furthermore, letting , we have
Since is a Markov chain, , and we obtain
Applying (31) with replaced with and replaced with with z fixed, we can bound the expression in parentheses as
and therefore
where (32) follows from the alternative expression (12) of the conditional mutual information.
As in the proof Theorem 3, the conditional independence of R and given X implies
which completes the proof. □
4.4. Information Bottleneck
Let X and Y be random variables as in Section 2. When , the joint distribution of the triple is determined (for fixed ) by the conditional distribution (transition kernel) as . The information bottleneck (IB) framework can be formulated as the study of the constrained optimization problem
for a given , where the maximization is over all transition kernels .
Originally proposed by Tishby et al. [35], the solution to the IB problem is a transition kernel , interpreted as a stochastic transformation, that “encodes” X into a “compressed” representation Z that preserves relevant information about Y through maximizing , while compressing X by requiring that . The intuition behind this framework is that by maximizing , the representation Z will retain the predictive power of X with respect to Y, while the requirement makes the representation Z concise.
Note that, in case X is discrete and has finite entropy , setting , or setting formally in the general case, the constraint becomes vacuous and (assuming the alphabet of Z is sufficiently large) the resulting Z will achieve the upper bound , so that , i.e., . Thus, the solution to (33) can be considered as a stochastically relaxed version of a minimal sufficient statistic for X in predicting Y (see Goldfeld and Polyanskiy ([36], Section II.C) for more on this interpretation). Recent tutorials on the IB problem include Asoodeh and Calmon [37] and Zaidi et al. [38].
Theorem 3 and its corollaries can be used to motivate the IB principle from an estimation-theoretic viewpoint. Let
be the value function for (33) and a resulting optimal Z (assuming such a maximizer exists). From the remark after Theorem 3, we know that the bounds given in the theorem and in its corollaries remain valid if we replace with a random variable Z, such that . Then, for example, Corollary 1 implies that
for all ℓ such that .
Thus, the IB paradigm minimizes, under the complexity constraint ≤, an upper bound on the difference that universally holds for all loss functions ℓ with . The resulting will then have guaranteed performance in predicting Y with respect to all sufficiently bounded loss functions. This gives a novel operational interpretation of the IB framework that seems to have been overlooked in the literature.
4.5. Deep Learning
The IB paradigm can also serve as a learning objective in deep neural networks (DNNs). Here the Lagrangian relaxation of (33) is considered. In particular, letting X denote the input and the output of the last hidden layer of the DNN, where is the collection of network parameters (weights), the objective is to maximize
over for a given . The parameter controls the trade-off between how informative is about Y, measured by , and how much is “compressed,” measured by . Clearly, larger values of correspond to smaller values of and thus to more compression. Here, is either a deterministic function of X in the form of , where represents the deterministic DNN, or it is produced by a stochastic kernel , parameterized by the network parameters . The latter is achieved by injecting independent noise into the network’s intermediate layers.
In addition to the motivation explained in the previous section, the IB framework for DNNs can be thought as a regularization method that results in improved generalization capabilities for a network trained on data using stochastic gradient-based methods, see, e.g., Tishby and Zaslavsky [39], Shwartz-Ziv and Tishby [40], Alemi et al. [41], as well as many other references in the excellent survey article Goldfeld and Polyanskiy [36], and the special issue [42] on information bottleneck and deep learning.
As in the previous section, our Theorem 1 and corollaries can serve as a (partial) justification for setting (34) as a learning objective. Assume that after training with a given , the obtained has (true) mutual information with Y (typically, this will not be the optimal solution, since maximizing (34) is not feasible and in practice only a proxy lower bound is optimized during training, see, e.g., Alemi et al. [41]). Then, by Corollary 1 the obtained network has a guaranteed predictive performance
for all loss functions ℓ with , where
5. Proof of Theorem 2
Proof Theorem 2.
- (a)
- The bounds given in the proof of Theorem 1 in [2] implywhereandUsing large deviation inequalities from Beirlant et al. [43] and in Biau and Györfi [18], Györfi and Walk [2] proved that for all , and ,We note that the bounds on the probabilities for were proven in [2] without either assuming the null hypothesis or using the condition that the partitions are nested. Under the null hypothesis, Györfi and Walk [2] claimed thatAs Neykov et al. [12] observed, this was incorrect. In order to resolve the gap, we show that under and condition (5) and under the null hypothesis, the last term in (35) is , i.e.,if n is large enough. The null hypothesis implies thatThus,Let and be as in Condition 1. Then,where in the last step we use condition (5). The inequalities (35) and (36) imply thatif . Since is proportional to , condition (8) on implies , and thus using the Borel–Cantelli lemma, after a random sample size, the test has no error with a probability of one.
- (b)
- This proof is a refinement of the proof of Corollary 1 in [2], in which we avoid the condition used there that the sequences of partitions and are nested. According to the proof of Part (a) (see the remark after (35)), we obtain thatTo simplify the notation, let , , and . Let be the expected total variation distance between and :where the supremum is taken over all Borel subsets F of . It suffices to prove that using the condition ,One has thatwhereIn [2], it was shown that the condition implies if the sequence of partitions is nested. In order to avoid this nestedness condition, introduce the density of the conditional distribution with respect to the distribution of as a dominating measure, and similarly let be the density of the conditional distribution with respect to , i.e., . Then,and therefore the term on the right-hand side of (37) will converge to zero, as long aswhich follows from using the standard technique of the bias of partitioning regression estimate for the regression function ; see Theorem 4.2 in [44]. The terms in (38) and (39) can be dealt with analogously. Thus,For fixed z, impliessee Abou-Jaoude [45] and Csiszár [46]. Therefore, the dominated convergence theorem yields
□
Note that in the proof of Part (b) the condition (5) is not used, at all.
6. Concluding Remarks
We studied the excess minimum risk in statistical inference and under mild conditions gave a strongly consistent procedure for testing from data if a given transformation of an observed feature vector results in zero excess minimum risk for all loss functions. It is an open research problem whether a strong universal test exists, i.e., a test that is strongly consistent without any conditions on the transformation and on the underlying distribution. We also developed information-theoretic upper bounds on the excess risk that uniformly hold over fairly general classes of loss functions. The bounds were not stated in the most general form possible, in that the observed quantities were restricted to taking values in Euclidean spaces and we did not allow transformations that were random functions of the observation, both of which restrictions could be relaxed.The bounds could be sharpened, e.g., in specific cases, but in their present form are already useful. For example, they give an additional theoretical motivation for applying the information bottleneck approach in deep learning.
Author Contributions
Conceptualization, L.G., T.L. and H.W.; Methodology, L.G. and T.L.; Validation, H.W.; Formal analysis, T.L.; Investigation, H.W.; Writing—original draft, L.G.; Writing—review & editing, L.G., T.L., L.G., T.L. and H.W. equally contributed to the published work. All authors have read and agreed to the published version of the manuscript.
Funding
The research of László Györfi has been supported by the National Research, Development and Innovation Fund of Hungary under the 2019-1.1.1-PIACI-KFI-2019-00018 funding scheme. Tamás Linder’s research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
Institutional Review Board Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
T. Linder would like to thank O. Ordentlich and S. Shamai for their helpful comments on an earlier version of this manuscript and for pointing out relevant literature.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Schervish, M.J. Theory of Statistics; Springer Series in Statistics; Springer: New York, NY, USA, 1995. [Google Scholar]
- Györfi, L.; Walk, H. Strongly consistent nonparametric tests of conditional independence. Stat. Probab. Lett. 2012, 82, 1145–1150. [Google Scholar] [CrossRef]
- Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Adv. Neural Inf. Process. Syst. 2017, 30, 2521–2530. [Google Scholar]
- Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. IEEE Trans. Inf. Theory 2022, 68, 7935–7955. [Google Scholar] [CrossRef]
- Raginsky, M.; Rakhlin, A.; Xu, A. Information-Theoretic Stability and Generalization. In Information-Theoretic Methods in Data Science; Rodrigues, M., Eldar, Y., Eds.; Cambridge University Press: Cambridge, UK, 2021; pp. 302–329. [Google Scholar] [CrossRef]
- Lugosi, G.; Neu, G. Generalization bounds via convex analysis. In Proceedings of the 34th Annual Conference on Learning Theory (COLT), London, UK, 2–5 July 2022; pp. 3524–3546. [Google Scholar]
- Jose, S.T.; Simeone, O. Information-theoretic generalization bounds for meta-learning and applications. Entropy 2021, 23, 126. [Google Scholar] [CrossRef]
- Hafez-Kolahi, H.; Moniri, B.; Kasaei, S. Information-theoretic analysis of minimax excess risk. IEEE Trans. Inf. Theory 2023, 69, 4659–4674. [Google Scholar] [CrossRef]
- Polyanskiy, Y.; Wu, Y. Information Theory: From Coding to Learning; Cambridge University Press: Cambridge, UK, 2022; Forthcoming; Available online: https://people.lids.mit.edu/yp/homepage/data/itbook-export.pdf (accessed on 5 July 2023).
- Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
- Cai, Z.; Li, R.; Zhang, Y. A distribution free conditional independence test with applications to causal discovery. J. Mach. Learn. Res. 2022, 23, 1–41. [Google Scholar]
- Neykov, M.; Balakrishnan, S.; Wasserman, L. Minimax optimal conditional independence testing. Ann. Stat. 2021, 49, 2151–2177. [Google Scholar] [CrossRef]
- Shah, R.D.; Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Ann. Stat. 2020, 48, 1514–1538. [Google Scholar] [CrossRef]
- Dembo, A.; Peres, Y. A topological criterion for hypothesis testing. Ann. Stat. 1994, 22, 106–117. [Google Scholar] [CrossRef]
- Nobel, A.B. Hypothesis testing for families of ergodic processes. Bernoulli 2006, 12, 251–269. [Google Scholar] [CrossRef]
- Cover, T. On determining the irrationality of the mean of a random variable. Ann. Stat. 1973, 1, 862–871. [Google Scholar] [CrossRef]
- Kulkarni, S.R.; Zeitouni, O. Can one decide the type of the mean from the empirical distribution? Stat. Probab. Lett. 1991, 12, 323–327. [Google Scholar] [CrossRef][Green Version]
- Biau, G.; Györfi, L. On the asymptotic properties of a nonparametric L1-test statistic of homogeneity. IEEE Trans. Inf. Theory 2005, 51, 3965–3973. [Google Scholar] [CrossRef]
- Devroye, L.; Lugosi, G. Almost sure classification of densities. J. Nonparametr. Stat. 2002, 14, 675–698. [Google Scholar] [CrossRef]
- Gretton, A.; Györfi, L. Consistent nonparametric tests of independence. J. Mach. Learn. Res. 2010, 11, 1391–1423. [Google Scholar]
- Morvai, G.; Weiss, B. On universal algorithms for classifying and predicting stationary processes. Probab. Surv. 2021, 18, 77–131. [Google Scholar] [CrossRef]
- Cover, T.; Thomas, J. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
- Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
- Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the information bottleneck to the privacy funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW), Hobart, TAS, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
- Ordentlich, O.; (School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel); Shamai, S.; (Department of Electrical Engineering, Technion, Haifa, Israel). Personal communication, July 2020.
- Hafez-Kolahi, H.; Moniri, B.; Kasaei, S.; Baghshah, M.S. Rate-distortion analysis of minimum excess risk in Bayesian learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 3998–4007. [Google Scholar]
- Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. arXiv 2020, arXiv:2012.14868. [Google Scholar]
- Faragó, T.; Györfi, L. On the continuity of error distortion function for multiple-hypotheses decisions. IEEE Trans. Inf. Theory 1975, IT-21, 458–560. [Google Scholar] [CrossRef]
- Györfi, L.; Walk, H. Detecting Ineffective Features for Pattern Recognition. Oberwolfach Preprint. 2017. Available online: http://publications.mfo.de/handle/mfo/1314 (accessed on 15 July 2023).
- Devroye, L.; Györfi, L.; Lugosi, G.; Walk, H. A nearest neighbor estimate of the residual variance. Electron. J. Stat. 2018, 12, 1752–1778. [Google Scholar] [CrossRef]
- Algoet, P.; Cover, T.M. Asymptotic optimality asymptotic equipartition properties of log-optimum investments. Ann. Probab. 1988, 16, 876–898. [Google Scholar] [CrossRef]
- Barron, A.R.; Cover, T.M. A bound on the financial value of information. IEEE Trans. Inf. Theory 1988, 34, 1097–1100. [Google Scholar] [CrossRef]
- Györfi, L.; Ottucsák, G.; Urbán, A. Empirical log-optimal portfolio selections: A survey. In Machine Learning for Financial Engineering; Györfi, L., Ottucsák, G., Walk, H., Eds.; Imperial College Press: London, UK, 2012; pp. 81–118. [Google Scholar]
- Györfi, L.; Ottucsák, G.; Walk, H. The growth optimal investment strategy is secure, too. In Optimal Financial Decision Making under Uncertainty; Consigli, G., Kuhn, D., Brandimarte, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; pp. 201–223. [Google Scholar]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA; 1999; pp. 368–377. [Google Scholar]
- Goldfeld, Z.; Polyanskiy, Y. The Information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
- Asoodeh, S.; Calmon, F.P. Bottleneck problems: An information and estimation-theoretic view. Entropy 2020, 22, 1325. [Google Scholar] [CrossRef] [PubMed]
- Zaidi, A.; Aguerri, I.E.; Shamai, S. On the information bottleneck problems: Models, connections, applications and information theoretic views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef] [PubMed]
- Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
- Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017; pp. 368–377. [Google Scholar]
- Geiger, B.C.; Kubin, G. Information bottleneck: Theory and applications in deep learning, Editorial for special issue on “Information Bottleneck: Theory and Applications in Deep Learning”. Entropy 2020, 22, 1408. [Google Scholar] [CrossRef]
- Beirlant, J.; Devroye, L.; Györfi, L.; Vajda, I. Large deviations of divergence measures on partitions. J. Stat. Plan. Inference 2001, 93, 1–16. [Google Scholar] [CrossRef]
- Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
- Abou-Jaoude, S. Conditions nécessaires et suffisantes de convergence L1 en prohabilité de l’histogramme pour une densité. Ann. L’Institut Henri Poincaré 1976, 12, 213–231. [Google Scholar]
- Csiszár, I. Generalized entropy and quantization problems. In Proceedings of the Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Prague, Czech Republic, 19–25 September 1971; Academia: Prague, Czech Republic, 1973; pp. 159–174. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).