Next Article in Journal
A Nonparametric Bayesian Approach to the Rare Type Match Problem
Next Article in Special Issue
Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification
Previous Article in Journal
Supercritical Fluid Gaseous and Liquid States: A Review of Experimental Results
Previous Article in Special Issue
Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Towards a Unified Theory of Learning and Information

by
Ibrahim Alabdulmohsin
Google Research, 8002 Zürich, Switzerland
Entropy 2020, 22(4), 438; https://doi.org/10.3390/e22040438
Submission received: 10 February 2020 / Revised: 5 April 2020 / Accepted: 6 April 2020 / Published: 13 April 2020

Abstract

:
In this paper, we introduce the notion of “learning capacity” for algorithms that learn from data, which is analogous to the Shannon channel capacity for communication systems. We show how “learning capacity” bridges the gap between statistical learning theory and information theory, and we will use it to derive generalization bounds for finite hypothesis spaces, differential privacy, and countable domains, among others. Moreover, we prove that under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule that has a vanishing learning capacity is equivalent to the assertion that the hypothesis space has a finite Vapnik–Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory. In addition, we show how the learning capacity of an algorithm provides important qualitative results, such as on the relation between generalization and algorithmic stability, information leakage, and data processing. Finally, we conclude by listing some open problems and suggesting future directions of research.

Graphical Abstract

1. Introduction

1.1. Generalization Risk

A central goal when learning from data is to strike a balance between underfitting and overfitting. Mathematically, this requirement can be translated into an optimization problem with two competing objectives. First, we would like the learning algorithm to produce a hypothesis (i.e., an answer) that performs well on the empirical sample. This goal can be easily achieved by using a rich hypothesis space that can “explain” any observations. Second, we would like to guarantee that the performance of the hypothesis on the empirical data (a.k.a. training error) is a good approximation of its performance with respect to the unknown underlying distribution (a.k.a. test error). This goal can be achieved by limiting the complexity of the hypothesis space. The first condition mitigates underfitting while the latter condition mitigates overfitting.
Formally, suppose we have a learning algorithm L : Z m H that receives a sample s = { z 1 , , z m } , which comprises of m i.i.d. observations z i p ( z ) , and uses s to select a hypothesis h H . Let l be a loss function defined on the product space Z × H . For instance, l can be the mean-square-error (MSE) in regression or the 0–1 error in classification. Then, the goal of learning from data is to select a hypothesis h H such that its true risk R ( h ) , defined by
R ( h ) = E z p ( z ) [ l ( z , h ) ] ,
is small. However, this optimization problem is often difficult to solve exactly since the underlying distribution of observations p ( z ) is seldom known. Rather, because the true risk R ( h ) can be decomposed into a sum of two terms:
R ( h ) = R s ( h ) + R ( h ) R s ( h ) ,
where R s ( h ) = E z s [ l ( z , h ) ] ( 1 / m ) z s l ( z , h ) , both terms can be tackled separately. The first term in the equation above corresponds to the empirical risk on the training sample s . The second term corresponds to the generalization risk. Hence, by minimizing both terms, one obtains a learning algorithm whose true risk is small.
Minimizing the empirical risk can be achieved using tractable approximations to the empirical risk minimization (ERM) procedure, such as stochastic convex optimization [1,2]. However, the generalization risk is often difficult to deal with directly because the underlying distribution is often unknown. Instead, it is a common practice to bound it analytically. By establishing analytical conditions for generalization, one hopes to design better learning algorithms that both perform well empirically and generalize as well into the future.
Several methods have been proposed in the past for bounding the generalization risk of learning algorithms. Some examples of popular approaches include uniform convergence, algorithmic stability, Rademacher and Gaussian complexities, and the PAC–Bayesian framework [3,4,5,6,7].
The proliferation of such bounds can be understood upon noting that the generalization risk of a learning algorithm is influenced by multiple factors, such as the domain Z , the hypothesis space H , and the mapping from Z to H . Hence, one may derive new generalization bounds by imposing conditions on any of such components. For example, the Vapnik–Chervonenkis (VC) theory derives generalization bounds by assuming constraints on H whereas stability bounds, e.g., [6,8,9], are derived by assuming constraints on the mapping from Z to H .
Rather than showing that certain conditions are sufficient for generalization, we will establish in this paper conditions that are both necessary and sufficient. More precisely, we will show that the “ uniform” generalization risk of a learning algorithm is an information-theoretic characterization. In particular, it is equal to the total variation distance between the joint distribution of the hypothesis h and a single random training example z ^ s , on one hand, and the product of their marginal distributions, on the other hand. Hence, it is analogous to the mutual information between h and z ^ . Since uniform generalization is an information-theoretic quantity, information-theoretic tools, such as the data-processing inequality and the chain rules of entropy [10], can be used to analyze the performance of machine learning algorithms. For example, we will illustrate this fact by presenting a simple proof to the classical generalization bound in the finite hypothesis space setting using, solely, information-theoretic inequalities without any reference to the union bound.

1.2. Types of Generalization

Generalization bounds can be stated either in expectation or in probability. Let l : Z × H [ 0 , 1 ] be some loss function with a bounded range. Then, we have the following definitions:
Definition 1
(Generalization in Expectation). The expected generalization risk of a learning algorithm L : Z m H with respect to a loss l : Z × H [ 0 , 1 ] is defined by:
R g e n ( L ) = E h [ R ( h ) ] E s , h E z ^ s [ l ( z ^ , h ) ] ,
where R ( h ) is defined in Equation (1), and the expectation is taken over the random choice of s and the internal randomness of L . A learning algorithm L generalizes in expectation if R g e n ( L ) 0 as m for all distributions p ( z ) .
Definition 2
(Generalization in Probability). A learning algorithm L generalizes in probability if for any ϵ > 0 , we have:
p | R ( h ) E z ^ s [ l ( z ^ , h ) ] | > ϵ 0 as m ,
where the probability is evaluated over the randomness of s and the internal randomness of the learning algorithm.
In general, both types of generalization have been used to analyze machine learning algorithms. For instance, generalization in probability is used in the VC theory to analyze algorithms with finite VC dimensions, such as linear classifiers [3]. Generalization in expectation, on the other hand, was used to analyze learning algorithms, such as the stochastic gradient descent (SGD), differential privacy, and ridge regression [11,12,13,14]. Generalization in expectation is often simpler to analyze, but it provides a weaker performance guarantee.

1.3. Paper Outline

In this paper, a third notion of generalization is introduced, which is called uniform generalization. Uniform generalization also provides generalization bounds in expectation, but it is stronger than the traditional form of generalization in expectation in Definition 1 because it requires that the generalization risk vanishes uniformly in expectation across all bounded parametric loss functions (hence the name). In this paper, a loss function l : Z × H [ 0 , 1 ] is called “ parametric” if it is conditionally independent of the original training sample given the learned hypothesis h H .
As mentioned earlier, the uniform generalization risk is equal to an information-theoretic quantity and it yields classical results in statistical learning theory. Perhaps more importantly, and unlike traditional in-expectation guarantees that do not imply concentration, we will show that uniform generalization in expectation implies generalization in probability. Hence, all of the uniform generalization bounds derived in this paper hold both in expectation and with a high probability.
The theory of uniform generalization bridges the gap between information theory and statistical learning theory. For example, we will establish an equivalence relation between the VC dimension, on one hand, and another quantity that is quite analogous to the Shannon channel capacity, on the other hand. Needless to mention, both the VC dimension and the Shannon channel capacity are arguably the most central concepts in statistical learning theory and information theory. This connection between the two concepts is obtained via the notion of the “ learning capacity” that we introduce in this paper, which is the supremum of the uniform generalization risk across all input distributions. We will compute the learning capacities for many machine learning algorithms and show how it matches known bounds on the generalization risk up to logarithmic factors.
In general, the main aim of this work is to bring to light a new information-theoretic approach for analyzing machine learning algorithms. Despite the fact that “ uniform generalization” might appear to be a strong condition at a first sight, one of the central themes that is emphasized repeatedly throughout this paper is that uniform generalization is, in fact, a natural condition that arises commonly in practice. It is not a condition to require or enforce by machine learning practitioners! We believe this holds because any learning algorithm is a channel from the space of training samples to the hypothesis space so its risk for overfitting can be analyzed by studying the properties of this mapping itself. Such an approach yields the uniform generalization bounds that are derived in this paper.
While we strive to introduce foundational results in this work, there are many important questions that remain unanswered. We conclude this paper by listing some of those open problems and suggesting future directions of research.

2. Notation

The notation used in this paper is fairly standard. Important exceptions are listed here. If x is a random variable that takes its values from a finite set s uniformly at random, we write x s to denote such a distribution. If x is a boolean random variable (i.e., a predicate), then I { x } = 1 if and only if x is true, otherwise I { x } = 0 . In general, random variables are denoted with boldface letters x , instances of random variables are denoted with small letters x, matrices are denoted with capital letters X, and alphabets i.e., fixed sets) are denoted with calligraphic typeface X (except L that will be reserved for the learning algorithm and D that will be reserved for the input distribution as is customary in the literature).
Throughout this paper, we will always write Z to denote the space of observations (a.k.a. domain) and write H to denote the hypothesis space (a.k.a. range). A learning algorithm L : Z m H is formally treated as a stochastic map, where the hypothesis h H can be a deterministic or a randomized function of the training sample s Z m . Given a 0–1 loss function l : H × Z { 0 , 1 } , we will abuse terminology slightly by speaking about the “ VC dimension of H ” when we actually mean the VC dimension of the loss class { l ( · , h ) : h H } .
In addition, given two probability measures p and q defined on the same space, we will write p , q to denote the overlapping coefficient between p and q. That is, p , q = 1 | | p , q | | T , where | | p , q | | T = 1 2 | | p q | | 1 is the total variation distance.
Moreover, we will use the order in probability notation for real-valued random variables. Here, we adopt the notation used by [15] and [16]. In particular, let x = x n be a real-valued random variable that depends on some parameter n N . Then, we will write x n = O p ( f ( n ) ) if for any δ > 0 , there exists absolute constants C and n 0 such that for any fixed n n 0 , the inequality | x n | < C | f ( n ) | holds with a probability of, at least, 1 δ . In other words, the ratio x n / f ( n ) is stochastically bounded [15]. Similarly, we write x n = o p ( f ( n ) ) if x n / f ( n ) converges to zero in probability. As an example, if x N ( 0 , I d ) is a standard multivariate Gaussian vector, then | | x | | 2 = O p ( d ) even though | | x | | 2 can be arbitrarily large. Intuitively, the probability of the event | | x | | 2 d 1 2 + ϵ when ϵ > 0 goes to zero as d so | | x | | 2 is effectively of the order O ( d ) .

3. Related Work

A learning algorithm is called consistent if the true risk of its hypothesis h converges to the optimal true risk in H , i.e., inf h H R ( h ) , as m in a distribution agnostic manner. A learning problem, which is a tuple ( Z , H , l ) with l being a loss function defined on the product space Z × H , is called learnable if it admits a consistent learning algorithm. It can be shown that learnability is equivalent to uniform convergence for supervised classification and regression even though uniform convergence is not necessary in the general setting [17].
Unlike learnability, the subject of generalization looks into how representative the empirical risk R s ( h ) is to the true risk R ( h ) as discussed earlier. It can be rightfully considered as an extension to the law of large numbers, which is one of the earliest and most important results in probability theory and statistics. However, unlike the law of large numbers, which assumes that observations are independent and identically distributed, the subject of generalization in machine learning addresses the case where the losses l ( z i , h ) are no longer i.i.d. due to the fact that h is selected according to the training sample s and z i s .
Similar to learnability, uniform convergence is, by definition, sufficient for generalization but it is not necessary because the learning algorithm might restrict its search space to a smaller subset of H . So, in addition to uniform convergence bounds, several other methods have been introduced for bounding the generalization risk, such as using algorithmic stability, Rademacher and Gaussian complexities, generic chaining bounds, the PAC-Bayesian framework, and robustness-based analysis [5,6,7,18,19,20]. Classical concentration of measure inequalities, such as using the union bound, form the building blocks of such rich theories.
In this work, we address the subject of generalization in machine learning from an information-theoretic point of view. We will show that if the hypothesis h conveys “ little” information about a random single training example z ^ s , then the difference between E z ^ s [ l ( z ^ , h ) ] and E z p ( z ) [ l ( z , h ) ] will be small with a high probability. The measure of information we use here is given by the notion of variational information J ( z ^ ; h ) between the hypothesis h and a single random training example z ^ s . Variational information, also sometimes called T-information [14], is an instance of the class of informativity measures using f-divergences, which can be motivated axiomatically [21,22]. Unlike traditional methods, we will prove that J ( z ^ ; h ) is equal to the “ uniform” generalization risk; it is not just an upper bound.
Information-theoretic approaches of analyzing the generalization risk of learning algorithms, such as the one proposed in this paper, have found applications in adaptive data analysis. This includes the work of [12] using the max-information, the work of [23] and [24] using the mutual information, and the work of [14] using the leave-one-out information. One key contribution of our work is to show that one should examine the relationship between the hypothesis and a single random training example, instead of examining the relationship between the hypothesis and the full training sample as is customary in the literature. The gap between such two approaches is strict. For example, Theorem 8 in Section 5.5 presents an example of when a learning algorithm can have a vanishing uniform generalization risk even when the mutual information between the learned hypothesis and the training sample can be made arbitrarily large.

4. Uniform Generalization

4.1. Preliminary Definitions

In this paper, we consider the general setting of learning introduced by Vapnik [3]. To reiterate, we have an observation space (a.k.a. domain) Z and a hypothesis space H . Our learning algorithm L receives a set of m observations s = { z 1 , , z m } Z m generated i.i.d. from some fixed unknown distribution p ( z ) , and picks a hypothesis h H according to some probability distribution p ( h | s ) . In other words, L is a channel from s to h . In this paper, we allow the hypothesis h to be any summary statistic of the training set. It can be an answer to a query, a measure of central tendency, or a mapping from the input space to the output space. In fact, we even allow h to be a subset of the training set itself. In formal terms, L is a stochastic map between the two random variables s Z m and h H , where the exact interpretation of those random variables is irrelevant. Moreover, we assume that there exists a non-negative bounded loss function l ( z , h ) [ 0 , 1 ] that is used to measure the fitness of the hypothesis h H on the observation z Z .
For any fixed hypothesis h H , we define its true risk R ( h ) by Equation (1) and denote its empirical risk on the training sample by R s ( h ) . We also define the true and empirical risks of the learning algorithm L by the expected corresponding risk of its hypothesis:
R ( L ) = E s E h p ( h | s ) [ R ( h ) ] = E h [ R ( h ) ]
R ^ ( L ) = E s E h p ( h | s ) [ R s ( h ) ] = E s , h [ R s ( h ) ]
Finally, the generalization risk of the learning algorithm is defined by:
R g e n ( L ) R ( L ) R ^ ( L )
Next, we define uniform generalization:
Definition 3
(Parametric Loss). A loss function l ( · , h ) : Z [ 0 , 1 ] is called parametric if it is conditionally independent of the training sample given the hypothesis h H . That is, it satisfies the Markov chain s h l ( · , h ) .
Definition 4
(Uniform Generalization). A learning algorithm L : Z m H generalizes uniformly with rate ϵ 0 if for all bounded parametric losses l : Z × H [ 0 , 1 ] , we have | R g e n ( L ) | ϵ , where R g e n ( L ) is given in Equation (5).
Informally, Definition 4 states that once a hypothesis h is selected by a learning algorithm L that achieves uniform generalization, then no “ adversary” can post-process the hypothesis in a manner that causes over-fitting to occur. Equivalently, uniform generalization implies that the empirical performance of h on the sample s will remain close to its performance with respect to the underlying distribution regardless of how that performance is being measured. For example, the loss function l : Z × H [ 0 , 1 ] in Equation (5) can be the misclassification error rate as in the traditional classification setting, a cost-sensitive error rate as in fraud detection and medical diagnosis [25], or the Brier score as in probabilistic predictions [26]. The generalization guarantee would hold in any case.

4.2. Variational Information

Given two random variables x and y , the variational information between the two random variables is defined to be the total variation distance between the join distribution p ( x , y ) and the product of marginals p ( x ) · p ( y ) . We will denote this by J ( x ; y ) . By definition:
J ( x ; y ) = E x , y | | p ( x , y ) , p ( x ) · p ( y ) | | T = E x | | p ( y ) , p ( y | x ) | | T
Note that 0 J ( x ; y ) 1 . We describe some of the important properties of variational information in this section. The reader may consult the appendices for detailed proofs.
Lemma 1
(Data Processing Inequality). If x y z is a Markov chain, then:
J ( x ; z ) J ( y ; z )
This data processing inequality holds, in general, for all informativity measures using f-divergences [21,22].
Lemma 2
(Information Cannot Hurt). For any random variables x X , y Y , and z Z , we have:
J ( x ; y ) J ( x ; ( y , z ) )
Proof. 
The proof is in Appendix A. □
Finally, we derive a chain rule for the variational information.
Definition 5
(Conditional Variational Information). The conditional variational information between the two random variables x and y given z is defined by:
J ( x ; y | z ) = E z | | p ( x , y | z ) , p ( x | z ) · p ( y | z ) | | T ,
which is analogous to the conditional mutual information in information theory [10].
Theorem 1
(Chain Rule). Let ( h 1 , , h k ) be a sequence of random variables. Then, for any random variable z , we have: J ( z ; ( h 1 , , h k ) ) t = 1 k J ( z ; h t | ( h 1 , , h t 1 ) )
Proof. 
The proof is in Appendix B. □
Although the chain rule above provides an upper bound, the upper bound is tight in the following sense:
Proposition 1.
For any random variables x , y , and z , we have | J ( x ; ( y , z ) ) J ( x ; z | y ) | J ( x ; y ) and | J ( x ; ( y , z ) ) J ( x ; y ) | J ( x ; z | y ) .
Proof. 
The proof is in Appendix C. □
In other words, the inequality in the chain rule J ( x ; ( y , z ) ) J ( x ; y ) + J ( x ; z | y ) becomes an equality if:
min { J ( x ; y ) , J ( x ; z | y ) } = 0
The chain rule provides a recipe for computing the bias of a composition of hypotheses ( h 1 , , h k ) . Recently, [23] proposed an information budget framework for controlling the bias of estimators by controlling the mutual information between h and the training sample s . The proposed framework rests on the chain rule of mutual information. Here, we note that the argument for the information budget framework also holds when using the variational information due to the chain rule above.

4.3. Equivalence Result

Our first main theorem states that the uniform generalization risk has a precise information-theoretic characterization.
Theorem 2.
Given a fixed constant 0 ϵ 1 and a learning algorithm L : Z m H that selects a hypothesis h H according to a training sample s = { z 1 , , z m } , where z i p ( z ) are i.i.d., L generalizes uniformly with rate ϵ if and only if J ( h ; z ^ ) ϵ , where z ^ s is a single random training example.
Proof. 
Let L : Z m H be a learning algorithm that receives a finite set of training examples s = { z 1 , , z m } Z m drawn i.i.d. from a fixed unknown distribution p ( z ) . Let h p ( h | s ) be the hypothesis chosen by L (can be deterministic or randomized) and write z ^ s to denote a random variable that selects its value uniformly at random from the training sample s . Clearly, z ^ and h are not independent in general. To simplify notation, we will write l = l ( · , h ) : Z [ 0 , 1 ] to denote the loss function. Note that l is itself a random variable that satisfies the Markov chain s h l . The claim is that L generalizes uniformly with rate ϵ > 0 across all parametric loss functions l if and only if J ( h ; z ^ ) ϵ .
By the Markov property, we have p ( l | h , s ) = p ( l | h ) . By definition, the true and empirical risks of L are given by:
R ( L ) = E s , h E l | h E z p ( z ) l ( z ) = E l E z p ( z ) l ( z )
R ^ ( L ) = E s E l | s E z s l ( z ) = E l E s | l E z s l ( z )
Because z ^ s is a random variable whose value is chosen uniformly at random with replacement from the training set s , its marginal distribution is p ( z ) . Its conditional distribution given l can be different, however, because both l and z ^ depend on the training set s . However, they are both conditionally independent of each other given s . By marginalization, we have:
p ( z ^ | l ) = E s | l p ( z ^ | s , l ) = E s | l p ( z ^ | s )
Combining this with Equations (6) and (7) yields R ( L ) = E l E z ^ l ( z ^ ) and R ^ ( L ) = E l E z ^ | l l ( z ^ ) . Both equations imply that:
R ( L ) R ^ ( L ) = E l E z ^ l ( z ^ ) E z ^ | l l ( z ^ )
Now, we would like to sandwich the right-hand side between upper and lower bounds. To do this, we note that if p 1 ( z ) and p 2 ( z ) are two distributions defined on the same domain Z and f : Z [ 0 , 1 ] , then:
| E z p 1 ( z ) f ( z ) E z p 2 ( z ) f ( z ) | | | p 1 ( z ) , p 2 ( z ) | | T ,
where | | p 1 ( z ) , p 2 ( z ) | | T is the total variation distance. This result can be immediately proven by considering the two regions { z Z : p 1 ( z ) > p 2 ( z ) } and { z Z : p 1 ( z ) < p 2 ( z ) } separately. In addition, it is tight because the inequality holds with equality for the loss function f ( z ) = I { p 1 ( z ) p 2 ( z ) } . Consequently:
| R ( L ) R ^ ( L ) | J ( l ; z ^ )
Finally, from the Markov chain z ^ s h l and the data processing inequality, we have J ( l ; z ^ ) J ( h ; z ^ ) . Plugging this into the earlier inequality yields the bound:
| R ( L ) R ^ ( L ) | J ( h ; z ^ )
To prove the converse, define:
l ( z , h ) = I p ( z ^ = z ) p ( z ^ = z | h ) = I p ( z ^ = z ) E s | h [ p z ^ s ( z ^ = z ) ]
The loss l ( z , h ) is independent of the training sample given h because p ( z ^ = z | h ) is evaluated by taking expectation over all the training samples conditioned on h . Hence, l ( z , h ) is a 0–1 loss defined on the product space Z × H and satisfies the Markov chain s h l . However, given this choice of loss, we have:
| R ( L ) R ^ ( L ) | = E h E z ^ I { p ( z ^ ) > p ( z ^ | h ) } E z ^ | h I { p ( z ^ ) > p ( z ^ | h ) } = E h | | p ( z ^ ) , p ( z ^ | h ) | | T = J ( h ; z ^ )
Hence, the variational information J ( h ; z ^ ) does not only provide an upper bound on the uniform generalization risk, but is also a lower bound to it. Therefore, J ( h ; z ^ ) is equal to the uniform generalization risk. □
Remark 1.
One important observation about Theorem 2 is that the variational information is measured between the hypothesis h and a single training example z ^ , which is quite different from previous works that looked into the mutual information with the entire training sample s . By considering z ^ rather than s , we quantify the uniform generalization risk with equality and the resulting bound is not vacuous even if the learning algorithm was deterministic. By contrast, J ( s ; h ) may yield vacuous bounds when L is deterministic and both Z and H are uncountable.
For concreteness, we illustrate how to compute the uniform generalization risk (or equivalently the variational information) on two simple examples. Here, B ( k ; ϕ , n ) = n k ϕ k ( 1 ϕ ) n k is the binomial distribution. The first example is a special case of a more general theorem that will be presented later in Section 5.2.
Example 1.
Suppose that observations z i { 0 , 1 } are i.i.d. Bernoulli trials with p ( z i = 1 ) = ϕ , and that the hypothesis produced by L is the empirical average h = 1 m i = 1 m z i . Because p ( h = k / m | z t r n = 1 ) = B ( k 1 ; ϕ , m 1 ) and p ( h = k / m | z t r n = 0 ) = B ( k ; ϕ , m 1 ) , it can be shown that the uniform generalization risk of this learning algorithm is given by the following quantity assuming that ϕ m is an integer:
J ( z ^ ; h ) = 2 ( 1 ϕ ) ( 1 ϕ ) m ϕ 1 + m ϕ ( 1 + m ϕ ) m m ϕ + 1
This is maximized when ϕ = 1 / 2 , in which case, the uniform generalization risk can be bounded using the Stirling approximation [27] by 1 / 2 π m up to a first-order term.
Proof. 
First, the probability we obtain a hypothesis h = k m , where k { 0 , 1 , , m } , given that we have m Bernoulli trials has a binomial distribution:
p ( h = k m ) = m k ϕ k ( 1 ϕ ) m k
We use the identity:
J ( z ^ ; h ) = k = 0 m p h = k m | | p ( z ^ ) , p ( z ^ | h ) | | T
However, p ( z ^ ) is Bernoulli with probability of success ϕ while p ( z ^ | h = k m ) is Bernoulli with probability of success h . The total variation distance between the two Bernoulli distributions is given by | ϕ h | . So, we obtain:
J ( z ^ ; h ) = k = 0 m m k ϕ k ( 1 ϕ ) m k | ϕ k m |
This is the mean deviation. Assuming ϕ m is an integer, then the mean deviation of the binomial random variable is given by de Moivre’s formula:
M D = 2 ( 1 ϕ ) ( 1 ϕ ) m ϕ 1 + m ϕ ( 1 + m ϕ ) m m ϕ + 1
The mean deviation is maximized when ϕ = 1 2 . This gives us:
J ( z ^ ; h ) 1 2 m m m / 2 + 1 1 2 π m ,
where in the last step we expanded the binomial coefficient and used Stirling’s approximation [27]. □
Example 2.
Suppose that the domain is Z = { 1 , 2 , 3 , , K } for some K < , where p ( z = k ) = 1 / K for all k Z . Let the hypothesis space be H = Z where p ( h = k ) is equal to the fraction of times the value k is observed in the training sample s = { z 1 , , z m } . For example, if s = { 1 , 3 , 2 , 1 , 1 , 3 } , the hypothesis h is chosen among the set { 1 , 2 , 3 } with the respective probabilities { 1 / 2 , 1 / 6 , 1 / 3 } . Then, the variational information is given by:
J ( z ^ ; h ) = 1 m 1 1 K
Proof. 
We have by symmetry p ( h = k ) = 1 / K for all k { 1 , 2 , 3 , , K } . Let z ^ = x . By Bayes rule, we have:
p ( z ^ = x | h = k ) = p ( h = k | z ^ = x ) · p ( z ^ = x ) p ( h ) = k = p ( h = k | z ^ = x )
However, given one observation z ^ = x , the probability of selecting a hypothesis h = k depends on two cases:
p ( h = k | z ^ = x ) = q if k = x r if k x
for some values q 0 and r 0 such that q + ( K 1 ) r = 1 . To find q, we use the definition of L :
q = 1 m + 1 K · m 1 m = 1 K + 1 m 1 1 K
This holds because L is equivalent to an algorithm that selects a single observation in the set s uniformly at random. So, to satisfy the condition q + ( K 1 ) r = 1 , we have:
r = 1 K 1 m K
Now, we are ready to find the desired expression.
J ( z ^ ; h ) = 1 2 x Z p ( z ^ = x ) k Z | p ( h = k ) p ( h = k | z ^ = x ) | = 1 2 k Z | p ( h = k ) p ( h = k | z ^ = 1 ) | = 1 2 1 m 1 1 K + K 1 m K = 1 m 1 1 K
Note that the variational information in Example 2 is Θ ( 1 / m ) , which is smaller than the variational information in Example 1. This is not a coincidence. The difference between the two examples is related to data processing. Specifically, suppose that K = 2 in Example 2 and let h 2 be the hypothesis. Let h 1 be the hypothesis in Example 1. Then, we have the Markov chain s h 1 h 2 because h 2 is Bernoulli with parameter h 1 .

4.4. Learning Capacity

The variational information depends on the distribution of observations p ( z ) , which is seldom known in practice. To construct a distribution-free bound on the uniform generalization risk, we introduce the following quantity:
Definition 6
(Learning Capacity). The learning capacity of an algorithm L is defined by:
C ( L ) sup p ( z ) J ( z ^ ; h ) ,
where h and z ^ are as defined in Theorem 2.
The above quantity is analogous to the Shannon channel capacity except that it is measured in the total variation distance. It quantifies the capacity for overfitting in the given learning algorithm. For example, the learning capacity of the algorithm in Example 1 is 1 / 2 π m up to a first order term, as proved earlier, so its capacity for overfitting is larger than that of the learning algorithm in Example 2.
Theorem 2 reveals that C ( L ) has, at least, three equivalent interpretations:
  • Statistical: The learning capacity C ( L ) is equal to the supremum of the expected generalization risk R g e n ( L ) across all input distributions and all bounded parametric losses. This holds by Theorem 2 and Definition 6.
  • Information-Theoretic: The learning capacity C ( L ) is equal to the amount of information contained in the hypothesis h about the training examples. This holds because J ( z ^ ; h ) = E h | | p ( z ^ ) , p ( z ^ | h ) | | T .
  • Algorithmic: The learning capacity C ( L ) measures the influence of a single training example z ^ on the distribution of the final hypothesis h . As such, a learning algorithm has a small learning capacity if and only if it is algorithmically stable. This follows from the fact that J ( z ^ ; h ) = E z ^ | | p ( h ) , p ( h | z ^ ) | | T .
Throughout the sequel, we analyze the properties of C ( L ) and derive upper bounds for it under various conditions, such as in the finite hypothesis space setting and differential privacy.

4.5. The Definition of Hypothesis

In the proof of Theorem 2, the following Markov chain z ^ s h l ( · , h ) is used. Essentially, this states that the loss function l ( · , h ) : Z [ 0 , 1 ] , which is a random variable itself, must be parameterized entirely by the hypothesis h as stated in Definition 3. We list, next, a few examples that highlight this point.
Example 3
(Input Normalization). If the data is normalized prior to training, such as using min-max or z-score normalization, then the normalization parameters are included in the definition of the hypothesis h .
Example 4
(Feature Selection). If the observations z comprise of d features and feature selection is implemented prior to training a model v (such as in classification or clustering), then the hypothesis h is the composition ( u , v ) , where u { 0 , 1 } d encodes the set of the features that have been selected by the feature selection algorithm.
Example 5
(Cross Validation). Hyper-parameter tuning is a common practice in machine learning. This includes choosing the tradeoff parameter C in support vector machine (SVM) [28] or the bandwidth γ in radial basis function (RBF) networks [29]. However, not all hyper-parameters are encoded in the hypothesis h . For instance, the tradeoff constant C is never used during prediction so it is omitted from the definition of h but the bandwidth parameter γ is included if it is selected based on the training sample.
In order to illustrate why the Markov chain z ^ s h l ( · , h ) is important, consider the following simple scenario. Suppose we have a mixture of two Gaussians in R d , one corresponding to the positive class and one corresponding to the negative class. If z-score normalization is applied before training a linear classifier, then the generalization risk might increase with normalization because the final hypothesis now includes more information about the training sample (see Lemma 2). Figure 1 shows this effect when d = 1 . As illustrated in the figure, normalization is often important in order to assign equal weights to all features but it can increase the generalization risk as well.

4.6. Concentration

The notion of uniform generalization in Definition 4 provides in-expectation guarantees. In this section, we show that whereas traditional generalization in expectation does not imply concentration, uniform generalization in expectation implies concentration. In fact, we will use the chain rule in Theorem 1 to derive a Markov-type inequality. After that, we show that the bound is tight.
We begin by showing why a non-uniform generalization in expectation does not imply concentration.
Proposition 2.
There exists a learning algorithm L : Z m H and a parametric loss l : Z × H [ 0 , 1 ] such that the expected generalization risk is R g e n ( L ) = 0 even though p | R ( h ) R s ( h ) | = 1 2 = 1 , where the probability is evaluated over the randomness of s and the internal randomness of L .
Proof. 
Let Z = [ 0 , 1 ] be an instance space with a continuous marginal density p ( z ) and let Y = { 1 , + 1 } be the target set. Let h : Z { 1 , + 1 } be some fixed predictor, such that p { h ( z ) = 1 } = 1 2 , where the probability is evaluated over the random choice of z Z . In other words, the marginal distribution of the labels predicted by h is uniform over the set { 1 , + 1 } . These assumptions are satisfied, for example, if p ( z ) is uniform in [ 0 , 1 ] and h ( z ) = I { z < 1 / 2 } .
Next, let the hypothesis space H be the set of predictors from Z to { 1 , + 1 } that output a label in { 1 , + 1 } uniformly at random everywhere in Z except at a finite number of points. Define the parametric loss by l ( z ; h ) = I h ( z ) h ( z ) .
Next, we construct a learning algorithm L that generalizes perfectly in expectation but does not generalize in probability. The learning algorithm L simply picks h { h 0 , h 1 } at random with equal probability. The two hypotheses are:
h 0 ( z ) = h ( z ) if z s Uniform ( 1 , + 1 ) if z s h 1 ( z ) = h ( z ) if z s Uniform ( 1 , + 1 ) if z s
Because Z is uncountable, where the probability of seeing the same observation z twice is zero, R ( h ) = 1 2 for this learning algorithm. Thus:
R g e n ( L ) = E s , h R s ( h ) R ( h ) = 0
However, the empirical risk for any s satisfies R s ( h ) { 0 , 1 } while the true risk always satisfies R ( h ) = 1 2 , as mentioned earlier. Hence, the statement of the proposition follows. □
There are many ways of seeing why the algorithm in Proposition 2 does not generalize uniformly in expectation. The simplest way is to use the equivalence between uniform generalization and variational information as stated in Theorem 2. Given the hypothesis h { h 0 , h 1 } that is learned by the algorithm constructed in the proposition, the marginal distribution of an individual training example p ( z ^ | h ) is uniform over the sample s . This follows from the fact that the hypothesis h has to encode the entire sample s . However, the probability of seeing the same observation twice is zero (by construction). Hence, | | p ( z ^ ) , p ( z ^ | h ) | | T = 1 . This shows that C ( L ) = 1 .
The example in Proposition 2 reveals an interesting property of non-uniform generalization. Namely, non-uniform generalization can be sensitive to every bit of information provided by the hypothesis. In the example above, the hypothesis h is encoded by the pair ( s , k ) , where k { 0 , 1 } determines which of the two hypotheses { h 0 , h 1 } is selected. The discrepancy between generalization in expectation and generalization in probability happens because k is added into the hypothesis.
Next, we use the chain rule in Theorem 1 to prove that uniform generalization, on the other hand, is a robust property of learning algorithms. More precisely, if k has a finite domain, then a hypothesis h generalizes uniformly in expectation if and only if the pair ( h , k ) generalizes uniformly in expectation. Hence, adding any finite amount of information (in bits) to a hypothesis cannot alter its uniform generalization property in a significant way.
Theorem 3.
Let L : Z m H be a learning algorithm whose hypothesis is h H . Let k K be a different hypothesis that is obtained from the same sample s . If z ^ s , then:
J ( z ^ ; ( h , k ) ) ( 2 + | K | 2 ) · J ( z ^ ; h ) + log | K | 2 m
Proof. 
The proof is in Appendix D. □
We use Theorem 3, next, to prove that a uniform generalization in expectation implies a generalization in probability. The proof is by contradiction. Suppose we have a hypothesis h that generalizes uniformly in expectation but there exists a parametric loss l : Z × H [ 0 , 1 ] that does not generalize in probability. We will derive a contradiction from these two assumptions. We show that appending little information to the hypothesis h will allow us to construct a different parametric loss that does not generalize in expectation by determining whether or not the empirical risk w.r.t. l : Z × H [ 0 , 1 ] is greater than, approximately equal to, or is less than the true risk w.r.t. the same loss. This is described in, at most, two bits. Knowing this additional information, we can define a new parametric loss that does not generalize in expectation, which contradicts the definition of uniform generalization.
Theorem 4.
Let L : Z m H be a learning algorithm, whose risk is evaluated using a parametric loss l : Z × H [ 0 , 1 ] . Then:
p | R s ( h ) R ( h ) | t 7 2 t J ( z ^ ; h ) + log 3 49 m ,
where the probability is evaluated over the random choice of s and the internal randomness of L .
Proof. 
Let l : Z × H [ 0 , 1 ] be a parametric loss function and write:
κ ( t ) = p | R s ( h ) R ( h ) | t
Consider the new pair of hypotheses ( h , k ) , where:
k = + 1 , if R s ( h ) R ( h ) + t 1 , if R s ( h ) R ( h ) t 0 , otherwise
Then, by Theorem 3, the uniform generalization risk in expectation for the composition of hypotheses ( h , k ) is bounded by ( 7 / 2 ) J ( z ^ ; h ) + log 3 2 m . This holds uniformly across all parametric loss functions that satisfy the Markov chain s ( h , k ) l ( · , ( h , k ) ) . Next, consider the parametric loss:
l ( z , ( h , k ) ) = l ( z ; h ) if k = + 1 1 l ( z ; h ) if k = 1 0 otherwise
Note that l ( z , ( h , k ) ) is parametric with respect to the composition of hypotheses ( h , k ) . Using Equation (12), the generalization risk w.r.t l ( z , ( h , k ) ) in expectation is, at least, as large as t κ ( t ) . Therefore, by Theorems 2 and 3, we have t κ ( t ) ( 7 / 2 ) J ( z ^ ; h ) + log 3 2 m , which is the statement of the theorem (Note: The proof assumes that the loss function l has access to the underlying distribution. This assumption is valid because the underlying distribution p ( z ) is fixed and does not depend on any random outcomes, such as s or h ). □
Theorem 4 reveals that uniform generalization is sufficient for concentration to hold. Importantly, the generalization bound depends on the learning algorithm L only via its variational information J ( z ^ ; h ) . Hence, by controlling the uniform generalization risk, one improves the generalization risk of L both in expectation and with a high probability.
The same proof technique used in Theorem 4 also implies the following concentration bound, which is useful when I ( h ; s ) = o ( m ) where I ( x ; y ) is the Shannon mutual information. The following bound is similar to the bound derived by [23] using properties of sub-Gaussian loss functions.
Proposition 3.
Let L : Z m H be a learning algorithm, whose risk is evaluated using a parametric loss function l : Z × H [ 0 , 1 ] . Then:
p | R s ( h ) R ( h ) | t 1 t I ( s ; h ) + 2 2 m .
Proof. 
The proof is in Appendix E. □
Note that having a vanishing mutual information, i.e., I ( s ; h ) = o ( m ) , which is the setting recently considered in the work of [23], is a strictly stronger condition than uniform generalization. For instance, we will later construct deterministic learning algorithms that generalize uniformly in expectation even though I ( s ; h ) is unbounded (see Theorem 8). By contrast, I ( s ; h ) = o ( m ) is sufficient for J ( z ^ ; h ) 0 to hold.
Finally, we note that the concentration bound depends linearly on the variational information J ( z ^ ; h ) . Typically, J ( z ^ ; h ) = O ( 1 / m ) . By contrast, the VC bound provides an exponential decay on m [3,17]. Can the concentration bound in Theorem 4 be improved? The following proposition answers this question in the negative.
Proposition 4.
For any rational 0 < t < 1 , there exists a learning algorithm L : Z m H , a distribution p ( z ) , and a parametric loss l : Z × H [ 0 , 1 ] such that:
p | R s ( h ) R ( h ) | = t = J ( z ^ ; h ) t ,
where the probability is evaluated over the random choice of s and the internal randomness of L .
Proof. 
The proof is in Appendix F. □
Proposition 4 shows that, without making any additional assumptions beyond that of uniform generalization, the concentration bound in Theorem 4 is tight up to constant factors. Essentially, the only difference between the upper and the lower bounds is a vanishing O ( 1 / m ) term that is independent of L .

5. Properties of the Learning Capacity

In this section, we derive bounds on the learning capacity under various settings. We also describe some of its important properties.

5.1. Data Processing

The relationship between learning capacity and data processing is presented in Lemma 1. Given the random variables x , y , and z and the Markov chain x y z , we always have J ( x ; z ) J ( x ; y ) . Hence, we have a partial order on learning algorithms. This presents us with an important qualitative insight into the design of machine learning algorithms.
Suppose we have two different hypotheses h 1 and h 2 . We will say that h 2 contains less information than h 1 if the Markov chain s h 1 h 2 holds. For example, if the observations z i { 0 , 1 } are Bernoulli trials, then h 1 R can be the empirical average as given in Example 1 while h 2 { 0 , 1 } can be the label that occurs most often in the training set. Because h 2 = I { h 1 m / 2 } , the hypothesis h 2 contains strictly less information about the original training set than h 1 . Formally, we have s h 1 h 2 . In this case, h 2 enjoys a better uniform generalization bound because of data-processing. Intuitively, we know that such a result should hold because h 2 is less dependent to the original training set than h 1 . Hence, one can improve the uniform generalization bound (or equivalently the learning capacity) of a learning algorithm by post-processing its hypothesis h in a manner that is conditionally independent of the original training set given h .
Example 6.
Post-processing hypotheses is a common technique in machine learning. This includes sparsifying the coefficient vector w R d in linear methods, where w j is set to zero if it has a small absolute magnitude. It also includes methods that have been proposed to reduce the number of support vectors in SVM by exploiting linear dependence [30], or some methods for decision tree pruning. By the data processing inequality, such techniques reduce the learning capacity and, as a consequence, mitigate the risk for overfitting.
Needless to mention, better generalization does not immediately translate into a smaller true risk. This is because the empirical risk itself may increase when the hypothesis h is post-processed independently of the original training sample.

5.2. Effective Domain Size

Next, we look into how the size of the domain Z limits the learning capacity. First, we start with the following definition:
Definition 7
(Lazy Learning). A learning algorithm L is called lazy if the training sample s Z m can be reconstructed perfectly from the hypothesis h H . In other words, H ( s | h ) = 0 , where H is the Shannon entropy. Equivalently, the mapping from s to h is injective.
One common example of a lazy learner is instance-based learning when h = s . Despite their simple nature, lazy learners are useful in practice. They are useful theoretical tools as well. In particular, because of the fact that H ( s | h ) = 0 and the data processing inequality, the learning capacity of a lazy learner provides an upper bound to the learning capacity of any possible learning algorithm. Therefore, we can relate the learning capacity C ( L ) to the size of the domain Z by determining the learning capacity of lazy learners. Because the size of Z is usually infinite, we introduce the following definition of effective set size.
Definition 8.
In a countable space Z endowed with a probability mass function p ( z ) , the effective size of Z w.r.t. p ( z ) is defined by: Ess p ( z ) ( Z ) 1 + z Z p ( z ) ( 1 p ( z ) ) 2 .
At one extreme, if p ( z ) is uniform over a finite alphabet Z , then Ess p ( z ) ( Z ) = | Z | . At the other extreme, if p ( z ) is a Kronecker delta distribution, then Ess p ( z ) ( Z ) = 1 . As proved next, this notion of effective set size determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allows us to relate the learning capacity to a property of the domain Z .
Theorem 5.
Let Z be a countable space endowed with a probability mass function p ( z ) . Let s be a set of m i.i.d. observations z i p ( z ) . Define p s ( z ) to be the empirical probability mass function that results from drawing observations uniformly at random from s . Then:
E s | | p ( z ) , p s ( z ) | | T = Ess p ( z ) [ Z ] 1 2 π m + o ( 1 / m ) ,
where Ess p ( z ) [ Z ] is the effective size of Z (see Definition 8).
Proof. 
The proof is in Appendix G. □
A special case of Theorem 5 was proved by de Moivre in the 1730s, who showed that the empirical mean of i.i.d. Bernoulli trials with a probability of success ϕ converges to the true mean with rate 2 ϕ ( 1 ϕ ) / ( π m ) . This is believed to be the first appearance of the square-root law in statistical inference in the literature [31]. Because the effective domain size of the Bernoulli distribution, according to Definition 8, is given by 1 + 4 ϕ ( 1 ϕ ) , Theorem 5 agrees with, in fact generalizes, de Moivre’s result.
Corollary 1.
Let L : Z m H be a learning algorithm whose hypothesis is h H . Then, J ( z ^ ; h ) Ess p ( z ) [ Z ] 1 2 π m + o ( 1 / m ) . Moreover, the bound is achieved by lazy learners.
Proof. 
Let h ˜ be the hypothesis produced by a lazy learner. The simplest example is if h is equal to the training sample s itself. Then, we always have the Markov chain s h ˜ h for any hypothesis h H . Therefore, by the data processing inequality, we have J ( z ^ ; h ) J ( z ^ ; h ˜ ) . By Theorem 5, we have:
J ( z ^ ; h ˜ ) = Ess p ( z ) [ Z ] 1 2 π m + o ( 1 / m )
Hence, the statement of the corollary follows. □
Corollary 2.
For any learning algorithm L : Z m H , we have C ( L ) | Z | 1 2 π m + o ( 1 / m ) .
Proof. 
The function f ( p ) = z p ( z ) ( 1 p ( z ) ) is both concave over the probability simplex and permutation-invariant. Hence, by symmetry, the maximum effective domain size must be achieved at the uniform distribution p ( z ) = 1 / | Z | , in which case Ess p ( z ) [ Z ] = | Z | . □

5.3. Finite Hypothesis Space

Next, we look into the role of the size of the hypothesis space. This is formalized by the following theorem.
Theorem 6.
Let h H be the hypothesis produced by a learning algorithm L : Z m H . Then:
C ( L ) H ( h ) 2 m log | H | 2 m ,
where H is the Shannon entropy measured in nats.
Proof. 
If we let I ( x ; y ) be the mutual information between the r.v.’s x and y and let s = { z 1 , z 2 , , z m } be the training set, we have:
I ( s ; h ) = H ( s ) H ( s | h ) = i = 1 m H ( z i ) H ( z 1 | h ) + H ( z 2 | z 1 , h ) +
Because conditioning reduces entropy, i.e., H ( x | y ) H ( x ) for any r.v.’s x and y , we have:
I ( s ; h ) i = 1 m [ H ( z i ) H ( z i | h ) ] = m [ H ( z ^ ) H ( z ^ | h ) ]
Therefore:
I ( z ^ ; h ) I ( s ; h ) m
Next, we use Pinsker’s inequality [10], which states that for any probability measures p and q: | | p , q | | T D ( p | | q ) 2 , where | | p , q | | T is total variation distance and D ( p | | q ) is the Kullback-Leibler divergence measured in nats. If we recall that J ( s ; h ) = | | p ( s ) p ( h ) , p ( s , h ) | | T while the mutual information is I ( s ; h ) = D ( p ( s , h ) | | p ( s ) p ( h ) ) , we deduce from Pinsker’s inequality and Equation (13):
J ( z ^ ; h ) = | | p ( z ^ ) p ( h ) , p ( z ^ , h ) | | T I ( z ^ ; h ) 2 I ( s ; h ) 2 m H ( h ) 2 m log | H | 2 m .
Theorem 6 re-establishes the classical PAC result on the finite hypothesis space setting. However, unlike its typical proofs, the proof presented here is purely information-theoretic and does not make any references to the union bounds.

5.4. Differential Privacy

Randomization reduces the risk for overfitting. One common randomization technique in machine learning is differential privacy [32,33], which addresses the goal of obtaining useful information about the sample s as a whole without revealing a lot of information about any individual observation. Here, we show that differentially-private learning algorithms have small learning capacities.
Definition 9
([33]). A randomized learning algorithm L : Z m H is ( ϵ , δ ) differentially private if for any O H and any two samples s and s that differ in one observation only, we have:
p ( h O | s ) e ϵ · p ( h O | s ) + δ
Proposition 5.
If a learning algorithm L : Z m H is ( ϵ , δ ) differentially private, then: J ( z ^ ; h ) ( e ϵ 1 + δ ) / 2 .
Proof. 
The proof is in Appendix H. □
Not surprisingly, the differential privacy parameters ( ϵ , δ ) control the uniform generalization risk, where small values of ϵ and δ lead to a reduced risk for overfitting.

5.5. Empirical Risk Minimization of 0–1 Loss Classes

Empirical risk minimization (ERM) of stochastic loss is a popular approach for learning from data. It is often regarded as the default strategy to use, due to its simplicity, generality, and statistical efficiency [1,3,13,34]. Given a fixed hypothesis space H , a domain Z , and a loss function l : H × Z R , the ERM learning rule selects the hypothesis h ^ s that minimizes the empirical risk:
h ^ s = arg min h H L s ( h ) = 1 | s | z i s l ( z i , h ) ,
By contrast, the true risk minimizer h is:
h = arg min h H L ( h ) = E z p ( z ) [ l ( z , h ) ] .
Hence, learning via ERM is justified if L ( h ^ s ) L ( h ) + ϵ , for some ϵ 1 . If such a condition holds and ϵ 0 as the sample size m increases, the ERM learning rule is called consistent.
Uniform generalization is a sufficient condition for the consistency of empirical risk minimization (ERM). To see this, we have by definition:
E s [ L s ( h ^ s ) ] = E s [ min h H L s ( h ) ] min h H E s [ L s ( h ) ] = min h H L ( h ) = R ( h ) ,
From this, we conclude that:
E s R ( h ^ s ) R ( h ) E s R ( h ^ s ) E s [ L s ( h ^ s ) ] C ( L ) ,
where C ( L ) is the learning capacity of the empirical risk minimization rule. The last inequality follows from Theorem 2. In addition, because R ( h ^ s ) R ( h ) 0 , we have by the Markov inequality:
p s R ( h ^ s ) R ( h ) t E s R ( h ^ s ) R ( h ) t C ( L ) t
Hence, the ERM learning rule is consistent if C ( L ) 0 as m . Next, we describe when such a condition on C ( L ) holds for 0–1 loss classes. To do that, we begin with two familiar definitions from statistical learning theory.
Definition 10
(Shattered Set). Given a domain Z , a hypothesis space H , and a 0–1 loss function l : Z × H { 0 , 1 } , a set { z 1 , , z d } is said to be shattered by H with respect to the function l if for any labeling I { 0 , 1 } d , there exists a hypothesis h I H such that ( l ( z 1 , h I ) , , l ( z d , h I ) ) = I .
Example 7.
Let Z = H = R and let the loss function be l ( z , h ) = I { z h 0 } . Then, any singleton set { z } is shattered by H since we always have the two hypotheses h 0 = z 1 and h 1 = z + 1 . However, no set of two points in Z can be shattered by H . By contrast, if the hypothesis is a pair ( h , c ) R × R and the loss function is l ( z , h , b ) = I { c z h 0 } , then any set of two distinct examples { z 1 , z 2 } is shattered by the hypothesis space.
Definition 11
(VC Dimension). The VC dimension of a hypothesis space H with respect to a domain Z and a 0–1 loss l : Z × H { 0 , 1 } is the maximum cardinality of a set of points in Z that can be shattered by H with respect to l.
The VC dimension is arguably the most fundamental concept in statistical learning theory because it provides a crisp characterization of learnability for 0–1 loss classes. Next, we show that the VC dimension has, in fact, an equivalence characterization with the learning capacity C ( L ) . Specifically, under the Axiom of Choice, an ERM learning rule exists that has a vanishing learning capacity C ( L ) if and only if the 0–1 loss class has a finite VC dimension.
Before we establish this important result, we describe why ERM by itself is not sufficient for uniform generalization to hold even when the hypothesis space has a finite VC dimension.
Proposition 6.
For any sample size m 1 and a positive constant ϵ > 0 , there exists a hypothesis space H , a domain Z , and a 0–1 loss l : Z × H { 0 , 1 } such that: (1) H has a VC dimension d = 1 , and (2) a learning algorithm L : Z m H exists that outputs an empirical risk minimizer h ^ s with J ( z ^ ; h ^ s ) 1 ϵ .
Proof. 
Let Z = X × Y , where X = [ 0 , 1 ] and Y = { + 1 , 1 } and let the loss be l ( x , y , h ) = I { y · ( x h ) 0 } . In other words, the goal is to learn a threshold in the unit interval that separates the positive from the negative examples. Let x X be uniformly distributed in [ 0 , 1 ] and let h be an error-free separator. Then, for any training sample s Z m , the set of all empirical risk minimizers H ^ is:
H ^ = h [ 0 , 1 ] : y i = sign ( x i h ) , i { 1 , , m }
In particular, H ^ is an interval, which has the power of the continuum, so it can be used to encode the entire training sample.
Fix δ > 0 in advance, which can be made arbitrarily small. Then, the probability over the random choice of the sample that | H ^ | < δ can be made arbitrarily small for a sufficiently small δ > 0 , where | H ^ | is the length of the interval.
Let h ^ H ^ be a hypothesis that lies at the middle of H ^ , i.e.,:
h ^ = 1 2 arg max x i s y i = 1 x i + arg min x i s y i = + 1 x i
Let k = 1 + log 2 ( 1 / δ ) . Then, [ h ^ 2 k , h ^ + 2 k ] H ^ holds with a high probability (which can be made arbitrarily close to 1 for a sufficiently small δ ). Let h ˜ be a hypothesis whose binary expansion agrees with h ^ in its first k + 1 bits and encodes the entire training sample in the rest of the bits.
Finally, the output of the learning algorithm is h ^ s , which is given by the following rule:
  • If h ˜ is an empirical risk minimizer, then set h ^ s = h ˜
  • Otherwise, set h ^ s = h ^ .
Now, define the following different parametric loss l : Z [ 0 , 1 ] to be a function that first uses h ^ s to decode the training sample s based on the coding method constructed above and, then, assigns 1 if and only if x s . To reiterate, this decoding succeeds with a probability that can be made arbitrarily high for a sufficiently small δ > 0 . Clearly, l is a loss defined on the product space Z × H and has a bounded range. However, the generalization risk w.r.t. l is, at least, equal to the probability that | H ^ | < δ , which can be made arbitrarily close to 1. Hence, the statement of the proposition holds. □
Proposition 6 shows that one cannot obtain a non-trivial bound on the uniform generalization risk of an ERM learning rule in terms of the VC dimension d and the sample size m without making some additional assumptions. Next, we prove that an ERM learning rule exists that satisfies the uniform generalization property if the hypothesis space has a finite VC dimension. We begin by recalling a fundamental result in modern set theory. A non-empty set Q is said to be well-ordered if Q is endowed with a total order ⪯ such that every non-empty subset of Q contains a least element. The following fundamental result, which was published in 1904, is due to Ernst Zermelo [35].
Theorem 7
(Well-Ordering Theorem). Under the Axiom of Choice, every non-empty subset can be well-ordered.
Theorem 8.
Given a hypothesis space H , a domain Z , and a 0–1 loss l : H × Z { 0 , 1 } , let ⪯ be a well-ordering on H and let L : Z m H be the learning rule that outputs the “ least” empirical risk minimizer to the training sample s Z m according to ⪯. Then, C ( L ) 0 as m if H has a finite VC dimension. In particular:
C ( L ) 3 m + 1 + d log 2 e m d m ,
where d is the VC dimension of H , provided that m d .
Proof. 
The proof is in Appendix I. □
Next, we prove a converse statement. Before we do this, we present a learning problem that shows why a converse to Theorem 8 is not generally possible without making some additional assumptions. Hence, our converse will be later established for the binary classification setting only.
Example 8
(Subset Learning Problem). Let Z = { 1 , 2 , 3 , , d } be a finite set of positive integers. Let H = 2 Z and define the 0–1 loss of a hypothesis h H to be l ( z , h ) = I { z h } . Then, the VC dimension is d. However, the learning rule that outputs h = Z is always an ERM learning rule that generalizes uniformly with rate ϵ = 0 regardless of the sample size and the distribution of observations.
The previous example shows that a converse to Theorem 8 is not generally possible without making some additional assumptions. In particular, in the Subset Learning Problem, the VC dimension is not an accurate measure of the complexity of the hypothesis space H because many hypotheses dominate others (i.e., perform better across all distributions of observations). For example, the hypothesis h = { 1 , 2 , 3 } dominates h = { 1 } because there is no distribution on observations in which h outperforms h . In fact, the hypothesis h = Z dominates all other hypotheses.
Consequently, in order to prove a lower bound for all ERM rules, we focus on the standard binary classification setting.
Theorem 9.
In any fixed domain Z = X × Y , let the hypothesis space H be a concept class on X and let l ( x , y , h ) = I { y h ( x ) } be the misclassification error. Then, any ERM learning rule L w.r.t. l has a learning capacity C ( L ) that is bounded from below by C ( L ) 1 2 1 1 d m , where m is the training sample size and d is the VC dimension of H .
Proof. 
The proof is in Appendix J. □
Using both Theorems 8 and 9, we arrive at the following equivalence characterization of the VC dimension of a concept class with the learning capacity.
Theorem 10.
Given a fixed domain Z = X × Y , let the hypothesis space H be a concept class on X and let l ( x , y , h ) = I { y h ( x ) } be the misclassification error. Let m be the sample size. Then, the following statements are equivalent under the Axiom of Choice:
  • H admits an ERM learning rule L whose learning capacity C ( L ) satisfies C ( L ) 0 as m .
  • H has a finite VC dimension.
Proof. 
The lower bound in Theorem 9 holds for all ERM learning rules. Hence, an ERM learning rule exists that generalize uniformly with a vanishing rate across all distributions only if H has a finite VC dimension. However, under the Axiom of Choice, H can always be well-ordered by Theorem 7 so, by Theorem 8, a finite VC dimension is also sufficient to guarantee the existence of a learning rule that generalize uniformly. □
Theorem 10 presents a characterization of the VC dimension in terms of information theory. According to the theorem, an ERM learning rule can be constructed that does not encode the training sample if and only if the hypothesis space has a finite VC dimension.
Remark 2.
One method of constructing a well-ordering on a hypothesis space H is to use the fact that computers are equipped with finite precisions. Hence, in practice, every hypothesis space is enumerable, from which the normal ordering of the integers forms a valid well-ordering on H .

6. Concluding Remarks

In this paper, we introduced the notion of “ learning capacity” for algorithms that learn from data, which is analogous to the Shannon capacity of communication channels. Learning capacity is an information-theoretic quantity that measures the contribution of a single training example to the final hypothesis. It has three equivalent interpretations: (1) as a tight upper bound on the uniform generalization risk, (2) as a measure of information leakage, and (3) as a measure of algorithmic stability. Furthermore, by establishing a chain rule for learning capacity, concentration bounds were derived, which revealed that the learning capacity controlled both the expectation of the generalization risk and its variance. Moreover, the relationship between algorithmic stability and data processing revealed that algorithmic stability can be improved by post-processing the learned hypothesis.
Throughout this paper, we provided several bounds on the learning capacity under various settings. For instance, we established a relationship between algorithmic stability and the effective size of the domain of observations, which can be interpreted as a formal justification for dimensionality reduction methods. Moreover, we showed how learning capacity recovered classical bounds, such as in the finite hypothesis space setting, and derived new bounds for other settings as well, such as differential privacy. We also established that, under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule for 0–1 loss classes that had a vanishing learning capacity was equivalent to the assertion that the hypothesis space had a finite Vapnik–Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory.
More generally, the intent of this work is to bring to light a new information-theoretic approach for analyzing machine learning algorithms. Despite the fact that “ uniform generalization” might appear to be a strong condition at a first sight, one of the central claims of this paper is that uniform generalization is, in fact, a natural condition that arises commonly in practice. It is not a condition to require or enforce! We believe this holds because any learning algorithm is a channel from the space of training samples to the hypothesis space. Because learning is a mapping between two spaces, its risk for overfitting should be determined from the mapping itself (i.e., independently of the choice of the loss function). Such an approach yields the uniform generalization bounds that are derived in this paper.
It is worth highlighting that uniform generalization bounds can be established for many other settings that have not be discussed in this paper and it has found some promising applications. Using sample compression schemes, one can show that any learnable hypothesis space is also learnable by an algorithm that achieves uniform generalization [36]. Also, generalization bounds for stochastic convex optimization yield information criteria for model selection that can outperform the popular Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion (BIC) [37]. More recently, uniform generalization has inspired the development of new approaches for structured regression as well [38].

7. Further Research Directions

Before we conclude, we suggest future directions of research and list some open problems.

7.1. Induced VC Dimension

The variational information J ( z ^ ; h ) provides an upper bound on the generalization risk of the learning algorithm L across all parametric loss classes. This upper bound is achievable by the generalization risk of the binary reconstruction loss:
l ( z , h ) = I { p ( z s | h ) p ( z s ) } ,
which assigns the value one to observations z Z that are more likely to have been present in the training sample s upon knowing h , and assigns zero otherwise. In expectation, the generalization risk of this parametric loss is the worst generalization risk across all parametric loss classes.
Let both p ( z ) and p ( h | z ) be fixed; the first is the distribution of observations while the second is entirely determined by the learning algorithm L . Then, because the loss in Equation (16) is binary, it has a VC dimension, which we will call the induced VC dimension of the learning algorithm L [39]. Note that this induced VC dimension is defined for all learning problems, including regression and clustering, but it is distribution-dependent, which is quite unlike the traditional VC dimension of hypothesis spaces.
There are a lot of open questions related to the induced VC dimension of learning algorithms. For instance, while a finite VC dimension implies a small variational information, when does the converse also hold? Can we obtain a non-trivial bound on the induced VC dimension of a learning algorithm L upon knowing its uniform generalization risk J ( z ^ ; h ) ? Along similar lines, suppose that L is an empirical risk minimization (ERM) algorithm of a 0–1 loss class that may or may not use an appropriate tie breaking rule (in light of what was discussed in Section 5.5). Is there a non-trivial relation between the VC dimension of the 0–1 loss that is being minimized and the induced VC dimension of the ERM learning algorithm?

7.2. Unsupervised Model Selection

Information criteria (such as AIC and BIC), are sometimes used in the unsupervised learning setting for model selection, such as when determining the value of k in the popular k-means algorithm [40]. Given that the notion of uniform generalization is developed in the general setting of learning, should the learning capacity C ( L ) serve as a model selection criterion in the unsupervised setting? Why or why not?

7.3. Effective Domain Size

The effective size of the domain of a random variable z in Definition 8 satisfies some intuitive properties and violates others. For instance, it reduces to the size of the domain | Z | when the distribution is uniform. Moreover, if z is Bernoulli, the effective domain size is determined by the variance of the Bernoulli distribution. Importantly, this notion is well-motivated because it determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allowed us to relate the learning capacity to a property of the domain Z .
However, such a notion of effective domain size has some surprising properties. For instance, the effective size of the domain of two independent random variables is not equal to the product of the effective size of each individual domain! In rate distortion theory, a similar phenomenon is observed. Reference [10] explain this observation by stating that “ rectangular grid points (arising from independent descriptions) do not fill up the space efficiently.” Can the effective domain size in Definition 8 be motivated using rate distortion theory?

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Lemma 2

With no loss of generality, let’s assume that all domains are enumerable. We have:
J ( x ; ( y , z ) ) = 1 x , y , z min p ( x = x ) p ( y = y , z = z ) , p ( x = x , y = y , z = z ) = 1 x p ( x = x ) y , z min p ( y = y , z = z ) , p ( y = y , z = z | x = x )
However, the minimum of the sums is always larger than the sum of minimums. That is:
min i α i , i β i i min { α i , β i }
Using marginalization p ( x ) = y p ( x , y = y ) and the above inequality, we obtain:
J ( x ; ( y , z ) ) 1 x p ( x = x ) y min { z p ( y = y , z = z ) , z p ( y = y , z = z | x = x ) } = 1 x p ( x = x ) y min { p ( y = y ) , p ( y = y | x = x ) } = J ( x ; y )

Appendix B. Proof of Theorem 1

We will first prove the inequality when k = 2 . First, we write by definition:
J ( z ; ( h 1 , h 2 ) ) = | | p ( z , h 1 , h 2 ) , p ( z ) p ( h 1 , h 2 ) | | T
Using the fact that the total variation distance is related to the 1 distance by | | P , Q | | T = 1 2 | | P Q | | 1 , we have:
J ( z ; ( h 1 , h 2 ) ) = 1 2 | | p ( z , h 1 , h 2 ) p ( z ) p ( h 1 , h 2 ) | | 1 = 1 2 | | p ( z , h 1 ) p ( h 2 | z , h 1 ) p ( z ) p ( h 1 ) p ( h 2 | h 1 ) | | 1 = 1 2 | | p ( z , h 1 ) p ( z ) p ( h 1 ) · p ( h 2 | h 1 ) + p ( z , h 1 ) · p ( h 2 | z , h 1 ) p ( h 2 | h 1 ) | | 1
Using the triangle inequality:
J ( z ; ( h 1 , h 2 ) ) 1 2 | | p ( z , h 1 ) p ( z ) p ( h 1 ) · p ( h 2 | h 1 ) | | 1 + 1 2 | | p ( z , h 1 ) · p ( h 2 | z , h 1 ) p ( h 2 | h 1 ) | | 1
The above inequality is interpreted by expanding the 1 distance into a sum of absolute values of terms in the product space Z × H 1 × H 2 , where h k H k . Next, we bound each term on the right-hand side separately. For the first term, we note that:
1 2 | | p ( z , h 1 ) p ( z ) p ( h 1 ) · p ( h 2 | h 1 ) | | 1 = 1 2 | | p ( z , h 1 ) p ( z ) p ( h 1 ) | | 1 = J ( z ; h 1 )
The equality holds by expanding the 1 distance and using the fact that h 2 p ( h 2 | h 1 ) = 1 .
However, the second term can be re-written as:
1 2 | | p ( z , h 1 ) · p ( h 2 | z , h 1 ) p ( h 2 | h 1 ) | | 1 = 1 2 | | p ( h 1 ) · p ( h 2 , z | h 1 ) p ( z | h 1 ) p ( h 2 | h 1 ) | | 1 = E h 1 | | p ( h 2 , z | h 1 ) , p ( z | h 1 ) p ( h 2 | h 1 ) | | T = J ( z ; h 2 | h 1 )
Combining Equations (A1) and (A2) yields the inequality:
J ( z ; ( h 1 , h 2 ) ) J ( z ; h 1 ) + J ( z ; h 2 | h 1 )
Next, we use Equation (A3) to prove the general statement for all k 1 . By writing:
J ( z ; ( h 1 , , h k ) ) J ( z ; h k | ( h 1 , , h k 1 ) ) + J ( z ; ( h 1 , , h k 1 ) )
Repeating the same inequality on the last term on the right-hand side yields the statement of the theorem.

Appendix C. Proof of Proposition 1

By the triangle inequality:
J ( x ; z | y ) = E y | | p ( x | y ) · p ( z | y ) , p ( x , z | y ) | | T = E x , y | | p ( z | y ) , p ( z | x , y ) | | T E x , y | | p ( z | y ) , p ( z ) | | T + E x , y | | p ( z ) , p ( z | x , y ) | | T = E y | | p ( z | y ) , p ( z ) | | T + E x , y | | p ( z ) , p ( z | x , y ) | | T = J ( y ; z ) + J ( z ; ( x , y ) )
Therefore:
J ( z ; ( x , y ) ) J ( x ; z | y ) J ( y ; z )
Combining this with the following chain rule of Theorem 2:
J ( z ; ( x , y ) ) J ( x ; z | y ) + J ( y ; z )
yields:
| J ( z ; ( x , y ) ) J ( x ; z | y ) | J ( y ; z )
Or equivalently:
| J ( x ; ( y , z ) ) J ( x ; z | y ) | J ( x ; y )
To prove the other inequality, we use Lemma 2. We have:
J ( x ; y ) J ( x ; ( y , z ) ) J ( x ; y ) + J ( x ; z | y ) ,
where the first inequality follows from Lemma 2 and the second inequality follows from the chain rule. Thus, we obtain the desired bound:
| J ( x ; ( y , z ) ) J ( x ; y ) | J ( x ; z | y )
Both Equations (A4) and (A5) imply that the chain rule is tight. More precisely, the inequality can be made arbitrarily close to an equality when one of the two terms in the upper bound is chosen to be arbitrarily close to zero.

Appendix D. Proof of Theorem 3

We will use the following fact:
Fact 1.
Let f : X [ 0 , 1 ] be a function with a bounded range in the interval [ 0 , 1 ] . Let p 1 ( x ) and p 2 ( x ) be two different probability measures defined on the same space X . Then:
| E x p 1 ( x ) f ( x ) E x p 2 ( x ) f ( x ) | | | p 1 ( x ) , p 2 ( x ) | | T
First Setting: We first consider the following scenario. Suppose a learning algorithm L produces a hypothesis h H from some marginal distribution p ( h ) independently of the training sample s . Afterwards, L produces a second hypothesis k K according to p ( k | h , s ) . In other words, k depends on both h and s but the latter two random variables are independent of each other. Under this scenario, we have:
J ( z ^ ; ( h , k ) ) = J ( z ^ ; k | h ) ,
where the equality follows from the chain rule in Theorem 1, the statement of Proposition 1, and the fact that J ( z ^ ; h ) = 0 .
The conditional variational information is written as:
J ( z ^ ; k | h ) = E h | | p ( z ^ ) · p ( k | h ) , p ( z ^ , k | h ) | | T ,
where we used the fact that p ( z ^ | h ) = p ( z ^ ) . By marginalization:
p ( k | h ) = E z ^ | h [ p ( k | z ^ , h ) ] = E z ^ p ( z ) [ p ( k | z ^ , h ) ]
Similarly:
p ( z ^ , k | h ) = p ( z ^ | h ) · p ( k | z ^ , h ) = p ( z ^ ) · p ( k | z ^ , h )
Therefore:
J ( z ^ ; k | h ) = E h E z ^ | | E z ^ [ p ( k | z ^ , h ) ] , p ( k | z ^ , h ) | | T
Next, we note that since h is independent of the sample s , the variational information between z ^ s and k K can be bounded using Theorem 6. This follows because h is selected independently of the sample s , and, hence, the i.i.d. property of the observations z i continues to hold. Therefore, we obtain:
E h E z ^ | | E z ^ [ p ( k | z ^ , h ) ] , p ( k | z ^ , h ) | | T log | K | 2 m
Because p ( k | z ^ , h ) is arbitrary in our derivation, the above bound holds for any distribution of observations p ( z ) , any distribution p ( h ) , and any family of conditional distributions p ( k | z ^ , h ) .
Original Setting: Next, we return to the original setting where both h H and k K are chosen according to the training sample s . We have:
J ( z ^ ; k | h ) = E h | | p ( z ^ | h ) · p ( k | h ) , p ( z ^ , k | h ) | | T = E h , z ^ | | p ( k | h ) , p ( k | z ^ , h ) | | T = E h , z ^ | | E z ^ | h [ p ( k | z ^ , h ) ] , p ( k | z ^ , h ) | | T E h , z ^ | | E z ^ | h [ p ( k | z ^ , h ) ] , E z ^ [ p ( k | z ^ , h ) ] | | T + E h , z ^ | | E z ^ [ p ( k | z ^ , h ) ] , p ( k | z ^ , h ) | | T
In the last line, we used the triangle inequality.
Next, we would like to bound the first term. Using the fact that the total variation distance is related to the 1 distance by | | p , q | | T = 1 2 | | p q | | 1 , we have:
E h , z ^ | | E z ^ | h [ p ( k | z ^ , h ) ] , E z ^ [ p ( k | z ^ , h ) ] | | T = E h | | E z ^ | h [ p ( k | z ^ , h ) ] , E z ^ [ p ( k | z ^ , h ) ] | | T = 1 2 E h k K | E z ^ | h [ p ( k = k | z ^ , h ) ] E z ^ [ p ( k = k | z ^ , h ) ] | 1 2 k K E h | | p ( z ^ | h ) , p ( z ^ ) | | T = 1 2 k K J ( z ^ ; h ) = | K | 2 J ( z ^ ; h )
Here, the inequality follows from Fact 1.
Next, we bound the second term in Equation (A7). Using Fact 1 and our earlier result in Equation (A6):
E h , z ^ | | E z ^ [ p ( k | z ^ , h ) ] , p ( k | z ^ , h ) | | T J ( z ^ ; h ) + E h E z ^ | | E z ^ [ p ( k | z ^ , h ) ] , p ( k | z ^ , h ) | | T J ( z ^ ; h ) + log | K | 2 m
Combining all results in Equations (A7)–(A9):
J ( z ^ ; k | z ^ ) 1 + | K | 2 J ( z ^ ; h ) + log | K | 2 m
This along with the chain rule imply the statement of the theorem.

Appendix E. Proof of Proposition 3

Let I ( x ; y ) denote the mutual information between x and y and let H ( x ) denote the Shannon entropy of the random variable x measured in nats (i.e., using natural logarithms). As before, we write s = ( z 1 , , z m ) . We have:
I ( s ; ( h , k ) ) = H ( s ) H ( s | h , k ) = i = 1 m H ( z i ) i = 1 m H ( z i | h , k , z 1 , , z i 1 ) i = 1 m H ( z i ) H ( z i | h , k ) = m I ( z ^ ; h , k )
The second line is the chain rule for entropy and the third lines follows from the fact that conditioning reduces entropy. We obtain:
I ( z ^ ; h , k ) I ( s ; ( h , k ) ) m
By Pinsker’s inequality:
J ( z ^ ; ( h , k ) ) I ( z ^ ; ( h , k ) ) 2 I ( s ; ( h , k ) ) 2 m
Using the chain rule for mutual information:
J ( z ^ ; ( h , k ) ) I ( s ; ( h , k ) ) 2 m = I ( s ; h ) + I ( s ; k | h ) 2 m I ( s ; h ) + H ( k ) 2 m I ( s ; h ) + log | k | 2 m
The desired bound follows by applying the same proof technique of Theorem 4 on the last uniform generalization bound, and using the fact that log 3 < 2 .

Appendix F. Proof of Proposition 4

Before we prove the statement of the theorem, we begin with the following lemma:
Lemma A1.
Let the observation space Z be the interval [ 0 , 1 ] , where p ( z ) is continuous in [ 0 , 1 ] . Let h s : | h | = k be a set of k examples picked at random without replacement from the training sample s . Then J ( z ^ ; h ) = k m .
Proof. 
First, we note that p ( z ^ | h ) is a mixture of two distributions: one that is uniform in h with probability k / m , and the original distribution p ( z ) with probability 1 k / m . By Jensen’s inequality, we have J ( z ^ ; h ) k / m . Second, let the parametric loss be l ( z ; h ) = I { z h } . Then, | R g e n ( L ) | = k m . By Theorem 2, we have J ( z ^ ; h ) | R g e n ( L ) | = k / m . Both bounds imply the statement of the lemma. □
Now, we prove Proposition 4. Consider the setting where Z = [ 0 , 1 ] and suppose that the observations z Z have a continuous marginal distribution. Because t is a rational number, let the sample size m be chosen such that k = t m is an integer.
Let s = { z 1 , , z m } be the training set, and let the hypothesis h be given by h = { z 1 , , z k } with some probability δ > 0 and h = { } otherwise. Here, the k instances z i h are picked uniformly at random without replacement from the sample s . To determine the variational information between z ^ and h , we consider the two cases:
  • If h { } , then | | p ( z ^ ) , p ( z ^ | h ) | | T = t as proved in Lemma 1. This happens with probability δ by design.
  • Otherwise, p ( z ^ | h ) = p ( z ^ ) . Thus: | | p ( z ^ ) , p ( z ^ | h ) | | T = 0 .
So, by combining the two cases above, we deduce that:
J ( z ^ ; h ) = E h | | p ( z ^ ) , p ( z ^ | h ) | | T = t δ .
Therefore, L generalizes uniformly with the rate t δ . Next, let the parametric loss be given by l ( z ; h ) = I z h . With this loss:
p | R s ( h ) R ( h ) | = t = δ = J ( z ^ ; h ) t ,
which is the statement of the proposition.

Appendix G. Proof of Theorem 5

Because Z is countable, we will assume without loss of generality that Z = { 1 , 2 , 3 , , } , and we will write p z = p ( z ^ = z ) to denote the marginal distribution of observations. Since all lazy learners are equivalent, we will look into the lazy learner whose hypothesis h is equal to the training sample s itself up to a permutation. Let m z denote the number of times z Z was observed in the training sample. Note that p ( z ^ = z | h ) = p s ( z ) , and so J ( z ^ ; h ) = E s | | p ( z ) , p s ( z ) | | T .
We have:
p ( h ) = p ( s ) = m m 1 , m 2 , p 1 m 1 p 2 m 2
Using the relation | | p , q | | T = 1 2 | | p q | | 1 for any two probability distributions p and q, we obtain:
E h | | p ( z ^ ) p ( z ^ | h ) | | 1 = k 1 : m 1 + m 2 + . . = m m m 1 , m 2 , × p 1 m 1 p 2 m 2 | m k m p k |
For the inner summation, we write:
m 1 + m 2 + = m m m 1 , m 2 , p 1 m 1 p 2 m 2 | m k m p k | = s = 0 m m s p k s | m k m p k | m 1 + + m k 1 + m k + 1 + = m s m s m 1 , , m k 1 , m k + 1 , × p 1 m 1 p k 1 m k 1 p k + 1 m k + 1
Using the multinomial series, we simplify the right-hand side into:
s = 0 m m s p k s ( 1 p k ) m s | s m p k |
Now, we use De Moivre’s formula for the mean deviation of the binomial random variable (see the proof of Example 1). This gives us:
m 1 + m 2 + = m m m 1 , m 2 , p 1 m 1 p 2 m 2 | s m p k | = s = 0 m m s p k s ( 1 p k ) m s | s m p k | = 2 m ( 1 p k ) ( 1 p k ) m p k 1 + m p k m ! ( p k m ) ! ( ( 1 p k ) m 1 ) !
Using Stirling’s approximation to the factorial [17], we obtain the simple asymptotic expression:
m 1 + m 2 + = m m m 1 , m 2 , p 1 m 1 p 2 m 2 | m k m p k | 2 p k ( 1 p k ) π m
Plugging this into the earlier expression for J ( z ^ ; h ) yields:
J ( z ^ ; h ) 1 2 k = 1 , 2 , 3 , 2 p k ( 1 p k ) π m = Ess [ Z ; p ( z ) ] 1 2 π m
Due to the tightness of the Stirling approximation, the asymptotic expression for the variational information is tight. Because J ( z ^ ; h ) = E s | | p ( z ) , p s ( z ) | | T , we deduce that:
E s | | p ( z ) , p s ( z ) | | T Ess [ Z ; p ( z ) ] 1 2 π m ,
which provides the asymptotic rate of convergence of an empirical probability mass function to the true distribution.

Appendix H. Proof of Proposition 5

First, we note that for any two adjacent samples s and s and any O H , we have in the differential privacy setting:
p ( h O | s ) p ( h O | s ) ( e ϵ 1 ) p ( h O | s ) + δ
Similarly, we have:
p ( h O | s ) p ( h O | s ) ( e ϵ 1 ) p ( h O | s ) e ϵ δ = ( 1 e ϵ ) p ( h O | s ) + e ϵ δ e ϵ ( 1 e ϵ ) p ( h O | s ) + e ϵ δ = ( e ϵ 1 ) p ( h O | s ) + δ
Both results imply that:
| p ( h O | s ) p ( h O | s ) | ( e ϵ 1 ) p ( h O | s ) + δ e ϵ 1 + δ
We write:
J ( z ^ ; h ) = E z ^ | | p ( h | z ^ ) , p ( h ) | | T = 1 2 E z ^ | | E z ^ p ( h | z ^ ) p ( h | z ^ ) | | 1 1 2 E z ^ , z ^ | | p ( h | z ^ ) p ( h | z ^ ) | | 1
The last inequality follows by convexity. Next, let s m 1 be a sample that contains m 1 observations drawing i.i.d. from p ( z ) . Then:
J ( z ^ ; h ) 1 2 E z ^ , z ^ | | E s m 1 p ( h | z ^ , s m 1 ) p ( h | z ^ , s m 1 ) | | 1 1 2 E s , s | | p ( h | s ) p ( h | s ) | | 1 ,
where s , s are two adjacent samples. Finally, we use Equation (A11) to arrive at the statement of the proposition.

Appendix I. Proof of Theorem 8

The proof is similar to the classical VC argument. Given a fixed hypothesis space H , a fixed domain Z , and a 0–1 loss function l : H × Z { 0 , 1 } , let s = { z 1 , , z m } be a training sample that comprises of m i.i.d. observations. Define the restriction of H to s by:
F s = l ( z 1 , h ) , , l ( z m , h ) : h H
In other words, F s is the set of all possible realizations of the 0–1 loss for the elements in s by hypotheses in H . We can introduce an equivalence relation between the elements of H w.r.t. the sample s . Specifically, we say that for h , h H , we have h s h if and only if:
l ( z 1 , h ) , , l ( z m , h ) = l ( z 1 , h ) , , l ( z m , h )
It is trivial to see that this defines an equivalence relation; i.e., it is reflexive, symmetric, and transitive. Let the set of equivalence classes w.r.t. s be denoted H s . Note that we have a one-to-one correspondence between the members of F s and the members of H s . Moreover, H s is a partitioning of H .
We use the standard twin-sample trick where we have s 2 = s s Z 2 m and L learns based on s only. For any fixed h H , let f : H × Z [ 0 , 1 ] be an arbitrary loss function, which can be different from the loss l that is optimized during the training. A Hoeffding bound for sampling without replacement [41] states that:
p | E z s [ f ( z , h ) ] E z s 2 [ f ( z , h ) ] | ϵ 2 exp { 2 ϵ 2 m }
Hence:
p { | E z s [ f ( z , h ) ] E z s [ f ( z , h ) ] | ϵ } p | E z s [ f ( z , h ) ] E z s 2 [ f ( z , h ) ] | ϵ 2 + p | E z s [ f ( z , h ) ] E z s 2 [ f ( z , h ) ] | ϵ 2 4 exp { ( 1 / 2 ) ϵ 2 m }
This happens for a hypothesis h H that is fixed independently of the random split of s 2 into training and ghost samples. When h is selected according to the random split of s 2 , then we need to employ the union bound.
For any subset H H , let min ( H ) be the least element in H according to ⪯. Let H s be as defined previously and write H m i n ( s ) = { min ( H k ) : H k H s } . Then, it is easy to observe that the ERM learning rule of Theorem 2 must select one of the hypotheses in H m i n ( s 2 ) regardless of the split s 2 = s s . This holds because H s 2 is a coarser partitioning of H than H s . In other words, every member of H s is a union of some finite number of members of H s 2 . By the well-ordering property, the “ least” element among the empirical risk minimizers must be in H m i n ( s 2 ) .
Hence, there is, at most, τ H ( 2 m ) possible hypotheses given s 2 , where τ H ( m ) is the growth function (sometimes referred to as the shattering coefficient), and those hypotheses can be fixed independently of the random splitting of s 2 into a training sample s and a ghost sample s .
Consequently, we have by the union bound:
p sup h H m i n ( s s ) | E z s [ f ( z , h ) ] E z s [ f ( z , h ) ] | ϵ 4 τ H ( 2 m ) exp { ϵ 2 m 2 } 4 2 e m d d exp { ϵ 2 m 2 } ,
where d is the VC dimension of H . Finally, to bound the generalization risk in expectation, we use Lemma A.4 in [13], which implies that if m d :
E s , s sup h H m i n ( s s ) | E z s [ f ( z , h ) ] E z s [ f ( z , h ) ] | 2 m 2 + log 2 + d log 2 e m d 2 m 2 + 1 + d log 2 e m d 3 + 1 + d log 2 e m d m
Writing h ^ for the least empirical risk minimizer w.r.t. the training sample s :
R g e n ( L ) = E s E z s [ f ( z , h ^ ) ] E z D [ f ( z , h ^ ) ] E s | E z s [ f ( z , h ^ ) ] E z D [ f ( z , h ^ ) ] | = E s | E z s [ f ( z , h ^ ) ] E s E z s [ f ( z , h ^ ) ] | E s , s | E z s [ f ( z , h ^ ) ] E z s [ f ( z , h ^ ) ] | E s , s sup h H m i n ( s s ) | E z s [ f ( z , h ) ] E z s [ f ( z , h ) ] | 3 + 1 + d log 2 e m d m
Because this bound in expectation holds for any single loss f : H × Z [ 0 , 1 ] , it holds for the following loss function:
l ( z , h ) = I p ( z s | h ) > p ( z s ) ,
which is a deterministic 0–1 loss function of h that assigns to z Z the value 1 if and only if our knowledge of h increases the probability that z belongs to the training sample. However, the generalization risk in expectation for the loss l is equal to the variational information J ( h ^ ; z ^ ) as shown in the proof of Theorem 2. Hence, we have the bound stated in the theorem:
J ( h ^ ; z ^ ) 3 + 1 + d log 2 e m d m ,
Because this is a distribution-free bound, we have:
C ( L ) 3 + 1 + d log 2 e m d m

Appendix J. Proof of Theorem 9

Let X = { x 1 , , x d } be a set of d points in X that are shattered by hypotheses in H . By definition, this implies that for any possible 0–1 labeling I { 0 , 1 } d , there exists a hypothesis h I H such that ( h I ( x 1 ) , , h I ( x d ) ) = I .
Given an ERM learning rule L whose hypothesis is denoted h ^ s , let p ( x ) be the uniform distribution of instances over X and define:
y ( x ) = arg min y ˜ { + 1 , 1 } p s h ^ s ( x ) = y ˜ | x s }
In other words, y ( x ) is the least probable class that is assigned by L to the instance x when x is unseen in the training sample. Let p ( z ) with z = ( x , y ) denote the uniform distribution of instances over X with y given by the labeling rule above.
By drawing a training sample s Z m of m i.i.d. observations from p ( z ) , our first task is to bound the expected number of distinct values in X that are not observed in the training sample. Let:
E i = I { x i s }
Then, the expected number of distinct values in X that are not observed in the training sample s is:
i = 1 d E [ E i ] = i = 1 d 1 1 d m = d 1 1 d m
Here, we used the linearity of expectation, which holds even when the random variables are not independent. This shows that the expected fraction of instances in X that are not seen in the sample s is ( 1 1 d ) m .
Next, given an ERM learning rule that outputs an empirical risk minimizer, the training error of this learning algorithm is zero because X is shattered by H . However, for any learning rule L , the expected error rate on the unseen examples is, at least, 1 / 2 by construction. Therefore, there exists a distribution p ( z ) in which the generalization risk is, at least, ( 1 / 2 ) ( 1 1 / d ) m .
By Theorem 2, the learning capacity is an upper bound on the maximum generalization risk across all distributions of observations and all parametric loss functions. Consequently:
C ( L ) 1 2 1 1 d m ,
which is the statement of the theorem.

References

  1. Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Stochastic Convex Optimization. In Proceedings of the Annual Conference on Learning Theory, Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
  2. Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef] [Green Version]
  3. Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Blumer, A.; Ehrenfeucht, A.; Haussler, D.; Warmuth, M.K. Learnability and the Vapnik-Chervonenkis dimension. JACM 1989, 36, 929–965. [Google Scholar] [CrossRef]
  5. McAllester, D. PAC-Bayesian stochastic model selection. Mach. Learn. 2003, 51, 5–21. [Google Scholar] [CrossRef] [Green Version]
  6. Bousquet, O.; Elisseeff, A. Stability and generalization. JMLR 2002, 2, 499–526. [Google Scholar]
  7. Bartlett, P.L.; Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR 2002, 3, 463–482. [Google Scholar]
  8. Kutin, S.; Niyogi, P. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the Eighteenth conference on Uncertainty in Artificial Intelligence (UAI), Edmonton, AB, Canada, 1–4 August 2002. [Google Scholar]
  9. Poggio, T.; Rifkin, R.; Mukherjee, S.; Niyogi, P. General conditions for predictivity in learning theory. Nature 2004, 428, 419–422. [Google Scholar] [CrossRef]
  10. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley & Sons: New York, NY, USA, 1991. [Google Scholar]
  11. Hardt, M.; Recht, B.; Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. arXiv 2015, arXiv:1509.01240. [Google Scholar]
  12. Dwork, C.; Feldman, V.; Hardt, M.; Pitassi, T.; Reingold, O.; Roth, A. Preserving Statistical Validity in Adaptive Data Analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), Portland, OR, USA, 14–17 June 2015; pp. 117–126. [Google Scholar]
  13. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
  14. Raginsky, M.; Rakhlin, A.; Tsao, M.; Wu, Y.; Xu, A. Information-theoretic analysis of stability and bias of learning algorithms. In Proceedings of the 2016 IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 26–30. [Google Scholar]
  15. Janson, S. Probability asymptotics: Notes on notation. arXiv 2011, arXiv:1108.3924. [Google Scholar]
  16. Tao, T. Topics in Random Matrix Theory; American Mathematical Society: Providence, RI, USA, 2012. [Google Scholar]
  17. Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Learnability, stability and uniform convergence. JMLR 2010, 11, 2635–2670. [Google Scholar]
  18. Talagrand, M. Majorizing measures: The generic chaining. Ann. Probab. 1996, 24, 1049–1103. [Google Scholar] [CrossRef]
  19. Audibert, J.Y.; Bousquet, O. Combining PAC-Bayesian and generic chaining bounds. JMLR 2007, 8, 863–889. [Google Scholar]
  20. Xu, H.; Mannor, S. Robustness and generalization. Mach. Learn. 2012, 86, 391–423. [Google Scholar] [CrossRef] [Green Version]
  21. Csiszár, I. A Class of Measures of Informativity of Observation Channels. Period. Math. Hung. 1972, 2, 191–213. [Google Scholar] [CrossRef]
  22. Csiszár, I. Axiomatic Characterizations of Information Measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
  23. Russo, D.; Zou, J. Controlling Bias in Adaptive Data Analysis Using Information Theory. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain, 9–11 May 2016. [Google Scholar]
  24. Bassily, R.; Moran, S.; Nachum, I.; Shafer, J.; Yehudayoff, A. Learners that Use Little Information. PMLR 2018, 83, 25–55. [Google Scholar]
  25. Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the IJCAI, Seattle, WA, USA, 4–10 August 2011. [Google Scholar]
  26. Kull, M.; Flach, P. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2015; pp. 68–85. [Google Scholar]
  27. Robbins, H. A remark on Stirling’s formula. Am. Math. Mon. 1955, 62, 26–29. [Google Scholar] [CrossRef]
  28. Cortes, C.; Vapnik, V. Support vector machine. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  29. Wang, J.; Chen, Q.; Chen, Y. RBF kernel based support vector machine with universal approximation and its application. ISNN 2004, 3173, 512–517. [Google Scholar]
  30. Downs, T.; Gates, K.E.; Masters, A. Exact simplification of support vector solutions. JMLR 2002, 2, 293–297. [Google Scholar]
  31. Stigler, S.M. The History of Statistics: The Measurement of Uncertainty before 1900; Harvard University Press: Cambridge, MA, USA, 1986. [Google Scholar]
  32. Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference (TCC 2006), New York, NY, USA, 4–7 March 2006; pp. 265–284. [Google Scholar]
  33. Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Theor. Comput. Sci. 2013, 9, 211–407. [Google Scholar]
  34. Koren, T.; Levy, K. Fast rates for exp-concave empirical risk minimization. In Proceedings of the NIPS 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 1477–1485. [Google Scholar]
  35. Kolmogorov, A.N.; Fomin, S.V. Introductory Real Analysis; Dover Publication, Inc.: New York, NY, USA, 1970. [Google Scholar]
  36. Alabdulmohsin, I.M. An information theoretic route from generalization in expectation to generalization in probability. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
  37. Alabdulmohsin, I. Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization. In Proceedings of the International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 149–158. [Google Scholar]
  38. Pavlovski, M.; Zhou, F.; Arsov, N.; Kocarev, L.; Obradovic, Z. Generalization-Aware Structured Regression towards Balancing Bias and Variance. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, 13–19 July 2018; pp. 2616–2622. [Google Scholar]
  39. Alabdulmohsin, I.M. Algorithmic Stability and Uniform Generalization. In Proceedings of the NIPS 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 19–27. [Google Scholar]
  40. Pelleg, D.; Moore, A.W. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 727–734. [Google Scholar]
  41. Bardenet, R.; Maillard, O.A. Concentration inequalities for sampling without replacement. Bernoulli 2015, 21, 1361–1385. [Google Scholar] [CrossRef]
Figure 1. This figure corresponds to a classification problem in one dimension in which a classifier is a threshold between positive and negative examples. In this figure, the x axis is the number of training examples while the y-axis is the generalization risk. The red curve (top) corresponds to the difference between training and test accuracy when z-score normalization is applied before learning a classifier. The blue curve (bottom) corresponds to the difference between training and test accuracy when the data is not normalized.
Figure 1. This figure corresponds to a classification problem in one dimension in which a classifier is a threshold between positive and negative examples. In this figure, the x axis is the number of training examples while the y-axis is the generalization risk. The red curve (top) corresponds to the difference between training and test accuracy when z-score normalization is applied before learning a classifier. The blue curve (bottom) corresponds to the difference between training and test accuracy when the data is not normalized.
Entropy 22 00438 g001

Share and Cite

MDPI and ACS Style

Alabdulmohsin, I. Towards a Unified Theory of Learning and Information. Entropy 2020, 22, 438. https://doi.org/10.3390/e22040438

AMA Style

Alabdulmohsin I. Towards a Unified Theory of Learning and Information. Entropy. 2020; 22(4):438. https://doi.org/10.3390/e22040438

Chicago/Turabian Style

Alabdulmohsin, Ibrahim. 2020. "Towards a Unified Theory of Learning and Information" Entropy 22, no. 4: 438. https://doi.org/10.3390/e22040438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop