Towards a Unified Theory of Learning and Information

In this paper, we introduce the notion of “learning capacity” for algorithms that learn from data, which is analogous to the Shannon channel capacity for communication systems. We show how “learning capacity” bridges the gap between statistical learning theory and information theory, and we will use it to derive generalization bounds for finite hypothesis spaces, differential privacy, and countable domains, among others. Moreover, we prove that under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule that has a vanishing learning capacity is equivalent to the assertion that the hypothesis space has a finite Vapnik–Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory. In addition, we show how the learning capacity of an algorithm provides important qualitative results, such as on the relation between generalization and algorithmic stability, information leakage, and data processing. Finally, we conclude by listing some open problems and suggesting future directions of research.


Generalization Risk
A central goal when learning from data is to strike a balance between underfitting and overfitting. Mathematically, this requirement can be translated into an optimization problem with two competing objectives. First, we would like the learning algorithm to produce a hypothesis (i.e., an answer) that performs well on the empirical sample. This goal can be easily achieved by using a rich hypothesis space that can "explain" any observations. Second, we would like to guarantee that the performance of the hypothesis on the empirical data (a.k.a. training error) is a good approximation of its performance with respect to the unknown underlying distribution (a.k.a. test error). This goal can be achieved by limiting the complexity of the hypothesis space. The first condition mitigates underfitting while the latter condition mitigates overfitting.
Formally, suppose we have a learning algorithm L : Z m → H that receives a sample s = {z 1 , . . . , z m }, which comprises of m i.i.d. observations z i ∼ p(z), and uses s to select a hypothesis h ∈ H. Let l be a loss function defined on the product space Z × H. For instance, l can be the mean-square-error (MSE) in regression or the 0-1 error in classification. Then, the goal of learning from data is to select a hypothesis h ∈ H such that its true risk R(h), defined by where R(h) is defined in Equation (1), and the expectation is taken over the random choice of s and the internal randomness of L. A learning algorithm L generalizes in expectation if R gen (L) → 0 as m → ∞ for all distributions p(z).
Definition 2 (Generalization in Probability). A learning algorithm L generalizes in probability if for any > 0, we have: where the probability is evaluated over the randomness of s and the internal randomness of the learning algorithm.
In general, both types of generalization have been used to analyze machine learning algorithms. For instance, generalization in probability is used in the VC theory to analyze algorithms with finite VC dimensions, such as linear classifiers [3]. Generalization in expectation, on the other hand, was used to analyze learning algorithms, such as the stochastic gradient descent (SGD), differential privacy, and ridge regression [11][12][13][14]. Generalization in expectation is often simpler to analyze, but it provides a weaker performance guarantee.

Paper Outline
In this paper, a third notion of generalization is introduced, which is called uniform generalization. Uniform generalization also provides generalization bounds in expectation, but it is stronger than the traditional form of generalization in expectation in Definition 1 because it requires that the generalization risk vanishes uniformly in expectation across all bounded parametric loss functions (hence the name). In this paper, a loss function l : Z × H → [0, 1] is called " parametric" if it is conditionally independent of the original training sample given the learned hypothesis h ∈ H.
As mentioned earlier, the uniform generalization risk is equal to an information-theoretic quantity and it yields classical results in statistical learning theory. Perhaps more importantly, and unlike traditional in-expectation guarantees that do not imply concentration, we will show that uniform generalization in expectation implies generalization in probability. Hence, all of the uniform generalization bounds derived in this paper hold both in expectation and with a high probability.
The theory of uniform generalization bridges the gap between information theory and statistical learning theory. For example, we will establish an equivalence relation between the VC dimension, on one hand, and another quantity that is quite analogous to the Shannon channel capacity, on the other hand. Needless to mention, both the VC dimension and the Shannon channel capacity are arguably the most central concepts in statistical learning theory and information theory. This connection between the two concepts is obtained via the notion of the " learning capacity" that we introduce in this paper, which is the supremum of the uniform generalization risk across all input distributions. We will compute the learning capacities for many machine learning algorithms and show how it matches known bounds on the generalization risk up to logarithmic factors.
In general, the main aim of this work is to bring to light a new information-theoretic approach for analyzing machine learning algorithms. Despite the fact that " uniform generalization" might appear to be a strong condition at a first sight, one of the central themes that is emphasized repeatedly throughout this paper is that uniform generalization is, in fact, a natural condition that arises commonly in practice. It is not a condition to require or enforce by machine learning practitioners! We believe this holds because any learning algorithm is a channel from the space of training samples to the hypothesis space so its risk for overfitting can be analyzed by studying the properties of this mapping itself. Such an approach yields the uniform generalization bounds that are derived in this paper.
While we strive to introduce foundational results in this work, there are many important questions that remain unanswered. We conclude this paper by listing some of those open problems and suggesting future directions of research.

Notation
The notation used in this paper is fairly standard. Important exceptions are listed here. If x is a random variable that takes its values from a finite set s uniformly at random, we write x ∼ s to denote such a distribution. If x is a boolean random variable (i.e., a predicate), then I{x} = 1 if and only if x is true, otherwise I{x} = 0. In general, random variables are denoted with boldface letters x, instances of random variables are denoted with small letters x, matrices are denoted with capital letters X, and alphabets i.e., fixed sets) are denoted with calligraphic typeface X (except L that will be reserved for the learning algorithm and D that will be reserved for the input distribution as is customary in the literature).
Throughout this paper, we will always write Z to denote the space of observations (a.k.a. domain) and write H to denote the hypothesis space (a.k.a. range). A learning algorithm L : Z m → H is formally treated as a stochastic map, where the hypothesis h ∈ H can be a deterministic or a randomized function of the training sample s ∈ Z m . Given a 0-1 loss function l : H × Z → {0, 1}, we will abuse terminology slightly by speaking about the " VC dimension of H" when we actually mean the VC dimension of the loss class {l(·, h) : h ∈ H}.
In addition, given two probability measures p and q defined on the same space, we will write p, q to denote the overlapping coefficient between p and q. That is, p, q = 1 − ||p , q|| T , where ||p , q|| T = 1 2 p − q 1 is the total variation distance. Moreover, we will use the order in probability notation for real-valued random variables. Here, we adopt the notation used by [15] and [16]. In particular, let x = x n be a real-valued random variable that depends on some parameter n ∈ N. Then, we will write x n = O p ( f (n)) if for any δ > 0, there exists absolute constants C and n 0 such that for any fixed n ≥ n 0 , the inequality |x n | < C | f (n)| holds with a probability of, at least, 1 − δ. In other words, the ratio x n / f (n) is stochastically bounded [15]. Similarly, we write

Related Work
A learning algorithm is called consistent if the true risk of its hypothesis h converges to the optimal true risk in H, i.e., inf h∈H R(h), as m → ∞ in a distribution agnostic manner. A learning problem, which is a tuple (Z, H, l) with l being a loss function defined on the product space Z × H, is called learnable if it admits a consistent learning algorithm. It can be shown that learnability is equivalent to uniform convergence for supervised classification and regression even though uniform convergence is not necessary in the general setting [17].
Unlike learnability, the subject of generalization looks into how representative the empirical risk R s (h) is to the true risk R(h) as discussed earlier. It can be rightfully considered as an extension to the law of large numbers, which is one of the earliest and most important results in probability theory and statistics. However, unlike the law of large numbers, which assumes that observations are independent and identically distributed, the subject of generalization in machine learning addresses the case where the losses l(z i , h) are no longer i.i.d. due to the fact that h is selected according to the training sample s and z i ∈ s. Similar to learnability, uniform convergence is, by definition, sufficient for generalization but it is not necessary because the learning algorithm might restrict its search space to a smaller subset of H. So, in addition to uniform convergence bounds, several other methods have been introduced for bounding the generalization risk, such as using algorithmic stability, Rademacher and Gaussian complexities, generic chaining bounds, the PAC-Bayesian framework, and robustness-based analysis [5][6][7][18][19][20]. Classical concentration of measure inequalities, such as using the union bound, form the building blocks of such rich theories.
In this work, we address the subject of generalization in machine learning from an information-theoretic point of view. We will show that if the hypothesis h conveys " little" information about a random single training exampleẑ ∼ s, then the difference between Eẑ ∼s [l(ẑ, h)] and E z∼p(z) [l(z, h)] will be small with a high probability. The measure of information we use here is given by the notion of variational information J (ẑ; h) between the hypothesis h and a single random training exampleẑ ∼ s. Variational information, also sometimes called T-information [14], is an instance of the class of informativity measures using f -divergences, which can be motivated axiomatically [21,22]. Unlike traditional methods, we will prove that J (ẑ; h) is equal to the " uniform" generalization risk; it is not just an upper bound.
Information-theoretic approaches of analyzing the generalization risk of learning algorithms, such as the one proposed in this paper, have found applications in adaptive data analysis. This includes the work of [12] using the max-information, the work of [23] and [24] using the mutual information, and the work of [14] using the leave-one-out information. One key contribution of our work is to show that one should examine the relationship between the hypothesis and a single random training example, instead of examining the relationship between the hypothesis and the full training sample as is customary in the literature. The gap between such two approaches is strict. For example, Theorem 8 in Section 5.5 presents an example of when a learning algorithm can have a vanishing uniform generalization risk even when the mutual information between the learned hypothesis and the training sample can be made arbitrarily large.

Preliminary Definitions
In this paper, we consider the general setting of learning introduced by Vapnik [3]. To reiterate, we have an observation space (a.k.a. domain) Z and a hypothesis space H. Our learning algorithm L receives a set of m observations s = {z 1 , . . . , z m } ∈ Z m generated i.i.d. from some fixed unknown distribution p(z), and picks a hypothesis h ∈ H according to some probability distribution p(h | s). In other words, L is a channel from s to h. In this paper, we allow the hypothesis h to be any summary statistic of the training set. It can be an answer to a query, a measure of central tendency, or a mapping from the input space to the output space. In fact, we even allow h to be a subset of the training set itself. In formal terms, L is a stochastic map between the two random variables s ∈ Z m and h ∈ H, where the exact interpretation of those random variables is irrelevant. Moreover, we assume that there exists a non-negative bounded loss function l(z, h) ∈ [0, 1] that is used to measure the fitness of the hypothesis h ∈ H on the observation z ∈ Z.
For any fixed hypothesis h ∈ H, we define its true risk R(h) by Equation (1) and denote its empirical risk on the training sample by R s (h). We also define the true and empirical risks of the learning algorithm L by the expected corresponding risk of its hypothesis: Finally, the generalization risk of the learning algorithm is defined by: Next, we define uniform generalization: Informally, Definition 4 states that once a hypothesis h is selected by a learning algorithm L that achieves uniform generalization, then no " adversary" can post-process the hypothesis in a manner that causes over-fitting to occur. Equivalently, uniform generalization implies that the empirical performance of h on the sample s will remain close to its performance with respect to the underlying distribution regardless of how that performance is being measured. For example, the loss function l : Z × H → [0, 1] in Equation (5) can be the misclassification error rate as in the traditional classification setting, a cost-sensitive error rate as in fraud detection and medical diagnosis [25], or the Brier score as in probabilistic predictions [26]. The generalization guarantee would hold in any case.

Variational Information
Given two random variables x and y, the variational information between the two random variables is defined to be the total variation distance between the join distribution p(x, y) and the product of marginals p(x) · p(y). We will denote this by J (x; y). By definition: Proof. The proof is in Appendix C.
In other words, the inequality in the chain rule J (x; (y, z)) ≤ J (x; y) + J (x; z | y) becomes an equality if: The chain rule provides a recipe for computing the bias of a composition of hypotheses (h 1 , . . . , h k ). Recently, [23] proposed an information budget framework for controlling the bias of estimators by controlling the mutual information between h and the training sample s. The proposed framework rests on the chain rule of mutual information. Here, we note that the argument for the information budget framework also holds when using the variational information due to the chain rule above.

Equivalence Result
Our first main theorem states that the uniform generalization risk has a precise information-theoretic characterization.
Theorem 2. Given a fixed constant 0 ≤ ≤ 1 and a learning algorithm L : Z m → H that selects a hypothesis h ∈ H according to a training sample s = {z 1 , . . . , z m }, where z i ∼ p(z) are i.i.d., L generalizes uniformly with rate if and only if J (h;ẑ) ≤ , whereẑ ∼ s is a single random training example.
Proof. Let L : Z m → H be a learning algorithm that receives a finite set of training examples s = {z 1 , . . . , z m } ∈ Z m drawn i.i.d. from a fixed unknown distribution p(z). Let h ∼ p(h|s) be the hypothesis chosen by L (can be deterministic or randomized) and writeẑ ∼ s to denote a random variable that selects its value uniformly at random from the training sample s. Clearly,ẑ and h are not independent in general. To simplify notation, we will write l = l(·, h) : Z → [0, 1] to denote the loss function. Note that l is itself a random variable that satisfies the Markov chain s → h → l. The claim is that L generalizes uniformly with rate > 0 across all parametric loss functions l if and only if J (h;ẑ) ≤ .
By the Markov property, we have p(l|h, s) = p(l|h). By definition, the true and empirical risks of L are given by: Becauseẑ ∼ s is a random variable whose value is chosen uniformly at random with replacement from the training set s, its marginal distribution is p(z). Its conditional distribution given l can be different, however, because both l andẑ depend on the training set s. However, they are both conditionally independent of each other given s. By marginalization, we have: Combining this with Equations (6) and (7) yields R(L) = E l Eẑ l(ẑ) andR(L) = E l Eẑ |l l(ẑ). Both equations imply that: Now, we would like to sandwich the right-hand side between upper and lower bounds. To do this, we note that if p 1 (z) and p 2 (z) are two distributions defined on the same domain Z and f : Z → [0, 1], then: where ||p 1 (z) , p 2 (z)|| T is the total variation distance. This result can be immediately proven by considering the two regions {z ∈ Z : p 1 (z) > p 2 (z)} and {z ∈ Z : p 1 (z) < p 2 (z)} separately. In addition, it is tight because the inequality holds with equality for the loss function f (z) = I{p 1 (z) ≥ p 2 (z)}. To prove the converse, define: The loss l (z, h) is independent of the training sample given h because p(ẑ = z | h) is evaluated by taking expectation over all the training samples conditioned on h. Hence, l (z, h) is a 0-1 loss defined on the product space Z × H and satisfies the Markov chain s → h → l. However, given this choice of loss, we have: Hence, the variational information J (h;ẑ) does not only provide an upper bound on the uniform generalization risk, but is also a lower bound to it. Therefore, J (h;ẑ) is equal to the uniform generalization risk.

Remark 1.
One important observation about Theorem 2 is that the variational information is measured between the hypothesis h and a single training exampleẑ, which is quite different from previous works that looked into the mutual information with the entire training sample s. By consideringẑ rather than s, we quantify the uniform generalization risk with equality and the resulting bound is not vacuous even if the learning algorithm was deterministic. By contrast, J (s; h) may yield vacuous bounds when L is deterministic and both Z and H are uncountable.
For concreteness, we illustrate how to compute the uniform generalization risk (or equivalently the variational information) on two simple examples. Here, B(k; φ, n) = ( n k )φ k (1 − φ) n−k is the binomial distribution. The first example is a special case of a more general theorem that will be presented later in Section 5.2.
, it can be shown that the uniform generalization risk of this learning algorithm is given by the following quantity assuming that φm is an integer: This is maximized when φ = 1/2, in which case, the uniform generalization risk can be bounded using the Stirling approximation [27] by 1/ √ 2 π m up to a first-order term.
Proof. First, the probability we obtain a hypothesis h = k m , where k ∈ {0, 1, . . . , m}, given that we have m Bernoulli trials has a binomial distribution: We use the identity: is Bernoulli with probability of success φ while p(ẑ|h = k m ) is Bernoulli with probability of success h. The total variation distance between the two Bernoulli distributions is given by |φ − h|. So, we obtain: This is the mean deviation. Assuming φ m is an integer, then the mean deviation of the binomial random variable is given by de Moivre's formula: The mean deviation is maximized when φ = 1 2 . This gives us: where in the last step we expanded the binomial coefficient and used Stirling's approximation [27].

Example 2.
Suppose that the domain is Z = {1, 2, 3, . . . , K} for some K < ∞, where p(z = k) = 1/K for all k ∈ Z. Let the hypothesis space be H = Z where p(h = k) is equal to the fraction of times the value k is observed in the training sample s = {z 1 , . . . , z m }. For example, if s = {1, 3, 2, 1, 1, 3}, the hypothesis h is chosen among the set {1, 2, 3} with the respective probabilities {1/2, 1/6, 1/3}. Then, the variational information is given by: By Bayes rule, we have: However, given one observationẑ = x, the probability of selecting a hypothesis h = k depends on two cases: for some values q ≥ 0 and r ≥ 0 such that q + (K − 1) r = 1. To find q, we use the definition of L: This holds because L is equivalent to an algorithm that selects a single observation in the set s uniformly at random. So, to satisfy the condition q + (K − 1) r = 1, we have: Now, we are ready to find the desired expression.
Note that the variational information in Example 2 is Θ(1/m), which is smaller than the variational information in Example 1. This is not a coincidence. The difference between the two examples is related to data processing. Specifically, suppose that K = 2 in Example 2 and let h 2 be the hypothesis. Let h 1 be the hypothesis in Example 1. Then, we have the Markov chain s → h 1 → h 2 because h 2 is Bernoulli with parameter h 1 .

Learning Capacity
The variational information depends on the distribution of observations p(z), which is seldom known in practice. To construct a distribution-free bound on the uniform generalization risk, we introduce the following quantity: Definition 6 (Learning Capacity). The learning capacity of an algorithm L is defined by: where h andẑ are as defined in Theorem 2.
The above quantity is analogous to the Shannon channel capacity except that it is measured in the total variation distance. It quantifies the capacity for overfitting in the given learning algorithm. For example, the learning capacity of the algorithm in Example 1 is 1/ √ 2πm up to a first order term, as proved earlier, so its capacity for overfitting is larger than that of the learning algorithm in Example 2.
Theorem 2 reveals that C(L) has, at least, three equivalent interpretations:

1.
Statistical: The learning capacity C(L) is equal to the supremum of the expected generalization risk R gen (L) across all input distributions and all bounded parametric losses. This holds by Theorem 2 and Definition 6.
Algorithmic: The learning capacity C(L) measures the influence of a single training exampleẑ on the distribution of the final hypothesis h. As such, a learning algorithm has a small learning capacity if and only if it is algorithmically stable. This follows from the fact that J Throughout the sequel, we analyze the properties of C(L) and derive upper bounds for it under various conditions, such as in the finite hypothesis space setting and differential privacy.

The Definition of Hypothesis
In the proof of Theorem 2, the following Markov chainẑ → s → h → l(·, h) is used. Essentially, this states that the loss function l(·, h) : Z → [0, 1], which is a random variable itself, must be parameterized entirely by the hypothesis h as stated in Definition 3. We list, next, a few examples that highlight this point.

Example 3 (Input Normalization).
If the data is normalized prior to training, such as using min-max or z-score normalization, then the normalization parameters are included in the definition of the hypothesis h.
Example 4 (Feature Selection). If the observations z comprise of d features and feature selection is implemented prior to training a model v (such as in classification or clustering), then the hypothesis h is the composition (u, v), where u ∈ {0, 1} d encodes the set of the features that have been selected by the feature selection algorithm.

Example 5 (Cross Validation).
Hyper-parameter tuning is a common practice in machine learning. This includes choosing the tradeoff parameter C in support vector machine (SVM) [28] or the bandwidth γ in radial basis function (RBF) networks [29]. However, not all hyper-parameters are encoded in the hypothesis h. For instance, the tradeoff constant C is never used during prediction so it is omitted from the definition of h but the bandwidth parameter γ is included if it is selected based on the training sample.
In order to illustrate why the Markov chainẑ → s → h → l(·, h) is important, consider the following simple scenario. Suppose we have a mixture of two Gaussians in R d , one corresponding to the positive class and one corresponding to the negative class. If z-score normalization is applied before training a linear classifier, then the generalization risk might increase with normalization because the final hypothesis now includes more information about the training sample (see Lemma 2). Figure 1 shows this effect when d = 1. As illustrated in the figure, normalization is often important in order to assign equal weights to all features but it can increase the generalization risk as well. The red curve (top) corresponds to the difference between training and test accuracy when z-score normalization is applied before learning a classifier. The blue curve (bottom) corresponds to the difference between training and test accuracy when the data is not normalized.

Concentration
The notion of uniform generalization in Definition 4 provides in-expectation guarantees. In this section, we show that whereas traditional generalization in expectation does not imply concentration, uniform generalization in expectation implies concentration. In fact, we will use the chain rule in Theorem 1 to derive a Markov-type inequality. After that, we show that the bound is tight.
We begin by showing why a non-uniform generalization in expectation does not imply concentration.

Proposition 2.
There exists a learning algorithm L : Z m → H and a parametric loss l : Z × H → [0, 1] such that the expected generalization risk is R gen (L) = 0 even though p |R(h) − R s (h)| = 1 2 = 1, where the probability is evaluated over the randomness of s and the internal randomness of L.
where the probability is evaluated over the random choice of z ∈ Z. In other words, the marginal distribution of the labels predicted by h is uniform over the set {−1, +1}. These assumptions are satisfied, for example, Next, let the hypothesis space H be the set of predictors from Z to {−1, +1} that output a label in {−1, +1} uniformly at random everywhere in Z except at a finite number of points. Define the parametric loss by l(z; h) = I h(z) = h (z) .
Next, we construct a learning algorithm L that generalizes perfectly in expectation but does not generalize in probability. The learning algorithm L simply picks h ∈ {h 0 , h 1 } at random with equal probability. The two hypotheses are: Because Z is uncountable, where the probability of seeing the same observation z twice is zero, R(h) = 1 2 for this learning algorithm. Thus: However, the empirical risk for any s satisfies R s (h) ∈ {0, 1} while the true risk always satisfies R(h) = 1 2 , as mentioned earlier. Hence, the statement of the proposition follows.
There are many ways of seeing why the algorithm in Proposition 2 does not generalize uniformly in expectation. The simplest way is to use the equivalence between uniform generalization and variational information as stated in Theorem 2. Given the hypothesis h ∈ {h 0 , h 1 } that is learned by the algorithm constructed in the proposition, the marginal distribution of an individual training example p(ẑ | h) is uniform over the sample s. This follows from the fact that the hypothesis h has to encode the entire sample s. However, the probability of seeing the same observation twice is zero (by construction). Hence, ||p(ẑ) , p(ẑ | h)|| T = 1. This shows that C(L) = 1.
The example in Proposition 2 reveals an interesting property of non-uniform generalization. Namely, non-uniform generalization can be sensitive to every bit of information provided by the hypothesis. In the example above, the hypothesis h is encoded by the pair (s, k), where k ∈ {0, 1} determines which of the two hypotheses {h 0 , h 1 } is selected. The discrepancy between generalization in expectation and generalization in probability happens because k is added into the hypothesis.
Next, we use the chain rule in Theorem 1 to prove that uniform generalization, on the other hand, is a robust property of learning algorithms. More precisely, if k has a finite domain, then a hypothesis h generalizes uniformly in expectation if and only if the pair (h, k) generalizes uniformly in expectation. Hence, adding any finite amount of information (in bits) to a hypothesis cannot alter its uniform generalization property in a significant way. Theorem 3. Let L : Z m → H be a learning algorithm whose hypothesis is h ∈ H. Let k ∈ K be a different hypothesis that is obtained from the same sample s. Ifẑ ∼ s, then: Proof. The proof is in Appendix D.
We use Theorem 3, next, to prove that a uniform generalization in expectation implies a generalization in probability. The proof is by contradiction. Suppose we have a hypothesis h that generalizes uniformly in expectation but there exists a parametric loss l : Z × H → [0, 1] that does not generalize in probability. We will derive a contradiction from these two assumptions. We show that appending little information to the hypothesis h will allow us to construct a different parametric loss that does not generalize in expectation by determining whether or not the empirical risk w.r.t. l : Z × H → [0, 1] is greater than, approximately equal to, or is less than the true risk w.r.t. the same loss. This is described in, at most, two bits. Knowing this additional information, we can define a new parametric loss that does not generalize in expectation, which contradicts the definition of uniform generalization. Theorem 4. Let L : Z m → H be a learning algorithm, whose risk is evaluated using a parametric loss l : where the probability is evaluated over the random choice of s and the internal randomness of L.
Proof. Let l : Z × H → [0, 1] be a parametric loss function and write: Consider the new pair of hypotheses (h, k), where: Then, by Theorem 3, the uniform generalization risk in expectation for the composition of hypotheses (h, k) is bounded by (7/2) J (ẑ; h) + log 3 2m . This holds uniformly across all parametric loss functions that satisfy the Markov chain s → (h, k) → l(·, (h, k)). Next, consider the parametric loss: Note that l(z, (h, k)) is parametric with respect to the composition of hypotheses (h, k). Using Equation (12), the generalization risk w.r.t l(z, (h, k)) in expectation is, at least, as large as t κ(t). Therefore, by Theorems 2 and 3, we have t κ(t) ≤ (7/2) J (ẑ; h) + log 3 2m , which is the statement of the theorem (Note: The proof assumes that the loss function l has access to the underlying distribution. This assumption is valid because the underlying distribution p(z) is fixed and does not depend on any random outcomes, such as s or h).
Theorem 4 reveals that uniform generalization is sufficient for concentration to hold. Importantly, the generalization bound depends on the learning algorithm L only via its variational information J (ẑ; h). Hence, by controlling the uniform generalization risk, one improves the generalization risk of L both in expectation and with a high probability.
The same proof technique used in Theorem 4 also implies the following concentration bound, which is useful when I(h; s) = o(m) where I(x; y) is the Shannon mutual information. The following bound is similar to the bound derived by [23] using properties of sub-Gaussian loss functions. Proposition 3. Let L : Z m → H be a learning algorithm, whose risk is evaluated using a parametric loss function l : Z × H → [0, 1]. Then: Proof. The proof is in Appendix E.
Note that having a vanishing mutual information, i.e., I(s; h) = o(m), which is the setting recently considered in the work of [23], is a strictly stronger condition than uniform generalization. For instance, we will later construct deterministic learning algorithms that generalize uniformly in expectation even though I(s; h) is unbounded (see Theorem 8).
where the probability is evaluated over the random choice of s and the internal randomness of L.
Proof. The proof is in Appendix F.
Proposition 4 shows that, without making any additional assumptions beyond that of uniform generalization, the concentration bound in Theorem 4 is tight up to constant factors. Essentially, the only difference between the upper and the lower bounds is a vanishing O(1/ √ m) term that is independent of L.

Properties of the Learning Capacity
In this section, we derive bounds on the learning capacity under various settings. We also describe some of its important properties.

Data Processing
The relationship between learning capacity and data processing is presented in Lemma 1. Given the random variables x, y, and z and the Markov chain x → y → z, we always have J (x; z) ≤ J (x; y). Hence, we have a partial order on learning algorithms. This presents us with an important qualitative insight into the design of machine learning algorithms.
Suppose we have two different hypotheses h 1 and h 2 . We will say that h 2 contains less information than h 1 if the Markov chain s → h 1 → h 2 holds. For example, if the observations z i ∈ {0, 1} are Bernoulli trials, then h 1 ∈ R can be the empirical average as given in Example 1 while h 2 ∈ {0, 1} can be the label that occurs most often in the training set. Because h 2 = I{h 1 ≥ m/2}, the hypothesis h 2 contains strictly less information about the original training set than h 1 . Formally, we have s → h 1 → h 2 . In this case, h 2 enjoys a better uniform generalization bound because of data-processing. Intuitively, we know that such a result should hold because h 2 is less dependent to the original training set than h 1 . Hence, one can improve the uniform generalization bound (or equivalently the learning capacity) of a learning algorithm by post-processing its hypothesis h in a manner that is conditionally independent of the original training set given h. Example 6. Post-processing hypotheses is a common technique in machine learning. This includes sparsifying the coefficient vector w ∈ R d in linear methods, where w j is set to zero if it has a small absolute magnitude. It also includes methods that have been proposed to reduce the number of support vectors in SVM by exploiting linear dependence [30], or some methods for decision tree pruning. By the data processing inequality, such techniques reduce the learning capacity and, as a consequence, mitigate the risk for overfitting.
Needless to mention, better generalization does not immediately translate into a smaller true risk. This is because the empirical risk itself may increase when the hypothesis h is post-processed independently of the original training sample.

Effective Domain Size
Next, we look into how the size of the domain Z limits the learning capacity. First, we start with the following definition: Definition 7 (Lazy Learning). A learning algorithm L is called lazy if the training sample s ∈ Z m can be reconstructed perfectly from the hypothesis h ∈ H. In other words, H(s|h) = 0, where H is the Shannon entropy. Equivalently, the mapping from s to h is injective.
One common example of a lazy learner is instance-based learning when h = s. Despite their simple nature, lazy learners are useful in practice. They are useful theoretical tools as well. In particular, because of the fact that H(s|h) = 0 and the data processing inequality, the learning capacity of a lazy learner provides an upper bound to the learning capacity of any possible learning algorithm. Therefore, we can relate the learning capacity C(L) to the size of the domain Z by determining the learning capacity of lazy learners. Because the size of Z is usually infinite, we introduce the following definition of effective set size.

Definition 8.
In a countable space Z endowed with a probability mass function p(z), the effective size of Z w.r.t. p(z) is defined by: Ess p(z) (Z ) At one extreme, if p(z) is uniform over a finite alphabet Z, then Ess p(z) (Z ) = |Z |. At the other extreme, if p(z) is a Kronecker delta distribution, then Ess p(z) (Z ) = 1. As proved next, this notion of effective set size determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allows us to relate the learning capacity to a property of the domain Z. Theorem 5. Let Z be a countable space endowed with a probability mass function p(z). Let s be a set of m i.i.d. observations z i ∼ p(z). Define p s (z) to be the empirical probability mass function that results from drawing observations uniformly at random from s. Then: where Ess p(z) [Z ] is the effective size of Z (see Definition 8).
Proof. The proof is in Appendix G.
A special case of Theorem 5 was proved by de Moivre in the 1730s, who showed that the empirical mean of i.i.d. Bernoulli trials with a probability of success φ converges to the true mean with rate 2φ(1 − φ)/(πm). This is believed to be the first appearance of the square-root law in statistical inference in the literature [31]. Because the effective domain size of the Bernoulli distribution, according to Definition 8, is given by 1 + 4φ(1 − φ), Theorem 5 agrees with, in fact generalizes, de Moivre's result. Corollary 1. Let L : Z m → H be a learning algorithm whose hypothesis is h ∈ H. Then, J (ẑ; h) ≤ . Moreover, the bound is achieved by lazy learners.
Proof. Leth be the hypothesis produced by a lazy learner. The simplest example is if h is equal to the training sample s itself. Then, we always have the Markov chain s →h → h for any hypothesis h ∈ H. Therefore, by the data processing inequality, we have J (ẑ; h) ≤ J (ẑ;h). By Theorem 5, we have: Hence, the statement of the corollary follows.

Corollary 2. For any learning algorithm
Proof. The function f (p) = ∑ z p(z)(1 − p(z)) is both concave over the probability simplex and permutation-invariant. Hence, by symmetry, the maximum effective domain size must be achieved at the uniform distribution p(z) = 1/|Z |, in which case Ess p(z) [Z ] = |Z |.

Finite Hypothesis Space
Next, we look into the role of the size of the hypothesis space. This is formalized by the following theorem.

Theorem 6.
Let h ∈ H be the hypothesis produced by a learning algorithm L : Z m → H. Then: where H is the Shannon entropy measured in nats.
Proof. If we let I(x; y) be the mutual information between the r.v.'s x and y and let s = {z 1 , z 2 , . . . , z m } be the training set, we have: Because conditioning reduces entropy, i.e., H(x|y) ≤ H(x) for any r.v.'s x and y, we have: Therefore: Next, we use Pinsker's inequality [10], which states that for any probability measures p and q: ||p , q|| T ≤ , where ||p , q|| T is total variation distance and D(p || q) is the Kullback-Leibler divergence measured in nats. If we recall that J (s; h) = ||p(s) p(h) , p(s, h)|| T while the mutual information is I(s; h) = D(p(s, h) || p(s) p(h)), we deduce from Pinsker's inequality and Equation (13): Theorem 6 re-establishes the classical PAC result on the finite hypothesis space setting. However, unlike its typical proofs, the proof presented here is purely information-theoretic and does not make any references to the union bounds.

Differential Privacy
Randomization reduces the risk for overfitting. One common randomization technique in machine learning is differential privacy [32,33], which addresses the goal of obtaining useful information about the sample s as a whole without revealing a lot of information about any individual observation. Here, we show that differentially-private learning algorithms have small learning capacities. Not surprisingly, the differential privacy parameters ( , δ) control the uniform generalization risk, where small values of and δ lead to a reduced risk for overfitting.

Empirical Risk Minimization of 0-1 Loss Classes
Empirical risk minimization (ERM) of stochastic loss is a popular approach for learning from data. It is often regarded as the default strategy to use, due to its simplicity, generality, and statistical efficiency [1,3,13,34]. Given a fixed hypothesis space H, a domain Z, and a loss function l : H × Z → R, the ERM learning rule selects the hypothesisĥ s that minimizes the empirical risk: By contrast, the true risk minimizer h is: Hence, learning via ERM is justified if L(ĥ s ) ≤ L(h ) + , for some 1. If such a condition holds and → 0 as the sample size m increases, the ERM learning rule is called consistent.
Uniform generalization is a sufficient condition for the consistency of empirical risk minimization (ERM). To see this, we have by definition: From this, we conclude that: where C(L) is the learning capacity of the empirical risk minimization rule. The last inequality follows from Theorem 2. In addition, because R(ĥ s ) − R(h ) ≥ 0, we have by the Markov inequality: Hence, the ERM learning rule is consistent if C(L) → 0 as m → ∞. Next, we describe when such a condition on C(L) holds for 0-1 loss classes. To do that, we begin with two familiar definitions from statistical learning theory. The VC dimension is arguably the most fundamental concept in statistical learning theory because it provides a crisp characterization of learnability for 0-1 loss classes. Next, we show that the VC dimension has, in fact, an equivalence characterization with the learning capacity C(L). Specifically, under the Axiom of Choice, an ERM learning rule exists that has a vanishing learning capacity C(L) if and only if the 0-1 loss class has a finite VC dimension.
Before we establish this important result, we describe why ERM by itself is not sufficient for uniform generalization to hold even when the hypothesis space has a finite VC dimension.
In other words, the goal is to learn a threshold in the unit interval that separates the positive from the negative examples. Let x ∈ X be uniformly distributed in [0, 1] and let h be an error-free separator. Then, for any training sample s ∈ Z m , the set of all empirical risk minimizersĤ is: In particular,Ĥ is an interval, which has the power of the continuum, so it can be used to encode the entire training sample.
Fix δ > 0 in advance, which can be made arbitrarily small. Then, the probability over the random choice of the sample that |Ĥ| < δ can be made arbitrarily small for a sufficiently small δ > 0, where |Ĥ| is the length of the interval.
Letĥ ∈Ĥ be a hypothesis that lies at the middle ofĤ, i.e.,: ⊆Ĥ holds with a high probability (which can be made arbitrarily close to 1 for a sufficiently small δ). Leth be a hypothesis whose binary expansion agrees witĥ h in its first k + 1 bits and encodes the entire training sample in the rest of the bits. Finally, the output of the learning algorithm isĥ s , which is given by the following rule: 1.
Ifh is an empirical risk minimizer, then setĥ s =h 2.
Otherwise, setĥ s =ĥ. Now, define the following different parametric loss l : Z → [0, 1] to be a function that first usesĥ s to decode the training sample s based on the coding method constructed above and, then, assigns 1 if and only if x ∈ s. To reiterate, this decoding succeeds with a probability that can be made arbitrarily high for a sufficiently small δ > 0. Clearly, l is a loss defined on the product space Z × H and has a bounded range. However, the generalization risk w.r.t. l is, at least, equal to the probability that |Ĥ| < δ, which can be made arbitrarily close to 1. Hence, the statement of the proposition holds.
Proposition 6 shows that one cannot obtain a non-trivial bound on the uniform generalization risk of an ERM learning rule in terms of the VC dimension d and the sample size m without making some additional assumptions. Next, we prove that an ERM learning rule exists that satisfies the uniform generalization property if the hypothesis space has a finite VC dimension. We begin by recalling a fundamental result in modern set theory. A non-empty set Q is said to be well-ordered if Q is endowed with a total order such that every non-empty subset of Q contains a least element. The following fundamental result, which was published in 1904, is due to Ernst Zermelo [35].
Theorem 7 (Well-Ordering Theorem). Under the Axiom of Choice, every non-empty subset can be well-ordered. Theorem 8. Given a hypothesis space H, a domain Z, and a 0-1 loss l : H × Z → {0, 1}, let be a well-ordering on H and let L : Z m → H be the learning rule that outputs the " least" empirical risk minimizer to the training sample s ∈ Z m according to . Then, C(L) → 0 as m → ∞ if H has a finite VC dimension. In particular: where d is the VC dimension of H, provided that m ≥ d.
Proof. The proof is in Appendix I.
Next, we prove a converse statement. Before we do this, we present a learning problem that shows why a converse to Theorem 8 is not generally possible without making some additional assumptions. Hence, our converse will be later established for the binary classification setting only.
Example 8 (Subset Learning Problem). Let Z = {1, 2, 3, . . . , d} be a finite set of positive integers. Let H = 2 Z and define the 0-1 loss of a hypothesis h ∈ H to be l(z, h) = I{z / ∈ h}. Then, the VC dimension is d. However, the learning rule that outputs h = Z is always an ERM learning rule that generalizes uniformly with rate = 0 regardless of the sample size and the distribution of observations.
The previous example shows that a converse to Theorem 8 is not generally possible without making some additional assumptions. In particular, in the Subset Learning Problem, the VC dimension is not an accurate measure of the complexity of the hypothesis space H because many hypotheses dominate others (i.e., perform better across all distributions of observations). For example, the hypothesis h = {1, 2, 3} dominates h = {1} because there is no distribution on observations in which h outperforms h . In fact, the hypothesis h = Z dominates all other hypotheses.
Consequently, in order to prove a lower bound for all ERM rules, we focus on the standard binary classification setting.

Theorem 9.
In any fixed domain Z = X × Y, let the hypothesis space H be a concept class on X and let l(x, y, h) = I{y = h(x)} be the misclassification error. Then, any ERM learning rule L w.r.t. l has a learning capacity C(L) that is bounded from below by C(L) ≥ 1 where m is the training sample size and d is the Proof. The proof is in Appendix J.
Using both Theorems 8 and 9, we arrive at the following equivalence characterization of the VC dimension of a concept class with the learning capacity.
Theorem 10. Given a fixed domain Z = X × Y, let the hypothesis space H be a concept class on X and let l(x, y, h) = I{y = h(x)} be the misclassification error. Let m be the sample size. Then, the following statements are equivalent under the Axiom of Choice:

1.
H admits an ERM learning rule L whose learning capacity C(L) satisfies C(L) → 0 as m → ∞.

2.
H has a finite VC dimension.
Proof. The lower bound in Theorem 9 holds for all ERM learning rules. Hence, an ERM learning rule exists that generalize uniformly with a vanishing rate across all distributions only if H has a finite VC dimension. However, under the Axiom of Choice, H can always be well-ordered by Theorem 7 so, by Theorem 8, a finite VC dimension is also sufficient to guarantee the existence of a learning rule that generalize uniformly.
Theorem 10 presents a characterization of the VC dimension in terms of information theory. According to the theorem, an ERM learning rule can be constructed that does not encode the training sample if and only if the hypothesis space has a finite VC dimension.

Remark 2.
One method of constructing a well-ordering on a hypothesis space H is to use the fact that computers are equipped with finite precisions. Hence, in practice, every hypothesis space is enumerable, from which the normal ordering of the integers forms a valid well-ordering on H.

Concluding Remarks
In this paper, we introduced the notion of " learning capacity" for algorithms that learn from data, which is analogous to the Shannon capacity of communication channels. Learning capacity is an information-theoretic quantity that measures the contribution of a single training example to the final hypothesis. It has three equivalent interpretations: (1) as a tight upper bound on the uniform generalization risk, (2) as a measure of information leakage, and (3) as a measure of algorithmic stability. Furthermore, by establishing a chain rule for learning capacity, concentration bounds were derived, which revealed that the learning capacity controlled both the expectation of the generalization risk and its variance. Moreover, the relationship between algorithmic stability and data processing revealed that algorithmic stability can be improved by post-processing the learned hypothesis.
Throughout this paper, we provided several bounds on the learning capacity under various settings. For instance, we established a relationship between algorithmic stability and the effective size of the domain of observations, which can be interpreted as a formal justification for dimensionality reduction methods. Moreover, we showed how learning capacity recovered classical bounds, such as in the finite hypothesis space setting, and derived new bounds for other settings as well, such as differential privacy. We also established that, under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule for 0-1 loss classes that had a vanishing learning capacity was equivalent to the assertion that the hypothesis space had a finite Vapnik-Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory.
More generally, the intent of this work is to bring to light a new information-theoretic approach for analyzing machine learning algorithms. Despite the fact that " uniform generalization" might appear to be a strong condition at a first sight, one of the central claims of this paper is that uniform generalization is, in fact, a natural condition that arises commonly in practice. It is not a condition to require or enforce! We believe this holds because any learning algorithm is a channel from the space of training samples to the hypothesis space. Because learning is a mapping between two spaces, its risk for overfitting should be determined from the mapping itself (i.e., independently of the choice of the loss function). Such an approach yields the uniform generalization bounds that are derived in this paper.
It is worth highlighting that uniform generalization bounds can be established for many other settings that have not be discussed in this paper and it has found some promising applications. Using sample compression schemes, one can show that any learnable hypothesis space is also learnable by an algorithm that achieves uniform generalization [36]. Also, generalization bounds for stochastic convex optimization yield information criteria for model selection that can outperform the popular Akaike's information criterion (AIC) and Schwarz's Bayesian information criterion (BIC) [37]. More recently, uniform generalization has inspired the development of new approaches for structured regression as well [38].

Further Research Directions
Before we conclude, we suggest future directions of research and list some open problems.

Induced VC Dimension
The variational information J (ẑ; h) provides an upper bound on the generalization risk of the learning algorithm L across all parametric loss classes. This upper bound is achievable by the generalization risk of the binary reconstruction loss: which assigns the value one to observations z ∈ Z that are more likely to have been present in the training sample s upon knowing h, and assigns zero otherwise. In expectation, the generalization risk of this parametric loss is the worst generalization risk across all parametric loss classes. Let both p(z) and p(h|z) be fixed; the first is the distribution of observations while the second is entirely determined by the learning algorithm L. Then, because the loss in Equation (16) is binary, it has a VC dimension, which we will call the induced VC dimension of the learning algorithm L [39]. Note that this induced VC dimension is defined for all learning problems, including regression and clustering, but it is distribution-dependent, which is quite unlike the traditional VC dimension of hypothesis spaces.
There are a lot of open questions related to the induced VC dimension of learning algorithms. For instance, while a finite VC dimension implies a small variational information, when does the converse also hold? Can we obtain a non-trivial bound on the induced VC dimension of a learning algorithm L upon knowing its uniform generalization risk J (ẑ; h)? Along similar lines, suppose that L is an empirical risk minimization (ERM) algorithm of a 0-1 loss class that may or may not use an appropriate tie breaking rule (in light of what was discussed in Section 5.5). Is there a non-trivial relation between the VC dimension of the 0-1 loss that is being minimized and the induced VC dimension of the ERM learning algorithm?

Unsupervised Model Selection
Information criteria (such as AIC and BIC), are sometimes used in the unsupervised learning setting for model selection, such as when determining the value of k in the popular k-means algorithm [40]. Given that the notion of uniform generalization is developed in the general setting of learning, should the learning capacity C(L) serve as a model selection criterion in the unsupervised setting? Why or why not?

Effective Domain Size
The effective size of the domain of a random variable z in Definition 8 satisfies some intuitive properties and violates others. For instance, it reduces to the size of the domain |Z | when the distribution is uniform. Moreover, if z is Bernoulli, the effective domain size is determined by the variance of the Bernoulli distribution. Importantly, this notion is well-motivated because it determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allowed us to relate the learning capacity to a property of the domain Z.
However, such a notion of effective domain size has some surprising properties. For instance, the effective size of the domain of two independent random variables is not equal to the product of the effective size of each individual domain! In rate distortion theory, a similar phenomenon is observed. Reference [10] explain this observation by stating that " rectangular grid points (arising from independent descriptions) do not fill up the space efficiently." Can the effective domain size in Definition 8 be motivated using rate distortion theory?
Funding: This research received no external funding.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Proof of Lemma 2
With no loss of generality, let's assume that all domains are enumerable. We have: However, the minimum of the sums is always larger than the sum of minimums. That is: Using marginalization p(x) = ∑ y p(x, y = y) and the above inequality, we obtain:

Appendix B. Proof of Theorem 1
We will first prove the inequality when k = 2. First, we write by definition: J (z; (h 1 , h 2 )) = ||p(z, h 1 , h 2 ) , p(z) p(h 1 , h 2 )|| T Using the fact that the total variation distance is related to the 1 distance by ||P , Q|| T = 1 2 ||P − Q|| 1 , we have: Using the triangle inequality: The above inequality is interpreted by expanding the 1 distance into a sum of absolute values of terms in the product space Z × H 1 × H 2 , where h k ∈ H k . Next, we bound each term on the right-hand side separately. For the first term, we note that: The equality holds by expanding the 1 distance and using the fact that ∑ h 2 p(h 2 |h 1 ) = 1. However, the second term can be re-written as: Combining Equations (A1) and (A2) yields the inequality: where the first inequality follows from Lemma 2 and the second inequality follows from the chain rule. Thus, we obtain the desired bound: J (x; (y, z)) − J (x; y) ≤ J (x; z | y) (A5) Both Equations (A4) and (A5) imply that the chain rule is tight. More precisely, the inequality can be made arbitrarily close to an equality when one of the two terms in the upper bound is chosen to be arbitrarily close to zero.

Appendix D. Proof of Theorem 3
We will use the following fact:

First Setting:
We first consider the following scenario. Suppose a learning algorithm L produces a hypothesis h ∈ H from some marginal distribution p(h) independently of the training sample s. Afterwards, L produces a second hypothesis k ∈ K according to p(k | h, s). In other words, k depends on both h and s but the latter two random variables are independent of each other. Under this scenario, we have: where the equality follows from the chain rule in Theorem 1, the statement of Proposition 1, and the fact that J (ẑ; h) = 0.
The conditional variational information is written as: where we used the fact that p(ẑ|h) = p(ẑ). By marginalization: Similarly: Next, we note that since h is independent of the sample s, the variational information betweenẑ ∼ s and k ∈ K can be bounded using Theorem 6. This follows because h is selected independently of the sample s, and, hence, the i.i.d. property of the observations z i continues to hold. Therefore, we obtain: Because p(k|ẑ, h) is arbitrary in our derivation, the above bound holds for any distribution of observations p(z), any distribution p(h), and any family of conditional distributions p(k|ẑ, h).
Original Setting: Next, we return to the original setting where both h ∈ H and k ∈ K are chosen according to the training sample s. We have: In the last line, we used the triangle inequality. Next, we would like to bound the first term. Using the fact that the total variation distance is related to the 1 distance by ||p , q|| T = 1 2 ||p − q|| 1 , we have: Here, the inequality follows from Fact 1. Next, we bound the second term in Equation (A7 This along with the chain rule imply the statement of the theorem.
without replacement from the sample s. To determine the variational information betweenẑ and h, we consider the two cases: 1.
So, by combining the two cases above, we deduce that: Therefore, L generalizes uniformly with the rate tδ. Next, let the parametric loss be given by l(z ; h) = I z ∈ h . With this loss: which is the statement of the proposition.  Due to the tightness of the Stirling approximation, the asymptotic expression for the variational information is tight. Because J (ẑ; h) = E s ||p(z), p s (z)|| T , we deduce that: E s ||p(z), p s (z)|| T ∼ Ess [Z; p(z)] − 1 2 π m , which provides the asymptotic rate of convergence of an empirical probability mass function to the true distribution.

Appendix H. Proof of Proposition 5
First, we note that for any two adjacent samples s and s and any O ⊆ H, we have in the differential privacy setting: p(h ∈ O|s) − p(h ∈ O|s ) ≤ (e − 1) p(h ∈ O|s ) + δ Similarly, we have: where s, s are two adjacent samples. Finally, we use Equation (A11) to arrive at the statement of the proposition.

Appendix I. Proof of Theorem 8
The proof is similar to the classical VC argument. Given a fixed hypothesis space H, a fixed domain Z, and a 0-1 loss function l : H × Z → {0, 1}, let s = {z 1 , . . . , z m } be a training sample that comprises of m i.i.d. observations. Define the restriction of H to s by: It is trivial to see that this defines an equivalence relation; i.e., it is reflexive, symmetric, and transitive. Let the set of equivalence classes w.r.t. s be denoted H s . Note that we have a one-to-one correspondence between the members of F s and the members of H s . Moreover, H s is a partitioning of H.
We use the standard twin-sample trick where we have s 2 = s ∪ s ∈ Z 2m and L learns based on s only. For any fixed h ∈ H, let f : H × Z → [0, 1] be an arbitrary loss function, which can be different from the loss l that is optimized during the training. A Hoeffding bound for sampling without replacement [41] states that: Hence: This happens for a hypothesis h ∈ H that is fixed independently of the random split of s 2 into training and ghost samples. When h is selected according to the random split of s 2 , then we need to employ the union bound.
For any subset H ⊆ H, let min(H) be the least element in H according to . Let H s be as defined previously and write H min (s) = {min(H k ) : H k ∈ H s }. Then, it is easy to observe that the ERM learning rule of Theorem 2 must select one of the hypotheses in H min (s 2 ) regardless of the split s 2 = s ∪ s . This holds because H s 2 is a coarser partitioning of H than H s . In other words, every member of H s is a union of some finite number of members of H s 2 . By the well-ordering property, the " least" element among the empirical risk minimizers must be in H min (s 2 ).
Hence, there is, at most, τ H (2m) possible hypotheses given s 2 , where τ H (m) is the growth function (sometimes referred to as the shattering coefficient), and those hypotheses can be fixed independently of the random splitting of s 2 into a training sample s and a ghost sample s . Consequently, we have by the union bound: Then, the expected number of distinct values in X that are not observed in the training sample s is: Here, we used the linearity of expectation, which holds even when the random variables are not independent. This shows that the expected fraction of instances in X that are not seen in the sample s is (1 − 1 d ) m . Next, given an ERM learning rule that outputs an empirical risk minimizer, the training error of this learning algorithm is zero because X is shattered by H. However, for any learning rule L, the expected error rate on the unseen examples is, at least, 1/2 by construction. Therefore, there exists a distribution p(z) in which the generalization risk is, at least, (1/2)(1 − 1/d) m .
By Theorem 2, the learning capacity is an upper bound on the maximum generalization risk across all distributions of observations and all parametric loss functions. Consequently: which is the statement of the theorem.