Distribution-Dependent Weighted Union Bound †

In this paper, we deal with the classical Statistical Learning Theory’s problem of bounding, with high probability, the true risk R(h) of a hypothesis h chosen from a set H of m hypotheses. The Union Bound (UB) allows one to state that PLR^(h),δqh≤R(h)≤UR^(h),δph≥1−δ where R^(h) is the empirical errors, if it is possible to prove that P{R(h)≥L(R^(h),δ)}≥1−δ and P{R(h)≤U(R^(h),δ)}≥1−δ, when h, qh, and ph are chosen before seeing the data such that qh,ph∈[0,1] and ∑h∈H(qh+ph)=1. If no a priori information is available qh and ph are set to 12m, namely equally distributed. This approach gives poor results since, as a matter of fact, a learning procedure targets just particular hypotheses, namely hypotheses with small empirical error, disregarding the others. In this work we set the qh and ph in a distribution-dependent way increasing the probability of being chosen to function with small true risk. We will call this proposal Distribution-Dependent Weighted UB (DDWUB) and we will retrieve the sufficient conditions on the choice of qh and ph that state that DDWUB outperforms or, in the worst case, degenerates into UB. Furthermore, theoretical and numerical results will show the applicability, the validity, and the potentiality of DDWUB.


Introduction
Statistical learning theory [1][2][3][4] deals with the problem of understanding and estimating the performance of a statistical learning procedure. The goal is to better understand the factors that influence its behavior and to suggest ways to improve it. Although asymptotic analysis is a crucial first step in this direction, finite sample error bounds are of more value as they allow the design of model selection procedures [5][6][7]. These error bounds typically have the following form: with high probability, the generalization error of the selected hypothesis, chosen in a space of possible ones, is bounded by an empirical estimate of the generalization error plus a penalty term which depends on the size of the hypothesis space and the number of samples available. The latter term basically considers that the learning procedure selects a hypothesis in a set of possible ones based on the available data. Every data-dependent choice implies a risk, and the penalty term is exactly the measure of this risk. When the hypothesis space is composed of an arbitrary finite number of hypothesis, and no additional information is provided, the evaluation of the total risk is usually made with the Union Bound (UB) [2, 7,8]. The UB is an ubiquitous building block in statistical learning theory and is exploited in many context and in many different ways to derive the final result: in the Vapnik-Chervonenkis theory [2], in the Rademacher Complexity theory [9,10], in the Algorithmic Stability theory [11], in the Compression Bound [12], in the PAC-Bayes theory [13], and more recently in the Differential Privacy theory [14].
Let us consider the classical binary classification framework (The extension to the general supervised learning characterized by bounded loss functions will be discussed If the choice of h ∈ H does not depend on D n , namely if we want to bound the generalization error of a single hypothesis in the hypothesis space chosen before seeing the data, it is possible to prove that (Please note that for simplicity, we will refer to R(h) and R(h) with R andR respectively, when it is clear from the context.) where δ ∈ (0, 1) while L and U are respectively lower and upper bounds of the generalization error (see, for example, [15][16][17]).
In general, the choice of h ∈ H does depend on D n : in this case we must estimate the risk due to this data-dependent choice.
As an example, common practice for choosing h ∈ H based on D n is to choose the hypothesis with minimum empirical error arg min h∈HR (h), and this approach is called Empirical Risk Minimization [2,18], but others possibilities exist such as the Structural Risk Minimization [2, 19,20], or the penalized (regularized) Empirical Risk Minimization [21][22][23].
To guarantee a prescribed confidence level, or risk, of the chosen hypothesis, the UB can be applied. The UB can be expressed in two forms (Please note that for simplicity, we will refer to R(h i ) andR(h i ) with R i andR i respectively, when it is clear from the context.) [8]: a simplified version (Theorem 1) and a generalized version (Theorem 2). Theorem 1 (Simple UB). The following bounds hold Theorem 2 (Generalized UB). Let q(h i ) ∈ (0, 1) and p(h i ) ∈ (0, 1) be some weight associated with h i with i ∈ I before seeing the data (Please note that for simplicity, we will refer to q(h i ) and p(h i ) with q i and p i respectively, when it is clear from the context.) and such that ∑ i∈I (q i + p i ) = 1, then the following bounds hold Theorem 1 is a special case of Theorem 2 when q i = p i = 1 /2m ∀i ∈ {1, · · · , m}. Theorem 2 introduces a weight for each risk associated with each choice. Weighting more the risk associated with useful choices leads to tighter bounds on the generalization error of hypotheses that will be selected by the algorithm (hypotheses characterized by small empirical error) and looser estimates over the others (hypotheses characterized by high empirical error). Unfortunately, this approach is mainly theoretical since the weights must be chosen before seeing the data and consequently we cannot set them without an a priori knowledge about the problem. Finally, Theorem 2 does not propose any solution for the choice of these weights.
For this reason, in this work, we propose a Distribution-Dependent Weighted UB (DDWUB) where the weights depend on some parameters of the distribution which generated them, extending our preliminary work [24]. In particular, we define a set of functions f p i : R m → R and f q i : R m → R with i ∈ I such that q i = f q i (R 1 , · · · , R m ), p i = f p i (R 1 , · · · , R m ) ∈ (0, 1), ∀i ∈ I, ∑ i∈I f q i (R 1 , · · · , R m ) + f p i (R 1 , · · · , R m ) = 1.
Please note that f q i , f p i with i ∈ I are quite general and are data independent (Please note that in the framework of the paper (binary classification with a loss function which counts the number of misclassified samples), the generalization error is the only parameter of the distribution which is a Binomial). It is surely possible to consider even more general data independent functions for defining the weights, but we think that our definition is general enough to contemplate a wide variety of cases.
At this point the proposed DDWUB for bounding the generalization error of a hypothesis chosen from a finite set of possible ones can be stated. Theorem 3. If ∀r 1 , · · · , r m ∈ [0, 1] f q i (r 1 , · · · , r m ), f p i (r 1 , · · · , r m ) ∈ (0, 1), ∀i ∈ I, ∑ i∈I f q i (r 1 , · · · , r m ) + f p i (r 1 , · · · , r m ) = 1, then the following bound holds The proof is a direct consequence of Theorem 2. DDWUB allow the binding of the generalization error of each hypothesis in the space of hypotheses but, to prove that the DDWUB outperforms the UB, we will require some sufficient conditions that L, U, f q i , and f p i with i ∈ I must satisfy. These sufficient conditions define a class of functions and an open problem would be to find special and simpler classes of functions which satisfy them.
Nevertheless, we will show that it is possible to find simple classes of functions which satisfy these conditions. For example, if one is interested in having tighter upper bound of the generalization error of the empirical minimizer DDWUB suggest combining classical L and U, such as [15] or [16], and set with particular values of γ ∈ [0, ∞) and θ ∈ [0, 1]. As a last remark we would like to note that all our results easily extend to multiclass classification problems and regression problems if the loss function is bounded.
DDWUB is a distribution-dependent form of the UB analogously to the Computable Shell Decomposition Bounds (CSDB) [20]. The CSDB splits the hypothesis space in shells based on the generalization error of each hypothesis and, instead of taking into account the risk of each hypothesis in the space, show that it is possible to just take into account the risk of choosing one shell and the risk associated with each hypothesis in the shell. This allows, for example, to not consider hypotheses with high generalization error. The CSDB show also how to estimate the size of these shells based on the histogram of the empirical errors.
DDWUB takes inspiration from several works in the field. The first idea, which is also a driver of the CSDB, is that during any learning procedure the hypotheses with high error will be never taken into account and consequently we should not pay the risk for those hypotheses [10]. The second idea is that since we do not know the true error of the hypotheses but just its empirical one, we should discard those hypotheses for which the estimated confidence intervals do not overlap [25] with the ones of the hypothesis of minimal training error. The third idea is that since there is no supporting theory for discarding the hypothesis with non-overlapping confidence intervals, we should weight differently the risk associated with each hypothesis based on their true error analogously to what is done in the field of multiple hypotheses testing [26]. The fourth idea is that other researchers have shown that a distribution-dependent weighting strategy can be performed without the actual knowledge of the distribution [27]. DDWUB combines all these ideas and improves both on the UB and the CSDB.
DDWUB applies to finite hypotheses spaces and surely more sophisticated techniques, such as Local Vapnik-Chervonenkis [28] or the Local Rademacher Complexity [10], can be employed and can sometimes result in tighter bounds. However, insight into finite classes remains quite useful [20,29]. Finite class analysis can be exploited for as a pedagogical tool. Finite class analysis can teach new directions in which to look for the development and evolution of more sophisticated bounds. Finite class analysis can be useful for model selection purposes (e.g., selecting the most suitable hypothesis space, or set of hyperparameters, or algorithm). Finite class analysis can be useful when the models are represented with limited number of bits because of the constants involved in the bounds.
The rest of the paper is organized as follows. Section 2 presents the DDWUB in a simplified setting. In Section 3 we present the DDWUB in a generalized setting, we derive the sufficient conditions which state when DDWUB improves over the UB, we will show that it is possible to find simple classes of functions which satisfy these conditions, and we will make the connection between our results and the ones of [25]. Section 4 reports a comparison between DDWUB and the UB by means of closed form results. Section 5 reports a comparison between DDWUB and the UB by means of an extensive set of numerical results. Section 6 compares DDWUB with CSDB by means of an extensive set of numerical results. Section 7 shows the applicability and the potentiality of DDWUB. Section 8 concludes the paper. In the Appendices known results, proof, and technicalities (See in Appendixs A-C) are reported for completeness.

Distribution-Dependent Weighted Union Bound: Simplified Setting
Let us consider Theorem 3 and the bound proposed by [16] recalled by Theorem A1 in Appendix A. Let us also suppose, for simplicity, that we are interested in upper bounding the generalization error of the empirical risk minimizer (Extensions will be discussed at the end of this section). In this setting it is possible to state our DDWUB.
∑ j∈I e −γ max[θ,r j ] , ∀i ∈ I, with γ ∈ [0, ∞) and θ ∈ [0, 1] then the following bound holds Corollary 1 is a direct consequence of Theorems 3 and A1. The choice of the weights takes inspiration from the work of [27] which proposed, in the context of the PAC-Bayes theory, a distribution-dependent method for assigning an a priori distribution over a set of hypotheses to give a higher probability to the hypothesis with small generalization error. This method has been shown to possess interesting theoretical properties [30,31] and to be also quite effective in practical applications [32].
Since we are interested in choosing and bounding the generalization error of the empirical minimizer, let us define This approach is analogous to Page's criterion [33], which was designed as a process inspection scheme to detect deviations in average in only one direction (one-sided) in a stochastic process.
In Corollary 1, γ acts as a weighting factor. The larger is γ the larger are the weights of the risks associated with hypotheses with small empirical error and the smaller are the weights of the risks associated with hypotheses with large empirical error. For γ → ∞ we have that (For simplicity, we assume in this statement that the empirical minimizer is unique.) p i * → 1 and p i → 0 ∀i ∈ I \ i * . The smaller is γ the less is the difference between the weights of the risks. For γ → 0 we have that p i = 1 /m ∀i ∈ I.
In Corollary 1, θ, instead, acts as a protection against the fact that the empirical error is measured over a finite number of samples and, if the sample size is small, hypotheses with a small difference in the empirical error are indistinguishable. In other words, the weights depend on unknown parameters of the data generating distribution, then we will have to estimate them and since the number of sample is finite these estimates will not allow us to distinguish hypotheses which show similar empirical error. For this reasons, θ gives the same the weight to the risks associated with hypotheses with small empirical error.
The values of γ and θ must be set in a particular way to be sure that DDWUB improves over the UB. In particular • Lemma 1 shows that to upper bound the generalization error of the empirical risk minimizer based on DDWUB of Corollary 1 we must solve an optimization problem; • Lemmas 2 and 3 show that for particular values of γ the solution is unique and can be found by simply search for the fixed point of a simple function; • Theorem 4 and Lemma 4 show that for particular values of θ it is possible to prove that DDWUB is tighter than, or in the worst case as tight as, the UB.
Thanks to Corollary 1 we can state the following lemma. then we can state that following bound holds Lemma 1 can be further simplified as follows.
Lemma 2. Under the same conditions of Lemma 1 the following bound holds The proof can be found in Appendix B. Please note that the optimization problem of Lemma 2 can be further simplified noting that for particular values of γ, the solution of the optimization problem is unique.
The proof can be found in Appendix B. Please note that to find the fixed point defined in Lemma 3 a simple bisection method can be applied.
For particular values of θ, it is possible to state that DDWUB is tighter than, or in the worst case as tight as, the UB.
The proof can be found in Appendix B.
The problem of the θ defined in Theorem 4 is that it is data-dependent since we do not know i * before seeing the data. For this reason, the following lemma suggests a data independent threshold of θ which satisfies the conditions of Theorem 4.
The proof can be found in Appendix B.
Lemma 4 provides us a method for finding a θ ≥ U R i * , δ 2m in a data independent way by setting By finding the r * i * for all possible values of θ and then by selecting the largest one which satisfies Equation (1) we have the results of our DDWUB.
Please note that the above-mentioned result easily extends to the whole supervised learning framework, until a bounded loss function is employed, since the inequality proposed by [16] cover this case.
Following the same argument described in this section it is possible to derive the DDWUB for lower bounding the generalization error of the empirical risk minimizer by simply setting in Theorem 3 , ∀i ∈ I.
Finally, it is possible to derive the DDWUB for upper and lower bounding the generalization error of the empirical risk minimizer by simply setting in Theorem 3 ∑ j∈I e −γ max[θ,R j ] , ∀i ∈ I. Example 1. Before presenting DDWUB in the general setting we would like to show an application of DDWUB in the simplified setting. Let us consider the case when (More general examples can be derived, and we will do it later with both closed form and numerical results, but here we want to keep the presentation as simple as possible.) Let us set γ = 2 √ n (see Lemma 3) and note that to upper bound the function with the smallest empirical error (i.e., the one corresponding toR 1 ) we have that DDWUB states that .
Please note that for a finite but large enough value of n Thanks to the theory of DDWUB (see Lemma 4) we can state that Let us note that if Then we can easily state that which means that all the hypothesis in the space withR = 0, if m < δe 2n(ν/2) 2 /2, are not taken into account, asymptotically, in estimating the upper bound of the hypothesis with the smaller error with DDWUB.

Distribution-Dependent Weighted Union Bound: General Setting
In this section, we will derive the sufficient conditions for stating that DDWUB is tighter than, or in the worst case as tight as, the UB.
In particular, as we have done in Section 2, we will start by supposing that we are just interested in upper bounding the generalization error of the empirical risk minimizer. Nevertheless, as pointed out in Section 2, DDWUB can be easily generalized also to the lower bounds, or to both lower and upper bounds, and to the general supervised setting with bounded loss functions but, in this work, we did not report all these extensions in order not to make the notation and the presentation over-complicated.
As noted in the introduction, the weights should not depend on the data, but they can depend on some parameters of the data generating distribution. For this reason, we define a set of functions f i : R m → R with i ∈ I such that In this setting DDWUB can be formulated as follows.
To prove that DDWUB outperforms UB we will require some sufficient conditions that L, U, and f i with i ∈ I must satisfy. Please note that from Corollary 2, it is possible to derive all the lower bounds of the generalization error of the hypotheses in the class since L R i , δ 2m depends just on known quantities. For what concerns, instead, the upper bounds, the answer is not as easy.
In the rest of this section, we will show how to find the upper bound of the generalization error of a hypothesis chosen in H based on DDWUB (Corollary 2) and under which conditions these upper bounds are tighter than the one of UB (Theorem 1). For this purpose Thanks to Corollary 2 we can state the following lemma.

Lemma 5.
Under the same conditions of Corollary 2, if ∀r ∈ [0, 1] and ∀δ ∈ (0, 1) then the following bound holds with probability at least (1 − δ) and ∀i ∈ I The solution of the optimization problem of Lemma 5 is not trivial to be found and its properties are not easy to catch.
The following lemma helps us in simplifying the optimization problem of Lemma 5 under a quite natural condition: the upper bound of the generalization error of a hypothesis should decrease if the generalization error of one of the other hypotheses in the class increases.
In fact, the hypotheses of the lemma imply that to reach the maximum or r i , one must reach the lower bounds of r j with j ∈ I \ i.
Even if the optimization problem of Lemma 6 is much simpler than the one of Lemma 5, the next result further simplifies it under another sufficient condition which ensure the existence and uniqueness of the solution: the upper bound of the generalization error of a hypothesis should not increase too fast it its generalization error decreases.

Lemma 7.
Under the same conditions of Lemma 6, if ∀r i ∈ {0, 1 /n, · · · , 1}, ∀r 1 , · · · , r m ∈ [0, 1], and ∀r i , then the solution r * i of the optimization problem of Lemma 6 exists, it is unique, and it is the fixed point of the following function of r i In fact, the hypothesis of the lemma guarantees the uniqueness of the solution. Algorithm 1 reports a simple pseudo-code for finding the fixed point of Lemma 7 based on the bisection method.
The next result introduces a parameter, more specifically a threshold, θ and states the condition over θ which states that the DDWUB improves over the UB.
Unfortunately, we cannot set θ = U R i , δ 2m since this would be a data-dependent choice which will result in a data-dependent weighting strategy. The next lemma addresses this problem. Input: m,R i and f i (r 1 , · · · , r m ) with i ∈ I, δ, n, and the precision Output: r * Under the same conditions of Theorem 5, if ∀r,r ,r ∈ {0, 1 /m, · · · , 1} such that r <r we have that The proof of Lemma 8 can be found in Appendix B.
Lemma 8 provides us a method for finding a θ ≥ U R i * , δ 2m in a data independent way by setting Thanks to all these results we can provide a method for finding the fixed point of Lemma 7 but with data independent weighting strategy which satisfies the hypothesis of Theorem 5 and a data independent θ defined in Lemma 8.

Lemma 9.
Under the same conditions of Lemma 8, Algorithm 2 finds the fixed point of Lemma 7 but with a data independent weighting strategy which satisfies the hypothesis of Theorem 5 and a data independent θ defined in Lemma 8.
The proof of Lemma 9 can be found in Appendix B.
Algorithm 2: Algorithm for finding the fixed point of Lemma 7 but with data independent weighting strategy which satisfies the hypothesis of Theorem 5 and a data independent θ defined in Lemma 8.
Input: m,R i and f i (r 1 , · · · , r m ) with i ∈ I, δ, n, and the precision Output: r * Please note that if we apply this general theory to Corollary 1 we obtain the same results of Section 2.
In the next section, instead, we apply the general theory to a the more complex case of when [15] is employed together with the same weights exploited in Corollary 1.

From Theory to Practice
In this section, we will exploit the same solution of Corollary 1 for the weights needed in DDWUB and we will show that is satisfies hypothesis of the Theorems, the Corollaries, and the Lemmas presented in the previous section. In particular we will set where γ ∈ [0, +∞) and θ ∈ [0, 1] are finite constants which regulates the shape of the distribution of the weights.
The following lemma shows that these weighs satisfy the sufficient conditions which states that DDWUB outperforms, or in the worst case performs as, the UB.
then the hypotheses of Corollary 2 and Theorem 5 are satisfied.
The proof of Lemma 10 can be found in Appendix B. Please note that for what concerns the hypothesis of Lemmas 6 and 7, we cannot prove that condition holds without knowing the shape of the lower and upper bounds of the generalization error. For this reason, we exploit the generalization bounds of [15] which are the tightest ones in the settings of this paper [6]. This will allow us in Section 5 to compare the UB with the DDWUB with a set of numerical experiments.
To reach our goal let us define the Regularized Incomplete Beta Function p = F(r; a, b) = I r (a, b) and its inverse r = F −1 (p; a, b) with parameters specified by a ∈ N * and b ∈ N * for the corresponding values of r and probabilities in p The following lemma states the conditions under which all the hypotheses of Corollary 2, Lemmas 6 and 7, Theorem 5, and Lemma 8 are satisfied if, in Corollary 2, the lower and upper generalization bounds proposed by [15], recalled by Theorem A2 in the Appendix A, and the weights defined in Lemma 10 are exploited. Lemma 11. Let us exploit the lower and upper generalization bounds proposed by [15] in Corollary 2 The proof of Lemma 11 can be found in Appendix B.
The case whereR i * = 1 is trivial since, in this case, we can safely state that R i * ≤ 1. Please note that the limit in the value of γ is O( √ n) and this result is connected and in agreement with the consistency results derived in the PAC-Bayes [30] and in Algorithmic Stability [31] theories where the distribution-dependent prior of [27] is exploited.

Observation
Let us consider the case where the empirical errors of the hypotheses in the space have been sorted as followsR 1 ≤R 2 ≤ · · · ≤R m , and let us exploit the results of Section 3.1.
Let us set 7 6 , and suppose thatR Then by using the UB (see Theorem 1) we can state that with probability at least (1 − δ) while, by using DDWUB (see Lemma 11), we can state that with probability at least (1 − δ) By looking at this last problem and by setting r 1 = r * 1 for simplicity, we can observe some properties. The first one is that The second one is that ∀i ∈ I \ J These properties state that DDWUB, with respect to the UB, can discard, or more properly reduce, the risk of the hypotheses h i with i ∈ I \ J when estimating the generalization error of the hypothesis h 1 . As a first raw approximation, we can state that we must pay a risk only for the hypotheses h i with i ∈ J obtaining where |J | ≤ m. A quite important aspect is then to understand some properties of θ. Let us recall that θ = U L −1 min[r 1 , · · · , r m ], δ 2m , δ 2m , but we can easily state that min[r 1 , · · · , r m ] = min U R 1 , δ f 1 (r 1 , · · · , r m ) 2 , L R 2 , δ 2m .
Thanks to Theorem 5, we can state that f 1 (r 1 , · · · , r m ) ≥ 1 m and then the case where is quite unusual since it means thatR 2 −R 1 is large, or, in other words, it means that our set of hypotheses contains just one hypothesis with small error and many hypotheses with high error.
which means that the threshold θ is obtained from the upper bound of the second-best hypothesis in the space, namely the hypothesis with the second smallest empirical error. To the best of our knowledge, this result is new since in the past many researchers have tried, with similar approaches but with no supporting theory, in proposing to clean the hypothesis space from the hypotheses with high empirical error. The basic idea of these methods is that it is reasonable to discard all the hypotheses such that the lower bound of their generalization error is greater than U R 1 , δ 2m , namely the upper bound of the generalization error of the hypothesis with the smallest empirical error, see for example [25]. Our theory states, instead, that we can smooth the risk due to the hypotheses such that the lower bound of their generalization error is greater than U R 2 , n, δ 2m , namely the upper bound of the generalization error of the hypothesis with the second smallest empirical error.

Closed Form Results
In this section, we will report some closed form results regarding the Lemma 11. These examples are useful for providing an idea of the effect of the DDWUB, an extensive comparison with numerical results is provided in Section 5.
Let us consider the case wherê Let us set 7 6 , and note that arg min i∈IR i = 1,R 1 = 1.
If we use the UB (see Theorem 1) with the [15] we obtain that with probability at least (1 − δ) the following bound holds If, instead, we use DDWUB (see Lemma 9 and Algorithm 2) we obtain that with probability at least (1 − δ) the following bound holds Since we can surely state that Moreover, thanks to Theorem 5, we can state that

Asymptotic Case
Please note that f 1 (r * 1 , r 2 , · · · , r m ) = Consequently, at least asymptotically, DDWUB can discard entirely the risk due to the hypotheses with high empirical error.
Finite Sample Case DDWUB obviously gives an advantage, with respect to the UB, until This means that we have an advantage in terms of the tightness of the estimated upper bound for and consequently until m < δ2 n−1 .

Numerical Results
This section is devoted to the numerical comparison between the UB (see Theorem 1) and the DDWUB (see Lemma 11).
In particular we will focus on upper bounding the generalization error of the hypothesis, in the set of possible ones, characterized by the smallest empirical error. The comparison between the UB and the DDWUB will be made in different scenarios to better understand the advantages of the DDWUB.

Scenario A.
Scenario A is an optimistic scenario, the same of Section 4: the set of hypotheses contains few useful hypotheses (small empirical error) and a lot of useless hypotheses (high empirical error) such thatR 1 =R 2 = 0,R 3 = · · · =R m = 1.
We set δ = 0.05 and set the numerical precision of the algorithms to = 0.0001. Figure 1 reports the estimated generalization error upper bound of the hypothesis with the smallest empirical error estimated with the UB and the DDWUB, together with the percentage of improvement in three different sub-scenarios: Based on the results reported in Figure 1 we can observe that • DDWUB is always tighter, or equivalent in the worst case, with respect to the UB; • increasing the number of samples always increases the advantage of the DDWUB over the UB until all the risk of the hypothesis with largest empirical error is disregarded; • increasing m increases the advantage of the DDWUB over the UB until a limit value for m: if too many useless hypotheses are present it is not able anymore to disregard their risk. Nevertheless, the larger is n the far is this value for m.
In a slightly less optimistic scenario when which is the case of lot of charlatans and just a few good candidates in a hiring process [34], the results are reported in Figure 2a-c and do not change too much, apart from the limit of m when the DDWUB stops to improve over the UB which is obviously smaller.

Scenario B.
The second scenario is a more classical one when This case is exploited in many applications (e.g., the CSDB of [20], or the Structural Risk Minimization of [2], or the Structural Risk Minimization over data-dependent hierarchies of [19]). Also, in this scenario we set δ = 0.05 and we set the numerical precision to = 0.0001. Then we vary n ∈ {10, 20, 40, 100, 200, 400, 1000}. Figure 3 reports the estimated generalization error upper bound of the hypothesis with the smallest empirical error estimated with the UB and the DDWUB, together with the percentage of improvement. Figure 3 clearly shows the advantage of the DDWUB over the UB and the improvement in the advantage as soon as n increases.      Scenario C.
The last scenario involves the unlucky case in which our set of m hypotheses is taken from all 2 n possible binary hypotheses over n data. Please note that the number of hypotheses with i errors in the 2 n possible binary hypotheses over n data are ( n i ). Then we force our m hypotheses to have at least one hypothesis for each possible value of the empirical error and consequently m ≥ n + 1. The remaining m − n − 1 are taken from the 2 n possible binary hypotheses over n to approximate the distribution of the 2 n possible binary hypotheses. The result of this approach is that our set of hypotheses will be composed by m = ∑ n j=0 ( n j ) 2 n z hypotheses, with z ∈ N * , as followŝ Please note that for example, when z = 1 we obtain the Scenario B. Also, in this scenario we set δ = 0.05 and we set the numerical precision to = 0.0001. Figure 4a-c reports the estimated generalization error upper bound of the hypothesis with the smallest empirical error estimated with the UB and the DDWUB, together with the percentage of improvement in the same sub-scenarios of Scenario A. Figure 4 shows that even in this unlucky case, the DDWUB can remarkably outperform the UB.

The Importance of γ and θ
In this section, we would like to discuss the importance of γ and θ also by means of some numerical experiments supporting our claims.
Let us start with θ. From one side setting θ = 1 would make the DDWUB degenerate in the UB eliminating all the benefits of using the DDWUB. From the other side setting θ = 0, namely removing θ, is not possible because of the constraint on θ of Theorem 5 which does not allow us, in this case, to guarantee that the solution of the DDWUB always outperform, or in the worst case degenerates in, the UB. Hence, θ ∈ (0, 1) deals with the fact that we do not know the generalization error of our hypotheses, then we must estimate it, and consequently hypotheses with a small but close empirical error cannot be distinguished. Unfortunately, the constraint of Theorem 5 on θ is data-dependent and consequently we must resort to a suboptimal, but data independent, limitation on θ as reported in Lemma 8. The question which raises here is the practical difference of all these choices. For this reason, we consider the Scenario A previously defined withR 1 where we set δ = 0.05, n = 100, and m = 1000. Then we vary ν ∈ {0.1, 0.2, · · · , 1}, and we reported in Figure 5 the comparison between the UB and the DDWUB with θ = 0, θ =θ i.e., equal to the lower limit of the constraint of Theorem 5, θ =θ i.e., equal to the lower limit of the constraint of Lemma 8, and θ = 1.  . Scenario A (R 1 = 0.1,R 2 = ν andR 3 = · · · =R m = 1): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB with θ ∈ {0,θ,θ, 1} together with the percentage of improvement when we set δ = 0.05, n = 100, and m = 1000 and we vary ν ∈ {0.1, 0.2, · · · , 1}. The two figures above depict the whole range while the two figures below report a zoom on the most interesting parts. From Figure 5 it is possible to derive some observations. Setting θ = 0 in the DDWUB can result in worse estimates with respect to the ones of the UB since whenR 2 is close toR 1 (small ν) we cannot distinguish between the first two hypotheses which are the ones with the lowest generalization error. As soon as ν grows the difference betweenR 1 andR 2 becomes statistical relevant and so setting θ = 0 in the DDWUB results in better estimates with respect to the UB or even better that the ones of the DDWUB since θ in this particular scenario for large ν is useless. Setting θ =θ gives the best results and always outperform the UB while setting θ =θ results in worse estimates but still better than the ones of the UB. Please note that for small and large ν,θ andθ are equivalent while there is a middle range of ν for whichθ performs betterθ. The explanation of this phenomena can be derived using the observations of Section 3.2. The hypothesis that regulatesθ is the one corresponding toR 1 (the one with the smallest empirical error). The hypothesis that regulatesθ, instead, is the one corresponding toR 2 (the second-best hypothesis) if the distance betweenR 1 andR 2 is small, i.e., for small ν, while is the one corresponding toR 1 if the distance betweenR 1 andR 2 is large. Consequently, whenR 1 andR 2 are close it is indifferent to choose one or the other and the estimates of the DDWUB are almost equivalent. When, instead, the difference betweenR 1 andR 2 increases,R 2 , instead ofR 1 , regulates θ and the DDWUB withθ performs better than the DDWUB withθ. Finally, when the distance betweenR 1 andR 2 is large, θ is regulated byR 1 both forθ andθ and consequently the corresponding estimates of the DDWUB are equivalent. Finally, setting θ = 1 results in the UB.
Let us now consider γ. From one side setting γ = 0 would make the DDWUB degenerate in the UB eliminating all the benefits of using the DDWUB. From the other side setting γ = ∞ would result in splitting equally the confidence just over the hypotheses with estimated error less than θ, not considering all the other hypotheses, which is our scope but unfortunately this is not possible because of the constraints of Lemma 7, which then result in the limitation over γ in Lemma 11. This is the reason we set, in the experiments, γ to the limits of what Lemma 11 allows. After that limit we do not know how the solution of the DDWUB behaves since we do not even know how to retrieve it. Setting a γ smaller than the maximum value allowed by Lemma 11 would diminish the performance of the DDWUB until the DDWUB degenerates in the UB. To support this statement we consider Scenario A withR 1 =R 2 = 0 andR 3 = · · · =R m = 1, then set δ = 0.05, n = 100 and m = 1000, and we vary γ ∈ {10 −5 √ n, 10 −4 √ n, · · · , 10 −1 √ n,γ}, whereγ is the limit defined in Lemma 11, and we reported the comparison between the DDWUB and the UB in Figure 6. . Scenario A (R 1 =R 2 = 0 andR 3 = · · · =R m = 1): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement when we set n = 100 and m = 1000 and we vary γ ∈ {10 −5 √ n, 10 −4 √ n, · · · , 10 −1 √ n,γ}, whereγ is the limit defined in Lemma 11. From Figure 6 it is possible to clearly observe that the maximum improvement is achieved when γ is maximum and consequently when γ =γ. The same result can be observed in all the other scenarios.

What About the Computable Shell Decomposition Bounds?
In this section, we will show that the DDWUB also improves over the CSDB. Before starting the comparison, we must recall this milestone result.

Theorem 6 ([20]). The following bounds hold
where r = max[1, rn ] /n ∈ { 1 /n, · · · , n /n} and if r = k /n then r ∈ [ k − 1 /n, k /m], kl(q||p) = q ln( q /p) + (1 − q) ln( 1 − q /1 − p) is the Kullback-Leibler divergence, and Let us consider the same scenario of Section 4. The DDWUB is able, at least asymptotically, to discard all risk associated with the hypotheses with high empirical error obtaining that for n large enough, the rate of convergence of the bound on Instead, if we use the CSDB of Theorem 6 we get that for n large enough, the rate of convergence of the bound on R 1 is O( ln(n) /n). Moreover, note that O( ln(n) /n) is also the fastest possible rate of convergence for Theorem 6.
If instead of checking the asymptotic behavior of the DDWUB and the CSDB we check their finite sample behavior by means of numerical experiments as in Section 5, we can derive other interesting observations. Figures 7-10 (and the associated sub-figures) report the same comparison of Figures 1-4 in Section 5 but, instead of comparing the UB with the DDWUB, here we compare the CSDB with the DDWUB.       As it can be clearly seen from the results the DDWUB performs consistently better than the CSDB.

Improving the Computable Shell Decomposition Bounds
The purpose of this section is to demonstrate that the DDWUB can be exploited to improve known results in Statistical Learning Theory. In particular, this section we will show that DDWUB can be exploited to improve the CSDB.
The proof of the CSDB of Theorem 6 relies on two main results, reported in the next theorem, combined with the UB. Theorem 7 ([20]). The following bounds hold where S k n , n = ln max 1, 2 h i : i ∈ I, R i ∈ k−1 n , k n andŜ k n , n, δ is defined as in Theorem 6.
By splitting the hypotheses in shells based on their generalization error, namely H r = {h i : i ∈ I, R i ∈ [ k−1 /n, k /n]} with r ∈ { 1 /n, 2 /n, · · · , 1}, by combining the two probabilistic bounds of Theorem 7, by using the UB, and by considering the worst case scenario, the result of Theorem 6 is derived. Consequently, [20] consider 2n probabilistic bounds, two for each of one the shells, and spread the confidence (risk) equally over them.
We propose, instead, to use the same approach of the DDWUB in the CSDB. Instead of spreading the confidence equally over the 2n probabilistic bounds, we spread the confidence over them based on the maximum generalization error of the function in each of the n shells to which the bounds refer. The results is reported in the following lemma.

Lemma 12. The following bound holds
The proof is the simple application of the concepts behind the DDWUB. θ is not needed since for each one of the shells we exactly know by definition the maximum generalization error of the functions inside it. γ is set to n /ln(n) and the reason is the following one. The rate of convergence of CSDB is O( √ ln(n) /n) in the general case and O( ln(n) /n) whenR i = 0. Let us study instead the rate of convergence of the bound of Lemma 12. Thanks to the Geometric Series we can state that For n large enough we can state that Consequently, the rate of convergence of the bound of Lemma 12 is O( √ ln(ln(n)) /n) in the general case and O( ln(ln(n)) /n) whenR i = 0. This means that for γ = n /ln(n) the rate of convergence of the bound of Lemma 12 of better than the one of CSDB.
It could be possible to improve the bound with a different values of γ and θ or to prove that for particular values of γ and θ, Lemma 12 is always better than the CSDB but this is beyond of the scope in this paper.
If, instead, we compare the finite sample behavior of the CSDB and the Lemma 12 (that we will call CSDB+DDWUB) by means of numerical experiments as in Section 5, we can observe the possible benefit of using DDWUB in CSDB, a well-known result of Statistical Learning Theory. Figures 11-14 (and the associated sub-figures) report the same comparison of Figures 1-4 in Section 5 but, instead of comparing the UB with the DDWUB, here we compare the CSDB with the CSDB+DDWUB.       As it can be clearly seen from the results the CSDB+DDWUB performs consistently better than the CSDB.

Conclusions and Discussion
In this work we derived, for an arbitrary finite hypothesis space, a fully empirical new upper bound on the generalization error of the hypothesis of minimal training error. As noted in the paper, although we presented just the upper bound, our result can be easily generalized also to lower or both upper and lower bounds.
In particular we depicted a quite general framework under which it is possible to improve the Union Bound with a distribution depended weighting strategy associated with the risk of each choice. Then we stated the conditions under which the proposed Distribution-Dependent Weighed Union Bound is always tighter than the one based on the Union Bound. We showed that these conditions are quite easy to satisfy. By means of both closed form and numerical results we demonstrated that Distribution-Dependent Weighed Union Bound is consistently tighter than the Union Bound in different scenarios. Finally, we showed that the Distribution-Dependent Weighed Union Bound is also able to improve over the Computable Shell Decomposition Bound, another quite powerful distribution-dependent Union Bound.
The results of this work are quite promising and pave the way toward many different future improvements. One is surely to derive a class of weighting strategies which satisfies behind the Distribution-Dependent Weighed Union Bound. Another one is to understand how to exploit the results of this work for improving all the results in Statistical Learning Theory where the Union Bound is employed. A final one, and probably the most important one, is how to extend and apply our results to the infinite-dimensional hypothesis spaces. This extension, which is obviously not trivial, has multiple alternatives which can be speculated. The first, and naive, approach would be to plug our approach into the compression bound which already deals with the infinite-dimensional case, exploiting the concept of compression, and exploits naively the union bound. The second approach, less intuitive, would be to split the hypothesis spaces in a finite number of shells, estimate the size of each shell with classical bounds based on the Vapnik-Chervonenkis or Rademacher Complexity theories and then exploit our Distribution-Dependent Weighed Union Bound to pay the price of the choice of one of the shells. The last, and more challenging, approach would be to plug our weighting strategy directly in the derivation of the Vapnik-Chervonenkis or Rademacher Complexity bases bounds shrinking then the measures of complexity as alternative to the localization approaches.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Known Results
Known bounds and results exploited in the paper are included in this section.

Appendix B. Proofs
Proofs of our results are included in this section.
Proof. (Lemma 2) Please note that under the assumptions of the lemma, ∀r 1 , · · · , r m ∈ [0, 1], and ∀i ∈ I \ i * and ∀r k , r k ∈ [0, 1] such that r k < r k then the statement of the lemma is proved.

Proof. (Lemma 3)
Please note that under the assumptions of the lemma Note also that if then the statement of the lemma is proved.

Proof. (Theorem 4) Let us define
Let us suppose that If we set Consequently, by exploiting the same reasoning exploited before, r * i * ≤ ϑ. By induction, the statement of the theorem is proved.
Then we can state thatR i * ≤ min[R 1 , · · · , R m ] + log 2m δ 2n , and consequently, the statement of the theorem is proved.

Proof. (Theorem 5) Let us define
Let us suppose that r j ≤ ϑ, ∀j ∈ I \ i.
Consequently, by exploiting the same reasoning exploited before, r * i ≤ ϑ. By induction, the statement of the theorem is proved.
Then we can state thatR i * ≤ L −1 min[R 1 , · · · , R m ], δ 2m , and consequently, the statement of the theorem is proved.
The derivation of Proposition A1 is trivial.