Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds

“No free lunch” results state the impossibility of obtaining meaningful bounds on the error of a learning algorithm without prior assumptions and modelling, which is more or less realistic for a given problem. Some models are “expensive” (strong assumptions, such as sub-Gaussian tails), others are “cheap” (simply finite variance). As it is well known, the more you pay, the more you get: in other words, the most expensive models yield the more interesting bounds. Recent advances in robust statistics have investigated procedures to obtain tight bounds while keeping the cost of assumptions minimal. The present paper explores and exhibits what the limits are for obtaining tight probably approximately correct (PAC)-Bayes bounds in a robust setting for cheap models.

A first type of restriction we can make can be called "expensive models".Consider that P belongs to the class P σ expensive consisting of all probability distributions over R such that if X follows distribution P, then for all This class P σ expensive is known as the one of subgaussian random variables with variance factor σ 2 (see Boucheron et al., 2013, for a nice introduction to concentration theory).Let δ ∈ (0, 1), is a confidence interval at level 1 − δ for the mean.
A second type of restriction can be called accordingly "cheap models".Assume that the distribution P belongs to the class P σ cheap , consisting of distributions with a finite variance, upper bounded by σ 2 .Here Chebyshev's inequality straightforwadly gives us a confidence interval.Let δ ∈ (0, 1), is a confidence interval at level 1 − δ for the mean.
In that case, there is no hope to obtain significantly tighter confidence intervals if one uses the empirical mean (as proved in Catoni, 2012, Proposition 6.2).
Note that the dependence in δ is fairly different in both confidence intervals defined in (1) and ( 2): for fixed σ 2 and N , the 2 log(1/δ) regime (the "good lunch") is obviously much more favorable than the 1/ √ δ regime (the "bad lunch").This is illustrated by Figure 1.So while it is clear than the best confidence interval requires more stringent assumptions, there has been attemps at relaxing those assumptions -or in other words, keep equally good lunches for a cheaper cost.
Organisation of the paper.We provide an overview of recent advances in robust statistics (Section 2), and briefly introduce our notation (Section 3) for PAC-Bayes learning (Section 4).We then propose in Section 5 a detailed study on the structural limits which do not allow for PAC-Bayes bounds which are simultaneously tight and cheap.The paper closes with conclusive remarks in Section 6.

Robust statistics
Robust statistics adress the following question: is it possible to obtain a good lunch, with just a cheap model ?In the mean estimation case hinted in Section 1, the question become: in the situation where P ∈ P σ cheap , can we build a confidence interval at level 1 − δ with a size proportional to σ √ N 2 log(1/δ)?
As mentioned above, there is no hope to achieve this goal with the empirical mean.Different alternative estimators have thus been considered in robust statistics, such as M-estimators (Catoni, 2012) or medianof-means (MoM) estimators (see Lerasle, 2019, for a recent survey, and references therein).
The key idea of MoM estimators is to achieve a compromise between the unbiased but non-robust empirical mean and the biased but robust median.As before, let us consider a sample of N real numbers x 1 , . . ., x N , assumed to be an iid sequence drawn from a distribution P. Let K ≤ N be a positive integer and assume for simplicity that K is a divisor of N .To compute the MoM estimator, the first step consists in dividing the sample (x 1 , . . ., x N ) into K distincts blocs B 1 , . . ., B K , each of length N/K.For each bloc we then compute the empirical mean The MoM estimator is defined as the median of these means: This estimator have the following nice property.
This property is quite encouraging as for a cheap model we obtain a confidence interval similar, up to a numerical constant, to the best one (1) in Section 1. However we also spot here an important limitation.
The confidence interval (3) for MoM is only valid for the particular error threshold δ = exp (−K/8), which depends on the number of blocs K (a parameter for the estimator MoM K ).The estimator must be changed each time we want to evaluate a different confidence level.
An ever more limiting feature is that the error threshold δ is constrained and cannot be set arbitrarily small, as in (1) or (2).Obviously, the number of blocks cannot exceed the sample size N , and the error threshold reaches its lowest tolerable value exp (−N/8).In other words, the interval defined in (3) can have confidence at most 1 − exp (−N/8).
Is this strong limitation specific to MoM estimators?No, say Devroye et al. (2016, Theorem 3.2 and following remark).This limitation is universal: over the class P σ cheap , there is no estimator x of the mean such that there exists a constant L > 1 such that Overall, a good and cheap lunch is possible, at the extra price that the bound is no longer valid for all confidence levels.

Notation
In the remainder of this paper, we focus on the supervised learning problem.We collect a sequence of input-output pairs (X i , Y i ) N i=1 ∈ (X × Y) N , which we assume to be N independent realisations of a random variable drawn from distribution P on X × Y.The overarching goal in statistics and machine learning is to select a hypothesis f over a space F which, given a new input x in X , delivers an output f (x) in Y, hopefully close (in a certain sense) to the unknown true output y.The quality of f is assessed through a loss function which characterises the discrepancy between the true output y and its prediction f (x), and we define a global notion of risk As the expectation with respect to P is intractable, we need to resort to an estimator of the risk.The most intuitive and simple choice is the empirical risk, defined for each f ∈ F as In the following, we consider integrals over the hypotheses space F. To keep notation as compact as possible we will write µ[g] = gdµ if µ is a measure over F and g ∈ F a µ-integrable function.

PAC-Bayes
In this section, we briefly introduce the generalised Bayesian setting in machine learning, and the resulting generalisation bounds, the PAC-Bayesian bounds.PAC-Bayes is a sophisticated framework to derive new learning algorithms and obtain state-of-the-art generalisation bounds: as such, we are interested in studying how PAC-Bayes is compatible with good and cheap lunches.We refer the reader to Guedj (2019) for a recent survey on PAC-Bayes.We focus on bounds known in the PAC-Bayes literature based on the empirical risk as a risk estimator in two conditions corresponding to the "expensive" and "cheap" models introduced in Section 1.

Generalised Bayes and PAC bounds
The aim of machine learning is to find a good (in the sense of a low risk) hypothesis f ∈ F. In the generalised Bayes setting, the learning algorithm does not output a single hypothesis but rather a distribution ρ over the hypotheses space F.
The main advantage of PAC-Bayes over deterministic approaches which output single hypotheses (through optimisation of a particular criterion, model selection, etc.) is that distributions allow to capture uncertainty on hypotheses, and take into account correlations among possible hypotheses.
The quantity to control is then which is an aggregated risk over the class F and represents the expected risk if the predictor f is drawn from ρ for each new prediction.The distribution ρ is usually data-dependent and is referred to as a "posterior" distribution (by analogy with Bayesian statistics).We also fix a reference measure π over F, called the "prior" (for similar reasons).We refer to Catoni (2007) and Guedj (2019) for in-depth discussions on the choice of the prior.
The generalisation bounds associated to this setting are known as "PAC-Bayesian" bounds, where PAC stands for Probably Approximately Correct.One important characteristic of PAC-Bayes bounds are that they hold true for any prior π and posterior ρ.In practice, bounds are optimised with respect to ρ.In the following, we focus on establishing bounds for any choice of π and ρ and do not mean to optimise.

Notion of divergence
An important notion used in PAC-Bayesian theory is the divergence between two probability distributions (see for example Csiszár and Shields, 2004, for a survey on divergences).Let E be a measurable space and µ and ν two probability distributions on E. Let f be a nonnegative convex function defined on R+ such that f (1) = 0, we define the f -divergence1 between µ and ν by Applying Jensen inequality we have that D f (µ, ν) is always nonnegative and equal to zero if and only if µ = ν.The class of f -divergences includes many celebrated divergences, such as the Kullback-Leibler (KL) divergence, the reversed KL, the Hellinger distance, the total variation distance, χ 2 -divergences, αdivergences, etc.
A divergence can be thought as a transport cost between two probability distributions.This interpretation will be useful for explaining PAC-Bayesian inequalities, where the divergence plays the role of a complexity term.In the following we will just use two types of divergence.The first is the Kullback-Leibler divergence and corresponds to the choice f (x) = x log x, we denote it by The second is linked to Pearson's χ 2 -divergence and corresponds to the choice To illustrate the behaviour of these two divergences, consider the case where µ and ν are normal distribution on We therefore see that the divergence D 2 penalises much more the gap between both distributions than the Kullback-Leibler divergence.

Expensive PAC-Bayesian bound
The first PAC-Bayesian bound we present is called "expensive PAC-Bayesian bound" in the spirit of Section 1: it is obtained under a subgaussian tails assumption.More precisely, we suppose here that for every f ∈ F, the distribution of the random variable (f (X), Y ) belongs to P σ expensive , which means In this situation we have the following bound, close to the ones obtained by Catoni (2007).Proposition 3. Assume that for any f ∈ F, (f (X), Y ) ∈ P σ expensive .For any prior π, posterior ρ and any δ ∈ (0, 1), the following inequality holds true with probability greater than Proof.The proof can be decomposed into two steps.
The first is to use the following lemma, consisting in a change of measure between the posterior and the prior.Let λ be a positive number and applying this result for the function λ(R − R N ): The next step is to control log π e λ(R−R N ) in high probability.With probability 1 − δ we have, by Markov's inequality By Fubini's theorem we can exchange the symbols E and π.Using the assumption P σ expensive , we obtain with probability greater than 1 − δ π e λ(R−R N ) ≤ exp λ 2 σ 2 /2N δ .
Now, putting these results together and setting we obtain the desired bound.
A PAC-Bayesian inequality is a bound which treats the complexity in the following manner: • At first, a global complexity measure is introduced with the change of measure and is characterised by the divergence term, measuring the price to switch from π (the reference distribution) to ρ (the posterior distribution on which all inference and prediction is based); • Next, the stochastic assumption on the datagenerating distribution is used to control π e λ(R−R N ) with high probability.

Cheap PAC-bayesian bound
The vast majority of works in the PAC-Bayesian literature focuses on expensive model.The main reason is that it include the situation where the loss is bounded, a common assumption in machine learning.
The case where (f (X, Y ) belongs to a cheap model has attracted far less attention: recently, Alquier and Guedj (2018) have obtained the following bound.
Proposition 5 (Alquier and Guedj (2018), Theorem 1).Assume that for any f ∈ F, (f (X), Y ) ∈ P σ cheap .For any prior π, posterior ρ and any δ ∈ (0, 1), the following inequality holds is true with probability greater than The proof (see Alquier and Guedj, 2018) uses the same elementary ingredients as in the expensive case, replacing the Kullback-Leibler divergence by D 2 and the dependence in δ moves from 2 log(1/δ) to 1 √ δ .Note the correspondence between these two bounds and the confidence intervals introduced in Section 1.

A good cheap lunch: towards a robust PAC-Bayesian bound?
If we take a closer look at the aforementioned PAC-Bayesian bounds with a robust statistics viewpoint, the following question arises: can we obtain a PAC-Bayesian bound with a log(1/δ) dependence (possibly up to a numerical constant) in the confidence level with the cheap model?In this section we shed light on some structural issues.In the following, we assume the existence of σ > 0 such that for every f ∈ F, (f (X), Y ) ∈ P σ cheap .

A necessary condition
Let R be an estimator of the risk.Here is a prototype of the inequality we are looking for: for any δ ∈ (0, 1), with probability 1 If we choose ρ = π = δ {f } (Dirac mass in the single hypothesis f ), the existence of such a PAC-Bayesian bound valid for all δ implies that is a confidence interval for the risk R(f ) for any level 1 − δ, where c is a constant.
Thus, a necessary condition for a PAC-Bayesian bound to be valid for all risk level δ, is to have tight confidence intervals for each f ∈ F.
However, as covered in Section 2, such estimators do not exist over the class P σ cheap , and the possibility to derive tight confidence interval is limited by the fact that the level δ must be greater that a positive constant of the form e −O(N ) .

A δ-dependant PAC-bayesian bound?
As a consequence, there is simply no hope for a robust PAC-Bayesian bound valid for any error threshold δ, for essentially the same reason which prevents it in the mean estimation case.The question we address now is the possibility of obtaining a robust PAC-Bayesian bound, with a dependence of magnitude 2 log(1/δ) (possibly up to a constant), with a possible limitation on the error threshold δ.In the following we assume to have a risk estimator R and an error threshold δ > 0 such that there exists a constant C > 0 such that for all f ∈ F, MoM is an example of such estimator.Let us stress that δ is fixed and cannot be used as a free parameter.
As seen above, a PAC-Bayesian bound proof proceeds in two steps: • First, we use a convexity argument to control the target quantity ρ[R − R] by an upper-bound involving a divergence term and a term of the form g −1 π g(R − R) where g is a nonnegative, increasing and convex function; • Second, we control the term π g(R − R) in high probability, using Markov's inequality.
The first step does not require any use of a stochastic model on the data, and is always valid, regardless of whether we have a cheap or an expensive model.The second step uses the model and introduce the dependence in the error rate δ on the right-term of the bound: g −1 (1/δ).In the case of the "expensive bound", we had g = exp, and the dependence was log(1/δ), the final rate log(1/δ) was obtained by choosing a relevant value for λ.
Let us follow this scheme to obtain a robust PAC-Bayesian bound.The first step gives Our goal is now to control π e λ(R− R) in high probability.Let us see why it seems impossible.

5.2.1
The case π = δ {f } Let us start with a very special case, where the prior is a Dirac mass on some hypothesis f ∈ F. Then Using how R is defined we can bound this quantity in the following way: with probability 1 − δ, Another way to formulate this result is to say that there exists an event A f with probability greater than 1−δ such that for all ω ∈ A f , the following holds true: In this example, we can control log π e λ(R− R) , at the price of a maximal constraint on the choice of the posterior.Indeed, the only possible choice for ρ for the Kullback Leibler KL(ρ, π) to make sense is ρ = π = δ {f } .

The case
Consider now a somewhat more sophisticated choice of prior which is a mixture of two Dirac masses in two distinct hypotheses.We do not fix the mixing proportion α and allow it to move freely between 0 and 1.The goal is to control the quantity More precisely, for all α ∈ (0, 1), we want to find an event A α on which this quantity is under control.In view of the prior's structure, the only way to ensure such a control is to have A α ⊂ A f2 ∩ A f2 , where A f1 (resp.A f2 ) is the favourable event for the concentration of f 1 (resp.f 2 ) around its mean.
By the union bound, we have that with probability greater than 1 − 2δ We have a double problem here.As above, if we want the final bound to be non-vacuous, we have to ensure that KL(ρ, π) is finite, which restricts the support for the posterior to be included in the set {f 1 , f 2 }.In addition, the probability with which we can guarantee the PAC-Bayesian bound is now 1 − 2δ...

Limitation
... which hints at the fact that this will become 1 − Kδ if the support for the prior contains K distinct hypotheses.If K ≥ 1/δ, the bound becomes vacuous.
In particular, we cannot obtain a relevant bound using this approach in the situation where the cardinal of F is infinite (which is commonly the case in most PAC-Bayes works).
This limiting fact highlights that to derive PAC-Bayesian bounds, we cannot rely on the construction of confidence interval for all R(f ) for a fixed error threshold δ.The issue is that when we want to transfer this local property into a global one (valid for any mixture of hypotheses by the prior π), we cannot avoid a worst-case reasoning by the use of the union bound.
The established bound in PAC-Bayesian literature, both in cheap and expensive models, repeatedly use the fact that when we assume that for any f ∈ F, we make an implicit assumption on the integrability of the tail of the distribution of (f (X), Y ).This argument is crucial for the second step of the PAC-Bayesian proof because, by Fubini's theorem, it allows to convert a local property (the tail distribution of each (f (X), Y )) into a global one (the control of π e λ(R−R N ) or π (R − R N )) 2 in high probability).

5.3
Is there yet another path for hope?
We have identified a structural limitation to derive a tight PAC-bayesian bound in a cheap model.We make the case that we cannot replicate the PAC-Bayesian proof presented in Section 4. To conclude this section, we want to highlight the fact that, up to our knowledge, no proof of PAC-Bayesian bounds avoid these two steps (see for example the general presentation in Bégin et al., 2016).
What if we try to avoid the change of measure step and try to control directly ρ[R] − ρ[ R] in high probability ?We remark that ρ can only be chosen with the 6 information given by the observation of R(f ), where f ∈ F. In particular we cannot obtain any information of the concentration of each R(f ) around R(f ) as such a knowledge requires to know the true risk.So it seems that a direct control cannot avoid starting as a "worst-case" bound: Then we have to control sup f ∈F R(f ) − R(f ) in high probability (see van der Vaart and Wellner, 1996 for a general presentation on such controls, and Lerasle, 2019 for recent results in the special case where R is a MoM estimator).However the obtained bound will take the following prototypic form: where the complexity term does not depend on the distribution ρ.Thus the optimisation of the right term leads to choose ρ as the Dirac mass in arg min So the overall procedure amounts to a slightly modified empirical risk minimisation (where the empirical mean is replaced with any estimator of the risk), and will not fall into the category of generalised Bayesian approaches which take into account the uncertainty on hypotheses.We would therefore loose pretty much all the strengths of PAC-Bayes.

Conclusion
The present paper contributes a better understanding of profound structural reasons why good cheap lunches (tight bounds under minimal assumptions) are not possible with PAC-Bayes, by walking gently through elementary examples.
From a theoretical perspective, PAC-Bayesian bounds requires too strong assumptions to adapt robust statistics results (where almost good lunches can be obtained for cheap models -with the limitation that the confidence level is constrained).The second step of the proof we have shown requires to transform a local hypothesis, a control of some moments of (f (X), Y ) into a global one, valid for all mixture of hypotheses by the prior π.As covered above, this transformation seems impossible.
To close on a more positive note after this negative result, let us stress that even if it does not seem possible to conciliate PAC-Bayes and robust statistics, we believe that recent ideas from robust statistics could be used in practical algorithms inspired by PAC-Bayes.
In particular, we leave as an avenue for future work the empirical study of PAC-Bayesian posteriors (such as the Gibbs measure defined as ρ ∝ exp(−γ R)π for any inverse temperature γ > 0) where the risk estimator is not the empirical mean (as in most PAC-Bayes works) but rather a robust estimator, such as MoM.

Figure 1 :
Figure 1: Relative sizes of confidence intervals: the dependence on δ.
a, I) and ν = N (0, I) (where I stands for the d × d identity matrix), we have