Next Article in Journal
F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits
Next Article in Special Issue
PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses
Previous Article in Journal
Testing the Social Bubble Hypothesis on the Early Dynamics of a Scientific Project: The FET Flagship Candidate FuturICT (2010–2013)
Previous Article in Special Issue
Meta-Strategy for Learning Tuning Parameters with Guarantees
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks

1
Centre for Artificial Intelligence, Department of Computer Science, University College London, London WC1V 6LJ, UK
2
Inria Lille—Nord Europe Research Centre and Inria London, 59800 Lille, France
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(10), 1280; https://doi.org/10.3390/e23101280
Submission received: 22 August 2021 / Revised: 23 September 2021 / Accepted: 27 September 2021 / Published: 29 September 2021
(This article belongs to the Special Issue Approximate Bayesian Inference)

Abstract

:
We make two related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC–Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of partially-aggregated estimators, proving that these lead to unbiased lower-variance output and gradient estimators; (2) we reformulate a PAC–Bayesian bound for signed-output networks to derive in combination with the above a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. We show empirically that this leads to competitive generalisation guarantees and compares favourably to other methods for training such networks. Finally, we note that the above leads to a simpler PAC–Bayesian training scheme for sign-activation networks than previous work.

1. Introduction

The use of stochastic neural networks has become widespread in the PAC–Bayesian and Bayesian deep learning [1] literature as a way to quantify predictive uncertainty and obtain generalisation bounds. PAC–Bayesian theorems generally bound the expected loss of randomised estimators, so it has proven easier to obtain non-vacuous numerical guarantees on generalisation in such networks.
However, we observe that when training these in the PAC–Bayesian setting, the objective used is generally somewhat divorced from the bound on misclassification loss itself, often because non-differentiability leads to difficulties with direct optimisation. For example, Langford and Caruana [2], Zhou et al. [3], and Dziugaite and Roy [4] all initially train non-stochastic networks before using them as the mode of a distribution, with variance chosen, respectively, through a computationally-expensive sensitivity analysis, as a proportion of weight norms, or by optimising an objective with both a surrogate loss function and a different dependence on the Kullback–Leibler (KL) divergence from their bound.
In exploring methods to circumvent this gap, we also note that PAC–Bayesian bounds can often be straightforwardly adapted to aggregates or averages of estimators, leading directly to analytic and differentiable objective functions (for example, [5]). Unfortunately, averages over deep stochastic networks are usually intractable or, if possible, very costly (as found by [6]).
Motivated by these observations, our main contribution is to obtain a compromise by defining new and general “partially-aggregated” Monte Carlo estimators for the average output and gradients of deep stochastic networks (Section 3), with the direct optimisation of PAC–Bayesian bounds in mind. Although our main focus here is on the use of this estimator in a PAC–Bayesian application, we emphasise that the technique applies generally to stochastic networks and thus has links to other variance-reduction techniques for training them, such as the pathwise estimator used in the context of neural networks by [7] amongst many others or Flipout [8]; indeed, it can be used in combination with these techniques. We provide proofs (Section 4) that this application leads to lower variances than a Monte Carlo forward pass and lower variance final-layer gradients than REINFORCE [9].
A further contribution of ours is a first application of this general estimator to non-differentiable “signed-output” networks (with a final output { 1 , + 1 } and arbitrarily complex other structure, see Section 4). As well as reducing variances as stated above, a small amount of additional structure in combination with partial-aggregation enables us to extend the pathwise estimator to the other layers, which usually requires a fully differentiable network and eases training by reducing the variance of gradient estimates.
We adapt a binary classification bound (Section 5) from Catoni [10] to these networks, yielding straightforward and directly differentiable objectives when used in combination with aggregation. Closing this gap between objectives and bounds leads to improved theoretical properties.
Further, since most of the existing PAC–Bayes bounds for neural networks have a heavy dependency on the distance from initialisation of the obtained solution, we would intuitively expect these lower variances to lead to faster convergence and tighter bounds (from finding low-error solutions nearer to the initialisation). We indeed observe this experimentally, showing that training PAC–Bayesian objectives in combination with partial aggregation leads to competitive experimental generalisation guarantees (Section 6), and improves upon naive Monte Carlo and REINFORCE.
As a useful corollary, this application also leads us to a similar but simpler PAC–Bayesian training method for sign-activation neural networks than Letarte et al. [6], which successfully aggregated networks with all sign activation functions { + 1 , 1 } and a non-standard tree structure, but incurred an exponential KL divergence penalty and a heavy computational cost (so that in practice they often resorted to a Monte Carlo estimate). Further, the lower variance of our obtained estimator predictions enables us to use the Gibbs estimator directly (where we draw a single sample function for every new example), leading to a modified bound on the misclassification loss which is a factor of two tighter without a significant performance penalty.
We discuss further and outline future work in Section 7.

2. Background

We begin here by setting out our notation and the requisite background.
Generally, we consider parameterised functions, { f θ : X Y | θ Θ R N } , in a specific form, choosing X R d 0 and an arbitrary output space Y which could be for example { 1 , + 1 } or R . We wish to find functions minimizing the out-of-sample risk, R ( f ) = E ( x , y ) D ( f ( x ) , y ) , for some loss function , for example the 0-1 misclassification loss for classification, 0 1 ( y , y ) = 1 { y y } , or the binary linear loss, lin ( y , y ) = 1 2 ( 1 y y ) , with Y = { + 1 , 1 } . These must be chosen based on an i.i.d. sample S = { ( x i , y i ) } i = 1 m D m from the data distribution D , using the surrogate of in-sample empirical risk, R S ( f ) = 1 m i = 1 m ( f ( x i ) , y i ) . We denote the expected and empirical risks under the misclassification and linear losses, respectively R 0 1 , R lin , R S 0 1 and R S lin .
In this paper, we consider learning a distribution (PAC–Bayesian posterior), Q, over the parameters θ . PAC–Bayesian theorems then provide bounds on the expected generalization risk of randomised classifiers, where every prediction is made using a newly sampled function from our posterior, f θ , θ Q .
We also consider averaging the above to obtain Q-aggregated prediction functions,
F Q ( x ) : = E θ Q f θ ( x ) .
In the case of a convex loss function, Jensen’s inequality lower bounds the risk of the randomised function by its Q-aggregate: ( F Q ( x ) , y ) E f Q ( f ( x ) , y ) . The equality is achieved by the linear loss, a fact we will exploit to obtain an easier PAC–Bayesian optimisation objective in Section 5.

2.1. Analytic Q-Aggregates for Signed Linear Functions

Q-aggregate predictors are analytically tractable for “signed-output” functions (here the sign function and “signed” functions have outputs { + 1 , 1 } , as the terminology “binary”, used sometimes in the literature, suggests to us too strongly an output { 0 , 1 } ) of the form f w ( x ) = sign ( w · x ) under a normal distribution, Q ( w ) = N ( μ , I ) , as specifically considered in a PAC–Bayesian context for binary classification by [5], obtaining an differentiable objective similar to the SVM. Provided x 0 :
F Q ( x ) : = E w N ( μ , I ) sign ( w · x ) = erf μ · x 2 x .
In Section 4, we will consider aggregating signed output ( f ( x ) { + 1 , 1 } ) functions of a more general form.

2.2. Monte Carlo Estimators for More Complex Q-Aggregates

The framework of Q-aggregates can be extended to less tractable cases (for example, with f θ a randomised or a “Bayesian” neural network, see, e.g., [1]) through a simple and unbiased Monte Carlo approximation:
F Q ( x ) = E θ Q f θ ( x ) 1 T t = 1 T f θ t ( x ) : = F ^ Q ( x ) .
If we go on to parameterize our posterior Q by ϕ Φ R N as Q ϕ and wish to obtain gradients without a closed form for F Q ϕ ( x ) = E θ Q ϕ f θ ( x ) , there are two possibilities. One is REINFORCE [9], which requires only a differentiable density function, q ϕ ( θ ) and makes a Monte Carlo approximation to the left hand side of the identity ϕ E θ q ϕ f θ ( x ) = E θ q ϕ [ f θ ( x ) ϕ log q ϕ ( θ ) ] .
The other is the pathwise estimator, which additionally requires that f θ ( x ) be differentiable w.r.t. θ , and that the probability distribution chosen has a standardization function, S ϕ , which removes the ϕ dependence, turning a parameterised q ϕ into a non-parameterised distribution p: for example, S μ , σ ( X ) = ( X μ ) / σ to transform a general normal distribution into a standard normal. If this exists, the right hand side of ϕ E θ q ϕ f θ ( x ) = E ϵ p ϕ f S ϕ 1 ( ϵ ) ( x ) generally yields lower-variance estimates than REINFORCE (see for a modern survey [11]).
The variance introduced by REINFORCE can make it difficult to train neural networks when the pathwise estimator is not available, for example when non-differentiable activation functions are used. Below we find a compromise between the analytically closed form of (2) and the above estimator that enables us to make differentiable certain classes of network and extend the pathwise estimator where otherwise it could not be used. Through this we are able to stably train a new class of network.

2.3. PAC–Bayesian Approach

We use PAC–Bayes in this paper to obtain generalisation guarantees and theoretically-motivated training methods. The primary bound utilised is based on the following theorem, valid for a loss taking values in [ 0 , 1 ] :
Theorem 1
([10], Theorem 1.2.6). Given probability measure P on hypothesis space F and α > 1 , for all Q on F with probability at least 1 δ over S D m ,
E f Q R ( f ) inf λ > 1 Φ λ / m 1 E f Q R S ( f ) + α λ Δ
with Φ γ 1 ( t ) = 1 exp ( γ t ) 1 exp ( γ ) and Δ = KL Q | P log δ + 2 log log α 2 λ log α .
This slightly opaque formulation (used previously by [3]) gives essentially identical results when KL / m is large to the better-known “small-kl” PAC–Bayes bounds originated by Langford and Seeger [12], Seeger et al. [13]. It is chosen because it leads to objectives that are linear in the empirical loss and KL divergence, like
E f Q R S ( f ) + KL Q | P λ .
This objective is minimised by a Gibbs distribution and is closely related to the evidence lower bound (ELBO) usually optimised by Bayesian Neural Networks [1]. Such a connection has been noted throughout the PAC–Bayesian literature; we refer the reader to [14] or [15] for a formalised treatment.

3. The Partial Aggregation Estimator

Here we outline our main contribution: a reformulation of Q-aggregation for neural networks leading to different, lower-variance, Monte Carlo estimators for their outputs and gradients. These estimators apply to networks with a dense final layer, and arbitrary stochastic other structure (for example convolutions, residual layers or a non-feedforward structure). Specifically, they take the form
f θ ( x ) = A ( w · η θ ¬ w ( x ) )
with θ = vec ( w , θ ¬ w ) Θ R D , w R d , and θ ¬ w Θ ¬ w R D d the parameter set excluding w , for the non-final layers of the network. These non-final layers are included in η θ ¬ w : X A d R d and the final activation is A : R Y . For simplicity we have used a one-dimensional output but we note that the formulation and below derivations trivially extend to a vector-valued function with elementwise activations. We require the distribution over parameters to factorise like Q ( θ ) = Q w ( w ) Q ¬ w ( θ ¬ w ) , which is consistent with the literature.
We recover a similar functional form to that considered in Section 2.1 by rewriting the function as A ( w · a ) with a A d the randomised hidden-layer activations. The “aggregated” activation function on the final layer, which we define as I ( a ) : = A ( w · a ) d Q w ( w ) , may then be analytically tractable. For example, with w N ( μ , I ) and a sign final activation, we recall (2) where I ( a ) = erf μ · a 2 a .
Using these definitions we can write the Q-aggregate in terms of the conditional distribution on the activations, a , which takes the form Q ˜ ¬ w ( a | x ) : = ( η ( · ) ( x ) ) Q ¬ w , (i.e., the distribution of η θ ¬ w ( x ) | x , with θ ¬ w Q ¬ w ). The Q-aggregate can then be stated as
F Q ( x ) : = E θ Q [ f θ ( x ) ] = θ ¬ w R d A ( w · η θ ¬ w ( x ) ) d Q w ( w ) d Q ¬ w ( θ ¬ w ) = θ ¬ w I ( η θ ¬ w ( x ) ) d Q ¬ w ( θ ¬ w ) = A d I ( a ) d { ( η ( · ) ( x ) ) Q ¬ w } ( a ) = : A d I ( a ) d Q ˜ ¬ w ( a | x ) .
In most cases, the final integral cannot be calculated exactly or involves a large summation, so we resort to a Monte Carlo estimate, for each x drawing T samples of the randomised activations, { a t } t = 1 T Q ˜ ¬ w ( a | x ) to obtain the “partially-aggregated” estimator
F Q ( x ) = A d I ( a ) d Q ˜ ¬ w ( a | x ) 1 T t = 1 T I ( a t ) = F ^ Q ( x ) .
This is quite similar to the original estimator from (3), but in fact the aggregation of the final layer may significantly reduce the variance of the outputs and also make better gradient estimates possible, as we will show below.

3.1. Reduced Variance Estimates

Proposition 1.
Lower variance outputs: For a neural network as defined by Equation (5) and the unbiased Q-aggregation estimators defined by Equations (3) and (6),
V Q [ F ^ Q ( x ) ] V Q [ F ^ Q ( x ) ] .
Proof. 
Treating a as a random variable, always conditioned on x , we have
V Q [ F ^ Q ( x ) ] V Q [ F ^ Q ( x ) ] = E Q | F ^ Q ( x ) | 2 E Q | F ^ Q ( x ) | 2 = 1 T E a | x E w | A ( w · a ) | 2 | E w A ( w · a ) | 2 = 1 T E a | x V w [ A ( w · a ) ] 0 .
 □
From the above we see that the aggregate outputs estimated through partial-aggregation have lower variances. Next, we consider the two unbiased gradient estimators for the distribution over final-layer weights, w , arising from partial-aggregation or REINFORCE (as would be used, for example, where the final layer is non-differentiable). Assuming Q w has a density, q ϕ ( θ w ) , parameterised by ϕ , these use forward samples of { w t , θ ¬ w , ( t ) } t = 1 T as:
G ^ ( x ) : = 1 T t = 1 T A ( w t · η t ) ϕ log q ϕ ( w t ) G ^ ( x ) : = 1 T t = 1 T ϕ I q ϕ ( η t ) .
Proposition 2.
Lower variance gradients: Under the conditions of Proposition 1 and the above definitions,
Cov Q [ G ^ ( x ) ] Cov Q [ G ^ ( x ) ]
where A B B A is positive semi-definite. Thus, for all u 0 , V [ G ^ ( x ) · u ] V [ G ^ ( x ) · u ] .
Proof. 
Writing v : = ϕ log q ϕ ( w ) and using the unbiasedness of the estimators,
Cov Q [ G ^ Q ( x ) ] Cov Q [ G ^ Q ( x ) ] = E Q [ G ^ Q ( x ) G ^ Q ( x ) T ] E Q [ G ^ Q ( x ) G ^ Q ( x ) T ] = 1 T E a | x E w [ A ( w · a ) 2 v v T ] ϕ I q ϕ ( η t ) ϕ I q ϕ ( η t ) T = 1 T E a | x Cov w [ A ( w · a ) ϕ log q ϕ ( w ) ] 0
where in the final line we have used that ϕ I q ϕ ( η t ) = ϕ E w [ A ( w · η t ) ] = E w [ A ( w · η t ) v ] . □

3.2. Single Hidden Layer

For clarity (and to introduce notation to be used in Section 4.2) we will briefly consider the case of a neural network with one hidden layer, f θ ( x ) = A 2 ( w 2 · A 1 ( W 1 x ) ) . The randomised parameters are θ = vec ( w 2 , W 1 ) , W 1 R d 1 × d 0 , w 2 R d 1 and the elementwise activations are A 1 : R d 1 A 1 d 1 R d 1 and A 2 : R Y . We choose the distribution Q ( θ ) = : Q 2 ( w 2 ) Q 1 ( W 1 ) to factorise over the layers. This is identical to the above and sets η W 1 ( x ) = A 1 ( W 1 x ) .
Sampling a is straightforward if sampling W 1 is. Further, if the final layer aggregate is differentiable, and so is the hidden layer activation A 1 , we may be able to use the lower-variance pathwise gradient estimator for gradients with respect to Q 1 . We note that this may be possible even if the activation A 2 is not differentiable, as in Section 4, where we extend the pathwise estimator where we could not otherwise use it.
Computationally, we may implement the above by analytically finding the distribution on the “pre-activations” W 1 x (trivial for the normal distribution) before sampling this and passing through the activation. With the pathwise estimator this is known as the “local reparameterization trick” [16], which can lead to considerable computational savings on parallel minibatches compared to direct hierarchical sampling, a = A 1 ( W 1 x ) with W 1 Q 1 . We will utilise this in all our reparameterizable dense networks, and a variation on it to save computation when using REINFORCE in Section 4.2 and Section 6.

4. Aggregating Signed-Output Networks

Here we consider a first practical application of the aggregation estimator to stochastic neural networks with a final dense sign-activated layer. We have seen above that this partial aggregation leads to better-behaved training objectives and lower-variance gradient estimates across arbitrary other network structure, It may also allow use of pathwise gradients for the other layers, which would not be possible otherwise due to the non-differentiability of the final layer.
Specifically, these networks take the form of Equation (5) with the final layer activation a sign function and weights drawn from a unit variance normal distribution, Q w ( w ) = N ( μ , I ) . The aggregate I ( a ) is given by Equation (2). Normally-distributed weights are chosen because of the simple analytic forms for the aggregate (reminiscent of the tanh activation occasionally used for neural networks) and KL divergence (effectively an L 2 regularisation penalty); we note however that closed forms are available for other commonly-used distributions such as the Laplace.
Using Equations (3) and (6) with independent samples { ( w t , θ ¬ w , ( t ) ) } t = 1 T Q and η t : = η θ ¬ w , ( t ) ( x ) leads to the two unbiased estimators for the output (henceforth assuming the technical condition P η | x { η = 0 } = 0 that allows aggregation to be well-defined).
F ^ Q ( x ) : = 1 T t = 1 T sign ( w t · η t )
F ^ Q ( x ) : = 1 T t = 1 T erf μ · η t 2 η t .
It follows immediately from Propositions 1 and 2 that the latter and the associated gradient estimators have lower variances than the former or the REINFORCE gradient estimates (which we would otherwise have to use due to the non-differentiability of the final layer).

4.1. Lower Variance Estimates of Aggregated Sign-Output Networks

We clarify the situation with the lower variance estimates further below. In particular, we find that the reduction in variance from using the partial-aggregation estimator is controlled by the norm μ , so that for small μ (as could be expected early in training) the difference can be large, while as μ grows, the difference in variance is controlled and we could reasonably revert to the Monte Carlo (or Gibbs) estimator. Note also that as F Q ( x ) ± 1 (as expected after training), both variances disappear.
We also show that a stricter condition than Proposition 2 holds on the variances of the aggregated gradients here, and thus that the non-aggregated gradients are noisier in all cases than the aggregate.
Proposition 3.
With the definitions given by Equation (7), for all x R d 0 , T N , and Q with normally-distributed final layer,
0 V Q [ F ^ Q ( x ) ] V Q [ F ^ Q ( x ) ] 1 T 1 erf μ 2 2 .
Proof. 
The left identity follows directly from Proposition 1. We also have
V Q [ F ^ Q ( x ) ] V Q [ F ^ Q ( x ) ] = 1 T E a | x V w [ sign ( w · a ) ] = 1 T 1 E a | x erf μ · η 2 η 2 1 T 1 erf μ 2 2 .
 □
Proposition 4.
Under the conditions of Proposition 3,
Cov [ G ^ ( x ) ] Cov [ G ^ ( x ) ] + 1 2 / π T I .
Thus, for all u with u = 1 ,
V [ G ^ ( x ) · u ] V [ G ^ ( x ) · u ] + 1 2 / π T .
Proof. 
It is straightforward to show that
G ^ ( x ) : = 1 T t = 1 T sign ( w t · η t ) ( μ w t ) G ^ ( x ) : = 1 T t = 1 T η t η t 2 π exp 1 2 μ · η t η t 2 Cov [ G ^ ( x ) ] = 1 T I G G T Cov [ G ^ ( x ) ] = 1 T E η η T η 2 2 π e μ · η η 2 G G T
so for u 0 ,
T u T Cov [ G ^ ( x ) ] Cov [ G ^ ( x ) ] u = u 2 2 π E | u · η | 2 η 2 e μ · η η 2 u 2 1 2 π > 0 .
Above we have brought u inside the term with an expectation, which is then bounded using Cauchy–Schwarz on | u · η | / η u , and e | t | 1 for all t R . □

4.2. All Sign Activations

Here we examine an important special case previously examined by Letarte et al. [6]: a feed-forward network with all sign activations and normal weights. This takes the form
f θ ( x ) = sign ( w L · sign ( W L 1 sign ( W 1 x ) ) )
with θ : = vec ( w L , , W 1 ) and W l : = [ w l , 1 w l , d l ] T ; l { 1 , . . . , L } are the layer indices. We choose unit-variance normal distributions on the weights, which factorise into Q l ( W l ) = i = 1 d l q l , i ( w l , i ) with q l , i = N ( μ l , i , I d l 1 ) . In the notation of Section 3, η θ ¬ w ( x ) = sign ( W L 1 sign ( W 1 x ) ) is the final layer activation, which could easily be sampled by mapping x through the first L 1 layers with draws from the weight distribution.
Instead, we go on to make an iterative replacement of the weight distributions on each layer by conditionals on the layer activations to obtain the summation
F Q ( x ) = a 1 { + 1 , 1 } d 1 Q ˜ 1 ( a 1 | x ) × × a L 1 { + 1 , 1 } d L 1 Q ˜ L 1 ( a L 1 | a L 2 ) erf μ L · a L 1 2 a L 1 .
The number of terms is exponential in the depth so we instead hierarchically sample the a l . Like local reparameterisation, this leads to a considerable computational saving over sampling a separate weight matrix for every input. The conditionals can be found in closed form: we can factorise individual hidden units Q ˜ l ( a l | a l 1 ) : = i = 1 d l q ˜ l , i ( a l , i | a l 1 ) , and find their activation distributions (with a 0 : = x and z a dummy variable):
q ˜ l , i ( a l , i = ± 1 | a l 1 ) = 0 N ( z ; ± μ l , i · a l 1 , a l 1 2 ) d z = 1 2 1 ± erf μ l , i · a l 1 2 a l 1 .
A marginalised REINFORCE-style gradient estimator for conditional distributions can then be used; this does not necessarily have better statistical properties but in combination with the above is much more computationally efficient. This idea of “conditional sampling” is inspired by the local reinforce trick. Using samples { ( a 1 t a L 1 t ) } t = 1 T Q ˜ ,
F Q ( x ) μ l , i 1 T t = 1 T erf μ L · a L 1 t 2 a L 1 t μ l , i log q ˜ l , i ( a l , i t | a l 1 t ) .
This formulation along with Equation (9) resembles the PBGNet model of [6], but derived in a very different way. Indeed both are equivalent in the single-hidden-layer case, but with more layers PBGNet uses an unusual tree-structured network to make the individual activations independent and avoid an exponential computational dependency on the depth in Equation (9). This makes the above summation exactly calculable but is also still not efficient enough in practice, so they resort further to a Monte Carlo approximation: informally, this draws new samples for every layer l based on an average of those from the previous layer, a l | { a l 1 ( t ) } t = 1 T 1 T t = 1 T Q ˜ ( a l | a l 1 ( t ) ) .
This is all justified within the tree-structured framework but leads to an exponential KL penalty which—as hinted by Letarte et al. [6] and shown empirically in Section 6—makes PAC–Bayes bound optimisation strongly favour shallower such networks. Our formulation avoids this, is more general—applying to alternative network structures—and we believe it is significantly easier to understand and use in practice.

5. PAC–Bayesian Objectives with Signed-Outputs

We now move to obtain binary classifiers with guarantees for the expected misclassification error, R 0 1 , which we do by optimizing PAC–Bayesian bounds. Such bounds (as in Theorem 1) will usually involve the non-differentiable and non-convex misclassification loss 0 1 . However, to train a neural network we need to replace this by a differentiable surrogate, as discussed in the introduction.
Here we adopt a different approach by using our signed-output networks, where since f ( x ) { + 1 , 1 } , there is an exact equivalence between the linear and misclassification losses, 0 1 ( f ( x ) , y ) = lin ( f ( x ) , y ) , avoiding an extra factor of two from the inequality 0 1 2 lin .
Although we have only moved the non-differentiability into f, the form of a PAC–Bayesian bound and the linearity of the loss and expectation allow us to go further and aggregate,
E f Q 0 1 ( f ( x ) , y ) = E f Q lin ( f ( x ) , y ) = lin ( F Q ( x ) , y )
which allows us to use the tools discussed in Section 4 to obtain lower-variance estimates and gradients. Below we prove a small result to show the utility of this:
Proposition 5.
Under the conditions of Proposition 3 and y { + 1 , 1 } ,
V Q [ lin ( F ^ Q ( x ) , y ) ] V Q [ lin ( F ^ Q ( x ) , y ) ] V f Q [ 0 1 ( f ( x ) , y ) ] = 1 4 ( 1 | F Q ( x ) | 2 ) .
Proof. 
V Q [ lin ( F ^ Q ( x ) , y ) ] = E Q 1 2 ( y F Q ( x ) y F ^ Q ( x ) ) 2 = 1 4 V Q [ F ^ Q ( x ) ]
and a similar result for F ^ Q . f = F ^ Q if T = 1 and lin ( f ( x ) , y ) = 0 1 ( f ( x ) , y ) . The result then follows from this and Proposition 3. □
Combining (11) with Theorem 1, we obtain a directly optimizable, differentiable bound on the misclassification loss without introducing the above-mentioned factor of 2.
Theorem 2.
Given P on θ and α > 1 , for all Q on θ and λ > 1 simultaneously with probability at least 1 δ over S D m ,
E θ Q R 0 1 ( f θ ) Φ λ / m 1 R S lin ( F Q ) + α λ Δ
with Φ γ 1 ( t ) = 1 exp ( γ t ) 1 exp ( γ ) , f θ : R d { + 1 , 1 } , θ Θ , and Δ = KL Q | P log δ + 2 log log α 2 λ log α .
Thus, for each λ , which can be held fixed (“fix- λ ”) or simultaneously optimized throughout training for automatic regularisation tuning (“optim- λ ”), we obtain a gradient descent objective:
R S lin ( F ^ Q ) + KL ( Q | P ) λ .

6. Experiments

All experiments (Table 1) run on “binary”-MNIST, dividing MNIST into two classes, of digits 0–4 and 5–9. Neural networks had three hidden layers with 100 units per layer and sign, sigmoid (sgmd) or relu activations, before a single-unit final layer with sign activation. Q was chosen as an isotropic, unit-variance normal distribution with initial means drawn from a truncated normal distribution of variance 0.05 . The data-free prior P was fixed equal to the initial Q, as motivated by Dziugaite and Roy [4] (Section 5 and Appendix B).
The objectives fix- λ and optim- λ from Section 5 were used for batch-size 256 gradient descent with Adam [17] for 200 epochs. Every five epochs, the bound (for a minimising λ ) was evaluated using the entire training set; the learning rate was then halved if the bound was unimproved from the previous two evaluations. The best hyperparameters were selected using the best bound achieved in these evaluations through a grid search of initial learning rates { 0.1 , 0.01 , 0.001 } , sample sizes T { 1 , 10 , 50 , 100 } . Once these were selected training was repeated 10 times to obtain the values in Table 1.
λ in optim- λ was optimised through Theorem 2 on alternate mini-batches with SGD and a fixed learning rate of 10 4 (whilst still using the objective (12) to avoid effectively scaling the learning rate with respect to empirical loss by the varying λ ). After preliminary experiments in fix- λ , we set λ = m = 60 , 000 , the training set size, as is common in Bayesian deep learning.
We also report the values of three baselines: reinforce, which uses the fix- λ objective without partial-aggregation, forcing the use of REINFORCE gradients everywhere; mlp, an unregularised non-stochastic relu neural network with tanh output activation; and the PBGNet model (pbg) from Letarte et al. [6]. For the latter, a misclassification error bound obtained through 0 1 2 lin must be used as their test predictions were made through the sign of a prediction function [ 1 , + 1 ] , not { + 1 , 1 } . Further, despite significant additional hyperparameter exploration, we were unable to train a three layer network through the PBGNet algorithm directly comparable to our method, likely because of the exponential KL penalty (in their Equation 17) within that framework; to enable comparison, we therefore allowed the number of hidden layers in this scenario to vary { 1 , 2 , 3 } . Other baseline tuning and setup was similar to the above, see the Appendix A for more details.
During evaluation reinforce draws a new set of weights for every test example, equivalent to the evaluation of the other models; but doing so during training, with multiple parallel samples, is prohibitively expensive. Two different approaches to straightforward, not partially-aggregated, gradient estimation for this case suggest themselves, arising from different approximations to the Q-expected loss of the minibatch, B S (with data indices B ). From the identities
ϕ E θ q ϕ R B ( f θ ) = E θ q ϕ 1 | B | i B ( f θ ( x i ) , y i ) ϕ log q ϕ ( θ ) = 1 | B | i B E θ q ϕ ( f θ ( x i ) , y i ) ϕ log q ϕ ( θ )
we obtain two slightly different estimators for ϕ E θ q ϕ R B ( f θ ) :
1 T | B | t = 1 T i B ( f θ ( t , i ) ( x i ) , y i ) ϕ log q ϕ ( θ ( t , i ) ) 1 T | B | i B t = 1 T ( f θ t ( x i ) , y i ) ϕ log q ϕ ( θ t ) .
The first draws many more samples and has lower variance but is much slower computationally; even aside from the O ( | B | ) increase in computation, there is a slowdown as the optimised BLAS matrix routines cannot be used, and the very large matrices involved may not fit in memory (see for more information [16]).
Therefore, as is standard in the Bayesian Neural Network literature with the pathwise estimator, we use the latter formulation, which has a similar computational complexity to local-reparameterisation and our marginalised REINFORCE estimator (10). We should note though that in preliminary experiments, the alternate estimator did not appear to lead to improved results. This clarifies the advantages of marginalised sampling, which can lead to lower variance with a similar computational cost.

7. Discussion

The experiments demonstrate that partial-aggregation enables training of multi-layer non-differentiable neural networks in a PAC–Bayesian context, which is not possible with REINFORCE gradients and a multiple-hidden-layer PBGNet [6]. These obtained only vacuous bounds, and our misclassification bounds also improve those of a single-hidden-layer PBGNet.
Our experiments raise a couple of questions: firstly, why is it that lower variance estimates empirically lead to tighter bounds? We speculate that the faster convergence of SGD in this case takes us to a more “local” minimum of the objective, closer to our initialisation. Since most existing PAC–Bayes bounds for neural networks have a very strong dependence on this distance from initialisation through the KL term, this leads to tighter bounds. This distance could also be reduced through other methods we consider out-of-scope, such as the data-dependent bounds employed by Dziugaite and Roy [18] and Letarte et al. [6].
A second and harder question is asking why the non-stochastic mlp model obtains a lower overall error. The bound optimisation is empirically quite conservative, but does not necessarily lead to better generalisation; understanding this gap is a key question in the theory of deep learning.
In future work we will develop significant new tools to extend partial-aggregation to multi-class classification, and to improve test prediction bounds for sign ( F ^ Q ( x ) ) with T > 1 , as in PBGNet, which gave slightly improved predictive performance despite the inferior theoretical guarantees.

Author Contributions

Conceptualization, F.B. and B.G.; Formal analysis, F.B. and B.G.; Methodology, F.B. and B.G.; Project administration, B.G.; Software, F.B.; Writing—original draft, F.B.; Writing—review and editing, F.B. and B.G. Both authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. BG acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23-0015-02. FB acknowledges the support from the Foundational Artificial Intelligence Centre for Doctoral Training at University College London.

Data Availability Statement

Not applicable

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Further Experimental Details

Appendix A.1. Aggregating Biases with the Sign Function

We used a bias term in our network layers, leading to a simple extension of the above formulation, omitted in the main text for conciseness:
E w N ( μ , Σ ) , b N ( β , σ 2 ) sign ( w · x + b ) = erf μ · x + β 2 ( x T Σ x + σ 2 )
since w · x + b N ( μ · x + β , x T Σ x + σ 2 ) and
E z N ( α , β 2 ) sign z = P ( z 0 ) P ( z < 0 ) = [ 1 Φ ( α / β ) ] Φ ( α / β ) = 2 Φ ( α / β ) 1 = erf ( α / 2 β ) .
The bias and weight co-variances were chosen to be diagonal with a scale of 1, which leads to some simplification in the above.

Appendix A.2. Dataset Details

We used the MNIST dataset version 3.0.1, available online at http://yann.lecun.com/exdb/mnist/ (accessed on 4 June 2021), which contains 60,000 training examples and 10,000 test examples, which were used without any further split, and rescaled to lie in the range [ 0 , 1 ] . For the “binary”-MINST task, the labels + 1 and 1 were assigned to digits in { 5 , 6 , 7 , 8 , 9 } and { 0 , 1 , 2 , 3 , 4 } , respectively, and images were scaled into the interval [ 0 , 1 ] .

Appendix A.3. Hyperparameter Search for Baselines

The baseline comparison values offered with our experiments were optimized similarly to the above, for completeness we report everything here.
The MLP model had three hidden ReLu layers of size 100 each trained with Adam, a learning rate { 0.1 , 0.01 , 0.001 } and a batch size of 256 for 100 epochs. Complete test and train evaluation was performed after every epoch, and in the absence of a bound, the model and epoch with lowest train linear loss was selected.
For PBGNet we choose the values of hyperparameters from within these values giving the least bound value. Note that, unlike in [6], we do not allow the hidden size to vary { 10 , 50 , 100 } , and we use the entire MNIST training set as we do not need a validation set. While attempting to train a three hidden layer network, we also searched through the hyperparameter settings with a batch size of 64 as in the original, but after this failed, we returned to the original batch size of 256 with Adam. All experiments were performed using the code from the original paper, available at https://github.com/gletarte/dichotomize-and-generalize (accessed on 4 June 2021).
Since we were unable to train a multiple-hidden-layer network through the PBGNet algorithm, for this model only we explored different numbers of hidden layers { 1 , 2 , 3 } .

Appendix A.4. Final Hyperparameter Settings

In Table A1 we report the hyperparameter settings used for the experiments in Table 1 after exploration. To save computation, hyperparameter settings that were not learning (defined as having a whole-train-set linear loss of > 0.45 after ten epochs) were terminated early. This was also done on the later evaluation runs, where in a few instances the fix- λ sigmoid network failed to train after ten epochs; to handle this we reset the network to obtain the main experimental results.
Table A1. Chosen hyperparameter settings and additional details for results in Table 1. Best hyperparameters were chosen by bound if available and non-vacuous, otherwise by best training linear loss through a grid search as described in Section 6 and Appendix A.3. Run times are rounded to nearest 5 min.
Table A1. Chosen hyperparameter settings and additional details for results in Table 1. Best hyperparameters were chosen by bound if available and non-vacuous, otherwise by best training linear loss through a grid search as described in Section 6 and Appendix A.3. Run times are rounded to nearest 5 min.
mlppbgReinforceFix- λ Optim- λ
signrelusignrelusgmdsignrelusgmd
Init. LR0.0010.010.10.10.010.10.10.010.10.1
Samples, T-100100100100501010010010
Hid. Layers3133333333
Hid. Size100100100100100100100100100100
Mean KL-265815,02013,613236335713011556132044000
Runtime/min1054040353025353025
For clarity we repeat here the hyperparameter settings and search space:
  • Initial Learning Rate { 0.1 , 0.01 , 0.001 } .
  • Training Samples { 1 , 10 , 50 , 100 } .
  • Hidden Size = 100 .
  • Batch Size = 256 .
  • Fix- λ , λ = m = 60 , 000 .
  • Number of Hidden Layers = 3 for all models, except PBGNet { 1 , 2 , 3 } .

Appendix A.5. Implementation and Runtime

Experiments were implemented using Python and the TensorFlow library [19]. Reported approximate runtimes are for execution on a NVIDIA GeForce RTX 2080 Ti GPU.

References

  1. Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight Uncertainty in Neural Network. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 1613–1622. [Google Scholar]
  2. Langford, J.; Caruana, R. (Not) Bounding the True Error. In Advances in Neural Information Processing Systems 14; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 809–816. [Google Scholar]
  3. Zhou, W.; Veitch, V.; Austern, M.; Adams, R.P.; Orbanz, P. Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  4. Dziugaite, G.K.; Roy, D.M. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2016, Sydney, NSW, Australia, 11–15 August 2017. [Google Scholar]
  5. Germain, P.; Lacasse, A.; Laviolette, F.; Marchand, M. PAC-Bayesian Learning of Linear Classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning—ICML’09, Montreal, QC, Canada, 14–18 June 2009; pp. 1–8. [Google Scholar] [CrossRef]
  6. Letarte, G.; Germain, P.; Guedj, B.; Laviolette, F. Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., dAlché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 6872–6882. [Google Scholar]
  7. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  8. Wen, Y.; Vicol, P.; Ba, J.; Tran, D.; Grosse, R. Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  9. Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
  10. Catoni, O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. IMS Lect. Notes Monogr. Ser. 2007, 56, 1–163. [Google Scholar] [CrossRef]
  11. Mohamed, S.; Rosca, M.; Figurnov, M.; Mnih, A. Monte Carlo Gradient Estimation in Machine Learning. arXiv 2019, arXiv:1906.10652. [Google Scholar]
  12. Langford, J.; Seeger, M. Bounds for Averaging Classifiers. 2001. Available online: https://www.cs.cmu.edu/~jcl/papers/averaging/averaging_tech.pdf (accessed on 4 June 2021).
  13. Seeger, M.; Langford, J.; Megiddo, N. An improved predictive accuracy bound for averaging classifiers. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 290–297. [Google Scholar]
  14. Germain, P.; Bach, F.; Lacoste, A.; Lacoste-Julien, S. PAC-Bayesian Theory Meets Bayesian Inference. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 1884–1892. [Google Scholar]
  15. Knoblauch, J.; Jewson, J.; Damoulas, T. Generalized Variational Inference: Three Arguments for Deriving New Posteriors. arXiv 2019, arXiv:1904.02063. [Google Scholar]
  16. Kingma, D.P.; Salimans, T.; Welling, M. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 2575–2583. [Google Scholar]
  17. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  18. Dziugaite, G.K.; Roy, D.M. Data-dependent PAC–Bayes priors via differential privacy. In Advances in Neural Information Processing Systems 31; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 8430–8441. [Google Scholar]
  19. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. arXiv 2015, arXiv:1603.04467. [Google Scholar]
Table 1. Average (from ten runs) binary-MNIST losses and bounds ( δ = 0.05 ) for the best epoch and optimal hyperparameter settings of various algorithms. Hyperparameters and epochs were chosen by bound if available and non-vacuous, otherwise by training linear loss. Bold numbers indicate the best values and standard deviation is reported in italics.
Table 1. Average (from ten runs) binary-MNIST losses and bounds ( δ = 0.05 ) for the best epoch and optimal hyperparameter settings of various algorithms. Hyperparameters and epochs were chosen by bound if available and non-vacuous, otherwise by training linear loss. Bold numbers indicate the best values and standard deviation is reported in italics.
mlppbgReinforceFix- λ Optim- λ
signrelusignsgmdrelusignsgmdrelu
Train Linear0.788.7226.018.68.777.606.356.716.475.41
error, 1 σ 0.080.080.81.40.040.190.100.110.180.16
Test 0–11.825.2625.417.98.737.886.516.856.845.61
error, 1 σ 0.160.181.01.50.230.300.190.270.210.20
Bound 0–1-40.810010021.718.815.522.619.316.0
error, 1 σ -0.20.00.00.040.170.040.030.310.05
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Biggs, F.; Guedj, B. Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks. Entropy 2021, 23, 1280. https://doi.org/10.3390/e23101280

AMA Style

Biggs F, Guedj B. Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks. Entropy. 2021; 23(10):1280. https://doi.org/10.3390/e23101280

Chicago/Turabian Style

Biggs, Felix, and Benjamin Guedj. 2021. "Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks" Entropy 23, no. 10: 1280. https://doi.org/10.3390/e23101280

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop