Differentiable PAC-Bayes Objectives with Partially Aggregated Neural Networks

We make three related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC-Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of \emph{partially-aggregated} estimators; (2) we show that these lead to provably lower-variance gradient estimates for non-differentiable signed-output networks; (3) we reformulate a PAC-Bayesian bound for these networks to derive a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. This bound is twice as tight as that of Letarte et al. (2019) on a similar network type. We show empirically that these innovations make training easier and lead to competitive guarantees.


Introduction
The use of stochastic neural networks has become widespread in the PAC-Bayesian and Bayesian deep learning (Blundell et al., 2015) literature as a way to quantify predictive uncertainty. PAC-Bayesian theorems generally bound the expected loss of stochastic estimators, so it has proven easier to obtain non-vacuous numerical guarantees on generalisation in such networks, beginning with Langford and Caruana (2002), picked up by Dziugaite and Roy (2017) in the modern "deep" era, and continued by others including Zhou et al. (2019).
These bounds can often be straightforwardly adapted to aggregates or averages of estimators (as in, for example, Germain et al. (2009)), but averages over deep stochastic networks are generally intractable. Letarte et al. (2019) successfully aggregated networks with sign activation functions ∈ {+1, −1} and a non-standard tree structure, but this incurred an exponential KL divergence penalty, and a heavy computational cost that meant in practice they often resorted to a Monte Carlo estimate.
Our first innovation, not specific to PAC-Bayes, is to find a compromise by defining new "partiallyaggregated" Monte Carlo estimators for the average output and gradients of general stochastic networks (Section 3). In the case of non-differentiable "signed-output" networks (with a dense final layer and sign activation, the rest of the network having arbitrarily complex structure) we prove that these have lower variance than REINFORCE (Williams, 1992) and a naive Monte Carlo forward pass (Section 4). This framework additionally leads to a similar but simpler training method for sign-activation neural networks than Letarte et al. (2019).
Next, we observe that when training neural networks in the PAC-Bayesian setting, the objective used is generally somewhat divorced from the bound on misclassification loss itself, because nondifferentiability leads to difficulties with direct optimisation. For example, Dziugaite and Roy (2017) use an objective with both a surrogate loss function and a different dependence on the KL from their bound, and Letarte et al. (2019) use a different loss function which leads to a bound on the misclassification loss which is a factor of two worse. This contrasts with the situation for other estimator classes, where the bound may directly lead to a differentiable objective (for example in Germain et al. (2009)).
Motivated by this, we show (Section 5) that a straightforward adaptation of a bound from Catoni (2007) to our signed-output networks, in combination with aggregation, yields straightforward and directly differentiable objectives: one that fixes a regularisation parameter, and another that tunes it automatically through the bound. This closes the previous gap between bounds and objectives in neural networks. We demonstrate that training these objectives in combination with our partial aggregation estimators leads to competitive experimental generalisation guarantees (Section 6). A brief discussion follows in Section 7.

Background
We begin here by setting out our notation and the requisite background.
The standard statistical learning perspective considers parameterised functions, { f θ : X → Y |θ ∈ Θ}, in a specific form, choosing X ⊂ R d 0 and an arbitrary output space Y which could be for example {−1, +1} or R. In general, we wish find functions minimizing the out-of-sample expected risk, R( f ) = E (x,y)∼D ( f (x), y), for some loss function , for example the 0-1 misclassification loss for classification, 0−1 (y, y ) = 1{y = y }, or the binary linear loss, lin (y, y ) = 1 2 (1 − yy ), with Y = {+1, −1}. These must be chosen based on an i.
We denote the expected and empirical risks under the misclassification and linear losses respectively R 0−1 , R lin , R 0−1 S and R lin S . In this paper we consider learning a distribution (PAC-Bayesian Posterior), Q, over the parameters θ . PAC-Bayesian theorems then provide bounds on the expected generalization risk of stochastic classifiers, where every prediction is made using a newly sampled function from our posterior, f θ , θ ∼ Q.
We also consider averaging the above to obtain Q-aggregated prediction functions, In the case of a convex loss function, Jensen's inequality lower bounds the risk of the stochastic function by its Q-aggregate: (F Q (x), y) ≤ E f ∼Q ( f (x), y). This is an equality for the linear loss, which we will exploit to obtain an easier PAC-Bayesian optimisation objective in Section 5.

Analytic Q-Aggregates for Signed Linear Functions
Q-aggregate predictors are analytically tractable for "signed-output" functions 1 of the form f w (x) = sign(w · x) under a normal distribution, Q(w) = N(µ, I), as first considered in a PAC-Bayesian context for binary classification by Germain et al. (2009), obtaining an differentiable objective similar to the SVM. Provided x = 0: In Section 4 we will consider aggregating signed output ( f (x) ∈ {+1, −1}) functions of a more general form.

Monte Carlo Estimators for More Complex Q-Aggregates
The framework of Q-aggregates can be extended to less tractable cases (for example, with f θ a stochastic or "Bayesian" neural network, see Blundell et al. (2015)) through a simple Monte Carlo approximation: If we go on to parameterize our posterior Q by φ as Q φ and wish to obtain gradients without a closed form for , there are two possibilities. One is REINFORCE (Williams, 1992), which requires only a differentiable density function, q φ (θ ) and makes a Monte Carlo approximation to the left hand side of the identity The other is the pathwise estimator (used in the context of neural networks by Kingma and Welling, 2013, amongst others), which additionally requires that f θ (x) be differentiable w.r.t. θ , and that the probability distribution chosen has a standardization function, S φ , which removes the φ dependence: for example, S µ,σ (X) = (X − µ)/σ for a normal distribution. If this exists, the right hand side of generally yields lower-variance estimates than REINFORCE (see Mohamed et al., 2019, for a modern survey).
The variance introduced by REINFORCE can make it difficult to train neural networks when the pathwise estimator is not available, for example when non-differentiable activation functions are used. Below we find a compromise between the analytically closed form of (1) and the above estimator that enables us to make differentiable certain classes of network and extend the pathwise estimator where otherwise it could not be used. Through this we are able to stably train a new class of network.

PAC-Bayesian Approach
We use PAC-Bayes in this paper to obtain generalisation guarantees and theoretically-motivated training methods. The primary bound utilised is based on the following theorem, valid for a loss taking values in [0, 1]: Theorem 1. (Catoni, 2007, Theorem 1.2.6) Given P on F and α > 1, for all Q on F with probability at least 1 − δ over S ∼ D m , This slightly opaque formulation (used previously by Zhou et al., 2019) gives essentially identical results when KL/m is large to the better-known "small-kl" bounds originated by ; . It is chosen because it leads to objectives that are linear in the empirical loss and KL divergence, like This objective is minimised by a Gibbs Posterior and is closely related to the evidence lower bound (ELBO) usually optimised by Bayesian Neural Networks (Blundell et al., 2015). Such a connection has been noted throughout the PAC-Bayesian literature; we refer the reader to Germain et al. (2016) or Knoblauch et al. (2019) for a formalised treatment.
is the parameter set excluding w, for the non-final layers of the network. For simplicity we have used a one-dimensional output but we note that the formulation and below derivations trivially extend to a vector-valued function with elementwise activations. We require the distribution over parameters to factorise like Q(θ ) = Q w (w)Q ¬w (θ ¬w ), which is consistent with the literature.
We recover a similar functional form to that considered in Section 2.1 by rewriting the function as A(w · a) with a ∈ A d the hidden-layer activations. The "aggregated" activation function on the final layer, defined as I(a) := A(w · a) dQ w (w), may then be analytically tractable, as in (1), where with w ∼ N(µ, I) and a sign final activation, I(a) = erf µ·a √ 2 a . With these definitions we can write the Q-aggregate in terms of the conditional distribution on the activations, a, which is a push-forward measure,Q ¬w (a|x) In most cases, the final integral cannot be calculated exactly or involves a large summation, so we resort to a Monte Carlo estimate, for each x drawing T samples of the activations, {a t } T t=1 ∼Q ¬w (a|x) to obtain the "partially-aggregated" estimator This is quite similar to the original estimator from (2), but in fact the aggregation of the final layer may significantly reduce the variance of the outputs and also make better gradient estimates possible, as we will show below in the case of "signed"-output networks.

Single Hidden Layer
For clarity (and to introduce notation used in Section 4.3) we will briefly consider the case of a neural network with one hidden layer, The distribution Q(θ ) =: Q 2 (w 2 )Q 1 (W 1 ) factorises over the two layers. This is identical to the above and Sampling a is straightforward; one method involves analytically finding the distribution on the "pre-activations", a trivial operation for the normal distribution among others, before sampling this and passing through the activation. In combination with the pathwise gradient estimator, this is known as the "local reparameterization trick" (Kingma et al., 2015), and can lead to considerable computational savings on parallel minibatches compared to direct hierarchical sampling, a = A 1 (W 1 x) with W 1 ∼ Q 1 . We will utilise this in all our reparameterizable dense networks, and a variation on it in Section 4.3.

Aggregating Signed-Output Neural Networks
Here we consider multi-layer feed-forward networks with a final sign activation function and unit variance normal distribution for the final layer, Q w (w) = N (µ, I), so the aggregate is given by (1). Using (2) and (4) with independent samples {(w t , θ ¬w,(t) )} T t=1 ∼ Q and η t := η θ ¬w,(t) (x) leads to the two estimators 2 Below we discuss how the second estimator, using aggregation of the final layer, leads to betterbehaved training objectives and lower-variance gradient estimates, enabling stable training of a previously difficult network type.

Lower Variance Estimates of Aggregated Sign-Output Networks
Proposition 1. With the definitions given in (5), for all x ∈ R d 0 , T ∈ N, and Q with normallydistributed final layer, Proof. Treating η as a random variable, always conditioned on x, The result follows as the error function is point-wise smaller than 1.
and a similar result forF * y). The result then follows from this and Proposition 1.
Note that as F Q (x) → ±1, both variances disappear, and that the difference in variances ofF * Q andF Q will be biggest in the regime µ 1.

Lower Variance Gradient Estimators
Final layer derivative estimators given through REINFORCE and aggregation are derived from the two restatements below: 2 Henceforth assuming the technical condition P η|x {η = 0} = 0 that allows aggregation to be well-defined.
This gives rise to the gradient estimators (using the samples from (5)): Proposition 3. Under the conditions of Proposition 1 and with these definitions, Proof. It is straightforward to show that The first equality follows directly from the linearity of the expectation.
If the rest of the layers are reparameterizable, we can also go on to use the pathwise estimator to estimate gradients in the aggregated case, which is not possible otherwise. We show experimentally this is significantly easier. Firstly, though, we consider an important special case where this is not true: a feed forward network with all sign activations.

All Sign Activations
Choosing all sign activations and unit-variance normal distributions on the weights of each feedforward layer, f θ (x) = sign(w L · sign(W L−1 . . . sign(W 1 x) . . . )) with θ := vec(w L , . . . ,W 1 ) and W l := [w l,1 . . . w l,d l ] T ; l ∈ {1, ..., L} indexes layers. The distribution factorises into Q l (W l ) = ∏ d l i=1 q l,i (w l,i ) with q l,i = N (µ l,i , I d l−1 ). In the notation of Section 3, η θ ¬w (x) = sign(W L−1 . . . sign(W 1 x) . . . ) is the final layer activation, which could easily be sampled by mapping x through the first L − 1 layers with draws from the weight distribution. Instead, we go on to make an iterative replacement of the weight distributions on each layer by conditionals on the layer activations to obtain the summation and hierarchically sample the a l ; like local-reparameterisation, this leads to a considerable computational saving over sampling a separate weight matrix for every input. The conditionals can be found in closed form asQ l (a l |a l−1 ) := ∏ d l i=1q l,i (a l,i |a l−1 ), and (with a 0 := x) A marginalised REINFORCE-style gradient estimator for conditional distributions can then be used; this does not necessarily have better statistical properties but in combination with the above is much more computationally efficient. Using samples {(a t 1 . . . a t L−1 )} T t=1 ∼Q, The above formulation somewhat resembles the PBGNet model of Letarte et al. (2019), but derived in a very different way. Both are equivalent in the single-hidden-layer case, but with more layers PBGNet uses an unusual tree-structured network for which the exact aggregate can be calculated (very expensively, though avoiding an exponential dependency on depth). Generally, though, to avoid this cost, they resort to a Monte Carlo approximation: informally, this draws new samples for every layer l based on an average of those from the previous layer, a l |{a . This is all justified within the tree-structured framework but leads to an exponential KL penalty which-as hinted by Letarte et al. (2019) and shown empirically in Section 6-makes PAC-Bayes bound optimisation strongly favour shallower such networks. In addition to this practical drawback, our formulation is more general and we claim it makes the fundamental ideas of partial-aggregation and marginalised sampling significantly clearer.

PAC-Bayesian Objectives with Signed-Outputs
We now move to obtain binary classifiers with guarantees for the expected misclassification error, R 0−1 , which we do by optimizing PAC-Bayesian bounds. Such bounds (as in Theorem 1) will usually involve the non-differentiable and non-convex misclassification loss 0−1 . However, to train a neural network we need to replace this by a differentiable surrogate, as discussed in the introduction.
Here we adopt a different approach by using our signed-output networks, where since f (x) ∈ {+1, −1}, there is an exact equivalence between the linear and misclassification losses, 0−1 ( f (x), y) = lin ( f (x), y), avoiding the extra factor of two from the inequality 0−1 ≤ 2 lin used by Letarte et al. (2019, Section 2.1).
Although we have only moved the non-differentiability into f , the form of a PAC-Bayesian bound and the linearity of the loss and expectation allow us to go further and aggregate, which as we saw in Section 2.1 can make some such non-differentiable functions differentiable.
Combining (8) with Theorem 1, we obtain a directly optimizable, differentiable bound on the misclassification loss without introducing an extra factor of 2: Theorem 2. Given P on θ and α > 1, for all Q on θ and λ > 1 simultaneously with probability at least 1 − δ over S ∼ D m , Thus, for each λ , which can be held fixed ("fix-λ ") or simultaneously optimized throughout training for automatic regularisation tuning ("optim-λ "), we obtain a gradient descent objective:

Experiments
All experiments run on "binary"-MNIST, dividing MNIST into two classes, of digits 0-4 and 5-9. Neural networks had three hidden layers with 100 units per layer and sign, sigmoid (sgmd) or relu activations, before a single-unit final layer with sign activation. Q was chosen as an isotropic, unit-variance normal distribution with initial means drawn from a truncated normal distribution of variance 0.05. The data-free prior P was fixed equal to the initial Q, as motivated by Dziugaite and Roy (2017, Section 5 and Appendix B).
The objectives fix-λ and optim-λ from Section 5 were used for batch-size 256 gradient descent with Adam (Kingma and Ba, 2014) for 200 epochs. Every five epochs, the bound (for a minimising λ ) was  evaluated using the entire training set; the learning rate was then halved if the bound was unimproved from the previous two evaluations. The best hyperparameters were selected using the best bound achieved in these evaluations through a grid search of initial learning rates ∈ {0.1, 0.01, 0.001}, sample sizes T ∈ {1, 10, 50, 100}. Once these were selected training was repeated 10 times to obtain the values in Table 1.
λ in optim-λ was optimised through Theorem 2 on alternate mini-batches with SGD and a fixed learning rate of 10 −4 (whilst still using the objective (9) to avoid effectively scaling the learning rate with respect to empirical loss by the varying λ ). After preliminary experiments in fix-λ , we set λ = m = 60000, the training set size, as is common in Bayesian deep learning. We also report the values of three baselines: reinforce, which uses the fix-λ objective without partial-aggregation, forcing the use of REINFORCE gradients everywhere; mlp, an unregularised non-stochastic relu neural network with tanh output activation; and the PBGNet model (pbg) from Letarte et al. (2019), with the misclassification error bound obtained through 0−1 ≤ 2 lin . Despite significant additional hyperparameter exploration for the latter, we were unable to train a three layer network through the PBGNet algorithm directly comparable to our method, likely because of the exponential KL penalty (in their equation 17) within that framework; to enable comparison, we therefore allowed the number of hidden layers in this scenario to vary ∈ {1, 2, 3}. Other baseline tuning and setup was similar to the above, see the Appendix for more details.

Discussion
The experiments demonstrate that partial-aggregation enables training of multi-layer nondifferentiable neural networks in a PAC-Bayesian context: REINFORCE gradients and a multiplehidden-layer PBGNet (Letarte et al., 2019) obtained only non-vacuous bounds, and our misclassification bounds improve those of a single-hidden-layer PBGNet. We note that our bound optimisation is empirically quite conservative, and the non-stochastic mlp model obtains a lower overall error; understanding this gap is one of the key questions in the theory of deep learning. Finally, we also observe that using sign(F Q (x)) with T > 1 for test prediction, as in PBGNet, gave improved empirical results despite the inferior theoretical guarantees; we consider this an interesting avenue of future research.

A.1 Aggregating Biases with the Sign Function
We used a bias term in our network layers, leading to a simple extension of the above formulation, omitted in the main text for conciseness: The bias and weight co-variances were chosen to be diagonal with a scale of 1, which leads to some simplification in the above.

A.2 Reinforce Model
During evaluation, the reinforce, draws a new set of weights for every test example, equivalent to the evaluation of the other models; but doing so during training, with multiple parallel samples, is prohibitively expensive.
Two different approaches to straightforward, not partially-aggregated, gradient estimation for the baseline reinforce suggest themselves, arising from different approximations to the Q-expected loss of the minibatch, B ⊆ S (with data indices B). From the identities we obtain two slightly different estimators for ∇ φ E θ ∼q φ R B ( f θ ): The first draws many more samples and has lower variance but is much slower computationally; even aside from the O(|B|) increase in computation, there is a slowdown as the optimised BLAS matrix routines cannot be used, and the very large matrices involved may not fit in memory (see Kingma et al., 2015, for more information).
Therefore, as is standard in the Bayesian Neural Network literature with the pathwise estimator, we use the latter formulation, which has a similar computational complexity to local-reparameterisation and our marginalised REINFORCE estimator (7). We should note though that in preliminary experiments, the alternate estimator did not appear to lead to improved results. This clarifies the advantages of marginalised sampling, which can lead to lower variance with a similar computational cost.

A.3 Dataset Details
We used the MNIST dataset version 3.0.1, available online at http://yann.lecun.com/exdb/ mnist/, which contains 60000 training examples and 10000 test examples, which were used without any further split, and rescaled to lie in the range [0, 1]. For the "binary"-MINST task, the labels +1 and −1 were assigned to digits in {5, 6, 7, 8, 9} and {0, 1, 2, 3, 4} respectively, and images were scaled into the interval [0, 1]. For PBGNet we choose the values of hyperparameters from within these values giving the least bound value. Note that, unlike in the original paper, we do not allow the hidden size to vary {∈ 10, 50, 100}, and we use the entire MNIST training set as we do not need a validation set. While attempting to train a three hidden layer network, we also searched through the hyperparameter settings with a batch size of 64 as in the original, but after this failed, we returned to the original batch size of 256 with Adam. All experiments were performed using the code from the original paper, available at https://github.com/gletarte/dichotomize-and-generalize.
Since we were unable to train a multiple-hidden-layer network through the PBGNet algorithm, for this model only we explored different numbers of hidden layers ∈ {1, 2, 3}.

A.5 Final Hyperparameter Settings
In Table 2 we report the hyperparameter settings used for the experiments in Table 1 after exploration. To save computation, hyperparameter settings that were not learning (defined as having a whole-trainset linear loss of > 0.45 after ten epochs) were terminated early. This was also done on the later evaluation runs, where in a few instances the fix-λ sigmoid network failed to train after ten epochs; to handle this we reset the network to obtain the main experimental results.
For clarity we repeat here the hyperparameter settings and search space: • Initial Learning Rate ∈ {0.1, 0.01, 0.001}.

A.6 Implementation and Runtime
Experiments were implemented using Python and the TensorFlow library (Abadi et al., 2015). Reported approximate runtimes are for execution on a NVIDIA GeForce RTX 2080 Ti GPU.