Gradient Regularisation as Approximate Variational Inference

Variational inference in Bayesian neural networks is usually performed using stochastic sampling which gives very high-variance gradients, and hence slow learning. Here, we show that it is possible to obtain a deterministic approximation of the ELBO for a Bayesian neural network by doing a Taylor-series expansion around the mean of the current variational distribution. The resulting approximate ELBO is the training-log-likelihood plus a squared gradient regulariser. In addition to learning the approximate posterior variance, we also consider a uniform-variance approximate posterior, inspired by the stationary distribution of SGD. The corresponding approximate ELBO has a simple form, as the log-likelihood plus a simple squared-gradient regulariser. We argue that this squared-gradient regularisation may at the root of the excellent empirical performance of SGD.


Introduction
Neural networks are increasingly being used in safety-critical settings such as self-driving cars (Bojarski et al., 2016) and medical diagnosis (Amato et al., 2013).In these settings, it is critical to be able to reason about uncertainty in the parameters of the network, for instance so that the system is able to call for additional human input when necessary (McAllister et al., 2017).Several approaches to Bayesian inference in neural networks are available, including from stochastic gradient Langevin dynamics (Welling and Teh, 2011) but here we focus on variational inference (Blundell et al., 2015).
Despite the theoretical advantages of Bayesian neural networks (e.g.their relationship with PAC-Bayesian methods that provide bounds on generalisation error Germain et al., 2016;Rivasplata et al., 2019), they are rare amongst networks claiming to give state-of-theart performance on tasks such as image classification.Instead, much simpler techniques such as stochastic gradient descent (SGD) (Krizhevsky et al., 2012) are found to give excellent performance in practice.Indeed, SGD is also found to give better performance than adaptive optimizers such as Adam (Kingma and Ba, 2014;Keskar and Socher, 2017;Loshchilov and Hutter, 2017;Wilson et al., 2017).This is usually understood as regularisation that is implicitly embodied in SGD (Keskar et al., 2016;Wu et al., 2017;Lei et al., 2018;Roberts, 2018).While SGD's implicit regularisation gives excellent practical performance, it becomes problematic when trying to improve the optimizer used in deep learning.In the ideal case, we would be able to separate the objective (including regularisation terms) from the optimizer, such that e.g.improvements in the optimizer's convergence rate can always be expected to improve performance.In contrast, at the moment, methods that converge faster (such as Adam; Kingma and Ba, 2014) are generally believed to have worse performance at convergence, due to the lack of the implicit regularisation embodied in SGD (Wilson et al., 2017;Keskar and Socher, 2017).Therefore, it is important to understand the implicit regularisation embodied in SGD, such that we can use that regularisation in combination with other optimizers.
Here, we start by noting that variational inference for Bayesian neural networks typically involves stochastic sampling, which can give rise to high-variance gradients.We note that we can form a deterministic approximation of the ELBO by doing a second-order Taylor expansion around the mode of the approximate posterior.This gives an approximate ELBO that is composed of the log-likelihood, standard weight-decay regularisation and a squared-gradient regulariser, which is weighted by the variance of the approximate posterior.Next, we noted that the implicit regularisation effects of SGD can be understood using the isotropic Gaussian stationary distribution under locally quadratic loss functions (Mandt et al., 2017).Inspired by this stationary distribution, we considered uniform-variance approximate posteriors, which correspond to a simpler squared-gradient regulariser.Incorporating this regulariser improved performance on standard benchmark tasks, and provided a potential explanation for the excellent performance of stochastic gradient descent.

Variational inference for Bayesian neural networks
Following the usual convention (Blundell et al., 2015), we use independent Gaussian priors and approximate posteriors for all parameters, where µ λ and σ 2 λ are learned parameters of the approximate posterior, and where Σ is a diagonal matrix, with Σ λλ = σ 2 λ .Please see Appendix A for more information.

The stationary distribution of SGD
When we solve for steady-state in which For the derivation, please see Appendix B.

Deterministic approximations to variational inference
We begin by noting that the ELBO can be rewritten in terms of the KL divergence between the prior and approximate posterior, And the KL-divergence can be evaluated analytically, As such, the only term we need to approximate is the expected log-likelihood.
To evaluate this expectation, we begin by taking a second-order Taylor series expansion of the log-likelihood around the current setting of the mean parameters, µ, Now we consider the expectation of each of these terms under the approximate posterior, Q (w).The first term is constant and independent of w.The second (linear) term is zero, because the expectation of (w − m) under the approximate posterior is zero the third (quadratic) term is difficult to evaluate because it involves H, the N × N matrix of second derivatives, where N is the number of parameters in the model.Nonetheless, we begin by using properties of the trace, and noting that the expectation is the covariance of the approximate posterior, writing the trace in index notation, and substituting for the (diagonal) posterior covariance, Σ, Evaluating this term requires the diagonal of the Hessian, and as such our methods can be used with any drop-in estimate of this Hessian.Note that this regulariser, explicitly in terms of the Hessian, can also be justified by empirical work on generalisation in neural networks (Wu et al., 2017), which found that networks with smaller Hessian norm generalised better.In our case, we use the Fisher Information (Kunstner et al., 2019), for three reasons.First, it is extremely stable (the Fisher Information matrix is always positive definite, and hence H λλ is always positive).Second the FI will allow us to relate to recently published work on implicit gradient regularisation.Third, use of the FI is standard practice in a variety of work (Khan and Lin, 2017;Khan et al., 2017Khan et al., , 2018;;Aitchison, 2018).Nonetheless, it is interesting future work to establish whether any other estimates of the diagonal of the Hessian (e.g. from Dangel et al., 2019) improve performance.The Fisher Information identity allows us to approximate the Hessian, Note that the S 2 arises because we defined g j as the average gradient for the minibatch Eq. 6, whereas the Fisher Information requires the raw log probability, which is formed by the sum.Substituting this approximation of the diagonal of the Hessian we can write the quadratic term in the Taylor expansion of the expected log-likelihood as, Thus, our deterministic approximation to the expected log-likelihood is, Critically, this quantity is a sum over datapoints, and we can therefore use minibatches of data to give unbiased estimates of the objective in stochastic gradient descent.
Our full approximation to the ELBO is thus, We now have to be extremely careful when using minibatch estimates of the gradient.The easiest approach is to consider x i and y i as minibatches, where there are M minibatches, each with S examples in the full dataset.In that case, the objective for one minibatch can be written where, remember, S is the minibatch size and P is the total number of datapoints in the full dataset.

Connections to SGD
If we set the approximate posterior to that suggested by the stationary distribution of SGD, and we use fixed prior variance, s 2 λ , then the ELBO simplifies further, simply including a log-likelihood, a squared-gradient regulariser and weight-decay,

Network architecture
Importantly, here the regulariser is the squared gradient of the loss with respect to the parameters.As such, computing the loss implicitly involves a second-derivative of the loss, and we therefore should not use piecewise linear activation functions such as ReLU, which have pathological second derivatives.Instead, we used a softplus activation function, but any activation with well-behaved second derivatives is admissible.

Results
We trained a PreactResNets-18 (He et al., 2016b) on CIFAR-10 with fixed approximate posteriors inspired by SGD.We used the Adam optimizer (Kingma and Ba, 2014), with an initial learning rate of 1E-4, which decreased by a factor of 10 after 100 epochs and a batch size of 128 with all the other optimizer parameters set to their default values.We have also run experiments whereby Beta-annealing is used with β = 0.1 (Huang et al., 2018).We compared the performance of adding noise to the parameters using "MCVI" (i.e. using samples to evaluate the expectation in Eq. 18) against using our approximate ELBO, and against MAP inference.We found that our approximate ELBO gave superior performance both to MCVI and MAP.We also tried MCVI for four times as many epochs ("rescaled"), to compensate for the additional compute required to backpropagate gradients.Remarkably, this actually performed worse, than the standard settings of MCVI and we hypothesise that the additional training time allows it to overfit more strongly.Our approximate ELBO displayed optimal performance with a variance of around σ 2 = 10 −4 or 10 −5 (Fig. 1).To understand whether this is sensible, in comparison with SGD learning rates, we solved for the implied learning rate, if we were to use, a batch size of S = 50.These settings correspond to an SGD learning rate of 10 −2 or 10 −3 , which is a common final learning rate (e.g. if we start at a learning rate of 10 −1 and decay the learning rate by a factor of 10 once or twice; He et al., 2016a,b).
In addition, our approximate ELBO can be used to learn the approximate posterior variance.We used a single scalar posterior variance for each convolutional weight matrix, which would correspond to a using a different SGD learning rate for each convolutional weight matrix.Sampling-based MCVI performed poorly, even with small initial variances (initialized to e − 6 times their prior value).In past work on MCVI it was found that achieving even 80% performance on CIFAR-10 required choosing an unusually small network (Ober and Aitchison, 2020) with 32 channels in all layers.Here by contrast, we used a standard-size PreActResNet-18 with 64 channels in the first layers, increasing to 512 channels in the final layers.Remarkably, our approximate ELBO in these larger standard networks was gave superior performance even to MCVI in networks specifically tuned for variational inference (Table 1).

Related work
Recent work also showed that gradient-descent implies an implicit squared-gradient regularisation (Barrett and Dherin, 2020) with a very similar dependence of the strength of regularisation on the SGD step-size.Remarkably, this arose despite their taking a radically different approach.In particular, they showed that squared-gradient regularisation can arise through errors when using finite step size Euler integration, as compared to following the true underlying gradient flow.However, this approach was derived for gradient descent, and does not take into account the fundamental stochasticity of stochastic gradient descent.The emergence of such similar results from fundamentally very different analytical approaches would suggest that gradient regularisation is a fundamental component of the implicit regularisation inherent in SGD.

Conclusions
We showed that it is possible to use a second-order Taylor expansion to compute a deterministic approximation of the ELBO in Bayesian neural networks.While any drop-in estimate of the diagonal of the Hessian could be used, here we consider the squared gradient (via the Fisher Information), as this allows efficient optimization and allows us to connect to gradient regularisation (Barrett and Dherin, 2020).Further, we connect to SGD by noting that the stationary distribution for SGD is isotropic Gaussian (Mandt et al., 2017).This approximate ELBO gives superior performance to unbiased sampling-based estimation of the ELBO, and to MAP inference.Finally, our framework makes a prediction: that the optimal posterior variance for the ELBO corresponds to standard settings of the SGD learning rate.Remarkably, we found that this prediction held, with the optimal posterior variance in ResNets on CIFAR-10 being 10 −4 or 10 −5 corresponds to standard values for the final learning rate in SGD used in these models.

Figure 1 :
Figure 1: Training a PreactResNet-18 on CIFAR-10 with an fixed-variance approximate posterior, to mirror SGD.The posterior variance is shown on the lower x-axis and the corresponding value for the SGD learning rate is shown on the upper x-axis."Sampled ELBO" corresponds to 200 training epochs and "Rescaled sampled ELBO" corresponds to 800 training epochs.

Table 1 :
Performance of a PreactResNet-18 with a single scalar learned variance (σ 2 ) and a learned diagonal covariance (Σ) respectively for each convolutional weight matrix.