A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization

Stochastic gradient sg-based algorithms for Markov chain Monte Carlo sampling (sgmcmc) tackle large-scale Bayesian modeling problems by operating on mini-batches and injecting noise on sgsteps. The sampling properties of these algorithms are determined by user choices, such as the covariance of the injected noise and the learning rate, and by problem-specific factors, such as assumptions on the loss landscape and the covariance of sg noise. However, current sgmcmc algorithms applied to popular complex models such as Deep Nets cannot simultaneously satisfy the assumptions on loss landscapes and on the behavior of the covariance of the sg noise, while operating with the practical requirement of non-vanishing learning rates. In this work we propose a novel practical method, which makes the sg noise isotropic, using a fixed learning rate that we determine analytically. Extensive experimental validations indicate that our proposal is competitive with the state of the art on sgmcmc.


Introduction
Stochastic gradient (SG) methods have been extensively studied as a means for MCMCbased Bayesian posterior sampling algorithms to scale to large data regimes. Variants of SG-MCMC algorithms have been studied through the lens of first [1][2][3] or second-order [4,5] Langevin Dynamics, which are mathematically convenient continuous-time processes that correspond to discrete-time gradient methods with and without momentum, respectively. The common traits underlying many methods from the literature can be summarized as follows: they address large data requirements using SG and mini-batching, they inject Gaussian noise throughout the algorithm execution, and they avoid the expensive Metropolis-Hasting accept/reject tests that use the whole data [1,2,4].
Despite mathematical elegance and some promising results restricted to simple models, current approaches fall short in dealing with the complexity of the loss landscape typical of popular modern machine learning models, e.g., neural networks [6,7], for which stochastic optimization poses some serious challenges [8,9].
In general, SG-MCMC algorithms inject random noise to SG descent algorithms: the covariance of such noise and the learning rate, or step-size in the stochastic differential equation simulation community, are tightly related to the assumptions on the loss landscape, which together with the SG noise, determine the sampling properties of these methods [5]. However, current SG-MCMC algorithms applied to popular complex models such as Deep Nets, cannot simultaneously satisfy the assumptions on posterior distribution geometry and on the behavior of the covariance of the SG noise, while operating with the practical requirement of non-vanishing learning rates. In this paper, in accordance with most of the Neural Network related literature, we refer to the posterior distribution geometry as loss landscape. Some recent work [10], instead, argue for fixed step sizes, but settle for variational approximations of quadratic losses. Although we are not the first to highlight these issues, including the lack of a unified notation [5], we believe that studying the role of noise in SG-MCMC algorithms has not received enough attention, and a deeper understanding is truly desirable, as it can clarify how various methods compare. Most importantly, this endeavor can suggest novel and more practical algorithms relying on fewer parameters and less restrictive assumptions.
In this work we chose a mathematical notation that emphasizes the role of noise covariances and learning rate on the behavior of SG-MCMC algorithms (Section 2). As a result, the equivalence between learning rate annealing and extremely large injected noise covariance becomes apparent, and this allows us to propose a novel practical SG-MCMC algorithm (Section 3). We derive our proposal, by first analyzing the case where we inject the smallest complementary noise such that its combined effects with the SG noise result in an isotropic noise. Thanks to this isotropic property of the noise, it is possible to deal with intricate loss surfaces typical of deep models, and produce samples from the true posterior without learning rate annealing. This, however, comes at the expense of cubic complexity matrix operations. We address such issues through a practical variant of our scheme, which employs well-known approximations to the SG noise covariance (see, e.g., [11]). The result is an algorithm that produces approximate posterior samples with a fixed, theoretically derived, learning rate. Please note that in generic Bayesian deep learning setting, none of the existing implementations of SG-MCMC methods converge to the true posterior without learning rate annealing. In contrast, our method automatically determines an appropriate learning rate through a simple estimation procedure. Furthermore, our approach can be readily applied to pre-trained models: after a "warmup" phase to compute SG noise estimates, it can efficiently perform Bayesian posterior sampling.
We evaluate SG-MCMC algorithms (Section 4) through an extensive experimental campaign, where we compare our approach to several alternatives, including Monte Carlo Dropout (MCD) [12] and Stochastic Weighted Averaging Gaussians (SWAG, [9]), which have been successfully applied to the Bayesian deep learning setting. Our results indicate that our approach offers performance that are competitive to the state of the art, according to metrics that aim at assessing the predictive accuracy and uncertainty.

Preliminaries and Related Work
Consider a dataset of m−dimensional observations D = {U i } N i=1 . Given prior p(θ) for a d-dimensional set of parameters, and a likelihood model p(D|θ), the posterior is obtained by means of Bayes theorem as follows: where p(D) is also known as the model evidence, defined as the integral p(D) = p(D|θ) p(θ)dθ. Except when the prior and the likelihood function are conjugate, Equation (1) is analytically intractable [13]. However, the joint likelihood term in the numerator is typically not hard to compute; this is a key element of many MCMC algorithms, since the normalization constant p(D) does not affect the shape of the distribution in any way other than scaling. The posterior distribution is necessary to obtain predictive distributions for new test observations U * , as: We focus in particular on Monte Carlo methods to obtain an estimate of this predictive distribution, by averaging over N MC samples obtained from the posterior over θ, i.e., We develop our work by working with an unnormalized version of the logarithm of the posterior density, by expressing the negative logarithm of the joint distribution of the dataset D and parameters θ as: For computational efficiency, we use a minibatch stochastic gradient g(θ), which guarantees that the estimated gradient is an unbiased estimate of the true gradient ∇ f (θ), and we assume that the randomness due to the minibatch introduces a Gaussian noise: where the matrix B(θ) denotes the SG noise covariance, which depends on the parametric model, the data distribution and the minibatch size. A survey of algorithms to sample from the posterior using SG methods can be found in Ma et al. [5]. In Appendix A we report some well-known facts which are relevant for the derivations in our paper. As shown in the literature [10,14], there are structural similarities between SG-MCMC algorithms and stochastic optimization methods, and both can be used to draw samples from posterior distributions. Notice that the original goal of stochastic optimization is to find the minimum of a given cost function, and the stochasticity is introduced by sub-sampling the dataset to scale. SG-MCMC methods instead aim at sampling from a given distribution, i.e., collecting multiple values, and the stochasticity is necessary explore the whole landscape. In what follows, we use a unified notation to compare many existing algorithms in light of the role played by their noise components.
It is well-known [15][16][17] that stochastic gradient descent (SGD), with and without momentum, can be studied through the following stochastic differential equation (SDE), when the learning rate η is small enough (In this work we do not consider discretization errors. The reader can refer to classical SDE texts such as [18] to investigate the topic in greater depth.): where s is usually referred to as driving force and D as diffusion matrix We use a generic form of the SDE, with variable z instead of θ, which accommodates SGD variants, with and without momentum. By doing this, we will be able to easily cast the expression for the two cases in what follows (The operator ∇ applied to matrix D(z) produces a row vector whose elements are the divergences of the D(z) columns. Our notation is aligned with Chen et al. [4]).
) is said to be a stationary distribution for the SDE of the form (6), if and only if it satisfies the following Fokker-Planck equation (FPE): Please note that in general, the stationary distribution does not converge to the desired posterior distribution, i.e., φ(z) = f (z), as shown by Chaudhari and Soatto [8]. Additionally, given an initial condition for z t , its distribution is going to converge to ρ(z) only for t → ∞. In practice, we observe the SDE dynamics for a finite amount of time: then, we declare that the process is approximately in the stationary regime once the potential has reached low and stable values.
Next, we briefly overview known approaches to Bayesian posterior sampling, and interpret them as variants of an SGD process, using the FPE formalism.

Gradient Methods without Momentum
The generalized updated rule of SGD, described as a discrete-time stochastic process, writes as: where P(θ n−1 ) is a user-defined preconditioning matrix, and w n is a noise term, distributed as w n ∼ N(0, 2C(θ n )), with a user-defined covariance matrix C(θ n ). Then, the corresponding continuous-time SDE is [15]: In this paper we use the symbol n to indicate discrete time, while t for continuous time. We denote by C(θ) the covariance of the injected noise and Σ(θ) the composite noise covariance. Please note that Σ(θ t ) = B(θ t ) + C(θ t ) combines the SG and the injected noise. Notice that our choice of notation is different from the standard one, in which the starting discretetime process is in the form δθ n = −ηP(θ n−1 )(g(θ n−1 )) + w n . By directly grouping the injected noise with the stochastic gradient we can better appreciate the relationship between annealing the learning rate and extremely large injected noise. Moreover, as will be explained in Section 3, this allows derivation of a new sampling algorithm. We define the stationary distribution of the SDE in Equation (9) as ρ(θ) ∝ exp(−φ(θ)). Please note that when C = 0, the potential φ(θ) differs from the desired posterior f (θ) [8].
The following theorem, which is an adaptation of known results in light of our formalism, states the conditions for which the noisy SGD converges to the true posterior distribution (proof in Appendix A).
An extension to SGLD is Stochastic Gradient Fisher Scoring (SGFS) [2], which can be tuned to switch between sampling from an approximate posterior, using a non-vanishing learning rate, and the true posterior, by annealing the learning rate to zero. SGFS uses preconditioning, P(θ) ∝ B(θ) −1 . In practice, however, B(θ) is ill conditioned for complex models such as deep neural networks. Then, many of its eigenvalues are almost zero [8], and computing B(θ) −1 is problematic. An in-depth analysis of SGFS reveals that conditions (10) would be met with a non-vanishing learning rate only if, at convergence, ∇ (B(θ) −1 ) = 0 , which would be trivially true if B(θ) was constant. However, recent work [6,7] suggest that this condition is difficult to justify for deep neural networks.
The work by [10] investigates constant-rate SGD (with no injected noise), and determines analytically the learning rate and preconditioning that minimize the Kullback-Leibler (KL) divergence between an approximation and the true posterior. Moreover, it shows that the preconditioning used in SGFS is optimal, in the sense that it converges to the true posterior, when B(θ) is constant and the true posterior has a quadratic form.
In summary, to claim convergence to the true posterior distribution, existing approaches require either vanishing learning rates or assumptions on the SG noise covariance that are difficult to verify in practice, especially when considering deep models. We instead propose a novel practical method that induces isotropic SG noise and thus satisfies Theorem 1. We determine analytically a fixed learning rate, and we require weaker assumptions on the loss shape.

Gradient Methods with Momentum
Momentum-corrected methods emerge as a natural extension to SGD approaches. The general set of update equations for (discrete-time) momentum-based algorithms is: where P(θ n−1 ) is a preconditioning matrix, M is the mass matrix and A(θ n−1 ) is the friction matrix, as shown by [4,19]. As with the first order counterpart, the noise term is distributed as w n ∼ N(0, 2C(θ n ))). Then, the SDE to describe continuous-time system dynamics is: where P(θ t ) 2 = P(θ t )P(θ t ) and we assume P(θ t ) to be symmetric. The theorem hereafter describes the conditions for which noisy SGD with momentum converges to the true posterior distribution (Appendix A).
Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [4] suggests that estimating B(θ) can be costly. Hence, the injected noise C(θ) is chosen such that C(θ) = η −1 A(θ), where A(θ) is user-defined. When η → 0, the following approximation holds: Σ(θ) C(θ). It is then trivial to check that conditions (12) hold without the need for explicitly estimating B(θ). A further practical reason to avoid setting A(θ) = ηB(θ) is that the computational cost for the operation A(θ n−1 )M −1 r n−1 has O(D 2 ) complexity, whereas if C(θ) is diagonal, this is reduced to O(D). This, however, severely slows down the sampling process.
From a practical standpoint, momentum-based methods suffer from the requirement to tune many hyperparameters, including the learning rate, and the parameters that govern the simulation of a second-order Langevin dynamics.
The method we propose in this work can be applied to momentum-based algorithms; in this case, it could be viewed as an extension of the work in [11], albeit addressing the complex loss landscapes typical of deep neural networks. However, we leave this avenue of research for future work.

Sampling by Layer-Wise Isotropization
We present a simple and practical approach to inject noise to SGD iterates to perform Bayesian posterior sampling. Our goal is to sample from the true posterior distribution (or approximations thereof) using a constant learning rate, and to rely on more lenient assumptions about the shape of the loss landscape that characterize deep models, compared to previous works. In general, in modern machine learning applications, we deal with multi-layer neural networks [21]. We exploit the natural subdivision of the parameters of these architecture into different layers to propose a practical sampling scheme Careful inspection of Theorem 1 reveals that the matrices P(θ), Σ(θ) are instrumental in determining the convergence properties of SG methods to the true posterior. Therefore, we consider the constructive approach of designing ηP(θ) to obtain a sampling scheme that meets our goals; we set ηP(θ) to be a constant, diagonal matrix which we constrain to be layer-wise uniform: By properly selecting the set of parameters {λ i } we can achieve the simultaneous result of non-vanishing learning rate and well-conditioned preconditioning matrix. This implies a layer-wise learning rate η (p) = 1 λ (p) for the p-th layer, without further preconditioning. We can now prove (see Appendix B), as a corollary to Theorem 1, that our design choices can guarantee convergence to the true posterior distribution. Corollary 1. (Theorem 1) Consider dynamics of the form (9) and define the stationary distribution ρ(θ) ∝ exp(−φ(θ)). If ηP(θ) = Λ −1 as in (13), C(θ) = Λ − B(θ) and C(θ) 0 ∀θ, then φ(θ) = f (θ).
If aforementioned conditions are satisfied, it is in fact simple to show that the relevant matrices satisfy the conditions in Equation (10). The covariance matrix of the composite noise is said to be isotropic within the layers of (deep) models. In fact, . From a practical point of view, we choose Λ to be, among all valid matrices satisfying Λ − B(θ) 0, the smallest (the one with the smallest λ's). Indeed, larger Λ induce a smaller learning rate, thus unnecessarily reducing sampling speed. Now, let us consider an ideal case, in which we assume the SG noise covariance B(θ) and Λ to be known in advance. The procedure described in Algorithm 1 illustrates a naive SG method that uses the injected noise covariance C(θ) to sample from the true posterior.
This deceivingly simple procedure generate samples from the true posterior, with a non-vanishing learning rate, as shown earlier. However, it cannot be used in practice as B(θ) and Λ are unknown. Furthermore, the algorithm requires computationally expensive Next, we describe a practical variant of our approach, where we use approximations at the expense of generating samples from the true posterior distribution. We note that [10] suggest exploring a related preconditioning, but do not develop this path in their work. Moreover, the proposed method shares similarities with a scheme proposed in [22] although the analysis we perform here is different.

A Practical Method: Isotropic SGD
To render the idealized sampling method practical, it is necessary to consider some additional assumptions. As we explain at the end of this section, the assumptions that follow are less strict than other approaches in the literature. Assumption 1. The SG noise covariance B(θ) can be approximated with a diagonal matrix, i.e., B(θ) = diag(b(θ)).

Assumption 2.
The signal-to-noise ratio (SNR) of a gradient is small enough such that in the stationary regime, the second-order moment of the gradient is a good estimate of the true variance. Hence, combining with Assumption 1, b(θ) , where indicates the elementwise product. The diagonal covariance assumption (i.e., Assumption 1) is common in other works, such as [2,11]. The small signal-to-noise ratio as stated in Assumption 2 is in line with recent studies, such as [11,23]. Assumption 3 is similar to those appeared in earlier work, such as [24]. Please note that Assumptions 2 and 3 must hold in the stationary regime when the process reaches the bottom valley of the loss landscape. The matrix (b(θ)) has been associated in the literature with the empirical Fisher information matrix [2,25]. As we do not consider this matrix for preconditioning purposes, we do not further investigate this connection.
Given our assumptions, and our design choices, it is then possible to show (see Appendix B) that the optimal (i.e., the smallest possible) Λ = λ (1) , . . . , λ (1) , . . . , λ (N l ) , . . . λ (N l ) satisfying Corollary 1 can be obtained as λ (p) = β (p) . Please note that we do not assume B(θ) to be known, but use a simple procedure to estimate its components by computing: , where g (p) (θ) is the portion of stochastic gradient corresponding to the p-th layer. Then, the composite noise matrix Σ = Λ is a layer-wise isotropic covariance matrix, which inspires the name of our proposed method as Isotropic SGD (I-SGD).
The practical implementation of I-SGD is shown in Algorithm 2. The advantage of I-SGD is that it can either be used to obtain posterior samples starting from a pre-trained model, or do so by training a model from scratch. In either case, the estimates of B(θ) are used to compute Λ, as discussed above. An important consideration is that once all λ (i) have been estimated, the learning rate, layer by layer, is determined automatically. In fact, for the p-th layer, the learning rate is: η (p) = λ (p) −1 . A simpler approach is to use a unique learning rate for all layers, where the equivalent λ is the sum of all λ (p) .

A Remark on Convergence
In summary, I-SGD is a practical method to perform approximate Bayesian posterior sampling, backed up by solid theoretical foundations. Our assumptions, which are at the origin of the approximate nature of I-SGD, are less strict than those used in the literature of SG-MCMC methods. More precisely, the theory behind I-SGD can explain convergence to the true posterior with a non-vanishing learning rate in the particular case when Assumption 1 holds and the estimation of B(θ) is perfect. Even with perfect estimates, this is not the case for SGFS, which requires the correction term ∇ B(θ) −1 = 0. Additionally, both SGRLD and SGRHMC are more demanding than I-SGD because they require computing ∇ B(θ) −1 , for which an estimation procedure is elusive. Finally, the method by Springenberg et al. [11] needs a constant, diagonal B(θ), a condition that does not necessarily hold for deep models.

Computational Cost
The computational cost of I-SGD is as follows. As with [4], we define the cost of computing a gradient minibatch as C g (N b , d). Thanks to Assumptions 1 and 2, the computational cost for estimating the noise covariance scales as O(d) multiplications. The computational cost of generating random samples with the desired covariance scales as O(d) square roots and O(d) multiplications (without considering the cost of generating random numbers). The overall cost of our method is the sum of the above terms. Notice that the cost of estimating the noise covariance does not depend on the minibatch size N b . We would like to stress that in many modern models, the real computational bottleneck is the backward propagation for the computation of the gradients. As all the SG-MCMC methods considered in this work require one gradient evaluation per step, the different methods have in practice the same complexity.
The space complexity of I-SGD is the same as SGHMC,SGFS and variants: it scales as O(N sam d), where N sam is the number of posterior samples.

Experiments
The empirical analysis of our method, and its comparison to alternative approaches from the literature, is organized as follows. First, we proceed with a validation of I-SGD using the standard UCI datasets [26] and a shallow neural network. Then we move to the case of deeper models: we begin with a simple CNN used on the MNIST [27] dataset, then move to the standard RESNET-18 [28] deep network using the CIFAR-10 [29] dataset.
We compare I-SGD to other Bayesian sampling methods such as SGHMC [4], SGLD [2], and to alternative approaches to approximate Bayesian inference, including MCD [12], SWAG [9] and VSGD [10]. In general, our result indicates that: (1) I-SGD achieves similar or superior performance regarding competitors, when measuring uncertainty quantification, even with simple datasets and models; (2) I-SGD is simple to tune, when compared to alternatives; (3) I-SGD is competitive when used for deep Bayesian modeling, even when compared to standard methods used in the literature. In particular, the proposed method shares some of the strengths of VSGD, such as learning rates determined automatically and the simplicity of SGLD. Appendix B includes additional implementation details on I-SGD. Appendix C presents detailed configurations of all methods we compare, and additional experimental results.

A Disclaimer on Performance Characterization
It is important to stress a detail on the analysis of the experimental campaign. The discussion is usually focused on the goodness of the various methods for representing the true posterior distribution. Different methods can or cannot claim convergence to the true posterior according to certain assumptions and the nature of the hyperparameters. In the experimental validation of the results, however, we do not have access to the form of the true posterior as it is exactly the problem we are trying to solve. The practical solution adopted is to compare the different methods in terms of proxy metrics evaluated on the test sets, such as the accuracy and uncertainty metrics. Being better in terms of these performance metrics does not imply that the sampling method is better at approximating the posterior distribution, and outperforming competitors in terms of these metric do not provide sufficient information about the intrinsic quality of the sampling scheme.

Regression Tasks, with Simple Models
We consider several regression tasks defined on the UCI datasets. We use a simple neural network configuration with two fully connected layers and a ReLU activation function; the hidden layer includes 50 units. In this set of experiments, we use the following metrics: the root mean square error (RMSE) to judge the model predictive performance and the mean negative log-likelihood (MNLL) as a proxy for uncertainty quantification. We note that the task of tuning our competitors was far from trivial. We used our own version of SGHMC, based on [11], to ensure a proper understanding of the implementation internals, and we proceeded with a tuning process to find appropriate values for the numerous hyperparameters. In this set of experiments, we omit results for SWAG, which we keep for more involved scenarios.
Tables 1 and 2 report a complete overview of our results, for a selection of UCI datasets. For each method and each dataset, we also included how many out of the 10 splits considered failed to converge, indicated as F = . . . . As explained in Appendix C we implemented a temperature scaled version of VSGD. A clear picture emerges from this first set of experiments: while for the RMSE the performance is similar for different methods, for the MNLL averaging over multiple samples clearly improves the uncertainty quantification capabilities. SGHMC is in many cases better than alternatives, considering however the standard deviation of the results it is difficult to claim clear superiority of one method over the others.

Classification Tasks, with Deeper Models
Next, we compare I-SGD against competitors on image classification tasks. First, we use the MNIST dataset, and a simple LENET-5 CNN [30]. All methods are compared based on the test accuracy ACC,MNLL and the expected calibration error (ECE, [31]). Additionally, at test time, we carry out predictions on both MNIST and NOT-MNIST; the latter is a dataset equivalent to MNIST , but it represents letters rather than numbers. (http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html, accessed on 24 October 2021) This experimental setup is often used to check whether the entropy of the predictions on NOT-MNIST is higher than the entropy of the predictions on MNIST (the entropy of the output of an N cl classes classifier, represented by the vector p, is defined as Table 3 indicates that all methods are essentially equivalent in terms of accuracy and MNLL. We consider, together with the classical in and out of distribution entropies the regions of convergence (ROCS) diagrams comparing detection of out of distribution samples and false alarms when using as test statistic the entropy. Results, reported in Figure 1, clearly shows that: (1) collecting multiple samples improve the uncertainty quantification capabilities (2) I-SGD is competitive (but not the best scheme) and importantly outperform the closest approach to ours, i.e., VSGD. The experimental results show that I-SGD improves the quality of the BASELINE model with respect to all metrics. To test whether the improvements are due just to "additional training" or are intrinsically due to the Bayesian averaging properties, we do consider alternative deterministic baselines (details in Appendix C). For this set of experiments the best performing is BASELINE R. As can be appreciated by comparing Table 3 and Figure 1, while it is possible to increase the classical metrics, I-SGD (and other methods) still outperform by a large margin the baselines in terms of detection of out of distribution samples.  We now move on to a classical image classification problem with deep convolutional networks, whereby we use the CIFAR10 dataset, and the RESNET-18 network architecture. For this set of experiments, we compare I-SGD, SGHMC, SWAG, and VSGD using again test accuracy and MNLL, which we report in Table 4. As usual, we compare the results against the baseline of the individual network resulting from the pre-training phase. Results are obtained averaging over three independent seeds. Notice, as expanded in Appendix C that for SWAG we do consider two variants: the Bayesian correct one (SWAG) and a second variant that has better performance (SWAG wd). We stress again, as highlighted in Section 4.1 that not always goodness of approximation of the posterior and performance correlate positively. Additionally in this case, we found I-SGD to be competitive with other methods and superior to the baseline. Among the competitors, we found I-SGD to the easiest to tune, given the feature of a fixed learning rate informed by theoretical considerations; we believe that this is an important aspect to consider for a wide adoption of our proposal by practitioners.

Conclusions
SG methods allowed Bayesian posterior sampling algorithms, such as MCMC, to regain relevance in an age when datasets have reached extremely large sizes. However, despite mathematical elegance and promising results, current approaches from the literature are restricted to simple models. Indeed, the sampling properties of these algorithms are determined by simplifying assumptions on the loss landscape, which do not hold for the kind of complex models which are popular these days, such as deep models. Meanwhile, SG-MCMC algorithms require vanishing learning rates, which force practitioners to develop creative annealing schedules that are often model specific and difficult to justify.
We have attempted to target these weaknesses by suggesting a simpler algorithm that relies on fewer parameters and less strict assumptions compared to the literature on SG-MCMC. We used a unified mathematical notation to deepen our understanding of the role of the covariance of the noise of stochastic gradients and learning rate on the behavior of SG-MCMC algorithms. We then presented a practical variant of the SGD algorithm, which uses a constant learning rate, and an additional noise to perform Bayesian posterior sampling. Our proposal is derived from the ideal method, in which it is guaranteed that samples are generated from the true posterior. When the learning rate and noise terms are empirically estimated, with no user intervention, our method offers a very good approximation to the posterior, as demonstrated by the extensive experimental campaign.
We verified empirically the quality of our approach, and compared its performance to state-of-the-art SG-MCMC and alternative methods. Results, which span a variety of settings, indicated that our method is competitive to the alternatives from the state-of-the-art, while being much simpler to use.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Background and Related Material
Appendix A. 1

. The Minibatch Gradient Approximation
Starting from the gradient of the logarithm of the posterior density: it is possible to define its minibatch version by computing the gradient on a random subset I N b with cardinality N b of all the indices. The minibatch gradient g(θ) is computed as By simple calculations it is possible to show that the estimation is unbiased (E(g(θ)) = ∇ f (θ)).
As the learning rate goes to zero (η → dt), similarly to the previous case, we can interpret the above difference equation as a discretization of the following FPE As before we assume that the stationary distribution has form ρ(z) ∝ exp(−φ(z)). The corresponding FPE is Notice that since ∇ D(z) = 0 we can rewrite that is verified with ∇φ(z) = s(z) if ∇ P(θ) = 0 A(θ) = ηP(θ) 2 Σ(θ).
as the quantity λ (p) − b(θ) is necessary at every step. As the learning rate is derived as 2 λ (p) , we found that the usage of second-order moments instead of variances, and in certain cases temperature scaling, kept the simulated trajectories more stable.

Appendix C. Methodology
We hereafter present additional implementation details.

Appendix C.1. Regression Tasks, with Simple Models
For this set of experiments we considered , the BASELINE is obtained by running the ADAM optimizer for 20,000 steps with learning rate 0.01 and default parameters. At test time we use 100 samples to estimate the predictive posterior distribution, using Equation (3), for the sampling methods (I-SGD,SGLD,SGHMC,VSGD), with a keep-every value equal to 1000. The I-SGD and VSGD sampling methods are started from the BASELINE. For I-SGD we selected temperature 0.01, while for SGHMC and SGLD we do performed experiments for temperatures 1 and 0.01. We modified the implementation of VSGD as the original implementation produced unstable learning rates (as noticed also in [9]). A simple and effective solution we implement that we kept throughout the experimental campaigns is to divide the learning rate by the number of parameters (thus performing variational inference on a tempered version of the posterior). For SGLD the learning rate decay is the one suggested in [2], with initial and finial learning rate equal to 10 −6 and 10 −8 respectively. For MCD we collected 1000 samples with standard dropout rate of 0.5. All our experiments use 10-splits. The considered batch size is 64 for all methods.

Appendix C.2. Classification Task, CONVNET
For the LENET-5 on MNIST experiment, we do consider also the SWAG algorithm. At test time we use 30 samples for all methods. Baselines are again trained using ADAM optimizer for 20,000 steps with learning rate 0.01 and default parameters. For I-SGD and SGHMC we collected samples for the different temperatures of 1 and 0.01. SGLD has initial and final learning rates of 10 −3 and 10 −5 . For all the sampling methods we do collect 100 samples with a keep-every of 10,000 steps. SWAG results are obtained by collecting the statistics over 300 epochs using ADAM optimizer and decreasing the learning rate every epoch in accordance with the original paper schedule [9]. DROP results are obtained by training the networks with SGD, with learning rate 0.005 and momentum 0.5. The number of collected samples for this method is 1000. The batch size for all the methods is 128.
As explained in the main text, we performed an ablation study on the considered baselines. In Table A1 we do report the results for the additional variants obtained by early stopping (10,000 iterations instead of 20,000) BASELINE S, to ablate overfitting, and BASELINE L, by training for 30,000 iterations. Finally, we include the best performing BASELINE R, obtained starting from BASELINE, reducing the learning rate by a factor of 10 and training for 10,000 more iterations. We here report details for the RESNET-18 on CIFAR10 experiments. The BASELINE is obtained with ADAM optimizer with learning rate 0.01 decreased by a factor of 10 every 50 epochs for a total of 200 epochs and weight decay of 0.05. For this set of experiments no temperature scaling was required. We could not find good hyperparameters for the SGLD scheme. Concerning I-SGD, SGHMC and VSGD the keep-every value is chosen as 10,000 and the number of collected samples is 30. For SWAG we used the default parameters described in [9]. Notice that for SWAG we performed the following ablation study: we trained the networks considering as loss function the joint log-likelihood and included or not the suggested weight decay of the original work [9]. From a purely Bayesian perspective no weight decay should be considered to be the information is implicit in the prior; however, we found that without the extra decay SWAG was not able to obtain competitive results. As underlined in Section 4.1, not necessarily a better posterior approximation translates into better empirical results.

Appendix C.4. Definition of the Metrics
For regression datasets, we consider RMSE and MNLL. Consider a single datapoint U i = (x i , y i ), with x i the input of the model and y i the true corresponding output. The output of the model, for a single sample of parameters θ j , isŷ θ j (x i ). RMSE is defined as , where σ 2 i is the empirical variance. For classification datasets, we consider ACC,MNLL and entropy. Consider a single datapoint U i = (x i , y i ), with x i the input of the model and y i the true corresponding label. The output of the model, for a single sample of parameters θ j , is the N cl vector p θ j (x i ). The averaged probability vector for a single sample is p( log p y i (x i ) . Entropy, as stated in the main text, is instead computed according to 1