All articles published by MDPI are made immediately available worldwide under an open access license. No special
permission is required to reuse all or part of the article published by MDPI, including figures and tables. For
articles published under an open access Creative Common CC BY license, any part of the article may be reused without
permission provided that the original article is clearly cited. For more information, please refer to
Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature
Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review
prior to publication.
The Feature Paper can be either an original research article, a substantial novel research study that often involves
several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest
progress in the field that systematically reviews the most exciting advances in scientific literature. This type of
paper provides an outlook on future directions of research or possible applications.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world.
Editors select a small number of articles recently published in the journal that they believe will be particularly
interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the
most exciting work published in the various research areas of the journal.
Stochastic gradient sg-based algorithms for Markov chain Monte Carlo sampling (sgmcmc) tackle large-scale Bayesian modeling problems by operating on mini-batches and injecting noise on sgsteps. The sampling properties of these algorithms are determined by user choices, such as the covariance of the injected noise and the learning rate, and by problem-specific factors, such as assumptions on the loss landscape and the covariance of sg noise. However, current sgmcmc algorithms applied to popular complex models such as Deep Nets cannot simultaneously satisfy the assumptions on loss landscapes and on the behavior of the covariance of the sg noise, while operating with the practical requirement of non-vanishing learning rates. In this work we propose a novel practical method, which makes the sg noise isotropic, using a fixed learning rate that we determine analytically. Extensive experimental validations indicate that our proposal is competitive with the state of the art on sgmcmc.
Stochastic gradient (sg) methods have been extensively studied as a means for mcmc-based Bayesian posterior sampling algorithms to scale to large data regimes. Variants of sg-mcmc algorithms have been studied through the lens of first [1,2,3] or second-order [4,5] Langevin Dynamics, which are mathematically convenient continuous-time processes that correspond to discrete-time gradient methods with and without momentum, respectively. The common traits underlying many methods from the literature can be summarized as follows: they address large data requirements using sg and mini-batching, they inject Gaussian noise throughout the algorithm execution, and they avoid the expensive Metropolis-Hasting accept/reject tests that use the whole data [1,2,4].
Despite mathematical elegance and some promising results restricted to simple models, current approaches fall short in dealing with the complexity of the loss landscape typical of popular modern machine learning models, e.g., neural networks [6,7], for which stochastic optimization poses some serious challenges [8,9].
In general, sg-mcmc algorithms inject random noise to sg descent algorithms: the covariance of such noise and the learning rate, or step-size in the stochastic differential equation simulation community, are tightly related to the assumptions on the loss landscape, which together with the sg noise, determine the sampling properties of these methods . However, current sg-mcmc algorithms applied to popular complex models such as Deep Nets, cannot simultaneously satisfy the assumptions on posterior distribution geometry and on the behavior of the covariance of the sg noise, while operating with the practical requirement of non-vanishing learning rates. In this paper, in accordance with most of the Neural Network related literature, we refer to the posterior distribution geometry as loss landscape. Some recent work , instead, argue for fixed step sizes, but settle for variational approximations of quadratic losses. Although we are not the first to highlight these issues, including the lack of a unified notation , we believe that studying the role of noise in sg-mcmc algorithms has not received enough attention, and a deeper understanding is truly desirable, as it can clarify how various methods compare. Most importantly, this endeavor can suggest novel and more practical algorithms relying on fewer parameters and less restrictive assumptions.
In this work we chose a mathematical notation that emphasizes the role of noise covariances and learning rate on the behavior of sg-mcmc algorithms (Section 2). As a result, the equivalence between learning rate annealing and extremely large injected noise covariance becomes apparent, and this allows us to propose a novel practical sg-mcmc algorithm (Section 3). We derive our proposal, by first analyzing the case where we inject the smallest complementary noise such that its combined effects with the sg noise result in an isotropic noise. Thanks to this isotropic property of the noise, it is possible to deal with intricate loss surfaces typical of deep models, and produce samples from the true posterior without learning rate annealing. This, however, comes at the expense of cubic complexity matrix operations. We address such issues through a practical variant of our scheme, which employs well-known approximations to the sg noise covariance (see, e.g., ). The result is an algorithm that produces approximate posterior samples with a fixed, theoretically derived, learning rate. Please note that in generic Bayesian deep learning setting, none of the existing implementations of sg-mcmc methods converge to the true posterior without learning rate annealing. In contrast, our method automatically determines an appropriate learning rate through a simple estimation procedure. Furthermore, our approach can be readily applied to pre-trained models: after a “warmup” phase to compute sg noise estimates, it can efficiently perform Bayesian posterior sampling.
We evaluate sg-mcmc algorithms (Section 4) through an extensive experimental campaign, where we compare our approach to several alternatives, including Monte Carlo Dropout (mcd)  and Stochastic Weighted Averaging Gaussians (swag, ), which have been successfully applied to the Bayesian deep learning setting. Our results indicate that our approach offers performance that are competitive to the state of the art, according to metrics that aim at assessing the predictive accuracy and uncertainty.
2. Preliminaries and Related Work
Consider a dataset of m—dimensional observations . Given prior for a d-dimensional set of parameters, and a likelihood model , the posterior is obtained by means of Bayes theorem as follows:
where is also known as the model evidence, defined as the integral . Except when the prior and the likelihood function are conjugate, Equation (1) is analytically intractable . However, the joint likelihood term in the numerator is typically not hard to compute; this is a key element of many mcmc algorithms, since the normalization constant does not affect the shape of the distribution in any way other than scaling. The posterior distribution is necessary to obtain predictive distributions for new test observations , as:
We focus in particular on Monte Carlo methods to obtain an estimate of this predictive distribution, by averaging over samples obtained from the posterior over , i.e.,
We develop our work by working with an unnormalized version of the logarithm of the posterior density, by expressing the negative logarithm of the joint distribution of the dataset and parameters as:
For computational efficiency, we use a minibatch stochastic gradient , which guarantees that the estimated gradient is an unbiased estimate of the true gradient , and we assume that the randomness due to the minibatch introduces a Gaussian noise:
where the matrix denotes the sg noise covariance, which depends on the parametric model, the data distribution and the minibatch size.
A survey of algorithms to sample from the posterior using sg methods can be found in Ma et al. . In Appendix A we report some well-known facts which are relevant for the derivations in our paper. As shown in the literature [10,14], there are structural similarities between sg-mcmc algorithms and stochastic optimization methods, and both can be used to draw samples from posterior distributions. Notice that the original goal of stochastic optimization is to find the minimum of a given cost function, and the stochasticity is introduced by sub-sampling the dataset to scale. sg-mcmc methods instead aim at sampling from a given distribution, i.e., collecting multiple values, and the stochasticity is necessary explore the whole landscape. In what follows, we use a unified notation to compare many existing algorithms in light of the role played by their noise components.
It is well-known [15,16,17] that stochastic gradient descent (sgd), with and without momentum, can be studied through the following stochastic differential equation (sde), when the learning rate is small enough (In this work we do not consider discretization errors. The reader can refer to classical sde texts such as  to investigate the topic in greater depth.):
where is usually referred to as driving force and as diffusion matrix We use a generic form of the sde, with variable instead of , which accommodates sgd variants, with and without momentum. By doing this, we will be able to easily cast the expression for the two cases in what follows (The operator applied to matrix produces a row vector whose elements are the divergences of the columns. Our notation is aligned with Chen et al. ).
A distribution is said to be astationarydistribution for thesdeof the form (6), if and only if it satisfies the following Fokker-Planck equation (fpe):
Please note that in general, the stationary distribution does not converge to the desired posterior distribution, i.e., , as shown by Chaudhari and Soatto . Additionally, given an initial condition for , its distribution is going to converge to only for . In practice, we observe the sde dynamics for a finite amount of time: then, we declare that the process is approximately in the stationary regime once the potential has reached low and stable values.
Next, we briefly overview known approaches to Bayesian posterior sampling, and interpret them as variants of an sgd process, using the fpe formalism.
2.1. Gradient Methods without Momentum
The generalized updated rule of sgd, described as a discrete-time stochastic process, writes as:
where is a user-defined preconditioning matrix, and is a noise term, distributed as , with a user-defined covariance matrix . Then, the corresponding continuous-time sde is :
In this paper we use the symbol n to indicate discrete time, while t for continuous time. We denote by the covariance of the injected noise and the composite noise covariance. Please note that combines the sg and the injected noise. Notice that our choice of notation is different from the standard one, in which the starting discrete-time process is in the form . By directly grouping the injected noise with the stochastic gradient we can better appreciate the relationship between annealing the learning rate and extremely large injected noise. Moreover, as will be explained in Section 3, this allows derivation of a new sampling algorithm.
We define the stationary distribution of the sde in Equation (9) as . Please note that when , the potential differs from the desired posterior . The following theorem, which is an adaptation of known results in light of our formalism, states the conditions for which the noisysgd converges to the true posterior distribution (proof in Appendix A).
Consider dynamics of the form (9) and define the stationary distribution . If
Stochastic Gradient Langevin Dynamics (sgld)  is a simple approach to satisfy Equation (10); it uses no preconditioning, , and sets the injected noise covariance to . In the limit for , it holds that . Then, , and . Although sgld succeeds in (asymptotically) generating samples from the true posterior, its mixing rate is unnecessarily slow, due to the extremely small learning rate .
An extension to sgld is Stochastic Gradient Fisher Scoring (sgfs) , which can be tuned to switch between sampling from an approximate posterior, using a non-vanishing learning rate, and the true posterior, by annealing the learning rate to zero. sgfs uses preconditioning, . In practice, however, is ill conditioned for complex models such as deep neural networks. Then, many of its eigenvalues are almost zero , and computing is problematic. An in-depth analysis of sgfs reveals that conditions (10) would be met with a non-vanishing learning rate only if, at convergence, , which would be trivially true if was constant. However, recent work [6,7] suggest that this condition is difficult to justify for deep neural networks.
The Stochastic Gradient Riemannian Langevin Dynamics (sgrld) algorithm  extends sgfs to the setting in which . The process dynamic is adjusted by adding the term . However, the term has not a clear estimation procedure, restricting sgrld to cases where it can be computed analytically.
The work by  investigates constant-rate sgd (with no injected noise), and determines analytically the learning rate and preconditioning that minimize the Kullback–Leibler (kl) divergence between an approximation and the true posterior. Moreover, it shows that the preconditioning used in sgfs is optimal, in the sense that it converges to the true posterior, when is constant and the true posterior has a quadratic form.
In summary, to claim convergence to the true posterior distribution, existing approaches require either vanishing learning rates or assumptions on the sg noise covariance that are difficult to verify in practice, especially when considering deep models. We instead propose a novel practical method that induces isotropic sg noise and thus satisfies Theorem 1. We determine analytically a fixed learning rate, and we require weaker assumptions on the loss shape.
2.2. Gradient Methods with Momentum
Momentum-corrected methods emerge as a natural extension to sgd approaches. The general set of update equations for (discrete-time) momentum-based algorithms is:
where is a preconditioning matrix, M is the mass matrix and is the friction matrix, as shown by [4,19]. As with the first order counterpart, the noise term is distributed as . Then, the sde to describe continuous-time system dynamics is:
where and we assume to be symmetric. The theorem hereafter describes the conditions for which noisy sgd with momentum converges to the true posterior distribution (Appendix A).
Consider dynamics of the form (11) and define the stationary distribution for as . If
In the naive case, where , Equation (12) are not satisfied and the stationary distribution does not correspond to the true posterior . To generate samples from the true posterior it is sufficient to set (as in Equation (9) in ).
Stochastic Gradient Hamiltonian Monte Carlo (sghmc)  suggests that estimating can be costly. Hence, the injected noise is chosen such that , where is user-defined. When , the following approximation holds: . It is then trivial to check that conditions (12) hold without the need for explicitly estimating . A further practical reason to avoid setting is that the computational cost for the operation has complexity, whereas if is diagonal, this is reduced to . This, however, severely slows down the sampling process.
Stochastic Gradient Riemannian Hamiltonian Monte Carlo (sgrhmc) is an extension to sghmc ), which considers a generic, space-varying preconditioning matrix derived from information geometric arguments . sgrhmc suggests setting , where is the Fisher Information matrix. To meet the requirement , it includes a correction term, . The injected noise is set to , consequently , and the friction matrix is set to . With all these choices, Theorem 2 is satisfied. Although appealing, the main drawbacks of this method are the need for an analytical expression of , and the assumption for to be known.
From a practical standpoint, momentum-based methods suffer from the requirement to tune many hyperparameters, including the learning rate, and the parameters that govern the simulation of a second-order Langevin dynamics.
The method we propose in this work can be applied to momentum-based algorithms; in this case, it could be viewed as an extension of the work in , albeit addressing the complex loss landscapes typical of deep neural networks. However, we leave this avenue of research for future work.
3. Sampling by Layer-Wise Isotropization
We present a simple and practical approach to inject noise to SGD iterates to perform Bayesian posterior sampling. Our goal is to sample from the true posterior distribution (or approximations thereof) using a constant learning rate, and to rely on more lenient assumptions about the shape of the loss landscape that characterize deep models, compared to previous works. In general, in modern machine learning applications, we deal with multi-layer neural networks . We exploit the natural subdivision of the parameters of these architecture into different layers to propose a practical sampling scheme
Careful inspection of Theorem 1 reveals that the matrices are instrumental in determining the convergence properties of sg methods to the true posterior. Therefore, we consider the constructive approach of designing to obtain a sampling scheme that meets our goals; we set to be a constant, diagonal matrix which we constrain to be layer-wise uniform:
By properly selecting the set of parameters we can achieve the simultaneous result of non-vanishing learning rate and well-conditioned preconditioning matrix. This implies a layer-wise learning rate for the p-th layer, without further preconditioning.
We can now prove (see Appendix B), as a corollary to Theorem 1, that our design choices can guarantee convergence to the true posterior distribution.
(Theorem 1) Consider dynamics of the form (9) and define the stationary distribution . If as in (13), and , then .
If aforementioned conditions are satisfied, it is in fact simple to show that the relevant matrices satisfy the conditions in Equation (10). The covariance matrix of the composite noise is said to be isotropic within the layers of (deep) models. In fact, . From a practical point of view, we choose to be, among all valid matrices satisfying , the smallest (the one with the smallest ’s). Indeed, larger induce a smaller learning rate, thus unnecessarily reducing sampling speed.
Now, let us consider an ideal case, in which we assume the sg noise covariance and to be known in advance. The procedure described in Algorithm 1 illustrates a naive sg method that uses the injected noise covariance to sample from the true posterior.
Algorithm 1 Idealized posterior sampling
This deceivingly simple procedure generate samples from the true posterior, with a non-vanishing learning rate, as shown earlier. However, it cannot be used in practice as and are unknown. Furthermore, the algorithm requires computationally expensive operations, i.e., to compute , which requires operations, and , which costs multiplications.
Next, we describe a practical variant of our approach, where we use approximations at the expense of generating samples from the true posterior distribution. We note that  suggest exploring a related preconditioning, but do not develop this path in their work. Moreover, the proposed method shares similarities with a scheme proposed in  although the analysis we perform here is different.
3.1. A Practical Method: Isotropic SGD
To render the idealized sampling method practical, it is necessary to consider some additional assumptions. As we explain at the end of this section, the assumptions that follow are less strict than other approaches in the literature.
Thesgnoise covariance can be approximated with a diagonal matrix, i.e., .
The signal-to-noise ratio (SNR) of a gradient is small enough such that in the stationary regime, the second-order moment of the gradient is a good estimate of the true variance. Hence, combining with Assumption 1, , where ⊙ indicates the elementwise product.
The sum of the variances of noise components, layer by layer, can be assumed to constant in the stationary regime. Then, , where is the set of indices of parameters belonging to layer.
The diagonal covariance assumption (i.e., Assumption 1) is common in other works, such as [2,11]. The small signal-to-noise ratio as stated in Assumption 2 is in line with recent studies, such as [11,23]. Assumption 3 is similar to those appeared in earlier work, such as . Please note that Assumptions 2 and 3 must hold in the stationary regime when the process reaches the bottom valley of the loss landscape. The matrix has been associated in the literature with the empirical Fisher information matrix [2,25]. As we do not consider this matrix for preconditioning purposes, we do not further investigate this connection.
Given our assumptions, and our design choices, it is then possible to show (see Appendix B) that the optimal (i.e., the smallest possible) satisfying Corollary 1 can be obtained as . Please note that we do not assume to be known, but use a simple procedure to estimate its components by computing: where is the portion of stochastic gradient corresponding to the p-th layer. Then, the composite noise matrix is a layer-wise isotropic covariance matrix, which inspires the name of our proposed method as Isotropic SGD (i-sgd).
The practical implementation of i-sgd is shown in Algorithm 2. The advantage of i-sgd is that it can either be used to obtain posterior samples starting from a pre-trained model, or do so by training a model from scratch. In either case, the estimates of are used to compute , as discussed above. An important consideration is that once all have been estimated, the learning rate, layer by layer, is determined automatically. In fact, for the p-th layer, the learning rate is: . A simpler approach is to use a unique learning rate for all layers, where the equivalent is the sum of all .
Algorithm 2i-sgd: practical posterior sampling
for to Nldo
A Remark on Convergence
In summary, i-sgd is a practical method to perform approximate Bayesian posterior sampling, backed up by solid theoretical foundations. Our assumptions, which are at the origin of the approximate nature of i-sgd, are less strict than those used in the literature of sg-mcmc methods. More precisely, the theory behind i-sgd can explain convergence to the true posterior with a non-vanishing learning rate in the particular case when Assumption 1 holds and the estimation of is perfect. Even with perfect estimates, this is not the case for sgfs, which requires the correction term . Additionally, both sgrld and sgrhmc are more demanding than i-sgd because they require computing , for which an estimation procedure is elusive. Finally, the method by Springenberg et al.  needs a constant, diagonal , a condition that does not necessarily hold for deep models.
3.2. Computational Cost
The computational cost of i-sgd is as follows. As with , we define the cost of computing a gradient minibatch as . Thanks to Assumptions 1 and 2, the computational cost for estimating the noise covariance scales as multiplications. The computational cost of generating random samples with the desired covariance scales as square roots and multiplications (without considering the cost of generating random numbers). The overall cost of our method is the sum of the above terms. Notice that the cost of estimating the noise covariance does not depend on the minibatch size . We would like to stress that in many modern models, the real computational bottleneck is the backward propagation for the computation of the gradients. As all the sg-mcmc methods considered in this work require one gradient evaluation per step, the different methods have in practice the same complexity.
The space complexity of i-sgd is the same as sghmc,sgfs and variants: it scales as , where is the number of posterior samples.
The empirical analysis of our method, and its comparison to alternative approaches from the literature, is organized as follows. First, we proceed with a validation of i-sgd using the standard uci datasets  and a shallow neural network. Then we move to the case of deeper models: we begin with a simple CNN used on the mnist  dataset, then move to the standard ResNet-18  deep network using the Cifar-10  dataset.
We compare i-sgd to other Bayesian sampling methods such as sghmc , sgld , and to alternative approaches to approximate Bayesian inference, including mcd , swag  and vsgd . In general, our result indicates that: (1) i-sgd achieves similar or superior performance regarding competitors, when measuring uncertainty quantification, even with simple datasets and models; (2) i-sgd is simple to tune, when compared to alternatives; (3) i-sgd is competitive when used for deep Bayesian modeling, even when compared to standard methods used in the literature. In particular, the proposed method shares some of the strengths of vsgd, such as learning rates determined automatically and the simplicity of sgld. Appendix B includes additional implementation details on i-sgd. Appendix C presents detailed configurations of all methods we compare, and additional experimental results.
4.1. A Disclaimer on Performance Characterization
It is important to stress a detail on the analysis of the experimental campaign. The discussion is usually focused on the goodness of the various methods for representing the true posterior distribution. Different methods can or cannot claim convergence to the true posterior according to certain assumptions and the nature of the hyperparameters. In the experimental validation of the results, however, we do not have access to the form of the true posterior as it is exactly the problem we are trying to solve. The practical solution adopted is to compare the different methods in terms of proxy metrics evaluated on the test sets, such as the accuracy and uncertainty metrics. Being better in terms of these performance metrics does not imply that the sampling method is better at approximating the posterior distribution, and outperforming competitors in terms of these metric do not provide sufficient information about the intrinsic quality of the sampling scheme.
4.2. Regression Tasks, with Simple Models
We consider several regression tasks defined on the uci datasets. We use a simple neural network configuration with two fully connected layers and a relu activation function; the hidden layer includes 50 units. In this set of experiments, we use the following metrics: the root mean square error (rmse) to judge the model predictive performance and the mean negative log-likelihood (mnll) as a proxy for uncertainty quantification. We note that the task of tuning our competitors was far from trivial. We used our own version of sghmc, based on , to ensure a proper understanding of the implementation internals, and we proceeded with a tuning process to find appropriate values for the numerous hyperparameters. In this set of experiments, we omit results for swag, which we keep for more involved scenarios.
Table 1 and Table 2 report a complete overview of our results, for a selection of uci datasets. For each method and each dataset, we also included how many out of the 10 splits considered failed to converge, indicated as . As explained in Appendix C we implemented a temperature scaled version of vsgd. A clear picture emerges from this first set of experiments: while for the rmse the performance is similar for different methods, for the mnll averaging over multiple samples clearly improves the uncertainty quantification capabilities. sghmc is in many cases better than alternatives, considering however the standard deviation of the results it is difficult to claim clear superiority of one method over the others.
4.3. Classification Tasks, with Deeper Models
Next, we compare i-sgd against competitors on image classification tasks. First, we use the mnist dataset, and a simple LeNet-5 CNN . All methods are compared based on the test accuracy acc,mnll and the expected calibration error (ece, 31]). Additionally, at test time, we carry out predictions on both mnist and not-mnist; the latter is a dataset equivalent to mnist, but it represents letters rather than numbers. (http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html, accessed on 24 October 2021) This experimental setup is often used to check whether the entropy of the predictions on not-mnist is higher than the entropy of the predictions on mnist (the entropy of the output of an classes classifier, represented by the vector , is defined as ).
Table 3 indicates that all methods are essentially equivalent in terms of accuracy and mnll. We consider, together with the classical in and out of distribution entropies the regions of convergence (rocs) diagrams comparing detection of out of distribution samples and false alarms when using as test statistic the entropy. Results, reported in Figure 1, clearly shows that: (1) collecting multiple samples improve the uncertainty quantification capabilities (2) i-sgd is competitive (but not the best scheme) and importantly outperform the closest approach to ours, i.e., vsgd. The experimental results show that i-sgd improves the quality of the baseline model with respect to all metrics. To test whether the improvements are due just to “additional training” or are intrinsically due to the Bayesian averaging properties, we do consider alternative deterministic baselines (details in Appendix C). For this set of experiments the best performing is baseline R. As can be appreciated by comparing Table 3 and Figure 1, while it is possible to increase the classical metrics, i-sgd (and other methods) still outperform by a large margin the baselines in terms of detection of out of distribution samples.
We now move on to a classical image classification problem with deep convolutional networks, whereby we use the cifar10 dataset, and the ResNet-18 network architecture. For this set of experiments, we compare i-sgd, sghmc, swag, and vsgd using again test accuracy and mnll, which we report in Table 4. As usual, we compare the results against the baseline of the individual network resulting from the pre-training phase. Results are obtained averaging over three independent seeds. Notice, as expanded in Appendix C that for swag we do consider two variants: the Bayesian correct one (swag) and a second variant that has better performance (swag wd). We stress again, as highlighted in Section 4.1 that not always goodness of approximation of the posterior and performance correlate positively. Additionally in this case, we found i-sgd to be competitive with other methods and superior to the baseline. Among the competitors, we found i-sgd to the easiest to tune, given the feature of a fixed learning rate informed by theoretical considerations; we believe that this is an important aspect to consider for a wide adoption of our proposal by practitioners.
sg methods allowed Bayesian posterior sampling algorithms, such as mcmc, to regain relevance in an age when datasets have reached extremely large sizes. However, despite mathematical elegance and promising results, current approaches from the literature are restricted to simple models. Indeed, the sampling properties of these algorithms are determined by simplifying assumptions on the loss landscape, which do not hold for the kind of complex models which are popular these days, such as deep models. Meanwhile, sg-mcmc algorithms require vanishing learning rates, which force practitioners to develop creative annealing schedules that are often model specific and difficult to justify.
We have attempted to target these weaknesses by suggesting a simpler algorithm that relies on fewer parameters and less strict assumptions compared to the literature on sg-mcmc. We used a unified mathematical notation to deepen our understanding of the role of the covariance of the noise of stochastic gradients and learning rate on the behavior of sg-mcmc algorithms. We then presented a practical variant of the sgd algorithm, which uses a constant learning rate, and an additional noise to perform Bayesian posterior sampling. Our proposal is derived from the ideal method, in which it is guaranteed that samples are generated from the true posterior. When the learning rate and noise terms are empirically estimated, with no user intervention, our method offers a very good approximation to the posterior, as demonstrated by the extensive experimental campaign.
We verified empirically the quality of our approach, and compared its performance to state-of-the-art sg-mcmc and alternative methods. Results, which span a variety of settings, indicated that our method is competitive to the alternatives from the state-of-the-art, while being much simpler to use.
Formal analysis, G.F., D.M., M.F. and P.M.; Methodology, G.F., D.M., M.F. and P.M.; Software, G.F., D.M., M.F. and P.M.; Writing—original draft, G.F., D.M., M.F. and P.M.; Writing—review & editing, G.F., D.M., M.F. and P.M. All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
MF gratefully acknowledges support from the AXA Research Fund and the Agence Nationale de la Recherche (grant ANR-18-CE46-0002 and ANR-19-P3IA-0002).
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Background and Related Material
Appendix A.1. The Minibatch Gradient Approximation
Starting from the gradient of the logarithm of the posterior density:
it is possible to define its minibatch version by computing the gradient on a random subset with cardinality of all the indices. The minibatch gradient is computed as
By simple calculations it is possible to show that the estimation is unbiased (). The estimation error covariance is defined to be .
If the minibatch size is large enough, invoking the central limit theorem, we can state that the minibatch gradient is normally distributed:
Appendix A.2. Gradient Methods without Momentum
Appendix A.2.1. The sde from Discrete Time
We start from the generalized updated rule of sgd:
Since we can rewrite the above equation as:
where . If we separate deterministic and random component we can equivalently write:
where . When is small enough () we can interpret the above equation as the discrete-time simulation of the following sde :
where is a d—dimensional Brownian motion.
Appendix A.2.2. Proof of Theorem 1
The stationary distribution of the above sde, , satisfies the following fpe
that we rewrite as
The above equation is verified with if
that proves Theorem 1.
Appendix A.3. Gradient Methods with Momentum
Appendix A.3.1. The SDE from Discrete Time
The general set of update equations for (discrete-time) momentum-based algorithms is:
Similarly to the case without momentum, we rewrite the second equation of the system as
where again . If we define the supervariable we can rewrite the system as
where , and .
As the learning rate goes to zero (), similarly to the previous case, we can interpret the above difference equation as a discretization of the following fpe
Appendix A.3.2. Proof of Theorem 2
As before we assume that the stationary distribution has form . The corresponding fpe is
Notice that since we can rewrite
that is verified with if
If in fact
since and the second term is zero due to the fact that is anti-symmetric while is symmetric.
Thus, we can rewrite
and then proving Theorem 2.
Appendix B. i-sgd Method Proofs and Details
Appendix B.1. Proof of Corollary 1
The requirement , ensures that the injected noise covariance is valid. The composite noise matrix is equal to . Since and by construction, then Theorem 1 is satisfied.
Appendix B.2. Proof of Optimality of Λ
Our design choice is to select . By the assumptions, the matrix is diagonal, and consequently is diagonal as well. The preconditioner must be chosen to satisfy the positive semidefinite constraint, i.e., . Equivalently, we must satisfy , where is the set of indices of parameters belonging to layer. By assumption 3, i.e., , it is easy to show that , is upper bounded as . To satisfy the positive semidefinite requirement in all cases the minimum valid set of is then determined as .
Appendix B.3. Algorithmic Details
In this section, we provide further details about the practical implementation of the proposed scheme. At any (discrete) time instant a minibatch version of the gradient is computed that is distributed, according to the hypotheses of the main paper, as . Since we assumed that the second-order moment is a good approximation of the variance, we can estimate as . In practice, we found that the following running average estimation procedure to be the most robust
where . In all experiments we considered
After a warmup period, the various , layer per layer, are estimated as and kept constant until the end. The estimation procedure continues during sampling phase, as the quantity is necessary at every step. As the learning rate is derived as , we found that the usage of second-order moments instead of variances, and in certain cases temperature scaling, kept the simulated trajectories more stable.
Appendix C. Methodology
We hereafter present additional implementation details.
Appendix C.1. Regression Tasks, with Simple Models
For this set of experiments we considered, the baseline is obtained by running the adam optimizer for 20,000 steps with learning rate and default parameters. At test time we use 100 samples to estimate the predictive posterior distribution, using Equation (3), for the sampling methods (i-sgd,sgld,sghmc,vsgd), with a keep-every value equal to 1000. The i-sgd and vsgd sampling methods are started from the baseline. For i-sgd we selected temperature , while for sghmc and sgld we do performed experiments for temperatures 1 and . We modified the implementation of vsgd as the original implementation produced unstable learning rates (as noticed also in ). A simple and effective solution we implement that we kept throughout the experimental campaigns is to divide the learning rate by the number of parameters (thus performing variational inference on a tempered version of the posterior). For sgld the learning rate decay is the one suggested in , with initial and finial learning rate equal to and respectively. For mcd we collected 1000 samples with standard dropout rate of . All our experiments use 10-splits. The considered batch size is 64 for all methods.
Appendix C.2. Classification Task, ConvNet
For the LeNet-5 on mnist experiment, we do consider also the swag algorithm. At test time we use 30 samples for all methods. Baselines are again trained using adam optimizer for 20,000 steps with learning rate and default parameters. For i-sgd and sghmc we collected samples for the different temperatures of 1 and . sgld has initial and final learning rates of and . For all the sampling methods we do collect 100 samples with a keep-every of 10,000 steps. swag results are obtained by collecting the statistics over 300 epochs using adam optimizer and decreasing the learning rate every epoch in accordance with the original paper schedule . drop results are obtained by training the networks with sgd, with learning rate 0.005 and momentum 0.5. The number of collected samples for this method is 1000. The batch size for all the methods is 128.
As explained in the main text, we performed an ablation study on the considered baselines. In Table A1 we do report the results for the additional variants obtained by early stopping (10,000 iterations instead of 20,000) baseline S, to ablate overfitting, and baseline L, by training for 30,000 iterations. Finally, we include the best performing baseline R, obtained starting from baseline, reducing the learning rate by a factor of 10 and training for 10,000 more iterations.
Baselines comparison for classification on mnist dataset.
Baselines comparison for classification on mnist dataset.
9886.6667 ± 11.0252
352.6640 ± 20.8622
0.0353 ± 0.0058
0.0468 ± 0.0001
0.0019 ± 0.0003
9871.6667 ± 20.7579
389.7142 ± 79.0354
0.0378 ± 0.0051
0.0468 ± 0.0008
0.0025 ± 0.0006
9893.0000 ± 4.8990
339.8170 ± 7.9855
0.0392 ± 0.0042
0.0477 ± 0.0008
0.0024 ± 0.0001
9919.0000 ± 9.4163
242.7644 ± 17.0736
0.0303 ± 0.0001
0.0482 ± 0.0006
0.0021 ± 0.0002
Appendix C.3. Classification Task, Deeper Models
We here report details for the ResNet-18 on cifar10 experiments. The baseline is obtained with adam optimizer with learning rate 0.01 decreased by a factor of 10 every 50 epochs for a total of 200 epochs and weight decay of . For this set of experiments no temperature scaling was required. We could not find good hyperparameters for the sgld scheme. Concerning i-sgd, sghmc and vsgd the keep-every value is chosen as 10,000 and the number of collected samples is 30. For swag we used the default parameters described in . Notice that for swag we performed the following ablation study: we trained the networks considering as loss function the joint log-likelihood and included or not the suggested weight decay of the original work . From a purely Bayesian perspective no weight decay should be considered to be the information is implicit in the prior; however, we found that without the extra decay swag was not able to obtain competitive results. As underlined in Section 4.1, not necessarily a better posterior approximation translates into better empirical results.
Appendix C.4. Definition of the Metrics
For regression datasets, we consider rmse and mnll. Consider a single datapoint , with the input of the model and the true corresponding output. The output of the model, for a single sample of parameters , is . rmse is defined as , where is the empirical mean . mnll is defined instead as , where is the empirical variance.
For classification datasets, we consider acc,mnll and entropy. Consider a single datapoint , with the input of the model and the true corresponding label. The output of the model, for a single sample of parameters , is the vector . The averaged probability vector for a single sample is .acc is defined as . mnll is computed as . Entropy, as stated in the main text, is instead computed according to .
Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
Ahn, S.; Korattikara, A.; Welling, M. Bayesian posterior sampling via stochastic gradient Fisher scoring. arXiv2012, arXiv:1206.6380. [Google Scholar]
Patterson, S.; Teh, Y.W. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In Advances in Neural Information Processing Systems 26; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; pp. 3102–3110. [Google Scholar]
Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International Conference on Machine Learning, Bejing, China, 22–24 June 2014; pp. 1683–1691. [Google Scholar]
Ma, Y.A.; Chen, T.; Fox, E. A complete recipe for stochastic gradient MCMC. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2917–2925. [Google Scholar]
Draxler, F.; Veschgini, K.; Salmhofer, M.; Hamprecht, F.A. Essentially no barriers in neural network energy landscape. arXiv2018, arXiv:1803.00885. [Google Scholar]
Garipov, T.; Izmailov, P.; Podoprikhin, D.; Vetrov, D.; Wilson, A.G. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Chaudhari, P.; Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In Proceedings of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 11–16 February 2018; pp. 1–10. [Google Scholar]
Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13132–13143. [Google Scholar]
Springenberg, J.T.; Klein, A.; Falkner, S.; Hutter, F. Bayesian optimization with robust Bayesian neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4134–4142. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, ICML, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Chen, C.; Carlson, D.; Gan, Z.; Li, C.; Carin, L. Bridging the gap between stochastic gradient MCMC and stochastic optimization. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 1051–1060. [Google Scholar]
Gardiner, C.W. Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences, 3rd ed.; Springer Series in Synergetics; Springer: Berlin/Heidelberg, Germany, 2004; Volume 13. [Google Scholar]
Kushner, H.; Yin, G. Stochastic Approximation and Recursive Algorithms and Applications; Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2003. [Google Scholar]
Ljung, L.; Pflug, G.; Walk, H. Stochastic Approximation and Optimization of Random Systems; Birkhauser Verlag: Basel, Switzerland, 1992. [Google Scholar]
Kloeden, P.E.; Platen, E. Numerical Solution of Stochastic Differential Equations; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 23. [Google Scholar]
Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; CRC Press: Boca Raton, FL, USA, 2011; Volume 2, p. 2. [Google Scholar]
Girolami, M.; Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B (Stat. Methodol.)2011, 73, 123–214. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Vollmer, S.J.; Zygalakis, K.C.; Teh, Y.W. Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res.2016, 17, 5504–5548. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp.2019, 2019, 124020. [Google Scholar] [CrossRef]
Zhu, Z.; Wu, J.; Yu, B.; Wu, L.; Ma, J. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. arXiv2018, arXiv:1803.00195. [Google Scholar]
Scott, W.A. Maximum likelihood estimation using the empirical fisher information matrix. J. Stat. Comput. Simul.2002, 72, 599–611. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely
those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or
the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas,
methods, instructions or products referred to in the content.