# A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Preliminaries and Related Work

**Definition**

**1.**

**stationary**distribution for thesdeof the form (6), if and only if it satisfies the following Fokker-Planck equation (fpe):

#### 2.1. Gradient Methods without Momentum

**Theorem**

**1.**

#### 2.2. Gradient Methods with Momentum

**Theorem**

**2.**

## 3. Sampling by Layer-Wise Isotropization

**Corollary**

**1.**

Algorithm 1 Idealized posterior sampling |

{Initialization: ${\mathit{\theta}}_{0}$} |

$\mathbf{\hspace{1em}SAMPLE}({\mathit{\theta}}_{0},\mathit{B}(\mathit{\theta}),\mathrm{\Lambda}):$ |

$\mathit{\hspace{1em}\theta}\leftarrow {\mathit{\theta}}_{0}$ |

loop |

$g=\nabla \stackrel{\sim}{f}\left(\mathit{\theta}\right)$ |

$n\sim N(0,\mathit{I})$ |

$\mathit{C}{\left(\mathit{\theta}\right)}^{1/2}\leftarrow {(\sum -\mathit{B}(\mathit{\theta}\left)\right)}^{1/2}$ |

$g\leftarrow {\sum}^{-1}(g+\sqrt{2}\mathit{C}{\left(\mathit{\theta}\right)}^{1/2}n)$ |

$\mathit{\theta}\leftarrow \mathit{\theta}-g$ |

end loop |

#### 3.1. A Practical Method: Isotropic SGD

**Assumption**

**1.**

**Assumption**

**2.**

**Assumption**

**3.**

Algorithm 2i-sgd: practical posterior sampling |

$\mathbf{\hspace{1em}SAMPLE}\left({\mathit{\theta}}_{0}\right):$ |

$\mathit{\hspace{1em}\theta}\leftarrow {\mathit{\theta}}_{0}$ |

loop |

$g=\nabla \stackrel{\sim}{f}\left(\mathit{\theta}\right)$ |

for $p\leftarrow 1$ to N_{l} do |

$n\sim N(0,\mathit{I})$ |

$\mathit{C}{\left(\mathit{\theta}\right)}^{1/2}\leftarrow ({\lambda}^{\left(p\right)}-(1/2)({g}^{\left(p\right)}\odot {g}^{\left(p\right)}\left)\right)$ |

${g}^{\left(p\right)}\leftarrow 1/{\lambda}^{\left(p\right)}({g}^{\left(p\right)}+\sqrt{2}\mathit{C}{\left(\mathit{\theta}\right)}^{1/2}n)$ |

end for |

$\mathit{\theta}\leftarrow \mathit{\theta}-g$ |

end loop |

#### A Remark on Convergence

#### 3.2. Computational Cost

## 4. Experiments

#### 4.1. A Disclaimer on Performance Characterization

#### 4.2. Regression Tasks, with Simple Models

#### 4.3. Classification Tasks, with Deeper Models

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Background and Related Material

#### Appendix A.1. The Minibatch Gradient Approximation

#### Appendix A.2. Gradient Methods without Momentum

#### Appendix A.2.1. The sde from Discrete Time

#### Appendix A.2.2. Proof of Theorem 1

#### Appendix A.3. Gradient Methods with Momentum

#### Appendix A.3.1. The SDE from Discrete Time

#### Appendix A.3.2. Proof of Theorem 2

## Appendix B. i-sgd Method Proofs and Details

#### Appendix B.1. Proof of Corollary 1

#### Appendix B.2. Proof of Optimality of Λ

#### Appendix B.3. Algorithmic Details

## Appendix C. Methodology

#### Appendix C.1. Regression Tasks, with Simple Models

#### Appendix C.2. Classification Task, ConvNet

Method | acc | mnll | Mean ${\mathit{H}}_{0}$ | ece | Mean ${\mathit{H}}_{1}$ | Failed |
---|---|---|---|---|---|---|

baseline | 9886.6667 ± 11.0252 | 352.6640 ± 20.8622 | 0.0353 ± 0.0058 | 0.0468 ± 0.0001 | 0.0019 ± 0.0003 | 0.0000 |

baseline l | 9871.6667 ± 20.7579 | 389.7142 ± 79.0354 | 0.0378 ± 0.0051 | 0.0468 ± 0.0008 | 0.0025 ± 0.0006 | 0.0000 |

baseline s | 9893.0000 ± 4.8990 | 339.8170 ± 7.9855 | 0.0392 ± 0.0042 | 0.0477 ± 0.0008 | 0.0024 ± 0.0001 | 0.0000 |

baseline r | 9919.0000 ± 9.4163 | 242.7644 ± 17.0736 | 0.0303 ± 0.0001 | 0.0482 ± 0.0006 | 0.0021 ± 0.0002 | 0.0000 |

#### Appendix C.3. Classification Task, Deeper Models

#### Appendix C.4. Definition of the Metrics

## References

- Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
- Ahn, S.; Korattikara, A.; Welling, M. Bayesian posterior sampling via stochastic gradient Fisher scoring. arXiv
**2012**, arXiv:1206.6380. [Google Scholar] - Patterson, S.; Teh, Y.W. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In Advances in Neural Information Processing Systems 26; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; pp. 3102–3110. [Google Scholar]
- Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International Conference on Machine Learning, Bejing, China, 22–24 June 2014; pp. 1683–1691. [Google Scholar]
- Ma, Y.A.; Chen, T.; Fox, E. A complete recipe for stochastic gradient MCMC. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2917–2925. [Google Scholar]
- Draxler, F.; Veschgini, K.; Salmhofer, M.; Hamprecht, F.A. Essentially no barriers in neural network energy landscape. arXiv
**2018**, arXiv:1803.00885. [Google Scholar] - Garipov, T.; Izmailov, P.; Podoprikhin, D.; Vetrov, D.; Wilson, A.G. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Chaudhari, P.; Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In Proceedings of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 11–16 February 2018; pp. 1–10. [Google Scholar]
- Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13132–13143. [Google Scholar]
- Mandt, S.; Hoffman, M.D.; Blei, D.M. Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res.
**2017**, 18, 4873–4907. [Google Scholar] - Springenberg, J.T.; Klein, A.; Falkner, S.; Hutter, F. Bayesian optimization with robust Bayesian neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4134–4142. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, ICML, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
- Bishop, C.M. Pattern Recognition and Machine Learning, 1st ed.; 2006. corr. 2nd printing 2011 ed.; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Chen, C.; Carlson, D.; Gan, Z.; Li, C.; Carin, L. Bridging the gap between stochastic gradient MCMC and stochastic optimization. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 1051–1060. [Google Scholar]
- Gardiner, C.W. Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences, 3rd ed.; Springer Series in Synergetics; Springer: Berlin/Heidelberg, Germany, 2004; Volume 13. [Google Scholar]
- Kushner, H.; Yin, G. Stochastic Approximation and Recursive Algorithms and Applications; Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2003. [Google Scholar]
- Ljung, L.; Pflug, G.; Walk, H. Stochastic Approximation and Optimization of Random Systems; Birkhauser Verlag: Basel, Switzerland, 1992. [Google Scholar]
- Kloeden, P.E.; Platen, E. Numerical Solution of Stochastic Differential Equations; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 23. [Google Scholar]
- Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; CRC Press: Boca Raton, FL, USA, 2011; Volume 2, p. 2. [Google Scholar]
- Girolami, M.; Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**2011**, 73, 123–214. [Google Scholar] [CrossRef] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Vollmer, S.J.; Zygalakis, K.C.; Teh, Y.W. Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res.
**2016**, 17, 5504–5548. [Google Scholar] - Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp.
**2019**, 2019, 124020. [Google Scholar] [CrossRef] - Zhu, Z.; Wu, J.; Yu, B.; Wu, L.; Ma, J. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. arXiv
**2018**, arXiv:1803.00195. [Google Scholar] - Scott, W.A. Maximum likelihood estimation using the empirical fisher information matrix. J. Stat. Comput. Simul.
**2002**, 72, 599–611. [Google Scholar] [CrossRef] - Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 25 October 2021).
- LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. 2010. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 25 October 2021).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Krizhevsky, A.; Nair, V.; Hinton, G. CIFAR-10 (Canadian Institute for Advanced Research). Available online: http://www.cs.toronto.edu/~kriz/cifar.html (accessed on 25 October 2021).
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef][Green Version] - Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. arXiv
**2017**, arXiv:1706.04599. [Google Scholar]

Method | wine | protein | naval | kin8nm | power | boston |
---|---|---|---|---|---|---|

sgld | 0.759 ± 0.07 | 5.687 ± 0.05 | 0.007 ± 0.00 (F = 6.000) | 0.171 ± 0.07 (F = 3.000) | 11.753 ± 3.25 | 9.602 ± 2.06 |

i-sgd | 0.635 ± 0.05 | 4.699 ± 0.03 | 0.001 ± 0.00 | 0.079 ± 0.00 | 4.320 ± 0.13 | 3.703 ± 1.19 |

Baseline | 0.641 ± 0.05 | 4.733 ± 0.05 | 0.001 ± 0.00 | 0.080 ± 0.00 | 4.354 ± 0.12 | 3.705 ± 1.19 |

vsgd | 0.635 ± 0.05 | 4.699 ± 0.03 | 0.001 ± 0.00 | 0.079 ± 0.00 | 4.325 ± 0.13 | 3.588 ± 1.06 (F = 1.000) |

sghmc | 0.628 ± 0.04 | 4.712 ± 0.03 | 0.000 ± 0.00 (F = 2.000) | 0.076 ± 0.00 (F = 1.000) | 4.310 ± 0.14 | 3.659 ± 1.24 |

sgld T | 0.752 ± 0.07 | 5.673 ± 0.04 | 0.007 ± 0.00 (F = 6.000) | 0.169 ± 0.07 (F = 3.000) | 11.351 ± 3.02 | 9.417 ± 2.07 |

drop | 0.637 ± 0.04 | 4.968 ± 0.05 | 0.003 ± 0.00 | 0.139 ± 0.01 | 4.531 ± 0.16 | 3.803 ± 1.26 |

sghmc T | 0.628 ± 0.04 | 4.684 ± 0.03 | 0.000 ± 0.00 (F = 6.000) | 0.076 ± 0.00 | 4.326 ± 0.13 | 3.692 ± 1.19 |

Method | wine | protein | naval | kin8nm | power | boston |
---|---|---|---|---|---|---|

sgld | 1.546 ± 0.25 | 5.604 ± 0.08 | −1.751 ± 0.28 (F = 6.000) | 5.140 ± 7.05 (F=3.000) | 8.429 ± 3.14 | 30.386 ± 15.77 |

i-sgd | 1.129 ± 0.15 | 4.371 ± 0.03 | −2.466 ± 1.12 | −0.460 ± 0.65 | 3.122 ± 0.07 | 9.799 ± 5.69 |

Baseline | 1.182 ± 0.03 | 3.964 ± 0.04 | 0.920 ± 0.00 | 0.924 ± 0.00 | 3.071 ± 0.06 | 5.421 ± 2.73 |

vsgd | 1.128 ± 0.15 | 4.371 ± 0.03 | −2.466 ± 1.12 | −0.480 ± 0.65 | 3.088 ± 0.06 | 8.413 ± 5.89 (F = 1.000) |

sghmc | 1.041 ± 0.12 | 4.142 ± 0.02 | −2.763 ± 1.33 (F = 2.000) | −0.798 ± 0.39 (F = 1.000) | 2.924 ± 0.04 | 3.097 ± 0.83 |

sgld T | 1.526 ± 0.24 | 5.591 ± 0.07 | −1.752 ± 0.28 (F = 6.000) | 5.118 ± 7.06 (F = 3.000) | 8.288 ± 3.04 | 33.212 ± 19.69 |

drop | 1.065 ± 0.12 | 4.218 ± 0.06 | −2.322 ± 0.75 | −0.086 ± 0.41 | 2.941 ± 0.04 | 3.989 ± 1.23 |

sghmc T | 1.104 ± 0.14 | 4.191 ± 0.02 | −2.966 ± 1.89 (F = 6.000) | −0.756 ± 0.42 | 3.116 ± 0.07 | 9.826 ± 5.72 |

Method | acc | mnll | Mean ${\mathit{H}}_{0}$ | ece | Mean ${\mathit{H}}_{1}$ | Failed |
---|---|---|---|---|---|---|

i-sgd | 9916.3333 ± 2.8674 | 263.5311 ± 16.3600 | 0.0368 ± 0.0019 | 0.0491 ± 0.0003 | 0.4558 ± 0.0591 | 0.0000 |

sghmc | 9930.6667 ± 2.4944 | 268.2559 ± 6.8172 | 0.0593 ± 0.0018 | 0.0531 ± 0.0003 | 1.0369 ± 0.0346 | 0.0000 |

drop | 9912.6667 ± 6.0185 | 362.8973 ± 24.8881 | 0.0960 ± 0.0090 | 0.0541 ± 0.0011 | 0.5507 ± 0.0577 | 0.0000 |

baseline | 9886.6667 ± 11.0252 | 352.6640 ± 20.8622 | 0.0353 ± 0.0058 | 0.0468 ± 0.0001 | 0.0019 ± 0.0003 | 0.0000 |

baseline r | 9919.0000 ± 9.4163 | 242.7644 ± 17.0736 | 0.0303 ± 0.0001 | 0.0482 ± 0.0006 | 0.0021 ± 0.0002 | 0.0000 |

swag | 9917.0000 ± 2.8284 | 308.8182 ± 20.0979 | 0.0675 ± 0.0108 | 0.0524 ± 0.0011 | 0.3953 ± 0.0442 | 0.0000 |

sgld | 9927.0000 ± 1.0000 | 279.7685 ± 16.6563 | 0.0556 ± 0.0034 | 0.0531 ± 0.0004 | 1.3032 ± 0.1942 | 1.0000 |

vsgd | 9927.3333 ± 6.7987 | 225.3725 ± 16.3739 | 0.0274 ± 0.0008 | 0.0481 ± 0.0005 | 0.0414 ± 0.0070 | 0.0000 |

i-sgd T | 9915.6667 ± 0.9428 | 255.9641 ± 12.8051 | 0.0289 ± 0.0014 | 0.0478 ± 0.0002 | 0.0284 ± 0.0122 | 0.0000 |

sghmc T | 9937.0000 ± 0.0000 | 231.5332 ± 0.0000 | 0.0434 ± 0.0000 | 0.0518 ± 0.0000 | 0.4623 ± 0.0000 | 2.0000 |

Method | acc | mnll | mean ${\mathit{H}}_{0}$ | ece |
---|---|---|---|---|

i-sgd | 8591.3333 ± 17.4611 | 4393.3557 ± 107.0878 | 0.6107 ± 0.0337 | 0.0731 ± 0.0075 |

sghmc | 8634.6667 ± 5.1854 | 4357.8998 ± 11.2722 | 0.6300 ± 0.0023 | 0.0819 ± 0.0017 |

swag wd | 8740.6667 ± 35.5653 | 3931.9900 ± 45.6605 | 0.4130 ± 0.0066 | 0.0275 ± 0.0015 |

swag | 8061.0000 ± 11.4310 | 5903.2605 ± 62.8167 | 0.5308 ± 0.0135 | 0.0163 ± 0.0019 |

baseline | 8273.3333 ± 26.7872 | 8050.4467 ± 109.9864 | 0.2250 ± 0.0005 | 0.0809 ± 0.0020 |

vsgd | 8255.6667 ± 24.1155 | 8919.8062 ± 106.3571 | 0.1761 ± 0.0078 | 0.0905 ± 0.0020 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Franzese, G.; Milios, D.; Filippone, M.; Michiardi, P. A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization. *Entropy* **2021**, *23*, 1426.
https://doi.org/10.3390/e23111426

**AMA Style**

Franzese G, Milios D, Filippone M, Michiardi P. A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization. *Entropy*. 2021; 23(11):1426.
https://doi.org/10.3390/e23111426

**Chicago/Turabian Style**

Franzese, Giulio, Dimitrios Milios, Maurizio Filippone, and Pietro Michiardi. 2021. "A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization" *Entropy* 23, no. 11: 1426.
https://doi.org/10.3390/e23111426