# Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

**(VI)**methods can rely on well-understood optimization techniques and scale well to large datasets, at the cost of an approximation quality depending heavily on the assumptions made. The Gaussian family is by far the most popular variational approximation used in VI [6,7]. This is for several reasons. First, Gaussian variational families are easy to sample from, reparametrize, and marginalize. Second, they are easily amenable to diagonal covariance approximations, making them scalable to high dimensions. Third, most expectations are either easily computable by quadrature or Monte Carlo integration, or known in closed-form.

**, with the speed of convergence and the scalability in dimensions as the main concerns. From the perspective of convergence speed, the major bottleneck when computing gradients with stochastic estimators is the estimator variance [8]. Particle-based methods with deterministic paths do not have this issue, and have been proven to be highly successful in many applications [9,10,11]. However, can we use a particle-based algorithm to compute a VGA? If so, what are its properties and is it competitive with other VGA methods?**

**(VGA)****, a framework to approximate a Gaussian variational distribution with particles. GPF is derived from a continuous-time flow, where the necessary expectations over the evolving densities are approximated by particles. The complexity of the method grows quadratically with the number of particles but linearly with the dimension, remaining compatible with other approximations such as structured mean-field approximations. Using the same dynamics, we also derive a stochastic version of the algorithm, Gaussian Flow**

**(GPF)****. To show convergence, we prove the decrease in an empirical version of the free energy that is valid for a finite number of particles. For the special case of D–dimensional Gaussian target densities, we show that $D+1$ particles are enough to obtain convergence to the true distribution. We also find, for this case, that convergence is exponentially fast. Finally, we compare our approach with other VGA algorithms, both in fully controlled synthetic settings and on a set of real-world problems.**

**(GF)**## 2. Related Work

**(VI)**aims to simplify this problem by turning it into an optimization one. The intractable posterior is approximated by the closest distribution within a tractable family, with closeness being measured by the Kullback-Leibler

**(KL)**divergence, defined by

#### 2.1. The Variational Gaussian Approximation

**(MF)**[18,19] approach imposes independence between variables in the variational distribution. The number of variational parameters is then $2D$, but covariance information between dimensions is lost.

#### 2.2. Natural Gradients

#### 2.3. Particle-Based VI

**[24], which computes a nonparametric transformation based on the kernelized Stein discrepancy [9]. SVGD has the advantage of not being restricted to a parametric form of the variational distribution. However, using standard distance-based kernels like the squared exponential kernel ($k(x,y)=exp(-\parallel x-y{\parallel}_{2}^{2}/2)$) can lead to underestimated covariances and poor performance in high dimensions [11,25]. Hence, it is interesting to develop particle approaches that approximate the VGA. We provide a more thorough comparison between our method and SVGD in Section 3.6.**

**(SVGD)**#### 2.4. GVA in Bayesian Neural Networks

**(BNN)**by adding priors to Neural Networks parameters. The true form of the posterior is unknown but VGA has been used due to its ease of use and scalability with the number of dimensions (typically $D\gg {10}^{5}$). Most of the aforementioned methods apply to BNN, but techniques have been specifically tailored with BNN in mind. [26] use the low-rank structure of [13] but exploit the Local Reparametrization Trick, where each datapoint ${y}_{i}$ gets a different sample from q in order to reduce the stochastic gradient estimator variance. Stochastic Weight Averaging-Gaussian

**(SWAG)**[27], in which a set of particles obtained via stochastic gradient descent represent a low-rank Gaussian distribution, approximating the true posterior with a prior posterior produced by the network’s regularization. While easy to implement, SWAG does not allow you to incorporate an explicit prior, and the resulting distribution does not derive from a principled Bayesian approach.

#### 2.5. Related Approaches

**(EKF)**[28]. It assumes that the posterior is computed in a sequential way, where, at each time step, only single (or smaller batches) of data observations, represented by their likelihoods, become available. An ensemble of particles, representing a Gaussian distribution is iteratively updated with every new batch of observations. EKF allows us to work on high-dimensional problems with a limited amount of particles but is restricted to factorizable likelihoods for which a sequential representation is possible. While EKF maintains a representation of a Gaussian posterior, it is not clear how this relates to the goal of minimizing the free energy or the KL divergence.

## 3. Gaussian (Particle) Flow

**(GPF)**and Gaussian Flow

**(GF)**, two computationally tractable approaches, to obtain a Variational Gaussian Approximation

**(VGA)**. In the following, we derive deterministic linear dynamics, which decreases the variational free energy. We additionally give some variants with a Mean-Field

**(MF)**approach and prove theoretical convergence guarantees.

#### 3.1. Gaussian Variable Flows

**(ODE)**. For the Gaussian case, in the spirit of the reparametrization trick (3), we choose a linear corresponding map f and write

#### 3.2. From Variable Flows to Parameter Flows

**algorithm can be easily derived. By differentiating the parametrisation ${x}^{t}={\mathrm{\Gamma}}^{t}({x}^{0}-{m}^{0})+{m}^{t}$ (with ${m}^{t}$ now considered as free variational parameter) with respect to time t and using (5), we obtain**

**(GF)**#### 3.3. Particle Dynamics

#### Relaxation of Empirical Free Energy and Convergence

#### 3.4. Algorithm and Properties

- Gradients of expectations have zero variance, at the cost of a bias decreasing with the number of particles and equal to zero for Gaussian target (see Theorem 1);
- It works with noisy gradients (when using subsampling data, for example);
- The rank of the approximated covariance C is $min(N-1,D)$. When $N\le D$, the algorithm can be used to obtain a low-rank approximation.
- The complexity of our algorithm is $\mathcal{O}\left({N}^{2}D\right)$ and storing complexity is $\mathcal{O}\left(N\right(N+D\left)\right)$. By adjusting the number of particles used, we can control the performance trade-off;
- GPF (and GF) are also compatible with any kind of structured MF (see Section 3.5);
- Despite working with an empirical distribution, we can compute a surrogate of the free energy $\mathcal{F}\left(q\right)$ to optimize hyper-parameters, compute the lower bound of the log-evidence, or simply monitor convergence.

Algorithm 1: Gaussian Flow (GF) |

Algorithm 2: Gaussian Particle Flow (GPF) |

#### 3.4.1. Relaxation of Empirical Free Energy

#### 3.4.2. Dynamics and Fixed Points for Gaussian Targets

**Theorem**

**1.**

**Proof.**

**Theorem**

**2.**

**Proof.**

**Theorem**

**3.**

**global minimum**of the regularised version $\tilde{\mathcal{F}}$ of the free energy (17) corresponds to the

**largest**eigenvalues of Σ.

**Proof.**

#### 3.5. Structured Mean-Field

#### 3.6. Comparison with SVGD

## 4. Experiments

#### 4.1. Multivariate Gaussian Targets

**(DSVI)**[14], Factor Covariance Structure

**(FCS)**[15] with rank $p=D$, iBayes Learning Rule

**(IBLR)**[17] with a full-rank covariance and their Hessian approach, and Stein Variational Gradient Descent with both a linear kernel

**(Linear SVGD)**[10] and a squared-exponential kernel

**(Sq. Exp. SVGD)**[24]. For all methods, we set the number of particles or, alternatively, the number of samples used by the estimator, as $D+1$, and use standard gradient descent (${x}^{t+1}={x}^{t}+\eta {\phi}^{t}\left(\right)open="("\; close=")">{x}^{t}$) with a learning rate of $\eta =0.01$ for all particle methods. We use RMSProp [37] with a learning rate of $0.01$ for all stochastic methods. We run each experiment 10 times with 30,000 iterations, and plot the average error on the mean and the covariance with one standard deviation. For GPF, we additionally evaluate the method with and without using natural gradients for the mean (i.e., pre-multiplying the averaged gradient with ${C}^{t}$), indicated, respectively, with a dashed and solid line. Figure 2 reports the ${L}_{2}$ norm of the difference between the mean and covariance with the true posterior over time for the target condition number $\kappa \in \{1,10,100\}$.

#### 4.2. Low-Rank Approximation for Full Gaussian Targets

#### 4.3. High-Dimensional Low-Rank Gaussian Targets

#### 4.4. Non-Gaussian Target

#### 4.5. Bayesian Logistic Regression

`spam`($N=4601,D=104$),

`krkp`($N=351,D=111$),

`ionosphere`($N=3196,D=37$) and

`mushroom`($N=8124,D=95$). We ran all algorithms discussed in Section 4.1, both with and without a mean-field approximation; SVGD was omitted since it is too unstable. All algorithms were run with a fixed learning rate $\eta ={10}^{-4}$, and we used mini-batches of size 100. We show alternative training settings in Appendix I. Note that FCS, for mean-field, simplifies to DSVI Additionally, we did not consider full-rank IBLR, as it is too expensive, and we used their reparametrized gradient version for the Hessian. Figure 6 shows the average negative log-likelihood on 10-fold cross-validation with one standard deviation for each dataset. While, as expected, the advantages shown for Gaussian targets do not transfer to non-Gaussian targets, GPF and GF are consistently on par with competitors. On the other hand, IBLR tends to be outperformed. It is also interesting to note that mean-field does not seem to have a negative impact on these problems, and performance remains the same even with a full-rank matrix.

#### 4.6. Bayesian Neural Network

**(SWAG)**[27] with an SGD learning rate of ${10}^{-6}$ (selected empirically) and Efficient Low-Rank Gaussian Variational Inference

**(ELRGVI)**[26]. We varied the assumptions on the covariance matrix to be diagonal

**(Mean-Field)**, or to have rank $L\in \{5,10\}$. Additionally, we showed, for GPF, the effect of using a structured mean-field assumption by imposing the independence of the weights between each layer (

**GPF (Layers)**).

## 5. Discussion

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Derivation of the Optimal Parameters

## Appendix B. Relaxation of the Empirical Free Energy

## Appendix C. Riemannian Gradient for Matrix Parameter Γ

## Appendix D. Regularised Free Energy for N ≤ D

#### Efficient Computation of $log|\tilde{C}|$

## Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (N > d)

**Theorem**

**A1**

## Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets

**Theorem**

**A2**

#### Appendix F.1. Convergence of the Mean

#### Appendix F.2. Convergence of the Covariance Matrix

#### Appendix F.3. Convergence of the Trace of the Covariance

**Lemma**

**A1**

#### Appendix F.4. Decay of Fluctuation Part of the Free Energy

#### Appendix F.5. Asymptotic Decay of the Free Energy:

## Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (N ≤ D)

**Theorem**

**A3**

**global minimum**of the regularised version $\tilde{\mathcal{F}}$ of the free energy (17) corresponds to the

**largest**eigenvalues of Σ.

**absolute minimum**of $\tilde{\mathcal{F}}$ is achieved, when the ${\lambda}_{i}$ are $N-1$

**largest**eigenvalues of $\mathrm{\Sigma}$. Our simulations empirically show that the algorithm usually converges to the absolute minimum.

## Appendix H. Dimension-Wise Optimizers

#### Appendix H.1. ADAM

Algorithm A1: ADAM |

Input: ${\phi}^{t},{m}^{t-1},{v}^{t-1},{\beta}_{1},{\beta}_{2},\eta $Output: $\mathrm{\Delta}$${m}_{n,d}^{t}={\beta}_{1}{m}_{n,d}^{t-1}+(1-{\beta}_{1}){\phi}_{n,d}^{t}$ ${v}_{n,d}^{t}={\beta}_{2}{v}_{n,d}^{t-1}+(1-{\beta}_{2}){{\left(}{{\phi}}_{{n}{,}{d}}^{{t}}{\right)}}^{{2}}$ ${\mathrm{\Delta}}_{n,d}=\eta \frac{{m}_{n,d}^{t}}{(1-{\beta}_{1}^{t})\left(\right)open="("\; close=")">\sqrt{{{v}}_{{n}{,}{d}}^{{t}}{(1-{\beta}_{2}^{t})}^{-1}}+\u03f5}$ |

Algorithm A2: Dimension-wise ADAM |

Input: ${\phi}^{t},{m}^{t-1},{v}^{t-1},{\beta}_{1},{\beta}_{2},\eta $Output: $\mathrm{\Delta}$${m}_{n,d}^{t}={\beta}_{1}{m}_{n,d}^{t-1}+(1-{\beta}_{1}){\phi}_{n,d}^{t};$ ${v}_{d}^{t}={\beta}_{2}{v}_{d}^{t-1}+(1-{\beta}_{2})\frac{{1}}{{N}}{{\mathrm{\sum}}}_{{n}{=}{1}}^{{N}}{{\left(}{{\phi}}_{{n}{,}{d}}^{{t}}{\right)}}^{{2}}$; ${\mathrm{\Delta}}_{n,d}=\eta \frac{{m}_{n,d}^{t}}{(1-{\beta}_{1}^{t})\left(\right)open="("\; close=")">\sqrt{{{v}}_{{d}}^{{t}}{(1-{\beta}_{2}^{t})}^{-1}}+\u03f5}$; |

#### Appendix H.2. AdaGrad

Algorithm A3: AdaGrad |

Input: ${\phi}^{t},{v}^{t-1},\eta $Output: $\mathrm{\Delta}$${n,d}_{v}^{t}={v}_{d}^{t-1}+{{\left(}{{\phi}}_{{n}{,}{d}}^{{t}}{\right)}}^{{2}}$ ${\mathrm{\Delta}}_{n,d}=\eta \frac{{\phi}_{n,d}^{t}}{\sqrt{{{v}}_{{t}}^{{n}{,}{d}}+\epsilon}}$ |

Algorithm A4: Dimension-wise AdaGrad |

Input: ${\phi}^{t},{v}^{t-1},\eta $Output: $\mathrm{\Delta}$${v}_{d}^{t}={v}_{d}^{t-1}+\frac{{1}}{{N}}{{\sum}}_{{n}{=}{1}}^{{N}}{{\left(}{{\phi}}_{{n}{,}{d}}^{{t}}{\right)}}^{{2}}$ ${\mathrm{\Delta}}_{n,d}=\eta \frac{{\phi}_{n,d}^{t}}{\sqrt{{{v}}_{{d}}^{{t}}}+\u03f5}$; |

#### Appendix H.3. RMSProp

Algorithm A5: RMSProp |

Input: ${\phi}^{t},{v}^{t-1},\rho ,\eta $Output: $\mathrm{\Delta}$${v}_{n,d}^{t}=\rho {v}_{n,d}^{t-1}+(1-\rho ){{\left(}{{\phi}}_{{n}{,}{d}}^{{t}}{\right)}}^{{2}}$ ${\mathrm{\Delta}}_{n,d}=\eta \frac{{\phi}_{n,d}^{t}}{\sqrt{{{v}}_{{n}{,}{d}}^{{t}}}+\u03f5}$ |

Algorithm A6: Dimension-wise RMSProp |

Input: ${\phi}^{t},{v}^{t-1},\rho ,\eta $Output: $\mathrm{\Delta}$${v}_{d}^{t}=\rho {v}_{d}^{t-1}+(1-\rho )\frac{{1}}{{N}}{{\sum}}_{{n}{=}{1}}^{{N}}{{\left(}{{\phi}}_{{n}{,}{d}}^{{t}}{\right)}}^{{2}}$ ${\mathrm{\Delta}}_{n,d}=\eta \frac{{\phi}_{n,d}^{t}}{\sqrt{{{v}}_{{d}}^{{t}}}+\u03f5}$ |

## Appendix I. Additional Figures

#### Appendix I.1. Bayesian Logistic Regression

**Figure A1.**Similarly to Figure 6, we show the average negative log-likelihood on a test-set over 10 runs against training time on different datasets for a Bayesian logistic regression problem. The dashed curve represents the low-rank approximation with RMSProp for methods based on stochastic estimators.

#### Appendix I.2. Bayesian Neural Network

**Figure A2.**Convergence of the classification error and average negative log-likelihood as a function of time.

**Figure A3.**Accuracy vs confidence. Every test sample is clustered in function of its highest predictive probability. The accuracy of this cluster is then computed. A perfectly calibrated estimator would return the identity.

## References

- Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE
**2016**, 104, 148–175. [Google Scholar] [CrossRef][Green Version] - Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin–Madison: Madison, WI, USA, 2009. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Bardenet, R.; Doucet, A.; Holmes, C. On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res.
**2017**, 18, 1515–1557. [Google Scholar] - Cowles, M.K.; Carlin, B.P. Markov chain Monte Carlo convergence diagnostics: A comparative review. J. Am. Stat. Assoc.
**1996**, 91, 883–904. [Google Scholar] [CrossRef] - Barber, D.; Bishop, C.M. Ensemble learning for multi-layer networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; pp. 395–401. [Google Scholar]
- Graves, A. Practical Variational Inference for Neural Networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24, pp. 2348–2356. [Google Scholar]
- Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822. [Google Scholar]
- Liu, Q.; Lee, J.; Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 276–284. [Google Scholar]
- Liu, Q.; Wang, D. Stein variational gradient descent as moment matching. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 32, pp. 8868–8877. [Google Scholar]
- Zhuo, J.; Liu, C.; Shi, J.; Zhu, J.; Chen, N.; Zhang, B. Message Passing Stein Variational Gradient Descent. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 6018–6027. [Google Scholar]
- Opper, M.; Archambeau, C. The variational Gaussian approximation revisited. Neural Comput.
**2009**, 21, 786–792. [Google Scholar] [CrossRef] [PubMed] - Challis, E.; Barber, D. Gaussian kullback-leibler approximate inference. J. Mach. Learn. Res.
**2013**, 14, 2239–2286. [Google Scholar] - Titsias, M.; Lázaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1971–1979. [Google Scholar]
- Ong, V.M.H.; Nott, D.J.; Smith, M.S. Gaussian variational approximation with a factor covariance structure. J. Comput. Graph. Stat.
**2018**, 27, 465–478. [Google Scholar] [CrossRef][Green Version] - Tan, L.S.; Nott, D.J. Gaussian variational approximation with sparse precision matrices. Stat. Comput.
**2018**, 28, 259–275. [Google Scholar] [CrossRef][Green Version] - Lin, W.; Schmidt, M.; Khan, M.E. Handling the Positive-Definite Constraint in the Bayesian Learning Rule. In Proceedings of the 37th International Conference on Machine Learning, Virtual. 13–18 July 2020; Volume 119, pp. 6116–6126. [Google Scholar]
- Hinton, G.E.; van Camp, D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; COLT ’93;. Association for Computing Machinery: New York, NY, USA, 1993; pp. 5–13. [Google Scholar]
- Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc.
**2017**, 112, 859–877. [Google Scholar] [CrossRef][Green Version] - Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput.
**1998**, 10, 251–276. [Google Scholar] [CrossRef] - Khan, M.E.; Nielsen, D. Fast yet simple natural-gradient descent for variational inference in complex models. In Proceedings of the International Symposium on Information Theory and Its Applications (ISITA), Singapore, 28–31 October 2018; pp. 31–35. [Google Scholar]
- Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3992–4002. [Google Scholar]
- Salimbeni, H.; Eleftheriadis, S.; Hensman, J. Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Lanzarote, Canary Islands, 9–11 April 2018; pp. 689–697. [Google Scholar]
- Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. arXiv
**2016**, arXiv:1608.04471. [Google Scholar] - Ba, J.; Erdogdu, M.A.; Ghassemi, M.; Suzuki, T.; Sun, S.; Wu, D.; Zhang, T. Towards Characterizing the High-dimensional Bias of Kernel-based Particle Inference Algorithms. In Proceedings of the 2nd Symposium on Advances in Approximate Bayesian Inference, Vancouver, BC, Canada, 8 December 2019. [Google Scholar]
- Tomczak, M.; Swaroop, S.; Turner, R. Efficient Low Rank Gaussian Variational Inference for Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual. 6–12 December 2020; Volume 33. [Google Scholar]
- Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13153–13164. [Google Scholar]
- Evensen, G. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. J. Geophys. Res. Oceans
**1994**, 99, 10143–10162. [Google Scholar] [CrossRef] - Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1530–1538. [Google Scholar]
- Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing, Montréal, QC, Canada, 3–8 December 2018; pp. 6572–6583. [Google Scholar]
- Ingersoll, J.E. Theory of Financial Decision Making; Rowman & Littlefield: Lanham, MD, USA, 1987; Volume 3. [Google Scholar]
- Barfoot, T.D.; Forbes, J.R.; Yoon, D.J. Exactly sparse gaussian variational inference with application to derivative-free batch nonlinear state estimation. Int. J. Robot. Res.
**2020**, 39, 1473–1502. [Google Scholar] [CrossRef] - Korba, A.; Salim, A.; Arbel, M.; Luise, G.; Gretton, A. A Non-Asymptotic Analysis for Stein Variational Gradient Descent. In Proceedings of the 32nd International Conference on Neural Information Processing, Virtual, 6–12 December 2020; Volume 33. pp. 4672–4682.
- Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
- Zaki, N.; Galy-Fajou, T.; Opper, M. Evidence Estimation by Kullback-Leibler Integration for Flow-Based Methods. In Proceedings of the Third Symposium on Advances in Approximate Bayesian Inference, Virtual Event. January–February 2021. [Google Scholar]
- Bezanson, J.; Edelman, A.; Karpinski, S.; Shah, V.B. Julia: A fresh approach to numerical computing. SIAM Rev.
**2017**, 59, 65–98. [Google Scholar] [CrossRef][Green Version] - Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop, Coursera: Neural Networks for Machine Learning; Technical Report; University of Toronto: Toronto, ON, USA, 2012. [Google Scholar]
- Zhang, G.; Li, L.; Nado, Z.; Martens, J.; Sachdeva, S.; Dahl, G.; Shallue, C.; Grosse, R.B. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 8196–8207. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets.php (accessed on 28 July 2021).
- Agarap, A.F. Deep learning using rectified linear units (relu). arXiv
**2018**, arXiv:1803.08375. [Google Scholar] - LeCun, Y. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 20 July 2021).
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
- Liu, C.; Zhuo, J.; Cheng, P.; Zhang, R.; Zhu, J. Understanding and accelerating particle-based variational inference. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4082–4092. [Google Scholar]
- Zhu, M.H.; Liu, C.; Zhu, J. Variance Reduction and Quasi-Newton for Particle-Based Variational Inference. In Proceedings of the 37th International Conference on Machine Learning, Virtual. 13–18 July 2020. [Google Scholar]
- Gronwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math.
**1919**, 20, 292–296. [Google Scholar] [CrossRef] - Zhang, F. Matrix Theory: Basic Results and Techniques; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]

**Figure 1.**Illustration of the Gaussian Particle Flow algorithm, with ${q}^{0}\left(x\right)$ and $p\left(x\right)$ representing the initial and target distribution respectively. Particles are iteratively moved according to the gradient flow starting from ${q}^{0}\left(x\right)$, approximating a new Gaussian distribution ${q}^{t}\left(x\right)$ at each iteration t.

**Figure 2.**${L}^{2}$ norm of the difference between the target mean $\mu $ (left side) and target covariance $\mathrm{\Sigma}$ (right side) with the inferred variational parameters ${m}^{t}$ and ${C}^{t}$ against time for 20-dimensional Gaussian targets with condition number $\kappa $. We use $D+1$ particles/samples and show the mean over 10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on the mean. Note that DSVI, GF and FCS are overlapping and are, at this scale, indistinguishable from one another.

**Figure 3.**Trace error for a Gaussian target with $D=50$ and condition numbers $\kappa $ for a varying number of particles with GPF. Predictions from Theorem 3 are shown in dashed-black.

**Figure 4.**Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target with effective rank $K\in \{10,20,30\}$. The rank of each method is fixed as 20. The difference in the starting point for the covariance is due to the initialization difference between each method. We show the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.

**Figure 5.**Two-dimensional Banana distribution. Comparison of GPF using an increasing number of particles and a different optimizer (ADAM) with the standard VGA (rightmost plot).

**Figure 6.**Average negative log-likelihood vs. time on a test-set over 10 runs against training time for a Bayesian logistic regression model applied to different datasets. Top plots use a mean-field approximation, while bottom plots use a low-rank structure for the covariance with rank $L=100$.

**Table 1.**Negative Log-Likelihood (NLL), Accuracy (Acc), and Expected Calibration Error (ECE) for a Bayesian Neural Networks

**(BNN)**on the MNIST dataset. We varied the rank of the variational covariance from mean-field (all variables are independent) to a low-rank structure with $L\in \{5,10\}$. Bold numbers indicated the best performance, and italic bold numbers indicate the best performance when restricted to VGA methods. Convergence and calibration plots can be found in Appendix I.

Alg. | Mean-Field | $\mathit{L}=5$ | $\mathit{L}=10$ | ||||||
---|---|---|---|---|---|---|---|---|---|

NLL | Acc | ECE | NLL | Acc | ECE | NLL | Acc | ECE | |

GPF | $0.183$ | $0.95$ | $0.0384$ | $0.166$ | 0.96 | $0.0918$ | $0.172$ | $0.955$ | $0.0869$ |

GPF (Layers) | - | - | - | 0.147 | $0.958$ | 0.0181 | $0.178$ | $0.952$ | $0.0395$ |

GF | $0.178$ | $0.953$ | $0.0706$ | $0.185$ | $0.956$ | $0.136$ | $0.171$ | $0.952$ | $0.0455$ |

DSVI | $0.204$ | $0.945$ | $0.11$ | - | - | - | - | - | - |

SVGD (Sq. Exp) | - | - | - | $0.139$ | $0.965$ | $0.0732$ | 0.133 | 0.967 | $0.0879$ |

SWAG | - | - | - | $0.257$ | $0.957$ | $0.0662$ | $0.287$ | $0.956$ | $0.0878$ |

ELRGVI | - | - | - | $0.453$ | $0.901$ | $0.53$ | $0.537$ | $0.882$ | $0.777$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Galy-Fajou, T.; Perrone, V.; Opper, M.
Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. *Entropy* **2021**, *23*, 990.
https://doi.org/10.3390/e23080990

**AMA Style**

Galy-Fajou T, Perrone V, Opper M.
Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. *Entropy*. 2021; 23(8):990.
https://doi.org/10.3390/e23080990

**Chicago/Turabian Style**

Galy-Fajou, Théo, Valerio Perrone, and Manfred Opper.
2021. "Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation" *Entropy* 23, no. 8: 990.
https://doi.org/10.3390/e23080990