Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation

Variational inference is a powerful framework, used to approximate intractable posteriors through variational distributions. The de facto standard is to rely on Gaussian variational families, which come with numerous advantages: they are easy to sample from, simple to parametrize, and many expectations are known in closed-form or readily computed by quadrature. In this paper, we view the Gaussian variational approximation problem through the lens of gradient flows. We introduce a flexible and efficient algorithm based on a linear flow leading to a particle-based approximation. We prove that, with a sufficient number of particles, our algorithm converges linearly to the exact solution for Gaussian targets, and a low-rank approximation otherwise. In addition to the theoretical analysis, we show, on a set of synthetic and real-world high-dimensional problems, that our algorithm outperforms existing methods with Gaussian targets while performing on a par with non-Gaussian targets.


Introduction
Representing uncertainty is a ubiquitous problem in machine learning. Reliable uncertainties are key for decision making, especially in contexts where the trade-off between exploitation and exploration plays a central role, such as Bayesian optimization [1], active learning [2], and reinforcement learning [3]. While Bayesian inference is a principled tool to provide uncertainty estimation, computing posterior distributions is intractable for many problems of interest. Most sampling methods struggle to scale up to large datasets [4], while the diagnosis of convergence is not always straightforward [5]. On the other hand, Variational Inference (VI) methods can rely on well-understood optimization techniques and scale well to large datasets, at the cost of an approximation quality depending heavily on the assumptions made. The Gaussian family is by far the most popular variational approximation used in VI [6,7]. This is for several reasons. First, Gaussian variational families are easy to sample from, reparametrize, and marginalize. Second, they are easily amenable to diagonal covariance approximations, making them scalable to high dimensions. Third, most expectations are either easily computable by quadrature or Monte Carlo integration, or known in closed-form.
A large body of work covers different approaches to optimize the Variational Gaussian Approximation (VGA), with the speed of convergence and the scalability in dimensions as the main concerns. From the perspective of convergence speed, the major bottleneck when computing gradients with stochastic estimators is the estimator variance [8]. Particlebased methods with deterministic paths do not have this issue, and have been proven to be highly successful in many applications [9][10][11]. However, can we use a particle-based where ϕ(x) = − log(p(y|x)p(x)). A standard descent algorithm based on gradients of Equation (2) with respect to variational parameters m, C give rise to some issues. First, naively computing the gradient of the expectation with respect to the covariance matrix C involves unwanted second derivatives of ϕ(x) [12], which may not be available or may be computationally too expensive in a black-box setting. Second, the gradient of the entropy term H q entails inverting a non-sparse matrix, which we would like to avoid for higher-dimensional cases. Finally, the positive-definiteness of the covariance matrix leads to non-trivial constraints on parameter updates, which can lead to a slowdown of convergence or, if ignored, to instabilities in the algorithm. To solve these issues, a variety of approaches have been proposed in the literature. If we focus on factorizable models, we can make a simplification: for problems with likelihoods that can be rewritten as p(y|x) = ∏ D d=1 p(y|x d ), the number of independent variational parameters is reduced to 2D [12,13]. In this special case, the Gaussian expectations in the free energy (2) split into a sum of 1-dimensional integrals, which can be efficiently computed by using numerical quadrature methods. To extend to the general case, gradients of the free energy are estimated by a stochastic sampling approach, which also forms the starting point of our method. This relies on the so-called reparametrization trick, where the expectation over the parameter-dependent variational density q θ is replaced by an expectation over a fixed density q 0 instead. This facilitates the gradient computation because unwanted derivatives of the type ∇ θ q θ (x) are avoided. For the Gaussian case, the reparametrization trick is a linear transformation of an arbitrary D dimensional Gaussian random variable x ∼ q θ (x) in terms of a D-dimensional Gaussian random variable x 0 ∼ q 0 = N (m 0 , C 0 ): where Γ ∈ R D×D and m ∈ R D are the variational parameters. We assume that the covariance C 0 is not degenerate and, for simplicity, we set it as the identity. For instance, the gradient of the expectation given q over a function f given the mean m becomes . This can be simply proved by using the reparametrization (3) inside the integral and passing the gradient inside; for more details, see [14]. Given this representation, the free energy is easily obtained as a function of the variational parameters: Other representations are possible. Challis and Barber [13] and Ong et al. [15] use a different reparametrization with a factorized structure of the covariance C = Γ Γ + diag(d), where Γ ∈ R D×P and d ∈ R D , with P ≤ D is the rank of Γ Γ. Other representations assume special structures of the precision matrix Λ = C −1 , which allow you to enforce special properties, such as sparsity in [16,17]. In general, these methods tend to scale poorly with the number of dimensions, as one needs to optimize D(D + 3)/2 parameters. The (structured) Mean-Field (MF) [18,19] approach imposes independence between variables in the variational distribution. The number of variational parameters is then 2D, but covariance information between dimensions is lost.

Natural Gradients
Besides the issue of expectations, more efficient optimizations directions, beyond ordinary gradient descent, have been considered. These can help to deal with constraints such as those given for the covariance matrix. Natural gradients [20] are a special case of Riemannian gradients and utilize the specific Riemannian manifold structure of variational parameters. They can often deal with constraints of parameters (such as the positive definiteness of the covariance), accelerate inference, and improve the convergence of algorithms. The application of such advanced gradient methods typically requires an estimate of the inverse Fisher information matrix as a preconditioner of ordinary gradients. Khan and Nielsen [21] and Lin et al. [22] propose a solution that requires extra second derivatives of the log-posteriors. Salimbeni et al. [23] developed an automatic process to compute these without the second derivatives but with instability issues. Lin et al. [17] solved these issues by using geodesics on the manifold of parameters, at the price of having to compute inverse matrices as well as Hessians.

Particle-Based VI
Stochastic gradient descent methods compute expectations (and gradients) at each time step with new independent Monte Carlo samples drawn from the current approximation of the variational density. Particle-based methods for variational inference draw samples only once at the beginning of the algorithm instead. They iteratively construct transformations of an initial random variable (having a simple tractable density) where the transformed density leads to the decrease and finally to the minimum of the variational free energy. The iterative approach induces a deterministic temporal flow of random variables which depends on the current density of the variable itself. Using an approximation by the empirical density (which is represented by the positions of a set of 'particles') one obtains a flow of interacting particles which converges asymptotically to an empirical approximation of the desired optimal variational density.
The most popular approach is Stein Variational Gradient Descent (SVGD) [24], which computes a nonparametric transformation based on the kernelized Stein discrepancy [9]. SVGD has the advantage of not being restricted to a parametric form of the variational distribution. However, using standard distance-based kernels like the squared exponential kernel (k(x, y) = exp(− x − y 2 2 /2)) can lead to underestimated covariances and poor performance in high dimensions [11,25]. Hence, it is interesting to develop particle approaches that approximate the VGA. We provide a more thorough comparison between our method and SVGD in Section 3.6.

GVA in Bayesian Neural Networks
There has been increased interest in making Bayesian Neural Networks (BNN) by adding priors to Neural Networks parameters. The true form of the posterior is unknown but VGA has been used due to its ease of use and scalability with the number of dimensions (typically D 10 5 ). Most of the aforementioned methods apply to BNN, but techniques have been specifically tailored with BNN in mind. [26] use the low-rank structure of [13] but exploit the Local Reparametrization Trick, where each datapoint y i gets a different sample from q in order to reduce the stochastic gradient estimator variance. Stochastic Weight Averaging-Gaussian (SWAG) [27], in which a set of particles obtained via stochastic gradient descent represent a low-rank Gaussian distribution, approximating the true posterior with a prior posterior produced by the network's regularization. While easy to implement, SWAG does not allow you to incorporate an explicit prior, and the resulting distribution does not derive from a principled Bayesian approach.

Related Approaches
The closest approach to our proposed method is the Ensemble Kalman Filter (EKF) [28]. It assumes that the posterior is computed in a sequential way, where, at each time step, only single (or smaller batches) of data observations, represented by their likelihoods, become available. An ensemble of particles, representing a Gaussian distribution is iteratively updated with every new batch of observations. EKF allows us to work on high-dimensional problems with a limited amount of particles but is restricted to factorizable likelihoods for which a sequential representation is possible. While EKF maintains a representation of a Gaussian posterior, it is not clear how this relates to the goal of minimizing the free energy or the KL divergence.

Gaussian (Particle) Flow
We introduce Gaussian Particle Flow (GPF) and Gaussian Flow (GF), two computationally tractable approaches, to obtain a Variational Gaussian Approximation (VGA). In the following, we derive deterministic linear dynamics, which decreases the variational free energy. We additionally give some variants with a Mean-Field (MF) approach and prove theoretical convergence guarantees.
In the following, dt indicates the total derivative given time, ∂t partial derivatives given time, ∇ x (·) gradients given a vector x.

Gaussian Variable Flows
We next discuss an alternative approach to generate the desired transformation of random variables, leading from a simple (prior) Gaussian density to a more complex Gaussian, which minimizes the variational free energy. It is based on the idea of variable flows, i.e., recursive deterministic transformations of the random variables defined by a mapping x n+1 = x n + f n (x n ) where f n : R D → R D . Well-known examples of flows are Normalizing Flows [29], where f n are bijections, or Neural ODEs [30] where f n = f is defined by a neural network and x 0 is the input. For simplicity, we will consider small changes → 0 and work with flows in the continuous-time limit (t = n ), which follow a system of Ordinary Differential Equation (ODE). For the Gaussian case, in the spirit of the reparametrization trick (3), we choose a linear corresponding map f and write where A t is a matrix and m t . = E q t [x] (which is no longer interpreted as an independent variational parameter). When the initial random variable x 0 is Gaussian distributed, the vectors x t are also Gaussian for any t. To construct a flow that decreases the free energy over time, we can either compute the time derivative of the specific free energy (2) induced by the ODE (5), or simply derive the general result valid for smooth maps f (see, e.g., [24]). To be self contained, we briefly repeat the main steps: We first compute the change of the free energy in terms of the time derivative of q t : where we have used the fact that ∂q t (x) ∂t dx = d dt q t (x)dx = 0 and ∂ϕ(x) ∂t = 0. We next use the continuity equation for the density related to the deterministic flow to obtain where we have applied Green's identity twice and used the fact that lim x→∞ q t (x) = 0. Specializing to the linear flow (5), we obtain where Equation (6) represents the change in the free energy F for an infinitesimal change in the variables x given by the flow (5). Obviously, the simplest choices lead to a decrease in the free energy ≤ 0. More detailed derivations are given in Appendix A. Additionally, equality only happens, when Using Stein's lemma [31], we can show that these fixed-point solutions are equal to the conditions for the optimal variational Gaussian distribution solution given in [12]. In Appendix C, we show that our parameter updates can be interpreted as a Riemannian gradient descent method for the free energy (4). This is based on the metric introduced by ([20], Theorem 7.6) as an efficient technique for learning the mixing matrix in models of blind source separation. This gradient should not be confused with the so-called natural gradient obtained by pre-multiplying with the inverse Fischer-information matrix.
Of course, there are other choices for A t and b t , which lead to a decrease in the free energy and the same fixed-point equations. In Section 3.6, we discuss how SVGD, with a linear kernel, can lead to the same fixed points but with different dynamics.

From Variable Flows to Parameter Flows
Before we introduce the particle algorithm, we show that the results for the variable flow can also be converted into a temporal change of the parameters Γ t , m t , as defined for Equation (3). From this, a corresponding Gaussian Flow (GF) algorithm can be easily derived. By differentiating the parametrisation x t = Γ t (x 0 − m 0 ) + m t (with m t now considered as free variational parameter) with respect to time t and using (5), we obtain By inserting x t = Γ t (x 0 − m 0 ) + m t into the right hand side of (10), and using the optimal parameters from (7), we obtain Note that the expectations are over the probability distribution of the initial random variable x 0 . Discretizing Equations (11) in time, and estimating the expectations by drawing independent samples from the fixed Gaussian q 0 at each time step, we obtain our GF algorithm to minimize the variational free energy in the space of Gaussian densities. We summarize the steps of GF in Algorithm 1. Remarkably, this scheme differs from previous VGA algorithms with Riemannian gradients based on the Fisher information metric (see, e.g., [17,32]) because no matrix inversions or second order derivatives of the function ϕ are required. GF also allows for the computation of a low-rank VGA by enforcing Γ ∈ R D×K and x 0 ∈ R K . This algorithm scales linearly in the number of dimensions and quadratically in the rank K of the covariance.
It is interesting to note that the reverse construction of a variable flow from a parameter flow is, in general, not possible. This would require the ability to eliminate all variational parameters and the initial variables x 0 in the resulting differential equation for x t , and replace them with functions of x t alone. For instance, if we eliminate the initial variables x 0 in terms of (Γ t ) −1 and x t the algorithm of [14], the resulting expression still depends on Γ t .

Particle Dynamics
The main idea of the particle approach is to approximate the Gaussian density q t in (7) by the empirical distributionq computed from N samples x t i , i = 1, . . . , N. These are initially sampled from the density q 0 at time t = 0 and are then propagated using the discretized dynamics of the ODE (5): whereÂ where η t 1 and η t 2 are learning rates (We further comment on the use of different optimization schemes in Section 4.4). Note that although Eqt ∇ x ϕ(x)(x −m t ) is a D × D matrix, changing the matrix multiplication order leads to a computational complexity of O(N 2 D) with a storage complexity of O(N(N + D)), since neither the empirical covariance matrix or A t need to be explicitly computed.

Relaxation of Empirical Free Energy and Convergence
We have shown that the continuous-time dynamics (10) of the random variables leads to a decay of the free energy F (q t ) with time t. Assuming that the free energy is bounded from below, one might conjecture that this property would imply the convergence of the particle algorithm to a fixed point when learning rates are sufficiently small such that the discrete-time dynamics are approximated well by the continuous limit. Unfortunately, the finite number N of particles poses an extra problem. The definition of the free energy F (q) by the KL-divergence (1) for continuous random variables such as assumes that both q(·) and p(·|y) are densities with respect to the Lebesgue measure. Hence, F (q) is not defined if we take q ≡q, (12) as the empirical distribution of the finite particle approximation. Nevertheless, we define a finite N approximation to the Gaussian free energy, which is also then found to decay under the finite N dynamics. Let us first assume that N > D and defineF with the empirical covariance matrix The definition (14) is chosen in such way that in the large N limit, when the empirical distributionq t converges to a Gaussian distribution q t , we will also obtain the convergence of the approximation (14) to F (q t ). It can be shown (see Appendix B) that dF (q t ) dt ≤ 0, with equality only at the fixed points of the dynamics.
In applications of our particle method to high-dimensional problems, the limitations of computational power may force us to restrict particle numbers to be smaller than the dimensionality D. For N < D + 1, the empirical covariance C t will be singular, and typically contain only N − 1 non-zero eigenvalues, which leads to the − log Ĉ = ∞ and makes Equation (14) meaningless. We resolve this issue through a regularisation of the log-determinant term in (14), replacing all zero eigenvalues ofĈ by the values 1, i.e., λ i = 0 →λ i = 1. We show in Appendix B that the free energy still decays, provided that the dynamics of the particles stay the same. This regularisation step can be formally stated as a replacement of the empirical covariance (15) in (14) bŷ where e t i = ith eigenvector ofĈ t .

Algorithm and Properties
The algorithm we propose is to sample N particles {x 0 1 , . . . , x 0 N } where x 0 i ∈ R D from q 0 (which can be centered around the MAP for example), and iteratively optimize their positions using Equation (13). Once convergence is reached, i.e., dF dt = 0, we can easily make predictions using the converged empirical distributionq( where δ is the Dirac delta function, or, alternatively, the Gaussian density it represents, i.e., q( . To draw samples fromq, no inversions of the empirical covariance C are needed, as we can obtain new samples by computing: where ξ i are i.i.d. normal variables: ξ i ∼ N (0, I D ). This can be shown by defining D, the deviation matrix, a matrix which columns equal to . We naturally have DD = C which makes D the Cholesky decomposition of C.
All the inference steps are summarized in Algorithm 2 and an illustration in two dimensions is provided in Figure 1.
We summarize the principal points of our approach: • Gradients of expectations have zero variance, at the cost of a bias decreasing with the number of particles and equal to zero for Gaussian target (see Theorem 1); • It works with noisy gradients (when using subsampling data, for example); • The rank of the approximated covariance C is min(N − 1, D). When N ≤ D, the algorithm can be used to obtain a low-rank approximation. • The complexity of our algorithm is O(N 2 D) and storing complexity is O(N(N + D)).
By adjusting the number of particles used, we can control the performance trade-off; • GPF (and GF) are also compatible with any kind of structured MF (see Section 3.5); • Despite working with an empirical distribution ,we can compute a surrogate of the free energy F (q) to optimize hyper-parameters, compute the lower bound of the log-evidence, or simply monitor convergence.

Relaxation of Empirical Free Energy
The definition of the free energy F (q) from the KL-divergence (1) for a continuous random variables assumes that both q(·) and p(·|y) are densities with respect to the Lebesgue measure. Hence, it is not a priori clear that a specific approximation F (q t ), based on an empirical distributionq t (x) with a finite number of particles N, will decrease under the particle flow. Thus we may not be able to guarantee convergence to a fixed point for finite N. Luckily, as we show in Appendix D, we find that: For N < D + 1, the empirical covariance C t will typically contain N − 1 non-zero eigenvalues and lead to − log|C| = ∞, making Equation (17) meaningless. We resolve this issue by introducing a regularized free energy F where log C t is replaced by are the eigenvalues of C t . We show in Appendix D that, given the dynamics from Equation (5), F is also guaranteed to not increase over time. It can, therefore, be used as a regularized proxy for the true F and used to optimize over hyper-parameters or to monitor convergence. Note that similar proofs exist for SVGD [33] and were proven to be highly non-trivial.

Dynamics and Fixed Points for Gaussian Targets
We illustrate our method by some exact theoretical results for the dynamics and the fixed points of our algorithm when the target is a multivariate Gaussian density. While such targets may seem like a trivial application, our analysis could still provide some insight into the performance for more complicated densities. Theorem 1. If the target density p(x) is a D-dimensional multivariate Gaussian, only D + 1 particles are needed for Algorithm 2 to converge to the exact target parameters.
Proof. The proof is given in Appendix E.
, with precision matrixΛ, where x ∈ R D , and N ≥ D + 1 particles, the continuous time limit of Algorithm 2 will converge exponentially fast for both the mean and the trace of the precision matrix: where m t and C t are the empirical mean and covariance matrix at time t and exp(−Λt) is the matrix exponential.
Proof. The proof is given in Appendix F.
Our result shows that convergence of the mean m t directly depends on Λ. However, we can also precondition the gradient on m by C t , i.e., using the natural gradient approximation in the Fisher sense, and eventually get rid of the dependency on Λ when The exponential relaxation of fluctuations also manifests itself in the decay of the free energy towards its minimum. For the Gaussian target, the free energy exactly separates into two terms corresponding to the mean and fluctuations. We can write F (m t , where the nontrivial fluctuation part (subtracted by its minimum) is given by We can show that indicating an asymptotic decrease in F f l (C t ) faster than e −4t , independent of the target. We can also prove the finite time bound The degenerate case N < D + 1 Additionally, we can show the following result for the fixed points: Theorem 3. Given a D-dimensional multivariate Gaussian target density p(x) = N (x|µ, Σ), using Algorithm 2 with N < D + 1 particles, the empirical mean converges to the exact mean µ. The N − 1 non-zero eigenvalues of C t converge to a subset of the target covariance Σ spectrum. Furthermore, the global minimum of the regularised version F of the free energy (17) corresponds to the largest eigenvalues of Σ.
Proof. The proof is given in Appendix G.
This result suggests that C t might typically converge to an optimal low-rank approximation of Σ. We show an empirical confirmation in Section 4.2 for this conjecture. This suggests that it makes sense to apply our algorithm to high-dimensional problems even when the number of particles is not large. If the target density has significant support close to a low-dimensional submanifold, we might still obtain a reasonable approximation.

Structured Mean-Field
For high-dimensional problems, it may be useful to restrict the variational Gaussian approximation to the posterior to a specific structure via a structured mean-field approximation. In this way, spurious dependencies between variables that are caused by finite-sample effects could be explicitly removed from the algorithms. This is most easily incorporated in our approach by splitting a given collection of latent variables x into M disjoint subsets x (i) . We reorder the vector indices in such a way that the first components correspond to x (1) , x (2) , and so on. Hence, we obtain x = {x (1) , x (2) , . . . , x (M) }. A structured mean-field approach is enforced by imposing a block matrix structure for the update matrix A MF = A (1) ⊕ · · · ⊕ A (M) , where ⊕ is the direct sum operator. It is easy to see that this construction corresponds to a related block structure of the Γ matrix in Equation (3). This means that the subsets of the random vectors are modeled as independent. Hence, when the number of particles grows to infinity, one recovers the fixed-point equations for the optimal MF structured Gaussian variational approximation from our approach. As previously, as the number of particles grows to infinity, we recover the optimal MF Gaussian variational approximation. Note that using a structured MF does not change the complexity of the algorithm but requires fewer particles to obtain a full-rank solution.

Comparison with SVGD
Given the similarities with the SVGD methods [24],one could question the differences of our approach. The model proposed by [10] using a linear kernel k(x, x ) = x x + 1 has similar properties to our approach. The variable update becomes: The fixed points are where the last equality holds since Eq[∇ϕ(x)] = 0. This is the same as our algorithm fixed points (9). Similarly to Theorem 1, D + 1 particles will converge to the exact D-dimensional multivariate Gaussian target. However, the generated flows are different. The main difference is that we normalize our flow via the L 2 norm, whereas [10] rely on the reproducing kernel Hilbert space (RKHS) norm, i.e., ϕ 2 . For a full introduction on RKHS, we recommend [34]. Remarkably, centering the particles on the mean, namely, using the modified linear kernel k(x, x ) = (x − m) (x − m) + 1, leads to the same dynamics. Additionally, when using SVGD, there is no direct possibility of computing the current KL divergence between the variational distribution and the target, unless some values are accumulated [35]. There is also no clear theory explaining what happens when the number of particles is smaller than the number of dimensions, for both distance-based kernels and the linear kernel.

Experiments
We now evaluate the efficiency of GPF and GF. First, given a Gaussian target, we compare the convergence of our approach with popular VGA methods, which are all described in Section 2. Second, we evaluate the effect of varying the number of particles for both Gaussian targets and non-Gaussian targets, especially with a low-rank covariance. Then, we evaluate the efficiency of our algorithm on a range of real-world binary classification problems through a Bayesian logistic regression model and a series of BNN on the MNIST dataset.
All the Julia [36] code and data used to reproduce the experiments are available at the Github repository: https://github.com/theogf/ParticleFlow_Exp (accessed on 27 July 2021).

Multivariate Gaussian Targets
We consider a 20-dimensional multivariate Gaussian target distribution. The mean is sampled from a normal Gaussian µ ∼ N (0, I D ) and the covariance is a dense matrix defined as Σ = UΛU , where U is a unitary matrix and Λ is a diagonal matrix. Λ is constructed as log 10 (Λ ii ) = log 10 (κ)(i−1) D−1 − 1 where κ is the condition number, i.e., κ = Λ max /Λ min . This means that, for κ = 1, we obtain a Σ = 0.1I, and for κ = 100, we obtain eigenvalues ranging uniformly from 0.1 to 10 in log-space.
We compare GPF and GF to the state-of-the art methods for VGA described in Section 2, namely Doubly Stochastic VI (DSVI) [14], Factor Covariance Structure (FCS) [15] with rank p = D, iBayes Learning Rule (IBLR) [17] with a full-rank covariance and their Hessian approach, and Stein Variational Gradient Descent with both a linear kernel (Linear SVGD) [10] and a squared-exponential kernel (Sq. Exp. SVGD) [24]. For all methods, we set the number of particles or, alternatively, the number of samples used by the estimator, as D + 1, and use standard gradient descent (x t+1 = x t + η ϕ t x t ) with a learning rate of η = 0.01 for all particle methods. We use RMSProp [37] with a learning rate of 0.01 for all stochastic methods. We run each experiment 10 times with 30,000 iterations, and plot the average error on the mean and the covariance with one standard deviation. For GPF, we additionally evaluate the method with and without using natural gradients for the mean (i.e., pre-multiplying the averaged gradient with C t ), indicated, respectively, with a dashed and solid line. Figure 2 reports the L 2 norm of the difference between the mean and covariance with the true posterior over time for the target condition number κ ∈ {1, 10, 100}. Gaussian targets with condition number κ. We use D + 1 particles/samples and show the mean over 10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on the mean. Note that DSVI, GF and FCS are overlapping and are, at this scale, indistinguishable from one another.
As Theorem 1 predicts, GPF converges exactly to the true distribution, regardless of the target. GF and other methods based on stochastic estimators cannot obtain the same precision as their accuracy is penalized by the gradient noise. IBLR approximate the covariance perfectly, despite the stochasticity of its estimator; however IBLR needs to compute the true Hessian at each step. When using a Hessian approximation instead, IBLR performed just like DSVI; the true benefit of IBLR appears when second-order functions are computed, which is naturally intractable in high-dimensions. SVGD with a linear kernel, achieves a good performance but is highly unstable: most of the runs (ignored here) diverge. This is due to the dot computation x x which can become extremely high, especially for non-centered data. For this reason, we do not consider this method for the later experiments. SVGD with a sq. exp. kernel obtains a good estimate for the mean but fails to approximate the covariance.
Perhaps surprisingly, GF does not perform much better than DSVI or FCS. This is potentially due to the benefit of Riemannian gradients being canceled by the gradient noise [38] providing a strong argument for particle-based methods over stochastic estimators.
Remarkably, we also confirm Theorem 2, that the convergence speed of C t is independent of the target Σ, while the convergence speed of m t has this dependency unless the natural gradient is used (see the dashed curves). The case κ = 1 highlights that natural gradient do not necessarily improve convergence speed.

Low-Rank Approximation for Full Gaussian Targets
We explore the effect of the number of particles for both Gaussian and non-Gaussian targets. We use the same Gaussian target from the previous experiment in 50 dimensions with a full-rank covariance determined by their condition number κ = λ max λ min . The covariance eigenvalues λ i in log-space range uniformly from 0.1 to 0.1κ. For a given target multivariate Gaussian, we vary the number of particles from 2 to D + 1 and look at the absolute difference of |tr(C − Σ)|. The results in D = 50, as well as the corresponding predictions (in dashed-black), from Theorem 3, are shown on Figure 3.
The empirical results perfectly match the theoretical predictions, confirming that, for Gaussian targets, the particles determine a low-rank approximation whose spectrum is equal to the largest eigenvalues from the target.

High-Dimensional Low-Rank Gaussian Targets
We consider a typical low-rank target case where the dimensionality is high but the effective rank of the covariance is unknown. The target is given by p(x) = N (µ, Σ) where µ ∼ N (0, I D ), the covariance is defined by Σ = UΛU , where U is a D × D unitary matrix and Λ is a diagonal matrix defined by where K is the effective rank of the target. We pick D = 500 and vary K ∈ {10, 20, 30} to simulate a true problem where the correct K is not known. We test all methods allowing for low-rank structure, namely, GPF, GF, FCS and SVGD (Linear and Sq. Exp.). We fix the rank (or the number of particles) to be 20; therefore, we obtain three cases where the rank is exact, under-estimated, and over-estimated. For all methods, we use RMSProp [37] for the stochastic methods, or a diagonal version of it (see Section 4.4) for the particle ones. The error of the mean and the covariance is shown in Figure 4. Note that the difference in the initial error on the covariance is due to the difficulty of starting with the same covariance between particle and stochastic methods. . Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target with effective rank K ∈ {10, 20, 30}. The rank of each method is fixed as 20. The difference in the starting point for the covariance is due to the initialization difference between each method. We show the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.
We observe once again that the SVGD with a linear kernel fails to converge due to the large gradients. All methods perform equally in the estimation of the mean while being non-influenced by the rank of the target. As expected, the approximation quality for the covariance degrades when the rank gets bigger, but all algorithms still converge to good approximations. SVGD with a sq. exp. kernel performs much worse than the rest of the methods. This is a known phenomenon where, for high dimensions, the covariance SVGD is either over-or underestimated.

Non-Gaussian Target
We now investigate the behavior of our algorithm with non-Gaussian target distributions. We built a two-dimensional banana distribution: p(x) ∝ exp(−0.5(0.01x 2 1 + 0.1(x 2 + 0.1x 2 1 − 10) 2 )), varied the number of particles used for GPF in {3, 5, 10, 20, 50} and compared it with a standard full-rank VGA approach. We also showed the impact of replacing a fixed η with the Adam [39] optimizer for 50 particles. The results are shown in Figure 5. As expected, increasing the number of particles madesthe distribution obtained via GPF increasingly closer to the optimal standard VGA, even in a non-Gaussian setting. However, using a momentum-based optimizer such as Adam breaks the linearity assumption of the original flow (5) and leads to a twisted representation of the particles. (We observed the same behavior with other momentum-based optimizers). A simple modification of the most known optimizers allows the linearity to be maintained while correctly adapting the learning rate to the shape of the problem. Most optimisers accumulate momentum or gradients element-wise, and end up modifying the updates as x t+1 = x t + P t ϕ t (x t ), where P t ∈ R D×D is the preconditioner obtained via the optimiser and is the Hadamard product. By instead taking the average over each dimensions, we obtained the updates The details of the dimensionwise conditioners for ADAM, AdaGrad and AdaDelta are given in Appendix H.

Bayesian Logistic Regression
Finally, we considered a range of real-world binary classification problems modeled with a Bayesian logistic regression. Given some data {(x i , y i )} N i=1 where x i ∈ R D and y ∈ {−1, 1}, we defined the model y i ∼ Bernoulli(σ(w x i )) with weight w ∈ R D , and with σ being the logistic function. We set a prior on w: w N (0, 10I D ). We benchmarked the competing approaches over four datasets from the UCI repository [40]: spam (N = 4601, D = 104), krkp (N = 351, D = 111), ionosphere (N = 3196, D = 37) and mushroom (N = 8124, D = 95). We ran all algorithms discussed in Section 4.1, both with and without a mean-field approximation; SVGD was omitted since it is too unstable. All algorithms were run with a fixed learning rate η = 10 −4 , and we used mini-batches of size 100. We show alternative training settings in Appendix I. Note that FCS, for mean-field, simplifies to DSVI Additionally, we did not consider full-rank IBLR, as it is too expensive, and we used their reparametrized gradient version for the Hessian. Figure 6 shows the average negative log-likelihood on 10-fold cross-validation with one standard deviation for each dataset. While, as expected, the advantages shown for Gaussian targets do not transfer to non-Gaussian targets, GPF and GF are consistently on par with competitors. On the other hand, IBLR tends to be outperformed. It is also interesting to note that mean-field does not seem to have a negative impact on these problems, and performance remains the same even with a full-rank matrix.

Bayesian Neural Network
We ran our algorithm on a standard network with two hidden layers each, with L = 200 neurons and tanh activation functions (we additionally tried ReLU [41], but some baselines failed to converge). We trained on the MNIST dataset [42] (N = 60,000, D = 784) and used an isotropic prior on the weights p(w) = N (0, αI D ) with α = 1.0. We additionally compared these with Stochastic Weight Averaging-Gaussian (SWAG) [27] with an SGD learning rate of 10 −6 (selected empirically) and Efficient Low-Rank Gaussian Variational Inference (ELRGVI) [26]. We varied the assumptions on the covariance matrix to be diagonal (Mean-Field), or to have rank L ∈ {5, 10}. Additionally, we showed, for GPF, the effect of using a structured mean-field assumption by imposing the independence of the weights between each layer (GPF (Layers)).
We trained each algorithm for 5000 iterations with a batchsize of 128(∼10 epochs) and reported the final average negative log-likelihood, accuracy and expected calibration error [43] on the test set (N = 10,000) on Table 1. The predictive distribution is given by where D is the training data, and x * is a test sample. We computed the accuracy and the average negative test log-likelihood as: where 1 y (x) is the indicator function (equal to 1 for y = x, 0 otherwise). For the definition of expected calibrated error, we refer the reader to [43]. Additional convergence and uncertainty calibration plots can be found in Appendix I. Overall, the SVGD method performed best in terms of both accuracy and negative log-likelihood. However, SVGD is not in the same category as others, since it is not a VGA. For VGAs, we observed that a low-rank approximation improves upon mean-field methods. In particular, assuming independence between layers provides a large advantage to GPF. GPF and GF generally perform equally or better than all the other VGA methods. Note that, although not reported here, all methods needed approximately the same time for the 5000 iterations, except for SWAG, which only needed the MAP and a few thousand iterations of SGD afterward, making it generally faster but also less controlled (a grid search was needed to find the appropriate learning for SGD).

Discussion
We introduced GPF, a general-purpose and theoretically grounded, particle-based approach, to perform inference with variational Gaussians as well as GF its parameter version. We were able to show the convergence of the particle algorithm based on an empirical approximation of the free energy. We also showed that we can approximate high-dimensional targets by allowing for low-rank approximations with a small number of particles. The results for Gaussian targets suggest that the convergence of posterior covariance approximation may relax asymptotically fast, with small dependence on the target. This work is the first step in analyzing convergence speed and guarantees in inference with variational Gaussians, and future work could extend guarantees to non-Gaussian problems. One could also take advantage of existing particle-based VI methods to accelerate inference further or reach a better optima [44,45]. Data Availability Statement: Datasets can be found on the UCI dataset website [40] and the MNIST dataset can be found on Yann Lecun website [42].

Appendix A. Derivation of the Optimal Parameters
In Section 3, we considered the optimization problem: where we have introduced A 2 2 F = tr(AA ), the Froebius norm and b t , the L 2 norm and To solve this problem, we used the Lagrange multiplier method. We write the Lagrangian as: where g(A) = tr(AA ) − 1 and h(b) = b 2 2 − 1. For simplicity we can divide the problem as: For A t , we have the constraints: Computing the gradients is straightforward: which gives us the result . Similarly for b t : Replacing the gradients gives:

Appendix B. Relaxation of the Empirical Free Energy
We prove the decrease in the empirical free energy (17) under the particle flow when the covariance C is nonsingular. We define the empirical distributionq(x) = 1 N ∑ N i=1 δ x,x i with a finite number N of particles. The empirical free energy is defined as We are interested in the temporal change of the free energy, when particles move under a general linear dynamics The induced dynamics for F are: For notational simplicity, we introduce g(x) = ∇ x ϕ(x) andẋ = dx dt (similarlyṁ = dm dt ).
where we used the permutation properties of the trace. Plugging the dynamics into Equation (A2), we obtain: We next look for conditions on b and A, under which dF dt < 0, i.e., the dynamics will lead to a decrease in the free energy. We pick b = −β 1 E q [g(x)], where β 1 > 0, and we obtain, for the first term in (A3): For A, let us first define ψ = E q g(x)(x − m) and rewrite the second and last term of the Equation (A3) as:

=tr(A)
Combining both, we get tr A (ψ − I) . Similarly to the previous step, we pick A = −β 2 (ψ − I), where β 2 ≥ 0, which leads to another negative term: where we use the fact that X X is a positive semi-definite matrix for any real valued X.
Note that different forms of A (e.g., β 2 are replaced by a positive definite matrix) could be used, as long as the trace of the product stays positive. Inserting b and A, the free energy dynamics become The variable dynamics are given by which is equivalent to Equation (5), for β 1 = β 2 = 1. Our result shows that the empirical approximation of the free energy decreases under the particle flow.

Appendix C. Riemannian Gradient for Matrix Parameter Γ
The parameter flow for the matrix Γ in (11) is given by This is easily rewritten in terms of the parameter gradient as dΓ t dt = ∂F ∂Γ ΓΓ Similar to natural gradients, which are defined by the metric, which is induced by the Fisher-matrix, we can rewrite the parameter change in terms of a different Riemannian gradient. This gradient is the direction of change dΓ = Γ(t + dt) − Γ(t), which yields the steepest descent of the free energy over a small time interval dt. As an extra condition, one keeps the length of dΓ (measured by a 'natural' metric, which has specific invariance properties) fixed. This is defined by an inner product (the squared length) dΓ, dΓ Γ in the tangent space of small deviations dΓ from the matrix Γ. Hence, dΓ is found by minimising F (Γ(t) + dΓ, m) (for small dΓ) under the condition that dΓ, dΓ Γ(t) is fixed. Following [20] (Theorem 6), a natural metric in the space of symmetric nonsingular matrices can be defined as This metric is invariant against multiplications of Γ and dΓ by matrices Y, i.e., dΓ, dΓ Γ = dΓ Y, dΓ Y ΓY and reduces to the Euclidian metric at the unit matrix Γ = I. The direction of the natural gradient is obtained by expanding the free energy for small dΓ and introducing a Lagrange-multiplier λ for the constraint. One ends up with the quadratic form to be minimised by dΓ. By taking the derivative with respect to dΓ, one finds that the direction of dΓ agrees with the right equation of the flow (11).

Appendix D. Regularised Free Energy for N ≤ D
The problem of defining an empirical approximation for N ≤ D particles is that the empirical covariance becomes singular and typically has N − 1 nonzero eigenvalues, and thus |C| = 0. Note that the extra 0 eigenvalue is derived from the fact that the empirical sum of fluctuations must be zero, which provides an additional linear constraint.
We can regularise the log determinant term by replacing the zero eigenvalues of C: log λ i , since log 1 = 0. The dynamics of the particles stays the same. To rewrite this formally in terms of matrices, we define e i e i and e i = ith eigenvector of C. This replaces all 0 eigenvalues by 1. C ⊥ is a projector: We also have tr(C ⊥ ) = D − (N − 1). In the following, it is useful to introduce the D × N matrix of fluctuations Z, such that C = ZZ /N. The column vectors of Z span the subspace of eigenvectors e i with λ i > 0. Hence, it follows that C ⊥ Z = 0.
We want to show that the regularised free energy F decreases under the particle dynamics for N ≤ D. Since the part of the time derivative of F that depends on dm dt is not changed, we will only discuss the fluctuation part in the following.
It is useful to introduce the matrix: To obtain this result, we need We need to work out where we have used the fact that the eigenvalues λ i = 1 of C have a zero time derivative and can be omitted. We use the linear dynamics dZ dt = AZ to obtain: where we have used C 2 ⊥ = C ⊥ and C ⊥ A = 0. Hence Finally, the temporal change in the free energy due to the fluctuations is given by Note that this proof is not only valid for N ≤ D, but also for N > D, as the overall computations are simplified with C ⊥ = 0. A more detailed proof for N > D is, furthermore, given in Appendix B.

Efficient Computation of log C
A practical way to compute log | C| without performing an eigenvector expansion is to define the N × N matrix where J N,N is the N × N all-ones matrix. Z Z/N shares the N − 1 nonzero eigenvalues with C and has an additional eigenvalue 0 corresponding to the constant eigenvector (e N ) i = 1/ √ N. Adding an all-ones matrix preserves all existing eigenvalues while replacing the 0 one with a constant. This leads to the following result:

Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (N > d)
Theorem A1 (1). If the target density p(x) is a D-dimensional multivariate Gaussian, only D + 1 particles are needed for Algorithm 2 to converge to the exact target parameters.
The general fixed-point condition for the dynamics (13) of the position x i for particle i is given by: for i = 1, . . . , N. By taking the expectation over all particles, we obtain: whereq is the empirical distributions of particles at the the fixed point. Note that this result is independent of N, i.e., it is also valid for N = 1. For a D-dimensional Gaussian target p(x) = N (µ, Σ), we will show that empirical mean and covariance given by the particle algorithm converge to the true mean and covariance matrix of the Gaussian when we use N ≥ D + 1 particles. In this setting, we have For simplification, we use the precision matrix Λ = Σ −1 and get The gradient g(x) becomes: At the fixed points, we have that dm dt and dΓ dt are equal to 0. For the mean m: For the matrix Γ, we have where we use the result for the mean m = µ and right multiplied by Γ as C = ΓΓ . Now, we can only simplify, as C = Λ −1 = Σ if C is not singular. This is true only if its rank is equal to D, needing D + 1 particles.

Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets
Theorem A2 (2). For a target p(x) = N (x | µ, Λ −1 ), where x ∈ R D , and N ≥ D + 1 particles, the continuous time limit of Algorithm 2 will converge exponentially fast for both the mean and the trace of the precision matrix: where m t and C t are the empirical mean and covariance matrix at time t and exp(−Λt) is the matrix exponential.
Appendix F.1. Convergence of the Mean Given our target p(x), similarly to Appendix E we have g(x) = Λ(x − µ), where η 1 = Σ −1 µ and η 2 = − 1 2 Σ −1 . This transform the first of Equations (11) into If now consider the error on m : δm = m − µ we obtain: Therefore, the mean converges exponentially fast to the true mean. The asymptotic rate is governed by the largest eigenvalue of Λ, i.e., the inverse of the smallest eigenvalue of Σ, λ min .

Appendix F.2. Convergence of the Covariance Matrix
Let z = x − m, we have from Equation (5), that This expectation can further be simplified as where q ∼ N (m, C). Hence, we have the exact result We know that the optimal target is C = Σ. Therefore, we define the error δC = C − Σ.
Linearizing Equation (A6) gives us We were not yet able to find a general solution of this equation, but we can obtain a simple result for the trace y t .
We, therefore, have a asymptotic linear convergence: y t ∝ e −2t y 0 which is independent of the parameters of the Gaussian model. We can also equivalently obtain a non-asymptotic estimate of a specific error measure for the precision matrix. Using equation (A6), we have the following dynamics for the precision C −1 : Taking the trace Hence we get the following exact result: which is again independent of the parameters of the Gaussian model. Additionally, this tells us that if the covariance C is non-singular at time t = 0, it will remain non-singular for all t (tr(C −1 ) would be infinite). Hence, if we start with N > d particles with a proper empirical covariance, they cannot collapse to make C singular.
where we have introduced the matrix B . = I − ΛC. One can show that its eigenvalues are real and are upper bounded by 1. First, we can show from the equations of motion that Second, using the elementary bound − ln(1 − u) ≤ u 1−u valid for u ≤ 1 and applied to the eigenvalues of B yields The last two equalities used the definition B = I − ΛC. Since B Λ −1 B and C −1 are both positive definite, we can bound the last term by (see ([47], Theorem 6.5)) where, in the last line, we have bounded the trace of a product of p.d. matrices a second time.
Combining with Equation (A7) we show that We can plug in our result from Theorem 2:  In the last inequality, we used tr(BB ) ≥ tr(B 2 ). Everything is expressed by traces of functions of B, and thus by its eigenvalues. Since B → 0 as t → ∞ (this applies also to its eigenvalues u), we can use Taylor's expansion ln(1 − u) + u = −u 2 /2 + O(u 3 ) to show that λ f ree ≥ 4 which is independent of Λ.

Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (N ≤ D)
Theorem A3 (3). Given a D-dimensional multivariate Gaussian target density p(x) = N (x|µ, Σ), using Algorithm 2 with N < D + 1 particles, the empirical mean converges to the exact mean µ. The N − 1 non-zero eigenvalues of C t converge to a subset of the target covariance Σ spectrum. Furthermore, the global minimum of the regularised version F of the free energy (17) corresponds to the largest eigenvalues of Σ.

=ΛC.
Using the equality above, we get: C(x i − m) =Σ(x i − m), ∀i = 1, . . . , N which shows that the obtained low-rank covariance C and the target covariance Σ have N − 1 eigenvectors and eigenvalues in common.
However, are these the largest ones? We look at the modified free energy (17) (ignoring the contribution of the mean): where λ i are the eigenvalues of the empirical covariance C. We first note that tr(ΛC) = N − 1, independent of which eigenvalues are obtained at the fixed point. This is easily seen by the following argument: If we use the index-set I for the common eigenvectors e i and eigenvalues λ i , i ∈ I, we can write The term N − 1 is a constant, but the first term makes a difference: The absolute minimum of F is achieved, when the λ i are N − 1 largest eigenvalues of Σ. Our simulations empirically show that the algorithm usually converges to the absolute minimum.

Appendix H. Dimension-Wise Optimizers
Here, we list some of the most populars optimizers used and their dimension-wise versions. In all algorithms, we consider ϕ the matrix created by the concatenation of the flow of each particle : ϕ = [ϕ 1 , . . . , ϕ N ], where ϕ n = ϕ(x n ) We additionally use the notation ϕ n,i for the i-th dimension of the flow of the n-th particle. The main differences between the original algorithms and their modified version were put in red.

Appendix H.1. ADAM
The ADAM algorithm is given by:

. AdaGrad
The AdaGrad algorithm is given by:  Figure A1. Similarly to Figure 6, we show the average negative log-likelihood on a test-set over 10 runs against training time on different datasets for a Bayesian logistic regression problem. The dashed curve represents the low-rank approximation with RMSProp for methods based on stochastic estimators.
Appendix I.2. Bayesian Neural Network Figure A2. Convergence of the classification error and average negative log-likelihood as a function of time. Figure A3. Accuracy vs confidence. Every test sample is clustered in function of its highest predictive probability. The accuracy of this cluster is then computed. A perfectly calibrated estimator would return the identity.