Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation

Galy-Fajou, Théo; Perrone, Valerio; Opper, Manfred

doi:10.3390/e23080990

Open AccessArticle

Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation

by

Théo Galy-Fajou

^1,*

,

Valerio Perrone

² and

Manfred Opper

^1,3

¹

Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany

²

Amazon Web Services, 10969 Berlin, Germany

³

Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham, Birmingham B15 2TT, UK

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(8), 990; https://doi.org/10.3390/e23080990

Submission received: 22 June 2021 / Revised: 15 July 2021 / Accepted: 21 July 2021 / Published: 30 July 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Variational inference is a powerful framework, used to approximate intractable posteriors through variational distributions. The de facto standard is to rely on Gaussian variational families, which come with numerous advantages: they are easy to sample from, simple to parametrize, and many expectations are known in closed-form or readily computed by quadrature. In this paper, we view the Gaussian variational approximation problem through the lens of gradient flows. We introduce a flexible and efficient algorithm based on a linear flow leading to a particle-based approximation. We prove that, with a sufficient number of particles, our algorithm converges linearly to the exact solution for Gaussian targets, and a low-rank approximation otherwise. In addition to the theoretical analysis, we show, on a set of synthetic and real-world high-dimensional problems, that our algorithm outperforms existing methods with Gaussian targets while performing on a par with non-Gaussian targets.

Keywords:

variational inference; Gaussian; particle flow; variable flow

1. Introduction

Representing uncertainty is a ubiquitous problem in machine learning. Reliable uncertainties are key for decision making, especially in contexts where the trade-off between exploitation and exploration plays a central role, such as Bayesian optimization [1], active learning [2], and reinforcement learning [3]. While Bayesian inference is a principled tool to provide uncertainty estimation, computing posterior distributions is intractable for many problems of interest. Most sampling methods struggle to scale up to large datasets [4], while the diagnosis of convergence is not always straightforward [5]. On the other hand, Variational Inference (VI) methods can rely on well-understood optimization techniques and scale well to large datasets, at the cost of an approximation quality depending heavily on the assumptions made. The Gaussian family is by far the most popular variational approximation used in VI [6,7]. This is for several reasons. First, Gaussian variational families are easy to sample from, reparametrize, and marginalize. Second, they are easily amenable to diagonal covariance approximations, making them scalable to high dimensions. Third, most expectations are either easily computable by quadrature or Monte Carlo integration, or known in closed-form.

A large body of work covers different approaches to optimize the Variational Gaussian Approximation (VGA), with the speed of convergence and the scalability in dimensions as the main concerns. From the perspective of convergence speed, the major bottleneck when computing gradients with stochastic estimators is the estimator variance [8]. Particle-based methods with deterministic paths do not have this issue, and have been proven to be highly successful in many applications [9,10,11]. However, can we use a particle-based algorithm to compute a VGA? If so, what are its properties and is it competitive with other VGA methods?

In this paper, we attempt to answer these questions by introducing the Gaussian Particle Flow (GPF), a framework to approximate a Gaussian variational distribution with particles. GPF is derived from a continuous-time flow, where the necessary expectations over the evolving densities are approximated by particles. The complexity of the method grows quadratically with the number of particles but linearly with the dimension, remaining compatible with other approximations such as structured mean-field approximations. Using the same dynamics, we also derive a stochastic version of the algorithm, Gaussian Flow (GF). To show convergence, we prove the decrease in an empirical version of the free energy that is valid for a finite number of particles. For the special case of D–dimensional Gaussian target densities, we show that

D + 1

particles are enough to obtain convergence to the true distribution. We also find, for this case, that convergence is exponentially fast. Finally, we compare our approach with other VGA algorithms, both in fully controlled synthetic settings and on a set of real-world problems.

2. Related Work

The goal of Bayesian inference is to carry out computations with the posterior distribution of a latent variable

x \in R^{D}

given some observations y. By Bayes theorem, the posterior distribution is

p (x | y) = \frac{p (y | x) p (x)}{p (y)}

, where

p (y | x)

and

p (x)

are, respectively, the likelihood and the prior distribution. Even if the likelihood and the prior are known analytically, marginalizing out high-dimensional variables in the product

p (y | x) p (x)

in order to compute quantities such as

p (y)

is typically intractable. Variational Inference (VI) aims to simplify this problem by turning it into an optimization one. The intractable posterior is approximated by the closest distribution within a tractable family, with closeness being measured by the Kullback-Leibler (KL) divergence, defined by

\begin{matrix} KL [q (x) | | p (x)] = E_{q} [log q (x) - log p (x)], \end{matrix}

where

E_{q} [f (x)] = \int f (x) q (x) d x

denotes the expectation of f over q. Denoting by

Q

a family of distributions, we look for

\begin{matrix} \underset{q \in Q}{arg min} KL [q (x) | | p (x | y)] . \end{matrix}

Since

p (y)

is not computable in an efficient way, we equivalently minimize the upper bound

F

:

\begin{matrix} KL [q (x) | | p (x | y)] \leq F [q] = - E_{q} [log p (y | x) p (x)] - H_{q}, \end{matrix}

(1)

where

H_{q}

is the entropy of q (

- E_{q} [log q (x)]

). Here,

F

is known as the variational free energy and

- F

is known as the Evidence Lower BOund (ELBO). A diverse set of approaches to perform VI with Gaussian families

Q

have been developed in the literature, which we review in the following.

2.1. The Variational Gaussian Approximation

The VGA is the restriction of

Q

to be the family of multivariate Gaussian distributions

q (x) = N (m, C)

, where

m \in R^{D}

is the mean and

C \in {A \in R^{D \times D} | x^{⊤} A x \geq 0, \forall x \in R^{D}}

is the covariance matrix, for which the free energy is found to be

\begin{matrix} F [q] = - \frac{1}{2} log | C | + E_{q} [φ (x)] . \end{matrix}

(2)

where

φ (x) = - log (p (y | x) p (x))

. A standard descent algorithm based on gradients of Equation (2) with respect to variational parameters

m, C

give rise to some issues. First, naively computing the gradient of the expectation with respect to the covariance matrix C involves unwanted second derivatives of

φ (x)

[12], which may not be available or may be computationally too expensive in a black-box setting. Second, the gradient of the entropy term

H_{q}

entails inverting a non-sparse matrix, which we would like to avoid for higher-dimensional cases. Finally, the positive-definiteness of the covariance matrix leads to non-trivial constraints on parameter updates, which can lead to a slowdown of convergence or, if ignored, to instabilities in the algorithm.

To solve these issues, a variety of approaches have been proposed in the literature. If we focus on factorizable models, we can make a simplification: for problems with likelihoods that can be rewritten as

p (y | x) = \prod_{d = 1}^{D} p (y | x_{d})

, the number of independent variational parameters is reduced to

2 D

[12,13]. In this special case, the Gaussian expectations in the free energy (2) split into a sum of 1-dimensional integrals, which can be efficiently computed by using numerical quadrature methods. To extend to the general case, gradients of the free energy are estimated by a stochastic sampling approach, which also forms the starting point of our method. This relies on the so-called reparametrization trick, where the expectation over the parameter-dependent variational density

q_{θ}

is replaced by an expectation over a fixed density

q^{0}

instead. This facilitates the gradient computation because unwanted derivatives of the type

\nabla_{θ} q_{θ} (x)

are avoided. For the Gaussian case, the reparametrization trick is a linear transformation of an arbitrary D dimensional Gaussian random variable

x \sim q_{θ} (x)

in terms of a D-dimensional Gaussian random variable

x^{0} \sim q^{0} = N (m^{0}, C^{0})

:

\begin{matrix} x = Γ (x^{0} - m^{0}) + m, \end{matrix}

(3)

where

Γ \in R^{D \times D}

and

m \in R^{D}

are the variational parameters. We assume that the covariance

C^{0}

is not degenerate and, for simplicity, we set it as the identity. For instance, the gradient of the expectation given q over a function f given the mean m becomes

\nabla_{m} E_{q} [f (x)] = E_{q^{0}} [\nabla_{m} f (Γ (x^{0} - m^{0}) + m)]

. This can be simply proved by using the reparametrization (3) inside the integral and passing the gradient inside; for more details, see [14].

Given this representation, the free energy is easily obtained as a function of the variational parameters:

\begin{matrix} F (q) = - log | Γ | + E_{q^{0}} [φ (Γ (x^{0} - m^{0}) + m)] . \end{matrix}

(4)

Other representations are possible. Challis and Barber [13] and Ong et al. [15] use a different reparametrization with a factorized structure of the covariance

C = Γ^{⊤} Γ + diag (d)

, where

Γ \in R^{D \times P}

and

d \in R^{D}

, with

P \leq D

is the rank of

Γ^{⊤} Γ

. Other representations assume special structures of the precision matrix

Λ = C^{- 1}

, which allow you to enforce special properties, such as sparsity in [16,17].

In general, these methods tend to scale poorly with the number of dimensions, as one needs to optimize

D (D + 3) / 2

parameters. The (structured) Mean-Field (MF) [18,19] approach imposes independence between variables in the variational distribution. The number of variational parameters is then

2 D

, but covariance information between dimensions is lost.

2.2. Natural Gradients

Besides the issue of expectations, more efficient optimizations directions, beyond ordinary gradient descent, have been considered. These can help to deal with constraints such as those given for the covariance matrix. Natural gradients [20] are a special case of Riemannian gradients and utilize the specific Riemannian manifold structure of variational parameters. They can often deal with constraints of parameters (such as the positive definiteness of the covariance), accelerate inference, and improve the convergence of algorithms. The application of such advanced gradient methods typically requires an estimate of the inverse Fisher information matrix as a preconditioner of ordinary gradients. Khan and Nielsen [21] and Lin et al. [22] propose a solution that requires extra second derivatives of the log–posteriors. Salimbeni et al. [23] developed an automatic process to compute these without the second derivatives but with instability issues. Lin et al. [17] solved these issues by using geodesics on the manifold of parameters, at the price of having to compute inverse matrices as well as Hessians.

2.3. Particle-Based VI

Stochastic gradient descent methods compute expectations (and gradients) at each time step with new independent Monte Carlo samples drawn from the current approximation of the variational density. Particle-based methods for variational inference draw samples only once at the beginning of the algorithm instead. They iteratively construct transformations of an initial random variable (having a simple tractable density) where the transformed density leads to the decrease and finally to the minimum of the variational free energy. The iterative approach induces a deterministic temporal flow of random variables which depends on the current density of the variable itself. Using an approximation by the empirical density (which is represented by the positions of a set of ’particles’) one obtains a flow of interacting particles which converges asymptotically to an empirical approximation of the desired optimal variational density.

The most popular approach is Stein Variational Gradient Descent (SVGD) [24], which computes a nonparametric transformation based on the kernelized Stein discrepancy [9]. SVGD has the advantage of not being restricted to a parametric form of the variational distribution. However, using standard distance-based kernels like the squared exponential kernel (

k (x, y) = exp (- ∥ x - y ∥_{2}^{2} / 2)

) can lead to underestimated covariances and poor performance in high dimensions [11,25]. Hence, it is interesting to develop particle approaches that approximate the VGA. We provide a more thorough comparison between our method and SVGD in Section 3.6.

2.4. GVA in Bayesian Neural Networks

There has been increased interest in making Bayesian Neural Networks (BNN) by adding priors to Neural Networks parameters. The true form of the posterior is unknown but VGA has been used due to its ease of use and scalability with the number of dimensions (typically

D ≫ 10^{5}

). Most of the aforementioned methods apply to BNN, but techniques have been specifically tailored with BNN in mind. [26] use the low-rank structure of [13] but exploit the Local Reparametrization Trick, where each datapoint

y_{i}

gets a different sample from q in order to reduce the stochastic gradient estimator variance. Stochastic Weight Averaging-Gaussian (SWAG) [27], in which a set of particles obtained via stochastic gradient descent represent a low-rank Gaussian distribution, approximating the true posterior with a prior posterior produced by the network’s regularization. While easy to implement, SWAG does not allow you to incorporate an explicit prior, and the resulting distribution does not derive from a principled Bayesian approach.

2.5. Related Approaches

The closest approach to our proposed method is the Ensemble Kalman Filter (EKF) [28]. It assumes that the posterior is computed in a sequential way, where, at each time step, only single (or smaller batches) of data observations, represented by their likelihoods, become available. An ensemble of particles, representing a Gaussian distribution is iteratively updated with every new batch of observations. EKF allows us to work on high-dimensional problems with a limited amount of particles but is restricted to factorizable likelihoods for which a sequential representation is possible. While EKF maintains a representation of a Gaussian posterior, it is not clear how this relates to the goal of minimizing the free energy or the KL divergence.

3. Gaussian (Particle) Flow

We introduce Gaussian Particle Flow (GPF) and Gaussian Flow (GF), two computationally tractable approaches, to obtain a Variational Gaussian Approximation (VGA). In the following, we derive deterministic linear dynamics, which decreases the variational free energy. We additionally give some variants with a Mean-Field (MF) approach and prove theoretical convergence guarantees.

In the following,

\frac{d (\cdot)}{d t}

indicates the total derivative given time,

\frac{\partial (\cdot)}{\partial t}

partial derivatives given time,

\nabla_{x} (\cdot)

gradients given a vector x.

3.1. Gaussian Variable Flows

We next discuss an alternative approach to generate the desired transformation of random variables, leading from a simple (prior) Gaussian density to a more complex Gaussian, which minimizes the variational free energy. It is based on the idea of variable flows, i.e., recursive deterministic transformations of the random variables defined by a mapping

x^{n + 1} = x^{n} + ϵ f^{n} (x^{n})

where

f^{n} : R^{D} \to R^{D}

. Well-known examples of flows are Normalizing Flows [29], where

f^{n}

are bijections, or Neural ODEs [30] where

f^{n} = f

is defined by a neural network and

x^{0}

is the input. For simplicity, we will consider small changes

ϵ \to 0

and work with flows in the continuous-time limit (

t = n ϵ

), which follow a system of Ordinary Differential Equation (ODE). For the Gaussian case, in the spirit of the reparametrization trick (3), we choose a linear corresponding map f and write

\begin{matrix} \frac{d x^{t}}{d t} = f^{t} (x^{t}) = A^{t} (x^{t} - m^{t}) + b^{t}, \end{matrix}

(5)

where

A^{t}

is a matrix and

m^{t} ≐ E_{q^{t}} [x]

(which is no longer interpreted as an independent variational parameter). When the initial random variable

x^{0}

is Gaussian distributed, the vectors

x^{t}

are also Gaussian for any t. To construct a flow that decreases the free energy over time, we can either compute the time derivative of the specific free energy (2) induced by the ODE (5), or simply derive the general result valid for smooth maps f (see, e.g., [24]). To be self contained, we briefly repeat the main steps: We first compute the change of the free energy in terms of the time derivative of

q^{t}

:

\begin{matrix} \frac{d F [q^{t}]}{d t} = & \frac{d}{d t} \int q^{t} (x) (log q^{t} (x) + φ (x)) d x \\ = & \int \frac{\partial q^{t} (x)}{\partial t} (log q^{t} (x) + φ (x)) d x + \int q^{t} (x) (\frac{\partial q^{t} (x)}{\partial t} \frac{1}{q^{t} (x)} + \frac{\partial φ (x)}{\partial t}) d x \\ = & \int \frac{\partial q^{t} (x)}{\partial t} (log q^{t} (x) + φ (x)) d x \end{matrix}

where we have used the fact that

\int \frac{\partial q^{t} (x)}{\partial t} d x = \frac{d}{d t} \int q^{t} (x) d x = 0

and

\frac{\partial φ (x)}{\partial t} = 0

. We next use the continuity equation for the density

\begin{matrix} \frac{\partial q^{t} (x)}{\partial t} = - \nabla_{x} \cdot (q^{t} (x) f^{t} (x)), \end{matrix}

related to the deterministic flow to obtain

\begin{matrix} \frac{d F [q^{t}]}{d t} = & \int \nabla_{x} \cdot (q^{t} (x) f^{t} (x)) (log q^{t} (x) + φ (x)) d x \\ = & - \int (q^{t} (x) f^{t} (x)) \cdot \nabla_{x} (log q^{t} (x) + φ (x)) d x \\ = & \int (\nabla_{x} \cdot (q^{t} (x) f^{t} (x)) + q^{t} (x) f^{t} (x) \cdot \nabla_{x} φ (x)) d x \\ = & \int \nabla_{x} q^{t} (x) \cdot f^{t} (x) + q^{t} (x) f^{t} (x) \cdot \nabla_{x} φ (x) d x \\ = & - E_{q^{t}} [\nabla_{x} \cdot f^{t} (x) - f^{t} (x) \cdot \nabla_{x} φ (x)] \end{matrix}

where we have applied Green’s identity twice and used the fact that

{lim}_{x \to \infty} q_{t} (x) = 0

. Specializing to the linear flow (5), we obtain

\begin{matrix} \frac{d F [q^{t}]}{d t} = - tr [A^{t} {(A_{★}^{t})}^{⊤}] - {(b^{t})}^{⊤} b_{★}^{t}, \end{matrix}

(6)

where

\begin{matrix} A_{★}^{t} ≐ & I - E_{q^{t}} [\nabla_{x} φ (x) {(x - m^{t})}^{⊤}] \\ b_{★}^{t} ≐ & - E_{q^{t}} [\nabla_{x} φ (x)] \end{matrix}

(7)

Equation (6) represents the change in the free energy

F

for an infinitesimal change in the variables x given by the flow (5). Obviously, the simplest choices

\begin{matrix} A^{t} \equiv A_{★}^{t} b^{t} \equiv b_{★}^{t} \end{matrix}

(8)

lead to a decrease in the free energy

\frac{d F [q^{t}]}{d t} \leq 0

. More detailed derivations are given in Appendix A. Additionally, equality only happens, when

\begin{matrix} I - E_{q} [\nabla_{x} φ (x) {(x - m)}^{⊤}] = 0 \\ E_{q} [\nabla_{x} φ (x)] = 0 \end{matrix}

(9)

Using Stein’s lemma [31], we can show that these fixed-point solutions are equal to the conditions for the optimal variational Gaussian distribution solution given in [12]. In Appendix C, we show that our parameter updates can be interpreted as a Riemannian gradient descent method for the free energy (4). This is based on the metric introduced by ([20], Theorem 7.6) as an efficient technique for learning the mixing matrix in models of blind source separation. This gradient should not be confused with the so-called natural gradient obtained by pre-multiplying with the inverse Fischer-information matrix.

Of course, there are other choices for

A^{t}

and

b^{t}

, which lead to a decrease in the free energy and the same fixed-point equations. In Section 3.6, we discuss how SVGD, with a linear kernel, can lead to the same fixed points but with different dynamics.

3.2. From Variable Flows to Parameter Flows

Before we introduce the particle algorithm, we show that the results for the variable flow can also be converted into a temporal change of the parameters

Γ^{t}

,

m^{t}

, as defined for Equation (3). From this, a corresponding Gaussian Flow (GF) algorithm can be easily derived. By differentiating the parametrisation

x^{t} = Γ^{t} (x^{0} - m^{0}) + m^{t}

(with

m^{t}

now considered as free variational parameter) with respect to time t and using (5), we obtain

\begin{matrix} \frac{d x^{t}}{d t} = \frac{d Γ^{t}}{d t} (x^{0} - m^{0}) + \frac{d m^{t}}{d t} = A^{t} (x^{t} - m^{t}) + b^{t} \end{matrix}

(10)

By inserting

x^{t} = Γ^{t} (x^{0} - m^{0}) + m^{t}

into the right hand side of (10), and using the optimal parameters from (7), we obtain

\begin{matrix} \begin{matrix} \frac{d Γ^{t}}{d t} = & Γ^{t} - E_{q^{0}} [\nabla_{x} φ (x^{t}) {(x^{0} - m^{0})}^{⊤}] Γ^{t} {(Γ^{t})}^{⊤} \\ \frac{d m^{t}}{d t} = & - E_{q^{0}} [\nabla_{x} φ (x^{t})] \end{matrix} \end{matrix}

(11)

Note that the expectations are over the probability distribution of the initial random variable

x^{0}

. Discretizing Equations (11) in time, and estimating the expectations by drawing independent samples from the fixed Gaussian

q^{0}

at each time step, we obtain our GF algorithm to minimize the variational free energy in the space of Gaussian densities. We summarize the steps of GF in Algorithm 1. Remarkably, this scheme differs from previous VGA algorithms with Riemannian gradients based on the Fisher information metric (see, e.g., [17,32]) because no matrix inversions or second order derivatives of the function

φ

are required.

GF also allows for the computation of a low-rank VGA by enforcing

Γ \in R^{D \times K}

and

x^{0} \in R^{K}

. This algorithm scales linearly in the number of dimensions and quadratically in the rank K of the covariance.

It is interesting to note that the reverse construction of a variable flow from a parameter flow is, in general, not possible. This would require the ability to eliminate all variational parameters and the initial variables

x^{0}

in the resulting differential equation for

x^{t}

, and replace them with functions of

x^{t}

alone. For instance, if we eliminate the initial variables

x^{0}

in terms of

{(Γ^{t})}^{- 1}

and

x^{t}

the algorithm of [14], the resulting expression still depends on

Γ^{t}

.

3.3. Particle Dynamics

The main idea of the particle approach is to approximate the Gaussian density

q^{t}

in (7) by the empirical distribution

\begin{matrix} {\hat{q}}^{t} ≐ \frac{1}{N} \sum_{i = 1}^{N} δ (x - x_{i}^{t}) \end{matrix}

(12)

computed from N samples

x_{i}^{t}

,

i = 1, \dots, N

. These are initially sampled from the density

q^{0}

at time

t = 0

and are then propagated using the discretized dynamics of the ODE (5):

\begin{matrix} \frac{d x_{i}^{t}}{d t} = - η_{1}^{t} E_{{\hat{q}}^{t}} [\nabla_{x} φ (x)] - η_{2}^{t} {\hat{A}}^{t} (x_{i}^{t} - {\hat{m}}^{t}) \end{matrix}

(13)

where

\begin{matrix} {\hat{A}}^{t} = I - \frac{1}{N} \sum_{i = 1}^{N} \nabla_{x} φ (x) {(x_{i}^{t} - {\hat{m}}^{t})}^{⊤} \\ {\hat{b}}^{t} = \frac{1}{N} \sum_{i = 1}^{N} \nabla_{x} φ (x_{i}^{t}), {\hat{m}}^{t} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}^{t} \end{matrix}

where

η_{1}^{t}

and

η_{2}^{t}

are learning rates (We further comment on the use of different optimization schemes in Section 4.4). Note that although

E_{{\hat{q}}^{t}} [\nabla_{x} φ (x) {(x - {\hat{m}}^{t})}^{⊤}]

is a

D \times D

matrix, changing the matrix multiplication order leads to a computational complexity of

O (N^{2} D)

with a storage complexity of

O (N (N + D))

, since neither the empirical covariance matrix or

A^{t}

need to be explicitly computed.

Relaxation of Empirical Free Energy and Convergence

We have shown that the continuous-time dynamics (10) of the random variables leads to a decay of the free energy

F (q^{t})

with time t. Assuming that the free energy is bounded from below, one might conjecture that this property would imply the convergence of the particle algorithm to a fixed point when learning rates are sufficiently small such that the discrete-time dynamics are approximated well by the continuous limit. Unfortunately, the finite number N of particles poses an extra problem. The definition of the free energy

F (q)

by the KL–divergence (1) for continuous random variables such as assumes that both

q (\cdot)

and

p (\cdot | y)

are densities with respect to the Lebesgue measure. Hence,

F (\hat{q})

is not defined if we take

q \equiv \hat{q}

, (12) as the empirical distribution of the finite particle approximation. Nevertheless, we define a finite N approximation to the Gaussian free energy, which is also then found to decay under the finite N dynamics. Let us first assume that

N > D

and define

\begin{matrix} \tilde{F} ({\hat{q}}^{t}) ≐ - \frac{1}{2} log | {\hat{C}}^{t} | + E_{{\hat{q}}^{t}} [φ (x)] \end{matrix}

(14)

with the empirical covariance matrix

\begin{matrix} {\hat{C}}^{t} = \frac{1}{N} \sum_{i = 1}^{N} (x_{i}^{t} - m^{t}) {(x_{i}^{t} - m^{t})}^{⊤} \end{matrix}

(15)

The definition (14) is chosen in such way that in the large N limit, when the empirical distribution

{\hat{q}}^{t}

converges to a Gaussian distribution

q^{t}

, we will also obtain the convergence of the approximation (14) to

F (q^{t})

. It can be shown (see Appendix B) that

\frac{d \tilde{F} ({\hat{q}}^{t})}{d t} \leq 0

, with equality only at the fixed points of the dynamics.

In applications of our particle method to high-dimensional problems, the limitations of computational power may force us to restrict particle numbers to be smaller than the dimensionality D. For

N < D + 1

, the empirical covariance

C^{t}

will be singular, and typically contain only

N - 1

non-zero eigenvalues, which leads to the

- log |\hat{C}| = \infty

and makes Equation (14) meaningless. We resolve this issue through a regularisation of the log–determinant term in (14), replacing all zero eigenvalues of

\hat{C}

by the values 1, i.e.,

λ_{i} = 0 \to {\tilde{λ}}_{i} = 1

. We show in Appendix B that the free energy still decays, provided that the dynamics of the particles stay the same. This regularisation step can be formally stated as a replacement of the empirical covariance (15) in (14) by

\begin{matrix} \hat{C^{t}} \to {\hat{C}}^{t} + \sum_{i : λ_{i}^{t} = 0} e_{i}^{t} {(e_{i}^{t})}^{⊤} \end{matrix}

where

e_{i}^{t} =

ith eigenvector of

{\hat{C}}^{t}

.

3.4. Algorithm and Properties

The algorithm we propose is to sample N particles

{x_{1}^{0}, \dots, x_{N}^{0}}

where

x_{i}^{0} \in R^{D}

from

q^{0}

(which can be centered around the MAP for example), and iteratively optimize their positions using Equation (13). Once convergence is reached, i.e.,

\frac{d F}{d t} = 0

, we can easily make predictions using the converged empirical distribution

\hat{q} (x) = \frac{1}{N} \sum_{i = 1}^{N} δ (x - x_{i})

, where

δ

is the Dirac delta function, or, alternatively, the Gaussian density it represents, i.e.,

q (x) = N (m, C)

, where

m = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

and

C = \frac{1}{N} \sum_{i = 1}^{N} (x_{i} - m) {(x_{i} - m)}^{⊤}

. To draw samples from

\hat{q}

, no inversions of the empirical covariance C are needed, as we can obtain new samples by computing:

\begin{matrix} x = \frac{1}{\sqrt{N}} \sum_{i = 1}^{N} (x_{i} - m) \circ ξ_{i} + m, \end{matrix}

(16)

where

ξ_{i}

are i.i.d. normal variables:

ξ_{i} \sim N (0, I_{D})

. This can be shown by defining D, the deviation matrix, a matrix which columns equal to

D_{i} = \frac{x_{i} - m}{\sqrt{N}}

. We naturally have

D D^{⊤} = C

which makes D the Cholesky decomposition of C.

All the inference steps are summarized in Algorithm 2 and an illustration in two dimensions is provided in Figure 1.

We summarize the principal points of our approach:

Gradients of expectations have zero variance, at the cost of a bias decreasing with the number of particles and equal to zero for Gaussian target (see Theorem 1);
It works with noisy gradients (when using subsampling data, for example);
The rank of the approximated covariance C is $min (N - 1, D)$ . When $N \leq D$ , the algorithm can be used to obtain a low-rank approximation.
The complexity of our algorithm is $O (N^{2} D)$ and storing complexity is $O (N (N + D))$ . By adjusting the number of particles used, we can control the performance trade-off;
GPF (and GF) are also compatible with any kind of structured MF (see Section 3.5);
Despite working with an empirical distribution, we can compute a surrogate of the free energy $F (q)$ to optimize hyper-parameters, compute the lower bound of the log-evidence, or simply monitor convergence.

Algorithm 1: Gaussian Flow (GF)

Algorithm 2: Gaussian Particle Flow (GPF)

3.4.1. Relaxation of Empirical Free Energy

The definition of the free energy

F (q)

from the KL–divergence (1) for a continuous random variables assumes that both

q (\cdot)

and

p (\cdot | y)

are densities with respect to the Lebesgue measure. Hence, it is not a priori clear that a specific approximation

F ({\hat{q}}^{t})

, based on an empirical distribution

{\hat{q}}^{t} (x) ≐ \frac{1}{N} \sum_{i = 1}^{N} δ (x - x_{i}^{t})

with a finite number of particles N, will decrease under the particle flow. Thus we may not be able to guarantee convergence to a fixed point for finite N. Luckily, as we show in Appendix D, we find that:

\begin{matrix} \frac{d F ({\hat{q}}_{t})}{d t} = \frac{d (E_{{\hat{q}}_{t}} [φ (x)] - \frac{1}{2} log |C^{t}|)}{d t} \leq 0 . \end{matrix}

(17)

For

N < D + 1

, the empirical covariance

C^{t}

will typically contain

N - 1

non-zero eigenvalues and lead to

- log |C| = \infty

, making Equation (17) meaningless. We resolve this issue by introducing a regularized free energy

\tilde{F}

where

log |C^{t}|

is replaced by

\sum_{i : λ_{i} > 0} log λ_{i}

where

{λ_{i}}_{i = 1}^{D}

are the eigenvalues of

C^{t}

. We show in Appendix D that, given the dynamics from Equation (5),

\tilde{F}

is also guaranteed to not increase over time. It can, therefore, be used as a regularized proxy for the true

F

and used to optimize over hyper-parameters or to monitor convergence. Note that similar proofs exist for SVGD [33] and were proven to be highly non-trivial.

3.4.2. Dynamics and Fixed Points for Gaussian Targets

We illustrate our method by some exact theoretical results for the dynamics and the fixed points of our algorithm when the target is a multivariate Gaussian density. While such targets may seem like a trivial application, our analysis could still provide some insight into the performance for more complicated densities.

Theorem 1.

If the target density

p (x)

is a D-dimensional multivariate Gaussian, only

D + 1

particles are needed for Algorithm 2 to converge to the exact target parameters.

Proof.

The proof is given in Appendix E. □

Theorem 2.

For a target

p (x) = N (x ∣ μ, Λ^{- 1})

, i.e., with precision matrixΛ, where

x \in R^{D}

, and

N \geq D + 1

particles, the continuous time limit of Algorithm 2 will converge exponentially fast for both the mean and the trace of the precision matrix:

\begin{matrix} m^{t} - μ = & e^{- Λ t} (m^{0} - μ), \\ tr ({(C^{t})}^{- 1} - Λ) = & e^{- 2 t} tr ({(C^{0})}^{- 1} - Λ), \end{matrix}

where

m^{t}

and

C^{t}

are the empirical mean and covariance matrix at time t and

exp (- Λ t)

is the matrix exponential.

Proof.

The proof is given in Appendix F. □

Our result shows that convergence of the mean

m^{t}

directly depends on

Λ

. However, we can also precondition the gradient on m by

C^{t}

, i.e., using the natural gradient approximation in the Fisher sense, and eventually get rid of the dependency on

Λ

when

{(C^{t})}^{- 1} \approx Λ

.

The exponential relaxation of fluctuations also manifests itself in the decay of the free energy towards its minimum. For the Gaussian target, the free energy exactly separates into two terms corresponding to the mean and fluctuations. We can write

F (m^{t}, C^{t}) = \frac{1}{2} {(m^{t} - μ)}^{⊤} Λ (m^{t} - μ) + \frac{D}{2} + F_{f l} (C^{t})

, where the nontrivial fluctuation part (subtracted by its minimum) is given by

\begin{matrix} F_{f l} (C^{t}) = - \frac{1}{2} log |C^{t}| + \frac{1}{2} tr (Λ C^{t} - I) . \end{matrix}

We can show that

\begin{matrix} - lim_{t \to \infty} \frac{d ln F_{f l} (C^{t})}{d t} \geq 4, \end{matrix}

indicating an asymptotic decrease in

F_{f l} (C^{t})

faster than

e^{- 4 t}

, independent of the target. We can also prove the finite time bound

\begin{matrix} F_{f l} (C^{t}) \leq F_{f l} (C^{0}) e^{- [\frac{2 t}{tr (Λ^{- 1}) (tr (Λ) + | tr ({(C^{0})}^{- 1} - Λ) |)}]} . \end{matrix}

The degenerate case

N < D + 1

Additionally, we can show the following result for the fixed points:

Theorem 3.

Given a D-dimensional multivariate Gaussian target density

p (x) = N (x | μ, Σ)

, using Algorithm 2 with

N < D + 1

particles, the empirical mean converges to the exact mean μ. The

N - 1

non-zero eigenvalues of

C^{t}

converge to a subset of the target covariance Σ spectrum. Furthermore, the global minimumof the regularised version

\tilde{F}

of the free energy (17) corresponds to the largest eigenvalues of Σ.

Proof.

The proof is given in Appendix G. □

This result suggests that

C^{t}

might typically converge to an optimal low-rank approximation of

Σ

. We show an empirical confirmation in Section 4.2 for this conjecture. This suggests that it makes sense to apply our algorithm to high-dimensional problems even when the number of particles is not large. If the target density has significant support close to a low-dimensional submanifold, we might still obtain a reasonable approximation.

3.5. Structured Mean-Field

For high-dimensional problems, it may be useful to restrict the variational Gaussian approximation to the posterior to a specific structure via a structured mean-field approximation. In this way, spurious dependencies between variables that are caused by finite-sample effects could be explicitly removed from the algorithms. This is most easily incorporated in our approach by splitting a given collection of latent variables x into M disjoint subsets

x^{(i)}

. We reorder the vector indices in such a way that the first components correspond to

x^{(1)}, x^{(2)}

, and so on. Hence, we obtain

x = {x^{(1)}, x^{(2)}, \dots, x^{(M)}}

. A structured mean-field approach is enforced by imposing a block matrix structure for the update matrix

A_{M F} = A_{(1)} \oplus \dots \oplus A_{(M)}

, where ⊕ is the direct sum operator. It is easy to see that this construction corresponds to a related block structure of the

Γ

matrix in Equation (3). This means that the subsets of the random vectors are modeled as independent. Hence, when the number of particles grows to infinity, one recovers the fixed-point equations for the optimal MF structured Gaussian variational approximation from our approach. As previously, as the number of particles grows to infinity, we recover the optimal MF Gaussian variational approximation. Note that using a structured MF does not change the complexity of the algorithm but requires fewer particles to obtain a full-rank solution.

3.6. Comparison with SVGD

Given the similarities with the SVGD methods [24], one could question the differences of our approach. The model proposed by [10] using a linear kernel

k (x, x^{'}) = x^{⊤} x^{'} + 1

has similar properties to our approach. The variable update becomes:

\begin{matrix} \frac{d x}{d t} & = \frac{1}{N} \sum_{i = 1}^{N} (- k (x_{i}, x) \nabla φ (x_{i}) + \nabla_{x_{i}} K (x_{l}, x_{i})) \\ = E_{\hat{q}} [I - \nabla φ (x) x^{⊤}] x - E_{\hat{q}} [\nabla φ (x)] \end{matrix}

The fixed points are

\begin{matrix} 0 = & E_{\hat{q}} [\nabla φ (x)] \\ I = & E_{\hat{q}} [\nabla φ (x) x^{⊤}] = E_{\hat{q}} [\nabla φ (x) {(x - m)}^{⊤}] \end{matrix}

where the last equality holds since

E_{\hat{q}} [\nabla φ (x)] = 0

. This is the same as our algorithm fixed points (9). Similarly to Theorem 1,

D + 1

particles will converge to the exact D-dimensional multivariate Gaussian target. However, the generated flows are different. The main difference is that we normalize our flow via the

L_{2}

norm, whereas [10] rely on the reproducing kernel Hilbert space (RKHS) norm, i.e.,

{∥ φ ∥}_{k}^{2} = φ^{⊤} K^{- 1} φ

where

φ_{i} = φ (x_{i})

and

K_{i j} = k (x_{i}, x_{j})

. For a full introduction on RKHS, we recommend [34]. Remarkably, centering the particles on the mean, namely, using the modified linear kernel

k (x, x^{'}) = {(x - m)}^{⊤} (x^{'} - m) + 1

, leads to the same dynamics. Additionally, when using SVGD, there is no direct possibility of computing the current KL divergence between the variational distribution and the target, unless some values are accumulated [35]. There is also no clear theory explaining what happens when the number of particles is smaller than the number of dimensions, for both distance-based kernels and the linear kernel.

4. Experiments

We now evaluate the efficiency of GPF and GF. First, given a Gaussian target, we compare the convergence of our approach with popular VGA methods, which are all described in Section 2. Second, we evaluate the effect of varying the number of particles for both Gaussian targets and non-Gaussian targets, especially with a low-rank covariance. Then, we evaluate the efficiency of our algorithm on a range of real-world binary classification problems through a Bayesian logistic regression model and a series of BNN on the MNIST dataset.

All the Julia [36] code and data used to reproduce the experiments are available at the Github repository: https://github.com/theogf/ParticleFlow_Exp (accessed on 27 July 2021).

4.1. Multivariate Gaussian Targets

We consider a 20-dimensional multivariate Gaussian target distribution. The mean is sampled from a normal Gaussian

μ \sim N (0, I_{D})

and the covariance is a dense matrix defined as

Σ = U Λ U^{⊤}

, where U is a unitary matrix and

Λ

is a diagonal matrix.

Λ

is constructed as

{log}_{10} (Λ_{i i}) = \frac{{log}_{10} (κ) (i - 1)}{D - 1} - 1

where

κ

is the condition number, i.e.,

κ = Λ_{max} / Λ_{min}

. This means that, for

κ = 1

, we obtain a

Σ = 0.1 I

, and for

κ = 100

, we obtain eigenvalues ranging uniformly from

0.1

to 10 in log-space.

We compare GPF and GF to the state-of-the art methods for VGA described in Section 2, namely Doubly Stochastic VI (DSVI) [14], Factor Covariance Structure (FCS) [15] with rank

p = D

, iBayes Learning Rule (IBLR) [17] with a full-rank covariance and their Hessian approach, and Stein Variational Gradient Descent with both a linear kernel (Linear SVGD) [10] and a squared-exponential kernel (Sq. Exp. SVGD) [24]. For all methods, we set the number of particles or, alternatively, the number of samples used by the estimator, as

D + 1

, and use standard gradient descent (

x^{t + 1} = x^{t} + η φ^{t} (x^{t})

) with a learning rate of

η = 0.01

for all particle methods. We use RMSProp [37] with a learning rate of

0.01

for all stochastic methods. We run each experiment 10 times with 30,000 iterations, and plot the average error on the mean and the covariance with one standard deviation. For GPF, we additionally evaluate the method with and without using natural gradients for the mean (i.e., pre-multiplying the averaged gradient with

C^{t}

), indicated, respectively, with a dashed and solid line. Figure 2 reports the

L_{2}

norm of the difference between the mean and covariance with the true posterior over time for the target condition number

κ \in {1, 10, 100}

.

As Theorem 1 predicts, GPF converges exactly to the true distribution, regardless of the target. GF and other methods based on stochastic estimators cannot obtain the same precision as their accuracy is penalized by the gradient noise. IBLR approximate the covariance perfectly, despite the stochasticity of its estimator; however IBLR needs to compute the true Hessian at each step. When using a Hessian approximation instead, IBLR performed just like DSVI; the true benefit of IBLR appears when second-order functions are computed, which is naturally intractable in high-dimensions. SVGD with a linear kernel, achieves a good performance but is highly unstable: most of the runs (ignored here) diverge. This is due to the dot computation

x^{⊤} x

which can become extremely high, especially for non-centered data. For this reason, we do not consider this method for the later experiments. SVGD with a sq. exp. kernel obtains a good estimate for the mean but fails to approximate the covariance.

Perhaps surprisingly, GF does not perform much better than DSVI or FCS. This is potentially due to the benefit of Riemannian gradients being canceled by the gradient noise [38] providing a strong argument for particle-based methods over stochastic estimators.

Remarkably, we also confirm Theorem 2, that the convergence speed of

C^{t}

is independent of the target

Σ

, while the convergence speed of

m^{t}

has this dependency unless the natural gradient is used (see the dashed curves). The case

κ = 1

highlights that natural gradient do not necessarily improve convergence speed.

4.2. Low-Rank Approximation for Full Gaussian Targets

We explore the effect of the number of particles for both Gaussian and non-Gaussian targets. We use the same Gaussian target from the previous experiment in 50 dimensions with a full-rank covariance determined by their condition number

κ = \frac{λ_{max}}{λ_{min}}

. The covariance eigenvalues

λ_{i}

in log-space range uniformly from

0.1

to

0.1 κ

. For a given target multivariate Gaussian, we vary the number of particles from 2 to

D + 1

and look at the absolute difference of

| tr (C - Σ) |

. The results in

D = 50

, as well as the corresponding predictions (in dashed-black), from Theorem 3, are shown on Figure 3.

The empirical results perfectly match the theoretical predictions, confirming that, for Gaussian targets, the particles determine a low-rank approximation whose spectrum is equal to the largest eigenvalues from the target.

4.3. High-Dimensional Low-Rank Gaussian Targets

We consider a typical low-rank target case where the dimensionality is high but the effective rank of the covariance is unknown. The target is given by

p (x) = N (μ, Σ)

where

μ \sim N (0, I_{D})

, the covariance is defined by

Σ = U Λ U^{⊤}

, where U is a

D \times D

unitary matrix and

Λ

is a diagonal matrix defined by

\begin{matrix} Λ_{i i} = \{\begin{matrix} N (2, 1), & if i \leq K \\ 10^{- 8}, & otherwise \end{matrix} \end{matrix}

where K is the effective rank of the target. We pick

D = 500

and vary

K \in {10, 20, 30}

to simulate a true problem where the correct K is not known. We test all methods allowing for low-rank structure, namely, GPF, GF, FCS and SVGD (Linear and Sq. Exp.). We fix the rank (or the number of particles) to be 20; therefore, we obtain three cases where the rank is exact, under-estimated, and over-estimated. For all methods, we use RMSProp [37] for the stochastic methods, or a diagonal version of it (see Section 4.4) for the particle ones. The error of the mean and the covariance is shown in Figure 4. Note that the difference in the initial error on the covariance is due to the difficulty of starting with the same covariance between particle and stochastic methods.

We observe once again that the SVGD with a linear kernel fails to converge due to the large gradients. All methods perform equally in the estimation of the mean while being non-influenced by the rank of the target. As expected, the approximation quality for the covariance degrades when the rank gets bigger, but all algorithms still converge to good approximations. SVGD with a sq. exp. kernel performs much worse than the rest of the methods. This is a known phenomenon where, for high dimensions, the covariance SVGD is either over- or underestimated.

4.4. Non-Gaussian Target

We now investigate the behavior of our algorithm with non-Gaussian target distributions. We built a two-dimensional banana distribution:

p (x) \propto exp (- 0.5 (0.01 x_{1}^{2} + 0.1 {(x_{2} + 0.1 x_{1}^{2} - 10)}^{2}))

, varied the number of particles used for GPF in

{3, 5, 10, 20, 50}

and compared it with a standard full-rank VGA approach. We also showed the impact of replacing a fixed

η

with the Adam [39] optimizer for 50 particles. The results are shown in Figure 5. As expected, increasing the number of particles madesthe distribution obtained via GPF increasingly closer to the optimal standard VGA, even in a non-Gaussian setting. However, using a momentum-based optimizer such as Adam breaks the linearity assumption of the original flow (5) and leads to a twisted representation of the particles. (We observed the same behavior with other momentum-based optimizers). A simple modification of the most known optimizers allows the linearity to be maintained while correctly adapting the learning rate to the shape of the problem. Most optimisers accumulate momentum or gradients element-wise, and end up modifying the updates as

x^{t + 1} = x^{t} + P^{t} ⊙ φ^{t} (x^{t})

, where

P^{t} \in R^{D \times D}

is the preconditioner obtained via the optimiser and ⊙ is the Hadamard product. By instead taking the average over each dimensions, we obtained the updates

x^{t + 1} = x^{t} + P^{t} φ^{t} (x^{t})

, where

P^{t}

is a

D \times D

diagonal matrix. The details of the dimension-wise conditioners for ADAM, AdaGrad and AdaDelta are given in Appendix H.

4.5. Bayesian Logistic Regression

Finally, we considered a range of real-world binary classification problems modeled with a Bayesian logistic regression. Given some data

{(x_{i}, y_{i})}_{i = 1}^{N}

where

x_{i} \in R^{D}

and

y \in {- 1, 1}

, we defined the model

y_{i} \sim Bernoulli (σ (w^{⊤} x_{i}))

with weight

w \in R^{D}

, and with

σ

being the logistic function. We set a prior on w:

w N (0, 10 I_{D})

. We benchmarked the competing approaches over four datasets from the UCI repository [40]: spam (

N = 4601, D = 104

), krkp (

N = 351, D = 111

), ionosphere (

N = 3196, D = 37

) and mushroom (

N = 8124, D = 95

). We ran all algorithms discussed in Section 4.1, both with and without a mean-field approximation; SVGD was omitted since it is too unstable. All algorithms were run with a fixed learning rate

η = 10^{- 4}

, and we used mini-batches of size 100. We show alternative training settings in Appendix I. Note that FCS, for mean-field, simplifies to DSVI Additionally, we did not consider full-rank IBLR, as it is too expensive, and we used their reparametrized gradient version for the Hessian. Figure 6 shows the average negative log-likelihood on 10-fold cross-validation with one standard deviation for each dataset. While, as expected, the advantages shown for Gaussian targets do not transfer to non-Gaussian targets, GPF and GF are consistently on par with competitors. On the other hand, IBLR tends to be outperformed. It is also interesting to note that mean-field does not seem to have a negative impact on these problems, and performance remains the same even with a full-rank matrix.

4.6. Bayesian Neural Network

We ran our algorithm on a standard network with two hidden layers each, with

L = 200

neurons and tanh activation functions (we additionally tried ReLU [41], but some baselines failed to converge). We trained on the MNIST dataset [42] (

N =

60,000,

D = 784

) and used an isotropic prior on the weights

p (w) = N (0, α I_{D})

with

α = 1.0

. We additionally compared these with Stochastic Weight Averaging-Gaussian (SWAG) [27] with an SGD learning rate of

10^{- 6}

(selected empirically) and Efficient Low-Rank Gaussian Variational Inference (ELRGVI) [26]. We varied the assumptions on the covariance matrix to be diagonal (Mean-Field), or to have rank

L \in {5, 10}

. Additionally, we showed, for GPF, the effect of using a structured mean-field assumption by imposing the independence of the weights between each layer (GPF (Layers)).

We trained each algorithm for 5000 iterations with a batchsize of 128(∼10 epochs) and reported the final average negative log-likelihood, accuracy and expected calibration error [43] on the test set (

N =

10,000) on Table 1. The predictive distribution is given by

\begin{matrix} p (y = k | x^{*}, D) = \int p (y = k | x^{*}, w) p (w | D) d w \approx \int p (y = k | x^{*}, w) q (w) d w, \end{matrix}

where

D

is the training data, and

x^{*}

is a test sample. We computed the accuracy and the average negative test log-likelihood as:

\begin{matrix} Acc & = \frac{1}{N} \sum_{i = 1}^{N} 1_{y_{i}} ({arg}_{k} max p (y = k | x_{i}^{*}, D)) \\ NLL & = - \frac{1}{N} \sum_{i = 1}^{N} log p (y = y_{i} | x_{i}^{*}, D) \end{matrix}

where

1_{y} (x)

is the indicator function (equal to 1 for

y = x

, 0 otherwise). For the definition of expected calibrated error, we refer the reader to [43]. Additional convergence and uncertainty calibration plots can be found in Appendix I.

Overall, the SVGD method performed best in terms of both accuracy and negative log-likelihood. However, SVGD is not in the same category as others, since it is not a VGA. For VGAs, we observed that a low-rank approximation improves upon mean-field methods. In particular, assuming independence between layers provides a large advantage to GPF. GPF and GF generally perform equally or better than all the other VGA methods. Note that, although not reported here, all methods needed approximately the same time for the 5000 iterations, except for SWAG, which only needed the MAP and a few thousand iterations of SGD afterward, making it generally faster but also less controlled (a grid search was needed to find the appropriate learning for SGD).

5. Discussion

We introduced GPF, a general-purpose and theoretically grounded, particle-based approach, to perform inference with variational Gaussians as well as GF its parameter version. We were able to show the convergence of the particle algorithm based on an empirical approximation of the free energy. We also showed that we can approximate high-dimensional targets by allowing for low-rank approximations with a small number of particles. The results for Gaussian targets suggest that the convergence of posterior covariance approximation may relax asymptotically fast, with small dependence on the target. This work is the first step in analyzing convergence speed and guarantees in inference with variational Gaussians, and future work could extend guarantees to non-Gaussian problems. One could also take advantage of existing particle-based VI methods to accelerate inference further or reach a better optima [44,45].

Author Contributions

Conceptualization, T.G.-F. and M.O.; methodology, T.G.-F., V.P. and M.O.; software, T.G.-F.; validation, T.G.-F.; formal analysis, T.G.-F.; investigation, T.G.-F.; resources, T.G.-F. and V.P.; data curation, T.G.-F.; writing—original draft preparation, T.G.-F., V.P. and M.O.; writing—review and editing, T.G.-F., V.P. and M.O.; visualization, T.G.-F.; supervision, M.O.; project administration, T.G.-F.; funding acquisition, M.O. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge the support of the German Research Foundation and the Open Access Publication Fund of TU Berlin.

Data Availability Statement

Datasets can be found on the UCI dataset website [40] and the MNIST dataset can be found on Yann Lecun website [42].

Acknowledgments

We thank Fela Winkelmolen for his initial help on computations, Jannik Thümmel for his work on the linear SVGD and the reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of the Optimal Parameters

In Section 3, we considered the optimization problem:

\begin{matrix} min_{A^{t}, b^{t} \in B} \frac{d F [q^{t}]}{d t} where B = {A^{t}, b^{t} : ∥ A^{t} ∥_{F}^{2} = 1, ∥ b^{t} ∥^{2} = 1}, \end{matrix}

where we have introduced

∥ A^{2} ∥_{F}^{2} = tr (A A^{⊤})

, the Froebius norm and

∥ b^{t} ∥

, the

L_{2}

norm and

\begin{matrix} \frac{d F [q^{t}]}{d t} = - tr [A^{t} {(A_{★}^{t})}^{⊤}] - {(b^{t})}^{⊤} b_{★}^{t} \end{matrix}

(A1)

To solve this problem, we used the Lagrange multiplier method. We write the Lagrangian as:

\begin{matrix} L (A^{t}, b^{t}) = \frac{d F [q^{t}]}{d t} - λ_{A} g (A^{t}) - λ_{b} h (b^{t}), \end{matrix}

where

g (A) = tr (A A^{⊤}) - 1

and

h (b) = {∥ b ∥}_{2}^{2} - 1

. For simplicity we can divide the problem as:

\begin{matrix} L (A^{t}) = & - tr [A^{t} {(A_{★}^{t})}^{⊤}] - λ_{A} g (A^{t}) \\ L (b^{t}) = & - {(b^{t})}^{⊤} b_{★}^{t} - λ_{b} h (b^{t}) \end{matrix}

For

A^{t}

, we have the constraints:

\begin{matrix} \nabla_{A^{t}} tr [A^{t} {(A_{★}^{t})}^{⊤}] = & λ_{A} \nabla_{A^{t}} g (A^{t}) \\ g (A^{t}) = & 0 \end{matrix}

Computing the gradients is straightforward:

\begin{matrix} A_{★}^{t} = & 2 λ_{A} A^{t} \\ \Rightarrow A^{t} = \frac{A_{★}^{t}}{2 λ_{A}} \\ \Rightarrow \frac{1}{4 λ_{A}^{2}} tr (A_{★}^{t} {(A_{★}^{t})}^{⊤}) = & 1 \\ \Rightarrow λ_{A} = \sqrt{\frac{tr (A_{★}^{t} {(A_{★}^{t})}^{⊤})}{4}} . \end{matrix}

which gives us the result

A^{t} = \frac{A_{★}^{t}}{∥ A_{★}^{t} ∥_{F}}

. Similarly for

b^{t}

:

\begin{matrix} \nabla_{b^{t}} {(b^{t})}^{⊤} b_{★}^{t} = & λ_{b} \nabla_{b^{t}} h (b^{t}) \\ h (b^{t}) = & 0 . \end{matrix}

Replacing the gradients gives:

\begin{matrix} b_{★}^{t} = & 2 λ_{b} b^{t} \\ \Rightarrow b^{t} = & \frac{b_{★}^{t}}{2 λ_{b}} \\ \Rightarrow \frac{1}{4 λ_{b}^{2}} {∥ b_{★}^{t} ∥}_{2}^{2} = 1 \\ \Rightarrow λ_{b} = \frac{2}{∥ b_{★}^{t} ∥_{2}} \end{matrix}

which gives us the result

b^{t} = \frac{b_{★}^{t}}{∥ b_{★}^{t} ∥_{2}}

.

Appendix B. Relaxation of the Empirical Free Energy

We prove the decrease in the empirical free energy (17) under the particle flow when the covariance C is nonsingular. We define the empirical distribution

\hat{q} (x) = \frac{1}{N} \sum_{i = 1}^{N} δ_{x, x_{i}}

with a finite number N of particles. The empirical free energy is defined as

\begin{matrix} F [\hat{q}] = E_{\hat{q}} [φ (x)] - \frac{1}{2} log | C | . \end{matrix}

We are interested in the temporal change of the free energy, when particles move under a general linear dynamics

\begin{matrix} \frac{d x_{i}}{d t} = b + A (x_{i} - m) . \end{matrix}

The induced dynamics for

F

are:

\begin{matrix} \frac{d F}{d t} = E_{q^{t}} [\nabla_{x} φ {(x)}^{⊤} \frac{d x}{d t}] - \frac{1}{2} tr (C^{- 1} \frac{d C}{d t}) \end{matrix}

For notational simplicity, we introduce

g (x) = \nabla_{x} φ (x)

and

\dot{x} = \frac{d x}{d t}

(similarly

\dot{m} = \frac{d m}{d t}

).

\begin{matrix} \frac{d C}{d t} = & \frac{d}{d t} E_{q} [(x - m) {(x - m)}^{⊤}] \\ = & E_{q} [(\dot{x} - \dot{m}) {(x - m)}^{⊤}] + E_{q} [(x - m) {(\dot{x} - \dot{m})}^{⊤}] \\ = & E_{q} [\dot{x} x^{⊤} + x {\dot{x}}^{⊤} - \dot{m} m^{⊤} - m {\dot{m}}^{⊤}] \\ = & E_{q} [\dot{x} {(x - m)}^{⊤}] + E_{q} [(x - m) {\dot{x}}^{⊤}] \end{matrix}

\begin{matrix} \frac{d F}{d t} = & E_{q} [g {(x)}^{⊤} \dot{x}] - \\ \frac{1}{2} E_{q} [tr (C^{- 1} \dot{x} {(x - m)}^{⊤}) + tr (C^{- 1} {(x - m)}^{⊤} {\dot{x}}^{⊤})] \\ = & E_{q} [{\dot{x}}^{⊤} (g (x) - C^{- 1} (x - m))] \end{matrix}

(A2)

where we used the permutation properties of the trace.

Plugging the dynamics into Equation (A2), we obtain:

\begin{matrix} \begin{matrix} \frac{d F}{d t} = & b^{⊤} E_{q} [g (x)] + E_{q} [{(x - m)}^{⊤} A^{⊤} g (x)] \\ - E_{q} [{(x - m)}^{⊤} A^{⊤} C^{- 1} (x - m)] \end{matrix} \end{matrix}

(A3)

where we used the fact that

b^{⊤} C^{- 1} E_{q} [x - m] = 0

.

We next look for conditions on b and A, under which

\frac{d F}{d t} < 0

, i.e., the dynamics will lead to a decrease in the free energy. We pick

b = - β_{1} E_{q} [g (x)]

, where

β_{1} > 0

, and we obtain, for the first term in (A3):

- β_{1} {∥ E_{q} [g (x)] ∥}^{2} \leq 0 .

For A, let us first define

ψ = E_{q} [g (x) {(x - m)}^{⊤}]

and rewrite the second and last term of the Equation (A3) as:

\begin{matrix} E_{q} [{(x - m)}^{⊤} A^{⊤} g (x)] = & tr (E_{q} [A^{⊤} g (x) {(x - m)}^{⊤}]) \\ = & tr (A^{⊤} ψ) \\ E_{q} [{(x - m)}^{⊤} A^{⊤} C^{- 1} (x - m)] = & tr (A^{⊤} C^{- 1} C) \\ = & tr (A) \end{matrix}

Combining both, we get

tr (A^{⊤} (ψ - I))

. Similarly to the previous step, we pick

A = - β_{2} (ψ - I)

, where

β_{2} \geq 0

, which leads to another negative term:

- β_{2} tr ({(ψ - I)}^{⊤} (ψ - I)) \leq 0,

where we use the fact that

X^{⊤} X

is a positive semi-definite matrix for any real valued X.

Note that different forms of A (e.g.,

β_{2}

are replaced by a positive definite matrix) could be used, as long as the trace of the product stays positive. Inserting b and A, the free energy dynamics become

\begin{matrix} \frac{d F}{d t} = & - β_{1} {∥ E_{q} [g (x)] ∥}^{2} - β_{2} tr ({(ψ - I)}^{⊤} (ψ - I)) \end{matrix}

The variable dynamics are given by

\begin{matrix} \frac{d x}{d t} = & - β_{1} E_{q} [g (x)] - β_{2} (ψ - I) (x - m) \\ = & - β_{1} E_{q} [g (x)] \\ - β_{2} (E_{q} [g (x) {(x - m)}^{⊤}] - I) (x - m), \end{matrix}

which is equivalent to Equation (5), for

β_{1} = β_{2} = 1

. Our result shows that the empirical approximation of the free energy decreases under the particle flow.

Appendix C. Riemannian Gradient for Matrix Parameter Γ

The parameter flow for the matrix

Γ

in (11) is given by

\begin{matrix} \frac{d Γ^{t}}{d t} = & Γ^{t} - E_{q^{0}} [\nabla_{x} φ (x^{t}) {(x^{0} - m^{0})}^{⊤}] Γ^{t} {(Γ^{t})}^{⊤} . \end{matrix}

This is easily rewritten in terms of the parameter gradient as

\frac{d Γ^{t}}{d t} = \frac{\partial F}{\partial Γ} Γ Γ^{⊤}

Similar to natural gradients, which are defined by the metric, which is induced by the Fisher–matrix, we can rewrite the parameter change in terms of a different Riemannian gradient. This gradient is the direction of change

d Γ = Γ (t + d t) - Γ (t)

, which yields the steepest descent of the free energy over a small time interval

d t

. As an extra condition, one keeps the length of

d Γ

(measured by a ’natural’ metric, which has specific invariance properties) fixed. This is defined by an inner product (the squared length)

{〈 d Γ, d Γ 〉}_{Γ}

in the tangent space of small deviations

d Γ

from the matrix

Γ

. Hence,

d Γ

is found by minimising

F (Γ (t) + d Γ, m)

(for small

d Γ

) under the condition that

{〈 d Γ, d Γ 〉}_{Γ (t)}

is fixed. Following [20] (Theorem 6), a natural metric in the space of symmetric nonsingular matrices can be defined as

\begin{matrix} {〈 d Γ, d Γ 〉}_{Γ} ≐ tr ({(d Γ Γ^{- 1})}^{⊤} d Γ Γ^{- 1}) . \end{matrix}

This metric is invariant against multiplications of

Γ

and

d Γ

by matrices Y, i.e.,

{〈 d Γ, d Γ 〉}_{Γ} = {〈 d Γ Y, d Γ Y 〉}_{Γ Y}

and reduces to the Euclidian metric at the unit matrix

Γ = I

.

The direction of the natural gradient is obtained by expanding the free energy for small

d Γ

and introducing a Lagrange–multiplier

λ

for the constraint. One ends up with the quadratic form

\begin{matrix} \frac{\partial F}{\partial Γ} d Γ + λ tr ({(d Γ Γ^{- 1})}^{⊤} d Γ Γ^{- 1}) \end{matrix}

to be minimised by

d Γ

. By taking the derivative with respect to

d Γ

, one finds that the direction of

d Γ

agrees with the right equation of the flow (11).

Appendix D. Regularised Free Energy for N ≤ D

The problem of defining an empirical approximation for

N \leq D

particles is that the empirical covariance becomes singular and typically has

N - 1

nonzero eigenvalues, and thus

| C | = 0

. Note that the extra 0 eigenvalue is derived from the fact that the empirical sum of fluctuations must be zero, which provides an additional linear constraint.

We can regularise the log determinant term by replacing the zero eigenvalues of C:

λ_{i} = 0 \to {\tilde{λ}}_{i} = 1

. The new covariance

\tilde{C}

becomes

\begin{matrix} log | \tilde{C} | = \sum_{i : λ_{i} > 0} log λ_{i}, \end{matrix}

since

log 1 = 0

. The dynamics of the particles stays the same. To rewrite this formally in terms of matrices, we define

\begin{matrix} \tilde{C} = C + C_{⊥} \end{matrix}

where

\begin{matrix} C_{⊥} = \sum_{i : λ_{i} = 0} e_{i} e_{i}^{⊤} \end{matrix}

and

e_{i} =

ith eigenvector of C. This replaces all 0 eigenvalues by 1.

C_{⊥}

is a projector:

C_{⊥}^{2} = C_{⊥}

and

C_{⊥} (I - C_{⊥}) = 0

. We also have

tr (C_{⊥}) = D - (N - 1)

. In the following, it is useful to introduce the

D \times N

matrix of fluctuations Z, such that

C = Z Z^{⊤} / N

. The column vectors of Z span the subspace of eigenvectors

e_{i}

with

λ_{i} > 0

. Hence, it follows that

C_{⊥} Z = 0

.

We want to show that the regularised free energy

\tilde{F}

decreases under the particle dynamics for

N \leq D

. Since the part of the time derivative of

\tilde{F}

that depends on

\frac{d m}{d t}

is not changed, we will only discuss the fluctuation part in the following.

It is useful to introduce the matrix:

\begin{matrix} \tilde{A} ≐ I - C_{⊥} - g Z^{⊤} / N = A - C_{⊥}, \end{matrix}

with

g = \nabla_{x} φ (x)

is the

D \times N

matrix of the gradient.

\begin{matrix} E_{q} [g {(x)}^{⊤} \frac{d x}{d t}] = & tr (A) - tr (A^{⊤} A) \\ = & tr (\tilde{A} + C_{⊥}) - tr ({(\tilde{A} + C_{⊥})}^{⊤} (\tilde{A} + C_{⊥})) \\ = & tr (\tilde{A}) - tr ({\tilde{A}}^{⊤} \tilde{A}) . \end{matrix}

To obtain this result, we need

\begin{matrix} tr (C_{⊥} \tilde{A}) = & tr (C_{⊥} {\tilde{A}}^{⊤}) \\ = & tr (C_{⊥} (I - C_{⊥}) - C_{⊥} Z g^{⊤} / N) = 0 . \end{matrix}

We need to work out

\begin{matrix} - \frac{1}{2} \frac{d ln | \tilde{C} |}{d t} = & - \frac{1}{2} tr (\frac{d \tilde{C}}{d t} {\tilde{C}}^{- 1}) \\ = & - \frac{1}{2} tr (\frac{d C}{d t} {\tilde{C}}^{- 1}) \end{matrix}

where we have used the fact that the eigenvalues

λ_{i} = 1

of

\tilde{C}

have a zero time derivative and can be omitted. We use the linear dynamics

\frac{d Z}{d t} = A Z

to obtain:

\begin{matrix} \frac{d C}{d t} = & = C A^{⊤} + A C \\ = & (\tilde{C} - C_{⊥}) ({\tilde{A}}^{⊤} + C_{⊥}) + (\tilde{A} + C_{⊥}) (\tilde{C} - C_{⊥}) \\ = & \tilde{C} {\tilde{A}}^{⊤} + \tilde{A} \tilde{C} + C_{⊥} \tilde{C} + \tilde{C} C_{⊥} - \tilde{A} C_{⊥} - C_{⊥} {\tilde{A}}^{⊤} - 2 C_{⊥} \\ = & \tilde{C} {\tilde{A}}^{⊤} + \tilde{A} \tilde{C}, \end{matrix}

where we have used

C_{⊥}^{2} = C_{⊥}

and

C_{⊥} {\tilde{A}}^{⊤} = 0

. Hence

\begin{matrix} - \frac{1}{2} tr (\frac{d \tilde{C}}{d t} {\tilde{C}}^{- 1}) = & - tr (\tilde{A}) . \end{matrix}

Finally, the temporal change in the free energy due to the fluctuations is given by

\begin{matrix} \frac{d \tilde{F}}{d t} = - tr ({\tilde{A}}^{⊤} \tilde{A}) \leq 0 . \end{matrix}

Note that this proof is not only valid for

N \leq D

, but also for

N > D

, as the overall computations are simplified with

C_{⊥} = 0

. A more detailed proof for

N > D

is, furthermore, given in Appendix B.

Efficient Computation of $log | \tilde{C} |$

A practical way to compute

log | \tilde{C} |

without performing an eigenvector expansion is to define the

N \times N

matrix

\begin{matrix} R ≐ Z^{⊤} Z / N + J_{N, N} / N, \end{matrix}

where

J_{N, N}

is the

N \times N

all-ones matrix.

Z^{⊤} Z / N

shares the

N - 1

nonzero eigenvalues with C and has an additional eigenvalue 0 corresponding to the constant eigenvector

{(e_{N})}_{i} = 1 / \sqrt{N}

. Adding an all-ones matrix preserves all existing eigenvalues while replacing the 0 one with a constant. This leads to the following result:

\begin{matrix} - \frac{1}{2} log | R | = - \frac{1}{2} \sum_{i = 1}^{N - 1} log λ_{i} . \end{matrix}

Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (N > d)

Theorem A1

(1). If the target density

p (x)

is a D-dimensional multivariate Gaussian, only

D + 1

particles are needed for Algorithm 2 to converge to the exact target parameters.

The general fixed-point condition for the dynamics (13) of the position

x_{i}

for particle i is given by:

\begin{matrix} (I - E_{\hat{q}} [g (x) {(x - m)}^{⊤}]) (x_{i} - m) - E_{\hat{q}} [g (x)] = 0 . \end{matrix}

for

i = 1, \dots, N

. By taking the expectation over all particles, we obtain:

\begin{matrix} E_{\hat{q}} [g (x)] = 0, \end{matrix}

(A4)

where

\hat{q}

is the empirical distributions of particles at the the fixed point. Note that this result is independent of N, i.e., it is also valid for

N = 1

.

For a D-dimensional Gaussian target

p (x) = N (μ, Σ)

, we will show that empirical mean and covariance given by the particle algorithm converge to the true mean and covariance matrix of the Gaussian when we use

N \geq D + 1

particles. In this setting, we have

φ (x) = \frac{1}{2} x^{⊤} Σ^{- 1} x - x^{⊤} Σ^{- 1} μ

. For simplification, we use the precision matrix

Λ = Σ^{- 1}

and get

\begin{matrix} φ (x) = \frac{1}{2} x^{⊤} Λ x - x^{⊤} Λ μ . \end{matrix}

The gradient

g (x)

becomes:

\begin{matrix} g (x) = Λ (x - μ) \end{matrix}

At the fixed points, we have that

\frac{d m}{d t}

and

\frac{d Γ}{d t}

are equal to 0. For the mean m:

\begin{matrix} \frac{d m}{d t} = E_{\hat{q}} [g (x)] = & 0 \\ Λ E_{\hat{q}} [x - μ] = & 0 \\ Λ m = & Λ μ \\ m = & μ \end{matrix}

For the matrix

Γ

, we have

\begin{matrix} \frac{d Γ}{d t} = - A Γ = & 0 \\ Γ - E_{q_{0}} [g (x) {(x - m)}^{⊤}] Γ = & 0 \\ E_{q_{0}} [Λ (x - μ) {(x - m)}^{⊤}] Γ = & Γ \\ - 2 η_{2} E_{q_{0}} [(x - m) {(x - m)}^{⊤}] Γ = & Γ \\ Λ C Γ = & Γ \\ Λ C^{2} = & C \end{matrix}

where we use the result for the mean

m = μ

and right multiplied by

Γ^{⊤}

as

C = Γ Γ^{⊤}

. Now, we can only simplify, as

C = Λ^{- 1} = Σ

if C is not singular. This is true only if its rank is equal to D, needing

D + 1

particles.

Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets

Theorem A2

(2). For a target

p (x) = N (x ∣ μ, Λ^{- 1})

, where

x \in R^{D}

, and

N \geq D + 1

particles, the continuous time limit of Algorithm 2 will converge exponentially fast for both the mean and the trace of the precision matrix:

\begin{matrix} m^{t} - μ = & e^{- Λ t} (m^{0} - μ), \\ tr ({(C^{t})}^{- 1} - Λ) = & e^{- 2 t} tr ({(C^{0})}^{- 1} - Λ), \end{matrix}

where

m^{t}

and

C^{t}

are the empirical mean and covariance matrix at time t and

exp (- Λ t)

is the matrix exponential.

In the following, we assume the target

p (x) = N (μ, Σ)

We use the notation

Λ ≐ Σ^{- 1}

and

δ C^{t} = C^{t} - Σ

.

Appendix F.1. Convergence of the Mean

Given our target

p (x)

, similarly to Appendix E we have

g (x) = Λ (x - μ)

, where

η_{1} = Σ^{- 1} μ

and

η_{2} = - \frac{1}{2} Σ^{- 1}

. This transform the first of Equations (11) into

\begin{matrix} \frac{d m}{d t} = & - Λ (E_{\hat{q}} [x] - μ) \\ = & - Λ (m - μ) \end{matrix}

If now consider the error on m:

δ m = m - μ

we obtain:

\begin{matrix} \frac{d δ m}{d t} = & \frac{d m}{d t} = - Λ (m - μ) \\ = & - Λ δ m . \end{matrix}

Therefore, the mean converges exponentially fast to the true mean. The asymptotic rate is governed by the largest eigenvalue of

Λ

, i.e., the inverse of the smallest eigenvalue of

Σ

,

λ_{min}

.

Appendix F.2. Convergence of the Covariance Matrix

Let

z = x - m

, we have from Equation (5), that

\begin{matrix} \frac{d z}{d t} = - A z \end{matrix}

where

A = E_{q_{0}} [g (x) z^{⊤}] - I

. This expectation can further be simplified as

\begin{matrix} E_{\hat{q}} [Λ (x - μ) z^{⊤}] = & Λ C, \end{matrix}

(A5)

where

q \sim N (m, C)

. Hence, we have the exact result

\begin{matrix} \frac{d C}{d t} = (I - Λ C) C + C (I - C Λ) . \end{matrix}

(A6)

We know that the optimal target is

C = Σ

. Therefore, we define the error

δ C = C - Σ

. Linearizing Equation (A6) gives us

\begin{matrix} \frac{d δ C}{d t} = \frac{d C}{d t} = & (I - Λ (δ C + Σ)) (δ C + Σ) \\ + (δ C + Σ) (I - (δ C + Σ) Λ) \\ = & - Λ δ C (δ C + Σ) - (δ C + Σ) δ C Λ \\ \approx & - Λ δ C Σ - Σ δ C Λ \end{matrix}

We were not yet able to find a general solution of this equation, but we can obtain a simple result for the trace

y^{t} ≐ tr (δ C)

at time t:

\begin{matrix} \frac{d y^{t}}{d t} ≃ - 2 y^{t} . \end{matrix}

We, therefore, have a asymptotic linear convergence:

y^{t} \propto e^{- 2 t} y^{0}

which is independent of the parameters of the Gaussian model.

We can also equivalently obtain a non-asymptotic estimate of a specific error measure for the precision matrix. Using equation (A6), we have the following dynamics for the precision

C^{- 1}

:

\begin{matrix} \frac{d C^{- 1}}{d t} = & - C^{- 1} \frac{d C}{d t} C^{- 1} \\ = & - C^{- 1} (I - Λ C) - (I - Λ C) C^{- 1} \end{matrix}

Taking the trace

\begin{matrix} \frac{d tr (C^{- 1})}{d t} = & - 2 tr (C^{- 1}) - 2 tr (Λ) \end{matrix}

\begin{matrix} \frac{d tr (C^{- 1} - Λ)}{d t} = & - 2 tr (C^{- 1} - Λ) \end{matrix}

Hence we get the following exact result:

\begin{matrix} tr ({(C^{t})}^{- 1} - Λ) = e^{- 2 t} tr ({(C^{0})}^{- 1} - Λ) \end{matrix}

which is again independent of the parameters of the Gaussian model.

Additionally, this tells us that if the covariance C is non-singular at time

t = 0

, it will remain non-singular for all t (

tr (C^{- 1})

would be infinite). Hence, if we start with

N > d

particles with a proper empirical covariance, they cannot collapse to make C singular.

Appendix F.3. Convergence of the Trace of the Covariance

The asymptotic result on traces obtained previously can be turned into an exact inequality. We have

\begin{matrix} \frac{d δ C}{d t} = - Λ δ C Σ - Σ Λ δ C - Λ {(δ C)}^{2} - {(δ C)}^{2} Λ \end{matrix}

Taking the trace, we get

\begin{matrix} \frac{d tr (δ C)}{d t} = - 2 tr (δ C) - 2 tr (δ C Λ δ C) \end{matrix}

Since

δ C Λ δ C

is positive definite, we have

- 2 tr (δ C Λ δ C) \leq 0

and thus

\begin{matrix} \frac{d tr (δ C)}{d t} \leq - 2 tr (δ C) \end{matrix}

leading to:

\begin{matrix} tr (δ C^{t}) \leq tr (δ C^{0}) e^{- 2 t} \end{matrix}

by using by Grönwall’s lemma [46]:

Lemma A1

(Grönwall). For an interval

I_{0} = [0, \infty)

and a given function f differentiable everywhere in

I_{0}

and satisfying:

\begin{matrix} f^{'} (t) \leq β (t) f (t), t \in I_{0} \end{matrix}

then f is bounded by the corresponding differential equation

g^{'} (t) = β (t) g (t)

:

\begin{matrix} f (t) \leq f (0) \int_{0}^{t} β (s) d s, t \in I_{0} \end{matrix}

The bound is nontrivial only if

tr (δ C) \geq 0

. This would be natural assumption for a Bayesian model, if

C^{0}

is the prior covariance and the eigenvalues of

C^{t}

at

t = \infty

(corresponding to the posterior) are reduced by the data.

Appendix F.4. Decay of Fluctuation Part of the Free Energy

Still focusing on the Gaussian model, we can further derive a bound on the free energy. It is easy to see that for the Gaussian case, the free energy in Equation (4) separates into a sum of two terms. The first one depends on the mean

m^{t}

only and the second one on only the fluctuations (i.e.,

C^{t}

).

We will consider the second, nontrivial part only. We assume that the covariance matrix is nonsingular (corresponding to

N > D

). The fluctuation part of the free energy (minus its minimum) is given by

\begin{matrix} F_{f l} = - \frac{1}{2} ln | I - B | - \frac{1}{2} tr (B) \end{matrix}

where we have introduced the matrix

B ≐ I - Λ C

. One can show that its eigenvalues are real and are upper bounded by 1. First, we can show from the equations of motion that

\begin{matrix} \frac{d F_{f l}}{d t} = - tr (B B^{⊤}) \end{matrix}

(A7)

Second, using the elementary bound

- ln (1 - u) \leq \frac{u}{1 - u}

valid for

u \leq 1

and applied to the eigenvalues of B yields

\begin{matrix} F_{f l} \leq & \frac{1}{2} tr (B {(I - B)}^{- 1} - B) \\ = & \frac{1}{2} tr (B {(I - B)}^{- 1} - B (I - B) {(I - B)}^{- 1}) \\ = & \frac{1}{2} tr (B^{2} {(I - B)}^{- 1}) \\ = & \frac{1}{2} tr (B^{2} C^{- 1} Λ^{- 1}) \leq \frac{1}{2} tr (B^{⊤} Λ^{- 1} B C^{- 1}) \end{matrix}

The last two equalities used the definition

B = I - Λ C

. Since

B^{⊤} Λ^{- 1} B

and

C^{- 1}

are both positive definite, we can bound the last term by (see ([47], Theorem 6.5))

\begin{matrix} F_{f l} \leq \frac{1}{2} tr (B^{⊤} Λ^{- 1} B) tr (C^{- 1}) \leq \\ \frac{1}{2} tr (B B^{⊤}) tr (Λ^{- 1}) tr (C^{- 1})), \end{matrix}

where, in the last line, we have bounded the trace of a product of p.d. matrices a second time.

Combining with Equation (A7) we show that

\begin{matrix} \frac{d F_{f l}}{d t} \leq - \frac{2 F_{f l}}{tr (Λ^{- 1}) tr (C^{- 1})} \end{matrix}

We can plug in our result from Theorem 2:

\begin{matrix} tr (C^{- 1}) = & tr (Λ) + tr (C^{- 1} - Λ) \\ = & tr (Λ) + e^{- 2 t} tr ({(C^{0})}^{- 1} - Λ) \\ \leq & tr (Λ) + e^{- 2 t} | tr ({(C^{0})}^{- 1} - Λ) | \\ \leq & tr (Λ) + | tr ({(C^{0})}^{- 1} - Λ) | \end{matrix}

We can plug this in and use Grönwall’s Lemma A1 to get an exponential bound

\begin{matrix} F_{f l} (C^{t}) \leq F_{f l} (C^{0}) e^{- [\frac{2 t}{tr (Λ^{- 1}) (tr (Λ) + | tr ({(C^{0})}^{- 1} - Λ) |)}]} . \end{matrix}

Appendix F.5. Asymptotic Decay of the Free Energy:

For large times t, we can do better. Let us analyse the asymptotic decay constant

F_{f l} ≃ e^{- λ_{f r e e} t}

defined by

\begin{matrix} λ_{f r e e} ≐ - lim_{t \to \infty} \frac{d ln (F_{f l})}{d t} = - lim \frac{\frac{d F_{f l}}{d t}}{F_{f l}} \\ = lim \frac{tr (B B^{⊤})}{- \frac{1}{2} ln | I - B | - \frac{1}{2} tr (B)} \geq \\ lim \frac{tr (B^{2})}{- \frac{1}{2} ln | I - B | - \frac{1}{2} tr (B)} \end{matrix}

In the last inequality, we used

tr (B B^{⊤}) \geq tr (B^{2})

. Everything is expressed by traces of functions of B, and thus by its eigenvalues. Since

B \to 0

as

t \to \infty

(this applies also to its eigenvalues u), we can use Taylor’s expansion

ln (1 - u) + u = - u^{2} / 2 + O (u^{3})

to show that

\begin{matrix} λ_{f r e e} \geq 4 \end{matrix}

which is independent of

Λ

.

Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (N ≤ D)

Theorem A3

(3). Given a D-dimensional multivariate Gaussian target density

p (x) = N (x | μ, Σ)

, using Algorithm 2 with

N < D + 1

particles, the empirical mean converges to the exact mean μ. The

N - 1

non-zero eigenvalues of

C^{t}

converge to a subset of the target covariance Σ spectrum. Furthermore, theglobal minimumof the regularised version

\tilde{F}

of the free energy (17) corresponds to thelargesteigenvalues of Σ.

Applying Equation (A4) to our fixed point equation, we obtain

\begin{matrix} (I - E_{\hat{q}} [g (x) {(x - m)}^{⊤}]) (x_{i} - m) = 0, \forall i = 1, \dots, N \end{matrix}

Hence, the set of centered positions of the particles

S = {\{x_{i} - m\}}_{i = 1}^{N}

, are all eigenvectors of the matrix

E_{\hat{q}} [g (x) {(x - m)}^{⊤}]

with eigenvalue 1. S spans a

N - 1

dimensional space (we have

\sum_{i = 1}^{N} (x_{i} - m) = 0

).

If we specialise to a Gaussian target

p (x) = N (x ∣ μ, Σ)

, (and

Λ = Σ^{- 1}

we have

g (x) = Λ (x - μ)

and can reuse the result from Equation (A5):

\begin{matrix} E_{\hat{q}} [g (x) {(x - m)}^{⊤}] & = Λ E_{\hat{q}} [(x - m) {(x - m)}^{⊤}] \\ = & Λ C . \end{matrix}

Using the equality above, we get:

\begin{matrix} Λ C (x_{i} - m) = & (x_{i} - m) \\ C (x_{i} - m) = & Σ (x_{i} - m), \forall i = 1, \dots, N \end{matrix}

which shows that the obtained low-rank covariance C and the target covariance

Σ

have

N - 1

eigenvectors and eigenvalues in common.

However, are these the largest ones? We look at the modified free energy (17) (ignoring the contribution of the mean):

\begin{matrix} min \tilde{F} = & min \{- \frac{1}{2} \sum_{i : λ_{i} > 0} ln λ_{i} + tr (Λ C)\} \end{matrix}

where

λ_{i}

are the eigenvalues of the empirical covariance C. We first note that

tr (Λ C) = N - 1

, independent of which eigenvalues are obtained at the fixed point. This is easily seen by the following argument: If we use the index–set

I

for the common eigenvectors

e_{i}

and eigenvalues

λ_{i}

,

i \in I

, we can write

\begin{matrix} C = \sum_{i \in I} e_{i} λ_{i} e_{i}^{⊤} \\ Σ = \sum_{i} e_{i} λ_{i} e_{i}^{⊤} \end{matrix}

From this we obtain

\begin{matrix} tr (Λ C) = tr (\sum_{i \in I} e_{i} λ_{i}^{- 1} λ_{i} e^{⊤}) = N - 1 \end{matrix}

From this result we obtain

\begin{matrix} min \tilde{F} = & max \frac{1}{2} \sum_{i : λ_{i} > 0} ln λ_{i} - (N - 1), \end{matrix}

The term

N - 1

is a constant, but the first term makes a difference: The absolute minimum of

\tilde{F}

is achieved, when the

λ_{i}

are

N - 1

largest eigenvalues of

Σ

. Our simulations empirically show that the algorithm usually converges to the absolute minimum.

Appendix H. Dimension-Wise Optimizers

Here, we list some of the most populars optimizers used and their dimension-wise versions. In all algorithms, we consider

φ

the matrix created by the concatenation of the flow of each particle:

φ = [φ_{1}, \dots, φ_{N}]

, where

φ_{n} = φ (x_{n})

We additionally use the notation

φ_{n, i}

for the i-th dimension of the flow of the n-th particle. The main differences between the original algorithms and their modified version were put in red.

Appendix H.1. ADAM

The ADAM algorithm is given by:

Algorithm A1: ADAM

Input:

φ^{t}, m^{t - 1}, v^{t - 1}, β_{1}, β_{2}, η

Output:

Δ

m_{n, d}^{t} = β_{1} m_{n, d}^{t - 1} + (1 - β_{1}) φ_{n, d}^{t}

v_{n, d}^{t} = β_{2} v_{n, d}^{t - 1} + (1 - β_{2}) {(φ_{n, d}^{t})}^{2}

Δ_{n, d} = η \frac{m_{n, d}^{t}}{(1 - β_{1}^{t}) (\sqrt{v_{n, d}^{t} {(1 - β_{2}^{t})}^{- 1}} + ϵ)}

Algorithm A2: Dimension-wise ADAM

Input:

φ^{t}, m^{t - 1}, v^{t - 1}, β_{1}, β_{2}, η

Output:

Δ

m_{n, d}^{t} = β_{1} m_{n, d}^{t - 1} + (1 - β_{1}) φ_{n, d}^{t};

v_{d}^{t} = β_{2} v_{d}^{t - 1} + (1 - β_{2}) \frac{1}{N} \sum_{n = 1}^{N} {(φ_{n, d}^{t})}^{2}

;

Δ_{n, d} = η \frac{m_{n, d}^{t}}{(1 - β_{1}^{t}) (\sqrt{v_{d}^{t} {(1 - β_{2}^{t})}^{- 1}} + ϵ)}

;

Appendix H.2. AdaGrad

The AdaGrad algorithm is given by:

Algorithm A3: AdaGrad

Input:

φ^{t}, v^{t - 1}, η

Output:

Δ

{n, d}_{v}^{t} = v_{d}^{t - 1} + {(φ_{n, d}^{t})}^{2}

Δ_{n, d} = η \frac{φ_{n, d}^{t}}{\sqrt{v_{t}^{n, d} + ε}}

Algorithm A4: Dimension-wise AdaGrad

Input:

φ^{t}, v^{t - 1}, η

Output:

Δ

v_{d}^{t} = v_{d}^{t - 1} + \frac{1}{N} \sum_{n = 1}^{N} {(φ_{n, d}^{t})}^{2}

Δ_{n, d} = η \frac{φ_{n, d}^{t}}{\sqrt{v_{d}^{t}} + ϵ}

;

Appendix H.3. RMSProp

The RMSProp algorithm is given by:

Algorithm A5: RMSProp

Input:

φ^{t}, v^{t - 1}, ρ, η

Output:

Δ

v_{n, d}^{t} = ρ v_{n, d}^{t - 1} + (1 - ρ) {(φ_{n, d}^{t})}^{2}

Δ_{n, d} = η \frac{φ_{n, d}^{t}}{\sqrt{v_{n, d}^{t}} + ϵ}

Algorithm A6: Dimension-wise RMSProp

Input:

φ^{t}, v^{t - 1}, ρ, η

Output:

Δ

v_{d}^{t} = ρ v_{d}^{t - 1} + (1 - ρ) \frac{1}{N} \sum_{n = 1}^{N} {(φ_{n, d}^{t})}^{2}

Δ_{n, d} = η \frac{φ_{n, d}^{t}}{\sqrt{v_{d}^{t}} + ϵ}

Appendix I. Additional Figures

Appendix I.1. Bayesian Logistic Regression

Similarly to the previous section, we also show results with the RMSProp optimizer with learning rate

1 \times 10^{- 4}

.

Figure A1. Similarly to Figure 6, we show the average negative log-likelihood on a test-set over 10 runs against training time on different datasets for a Bayesian logistic regression problem. The dashed curve represents the low-rank approximation with RMSProp for methods based on stochastic estimators.

Appendix I.2. Bayesian Neural Network

Figure A2. Convergence of the classification error and average negative log-likelihood as a function of time.

Figure A3. Accuracy vs confidence. Every test sample is clustered in function of its highest predictive probability. The accuracy of this cluster is then computed. A perfectly calibrated estimator would return the identity.

References

Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef] [Green Version]
Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin–Madison: Madison, WI, USA, 2009. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Bardenet, R.; Doucet, A.; Holmes, C. On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res. 2017, 18, 1515–1557. [Google Scholar]
Cowles, M.K.; Carlin, B.P. Markov chain Monte Carlo convergence diagnostics: A comparative review. J. Am. Stat. Assoc. 1996, 91, 883–904. [Google Scholar] [CrossRef]
Barber, D.; Bishop, C.M. Ensemble learning for multi-layer networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; pp. 395–401. [Google Scholar]
Graves, A. Practical Variational Inference for Neural Networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24, pp. 2348–2356. [Google Scholar]
Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822. [Google Scholar]
Liu, Q.; Lee, J.; Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 276–284. [Google Scholar]
Liu, Q.; Wang, D. Stein variational gradient descent as moment matching. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 32, pp. 8868–8877. [Google Scholar]
Zhuo, J.; Liu, C.; Shi, J.; Zhu, J.; Chen, N.; Zhang, B. Message Passing Stein Variational Gradient Descent. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 6018–6027. [Google Scholar]
Opper, M.; Archambeau, C. The variational Gaussian approximation revisited. Neural Comput. 2009, 21, 786–792. [Google Scholar] [CrossRef] [PubMed]
Challis, E.; Barber, D. Gaussian kullback-leibler approximate inference. J. Mach. Learn. Res. 2013, 14, 2239–2286. [Google Scholar]
Titsias, M.; Lázaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1971–1979. [Google Scholar]
Ong, V.M.H.; Nott, D.J.; Smith, M.S. Gaussian variational approximation with a factor covariance structure. J. Comput. Graph. Stat. 2018, 27, 465–478. [Google Scholar] [CrossRef] [Green Version]
Tan, L.S.; Nott, D.J. Gaussian variational approximation with sparse precision matrices. Stat. Comput. 2018, 28, 259–275. [Google Scholar] [CrossRef] [Green Version]
Lin, W.; Schmidt, M.; Khan, M.E. Handling the Positive-Definite Constraint in the Bayesian Learning Rule. In Proceedings of the 37th International Conference on Machine Learning, Virtual. 13–18 July 2020; Volume 119, pp. 6116–6126. [Google Scholar]
Hinton, G.E.; van Camp, D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; COLT ’93;. Association for Computing Machinery: New York, NY, USA, 1993; pp. 5–13. [Google Scholar]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef] [Green Version]
Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
Khan, M.E.; Nielsen, D. Fast yet simple natural-gradient descent for variational inference in complex models. In Proceedings of the International Symposium on Information Theory and Its Applications (ISITA), Singapore, 28–31 October 2018; pp. 31–35. [Google Scholar]
Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3992–4002. [Google Scholar]
Salimbeni, H.; Eleftheriadis, S.; Hensman, J. Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Lanzarote, Canary Islands, 9–11 April 2018; pp. 689–697. [Google Scholar]
Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. arXiv 2016, arXiv:1608.04471. [Google Scholar]
Ba, J.; Erdogdu, M.A.; Ghassemi, M.; Suzuki, T.; Sun, S.; Wu, D.; Zhang, T. Towards Characterizing the High-dimensional Bias of Kernel-based Particle Inference Algorithms. In Proceedings of the 2nd Symposium on Advances in Approximate Bayesian Inference, Vancouver, BC, Canada, 8 December 2019. [Google Scholar]
Tomczak, M.; Swaroop, S.; Turner, R. Efficient Low Rank Gaussian Variational Inference for Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual. 6–12 December 2020; Volume 33. [Google Scholar]
Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13153–13164. [Google Scholar]
Evensen, G. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. J. Geophys. Res. Oceans 1994, 99, 10143–10162. [Google Scholar] [CrossRef]
Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1530–1538. [Google Scholar]
Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing, Montréal, QC, Canada, 3–8 December 2018; pp. 6572–6583. [Google Scholar]
Ingersoll, J.E. Theory of Financial Decision Making; Rowman & Littlefield: Lanham, MD, USA, 1987; Volume 3. [Google Scholar]
Barfoot, T.D.; Forbes, J.R.; Yoon, D.J. Exactly sparse gaussian variational inference with application to derivative-free batch nonlinear state estimation. Int. J. Robot. Res. 2020, 39, 1473–1502. [Google Scholar] [CrossRef]
Korba, A.; Salim, A.; Arbel, M.; Luise, G.; Gretton, A. A Non-Asymptotic Analysis for Stein Variational Gradient Descent. In Proceedings of the 32nd International Conference on Neural Information Processing, Virtual, 6–12 December 2020; Volume 33. pp. 4672–4682.
Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Zaki, N.; Galy-Fajou, T.; Opper, M. Evidence Estimation by Kullback-Leibler Integration for Flow-Based Methods. In Proceedings of the Third Symposium on Advances in Approximate Bayesian Inference, Virtual Event. January–February 2021. [Google Scholar]
Bezanson, J.; Edelman, A.; Karpinski, S.; Shah, V.B. Julia: A fresh approach to numerical computing. SIAM Rev. 2017, 59, 65–98. [Google Scholar] [CrossRef] [Green Version]
Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop, Coursera: Neural Networks for Machine Learning; Technical Report; University of Toronto: Toronto, ON, USA, 2012. [Google Scholar]
Zhang, G.; Li, L.; Nado, Z.; Martens, J.; Sachdeva, S.; Dahl, G.; Shallue, C.; Grosse, R.B. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 8196–8207. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets.php (accessed on 28 July 2021).
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
LeCun, Y. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 20 July 2021).
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Liu, C.; Zhuo, J.; Cheng, P.; Zhang, R.; Zhu, J. Understanding and accelerating particle-based variational inference. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4082–4092. [Google Scholar]
Zhu, M.H.; Liu, C.; Zhu, J. Variance Reduction and Quasi-Newton for Particle-Based Variational Inference. In Proceedings of the 37th International Conference on Machine Learning, Virtual. 13–18 July 2020. [Google Scholar]
Gronwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math. 1919, 20, 292–296. [Google Scholar] [CrossRef]
Zhang, F. Matrix Theory: Basic Results and Techniques; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]

Figure 1. Illustration of the Gaussian Particle Flow algorithm, with

q^{0} (x)

and

p (x)

representing the initial and target distribution respectively. Particles are iteratively moved according to the gradient flow starting from

q^{0} (x)

, approximating a new Gaussian distribution

q^{t} (x)

at each iteration t.

Figure 1. Illustration of the Gaussian Particle Flow algorithm, with

q^{0} (x)

and

p (x)

representing the initial and target distribution respectively. Particles are iteratively moved according to the gradient flow starting from

q^{0} (x)

, approximating a new Gaussian distribution

q^{t} (x)

at each iteration t.

Figure 2.

L^{2}

norm of the difference between the target mean

μ

(left side) and target covariance

Σ

(right side) with the inferred variational parameters

m^{t}

and

C^{t}

against time for 20-dimensional Gaussian targets with condition number

κ

. We use

D + 1

particles/samples and show the mean over 10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on the mean. Note that DSVI, GF and FCS are overlapping and are, at this scale, indistinguishable from one another.

Figure 2.

L^{2}

norm of the difference between the target mean

μ

(left side) and target covariance

Σ

(right side) with the inferred variational parameters

m^{t}

and

C^{t}

against time for 20-dimensional Gaussian targets with condition number

κ

. We use

D + 1

particles/samples and show the mean over 10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on the mean. Note that DSVI, GF and FCS are overlapping and are, at this scale, indistinguishable from one another.

Figure 3. Trace error for a Gaussian target with

D = 50

and condition numbers

κ

for a varying number of particles with GPF. Predictions from Theorem 3 are shown in dashed-black.

Figure 3. Trace error for a Gaussian target with

D = 50

and condition numbers

κ

for a varying number of particles with GPF. Predictions from Theorem 3 are shown in dashed-black.

Figure 4. Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target with effective rank

K \in {10, 20, 30}

. The rank of each method is fixed as 20. The difference in the starting point for the covariance is due to the initialization difference between each method. We show the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.

Figure 4. Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target with effective rank

K \in {10, 20, 30}

. The rank of each method is fixed as 20. The difference in the starting point for the covariance is due to the initialization difference between each method. We show the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.

Figure 5. Two-dimensional Banana distribution. Comparison of GPF using an increasing number of particles and a different optimizer (ADAM) with the standard VGA (rightmost plot).

Figure 6. Average negative log-likelihood vs. time on a test-set over 10 runs against training time for a Bayesian logistic regression model applied to different datasets. Top plots use a mean-field approximation, while bottom plots use a low-rank structure for the covariance with rank

L = 100

.

Figure 6. Average negative log-likelihood vs. time on a test-set over 10 runs against training time for a Bayesian logistic regression model applied to different datasets. Top plots use a mean-field approximation, while bottom plots use a low-rank structure for the covariance with rank

L = 100

.

Table 1. Negative Log-Likelihood (NLL), Accuracy (Acc), and Expected Calibration Error (ECE) for a Bayesian Neural Networks (BNN) on the MNIST dataset. We varied the rank of the variational covariance from mean-field (all variables are independent) to a low-rank structure with

L \in {5, 10}

. Bold numbers indicated the best performance, and italic bold numbers indicate the best performance when restricted to VGA methods. Convergence and calibration plots can be found in Appendix I.

Table 1. Negative Log-Likelihood (NLL), Accuracy (Acc), and Expected Calibration Error (ECE) for a Bayesian Neural Networks (BNN) on the MNIST dataset. We varied the rank of the variational covariance from mean-field (all variables are independent) to a low-rank structure with

L \in {5, 10}

. Bold numbers indicated the best performance, and italic bold numbers indicate the best performance when restricted to VGA methods. Convergence and calibration plots can be found in Appendix I.

Alg.	Mean-Field			$L = 5$			$L = 10$
Alg.	NLL	Acc	ECE	NLL	Acc	ECE	NLL	Acc	ECE
GPF	$0.183$	$0.95$	$0.0384$	$0.166$	0.96	$0.0918$	$0.172$	$0.955$	$0.0869$
GPF (Layers)	-	-	-	0.147	$0.958$	0.0181	$0.178$	$0.952$	$0.0395$
GF	$0.178$	$0.953$	$0.0706$	$0.185$	$0.956$	$0.136$	$0.171$	$0.952$	$0.0455$
DSVI	$0.204$	$0.945$	$0.11$	-	-	-	-	-	-
SVGD (Sq. Exp)	-	-	-	$0.139$	$0.965$	$0.0732$	0.133	0.967	$0.0879$
SWAG	-	-	-	$0.257$	$0.957$	$0.0662$	$0.287$	$0.956$	$0.0878$
ELRGVI	-	-	-	$0.453$	$0.901$	$0.53$	$0.537$	$0.882$	$0.777$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Galy-Fajou, T.; Perrone, V.; Opper, M. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. Entropy 2021, 23, 990. https://doi.org/10.3390/e23080990

AMA Style

Galy-Fajou T, Perrone V, Opper M. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. Entropy. 2021; 23(8):990. https://doi.org/10.3390/e23080990

Chicago/Turabian Style

Galy-Fajou, Théo, Valerio Perrone, and Manfred Opper. 2021. "Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation" Entropy 23, no. 8: 990. https://doi.org/10.3390/e23080990

APA Style

Galy-Fajou, T., Perrone, V., & Opper, M. (2021). Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. Entropy, 23(8), 990. https://doi.org/10.3390/e23080990

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation

Abstract

1. Introduction

2. Related Work

2.1. The Variational Gaussian Approximation

2.2. Natural Gradients

2.3. Particle-Based VI

2.4. GVA in Bayesian Neural Networks

2.5. Related Approaches

3. Gaussian (Particle) Flow

3.1. Gaussian Variable Flows

3.2. From Variable Flows to Parameter Flows

3.3. Particle Dynamics

Relaxation of Empirical Free Energy and Convergence

3.4. Algorithm and Properties

3.4.1. Relaxation of Empirical Free Energy

3.4.2. Dynamics and Fixed Points for Gaussian Targets

3.5. Structured Mean-Field

3.6. Comparison with SVGD

4. Experiments

4.1. Multivariate Gaussian Targets

4.2. Low-Rank Approximation for Full Gaussian Targets

4.3. High-Dimensional Low-Rank Gaussian Targets

4.4. Non-Gaussian Target

4.5. Bayesian Logistic Regression

4.6. Bayesian Neural Network

5. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Derivation of the Optimal Parameters

Appendix B. Relaxation of the Empirical Free Energy

Appendix C. Riemannian Gradient for Matrix Parameter Γ

Appendix D. Regularised Free Energy for N ≤ D

Efficient Computation of log | C ˜ |

Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (N > d)

Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets

Appendix F.1. Convergence of the Mean

Appendix F.2. Convergence of the Covariance Matrix

Appendix F.3. Convergence of the Trace of the Covariance

Appendix F.4. Decay of Fluctuation Part of the Free Energy

Appendix F.5. Asymptotic Decay of the Free Energy:

Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (N ≤ D)

Appendix H. Dimension-Wise Optimizers

Appendix H.1. ADAM

Appendix H.2. AdaGrad

Appendix H.3. RMSProp

Appendix I. Additional Figures

Appendix I.1. Bayesian Logistic Regression

Appendix I.2. Bayesian Neural Network

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Efficient Computation of $log | \tilde{C} |$