Geometric Variational Inference

Efficiently accessing the information contained in non-linear and high dimensional probability distributions remains a core challenge in modern statistics. Traditionally, estimators that go beyond point estimates are either categorized as Variational Inference (VI) or Markov-Chain Monte-Carlo (MCMC) techniques. While MCMC methods that utilize the geometric properties of continuous probability distributions to increase their efficiency have been proposed, VI methods rarely use the geometry. This work aims to fill this gap and proposes geometric Variational Inference (geoVI), a method based on Riemannian geometry and the Fisher information metric. It is used to construct a coordinate transformation that relates the Riemannian manifold associated with the metric to Euclidean space. The distribution, expressed in the coordinate system induced by the transformation, takes a particularly simple form that allows for an accurate variational approximation by a normal distribution. Furthermore, the algorithmic structure allows for an efficient implementation of geoVI which is demonstrated on multiple examples, ranging from low-dimensional illustrative ones to non-linear, hierarchical Bayesian inverse problems in thousands of dimensions.


Introduction
In modern statistical inference and machine learning it is of utmost importance to access the information contained in complex and high dimensional probability distributions. In particular in Bayesian inference, it remains one of the key challenges to approximate samples from the posterior distribution, or the distribution itself, in a computationally fast and accurate way. Traditionally, there have been two distinct approaches towards this problem: the direct construction of posterior samples based on Markov Chain Monte-Carlo (MCMC) methods [1][2][3], and the attempt to approximate the probability distribution with a different one, chosen from a family of simpler distributions, known as variational inference (VI) [4][5][6][7] or variational Bayes' (VB) methods [8][9][10]. While MCMC methods are attractive due to their theoretical guarantees to reproduce the true distribution in the limit, they tend to be more expensive compared to variational alternatives. On the other hand, the family of distributions used in VI is typically chosen ad hoc. While VI aims to provide an appropriate approximation within the chosen family, the entire family may be a poor approximation to the true distribution.
In recent years, MCMC methods have been improved by incorporating geometric information of the posterior, especially by means of Riemannian manifold Hamilton Monte-Carlo (RMHMC) [11], a particular hybrid Monte-Carlo (HMC) [12,13] technique that constructs a Hamiltonian system on a Riemannian manifold with a metric tensor related to the Fisher information metric of the likelihood distribution and the curvature of the prior. For VI methods, however, the geometric structure of the true distribution has rarely been utilized to motivate and enhance the family of distributions used during optimization. One of the few examples being [14] where the Fisher metric has been used to reformulate the task of VI by means of α-divergencies in the mean-field setting.
In addition, a powerful variational approximation technique for the family of normal distributions utilizing infinitesimal geometric properties of the posterior is Metric Gaussian Variational Inference (MGVI) [15]. In MGVI the family is parameterized in terms of the mean m, and the covariance matrix is set to the inverse of the metric tensor evaluated at m. This choice ensures that the true distribution and the approximation obtain the same geometric properties infinitesimally, i.e., at the location of the mean m. In this work we extend the geometric correspondence used by MGVI to be valid not only at m, but also in a local neighborhood of m. We achieve this extension by means of an invertible coordinate transformation from the coordinate system used within MGVI, in which the curvature of the prior is the identity, to a new coordinate system in which the metric of the posterior becomes (approximately) the Euclidean metric. We use a normal distribution in these coordinates as the approximation to the true distribution and thereby establish a non-Gaussian posterior in the MGVI coordinate system. The resulting algorithm, called geometric Variational Inference (geoVI) can be computed efficiently and is inherently similar to the implementation of MGVI. This is not by mere coincidence: To linear order, geoVI reproduces MGVI. In this sense, the geoVI algorithm is a non-linear generalization of MGVI that captures the geometric properties encoded in the posterior metric not only infinitesimally, but also in a local neighborhood of this point. We include an implementation of the proposed geoVI algorithm into the software package Numerical Information Field Theory (NIFTy [16]), a versatile library for signal inference algorithms.

Mathematical Setup
Throughout this work, we consider the joint distribution P(d, s) of observational data d ∈ Ω and the unknown, to be inferred signal s. This distribution is factorized into the likelihood of observing the data, given the signal P(d|s), and the prior distribution P(s). In general, only a subset of the signal, denoted as s , may be directly constrained by the likelihood, such that P(d|s) = P(d|s ), and therefore there may be additional hidden variables in s, that are unobserved by the data, but part of the prior model. Thus, the prior distribution P(s) may posses a hierarchical structure that summarizes our knowledge about the system prior to the measurement, and s represents everything in the system that is of interest to us, but about which our knowledge is uncertain a priori. We do not put any constraints on the functional form of P(s), and assume that the signal s solely consists of continuous real valued variables, i.e., s ∈ X ⊂ R M . This enables us to regard s as coordinates of the space on which P(s) is defined and to use geometric concepts such as coordinate transformations to represent probability distributions in different coordinate systems. Probability densities transform in a probability mass preserving fashion. Specifically let f : R M → X be an invertible function, and let s = f (ξ). Then the distributions P(s) and P(ξ) relate via P(s) ds = P(ξ) dξ . ( This allows us to express P(s) by means of the pushforward of P(ξ) by f . We denote the pushforward as P(s) = ( f P(ξ))(s) = δ(s − f (ξ)) P(ξ) dξ = P(ξ) d f dξ Under mild regularity conditions on the prior distribution, there always exists an f that relates the complex hierarchical form of P(s) to a simple distribution P(ξ) [17]. We choose f such that P(ξ) takes the form of a normal distribution with zero mean and unit covariance and call such a distribution a standard distribution: where N (ξ; m, D) denotes a multivariate normal distribution in the random variables ξ with mean m and covariance D.
We may express the likelihood in terms of ξ as where f is the part of f that maps onto the observed quantities s . In general, f is a non-invertible function and is commonly referred to as generative model or generative process, as it encodes all the information necessary to transform a standard distribution into the observed quantities, subject to our prior beliefs. Using Equation (4) we get by means of Bayes' theorem, that the posterior takes the form P(ξ|d) = P(ξ, d) P(d) = P(d|ξ) N (ξ; 0, 1) P(d) .
Using the push-forward of the posterior, we can recover the posterior statistics of s via P(s|d) = ( f P(ξ|d))(s) , which means that we can fully recover the posterior properties of s, which typically has a physical interpretation as opposed to ξ. In particular Equation (6) implies that if we are able to draw samples from P(ξ|d) we can simply generate posterior samples for s since s = f (ξ).

Geometric Properties of Posterior Distributions
In order to access the information contained in the posterior distribution P(ξ|d), in this work, we wish to exploit the geometric properties of the posterior, in particular with the help of Riemannian geometry. Specifically, we define a Riemannian manifold using a metric tensor, related to the Fisher Information metric of the likelihood and a metric for the prior, and establish a (local) isometry of this manifold to Euclidean space. The associated coordinate transformation gives rise to a coordinate system in which, hopefully, the posterior takes a simplified form despite the fact that probabilities do not transform in the same way as metric spaces do. As we will see, in cases where the isometry is global, and in addition the transformation is (almost) volume-preserving, the complexity of the posterior distribution can be absorbed (almost) entirely into this transformation.
To begin our discussion, we need to define an appropriate metric for posterior distributions. To this end, consider the negative logarithm of the posterior, sometimes also referred to as information Hamiltonian, which takes the form H(ξ|d) ≡ − log(P(ξ|d)) = H(d|ξ) + H(ξ) − H(d) .
A common choice to extract geometric information from this Hamiltonian is the Hessian C of H. Specifically where the identity matrix arises from the curvature of the prior (information Hamiltonian).
While C provides information about the local geometry, it turns out to be unsuited for our approach to construct a coordinate transformation, as it is not guaranteed to be positive definite for all ξ. An alternative, positive definite, measure for the curvature can be obtained by replacing the Hessian of the likelihood with its Fisher information metric [18], defined as The Fisher information metric can be understood as a Riemannian metric defined over the statistical manifold associated with the likelihood [19], and is a core element in the field of information geometry [20] as it provides a distance measure between probability distributions [21]. Replacing C d|ξ with M d|ξ in Equation (8) we find which, from now on, we refer to as the metric M. As the Fisher metric of the likelihood is a symmetric, positive-semidefinite matrix, we get that M is a symmetric, positive-definite matrix for all ξ. It is noteworthy that upon insertion, we find that the metric M is defined as the expectation value of the Hessian of the posterior Hamiltonian C w.r.t. the likelihood P(d|ξ). Therefore, in some way, we may regard M as the measure for the curvature in case the observed data d is unknown, and the only information given is the structure of the model itself, as encoded in P(d|ξ). This connection is only of qualitative nature, but it highlights a key limitation of M when used as the defining property of the posterior geometry. From a Bayesian perspective, only the data d that is actually observed is of relevance as the posterior is conditioned on d. Therefore a curvature measure that arises from marginalization over the data must be sub-optimal compared to a measure conditional to the data, as it ignores the local information that we gain from observing d. Nevertheless, we find that in many practical applications M encodes enough relevant information about the posterior geometry that it provides us with a valuable metric to construct a coordinate transformation. It is noteworthy that attempts have been provided to resolve this issue via a more direct approach to recover a positive definite matrix from the Hessian of the posterior while retaining the local information of the data. e.g., in [22], the SoftAbs non-linearity is applied to the Hessian and the resulting positive definite matrix is used as a curvature measure. In our practical applications, however, we are particularly interested in solving very high dimensional problems, and applying a matrix non-linearity is currently too expensive to give rise to a scalable algorithm for our purposes. Therefore we rely on the metric M as a measure for the curvature of the posterior, and leave possible extensions to future research.

Coordinate Transformation
Our goal is to construct a coordinate system y and an associated transformation g, that maps from ξ to y, in which the posterior metric M takes the form of the identity matrix 1. The motivation is that if M captures the geometric properties of the posterior, a coordinate system in which this metric becomes trivial should also be a coordinate system in which the posterior distribution takes a particularly simple form. For an illustrative example see Figure 1. To do so, we require the Fisher metric of the likelihood M d|ξ to be the pullback of the Euclidean metric. Specifically we propose a function x(ξ) such that where T denotes the adjoint of a matrix. As outlined in Appendix A, for many practically relevant likelihoods such a decomposition is possible by means of an inexpensive to evaluate function x (Here with "inexpensive" we mean that applying the function x(ξ) has a similar computational cost compared to applying the likelihood function P(d|ξ) to a specific ξ). Given x, we can rewrite the posterior metric M as In order to relate this metric to Euclidean space, we aim to find the isometry g that relates the Riemannian manifold associated with the metric M to Euclidean space. Specifically we seek to find an invertible function g satisfying In general, i.e., for a general function x(ξ), however, this decomposition does not exist globally. Nevertheless, there exists a transformation g(ξ;ξ) based on an approximative Taylor series around an expansion pointξ, that results in a metricM(ξ) such that in the vicinity ofξ. This transformation g can be obtained up to an integration constant by Taylor expanding Equation (14) aroundξ and solving for the Taylor coefficients of g in increasing order. We express g in terms of its Taylor series using the Einstein sum convention as where repeated indices get summed over, a ,i denotes the partial derivative of a w.r.t. the ith component of ξ, ands denotes a (tensor) field s(ξ), evaluated at the expansion pointξ. We begin to expand Equation (14) aroundξ and obtain for the zeroth order Expanding Equation (14) to first order yields and thereforeḡ whereM ij = M −1 ij denotes the components of the inverse ofM. Thus, to first order in the metric (meaning to second order in the transformation) the expansion remains solvable for a general x. Proceeding with the second order, however, we get that which does not exhibit a general solution forḡ i ,jkl in higher dimensions due to the fact that the third derivative has to be invariant under arbitrary permutation of the latter three indices jkl. However, in analogy to Equation (18), we may set which cancels the first and the last term of Equation (19), and study the remaining error which takes the form Let X i j ≡x i ,j , the expression in the parentheses takes the form and thus Equation (21) reduces to The impact of this error contribution can be qualitatively studied using the spectrum λ(M) of the matrix M. This spectrum may exhibit two extreme cases, a so-called likelihood dominated regime, where the spectrum λ(XX T ) 1, and a prior dominated regime where λ(XX T ) 1. In the likelihood dominated regime, we get that λ(M) 1 and thus the contribution of the error is small, whereas in the prior dominated regime λ(M) ≈ 1 which yields an O(1) error. However, in the prior dominated regime, the entire metric M is close to the identity as we are in the standard coordinate system of the prior ξ and therefore higher order derivatives of x are small. As a consequence, the error is of the order O(1) only in regimes where the third (and higher) order of the expansion is negligible compared to the first and second order. An exception occurs when the expansion point is close to a saddle point of x, i.e., in cases where the first derivative of x becomes small (and therefore the metric is close to the identity), but higher order derivatives of x may be large. For the moment, we proceed under the assumption that the change of x, as a function of ξ, is sufficiently monotonic throughout the expansion regime. We discuss the implications of violating this assumption in Section 5.3. Non-linear posterior distribution P(ξ|d) in the standard coordinate system of the prior distribution ξ (left) and the transformed distribution P(y|d) (right) in the coordinate system y where the posterior metric becomes (approximately) the identity matrix. P(y|d) is obtained from P(ξ|d) via the push-forward through the transformation g which relates the two coordinate systems. The functional form of g is derived in Section 2.1 and depends on an expansion pointξ (orange dot in the left image), and g is set up such thatξ coincides with the origin in y. To visualize the transformation, the coordinate lines of y (black mesh grid on the right) are transformed back into ξ-coordinates using the inverse coordinate transformation g −1 and are displayed as a black mesh in the original space on the left. In addition, note that while the transformed posterior P(y|d) arguably takes a simpler form compared to P(ξ|d), it does not become trivial (e.g., identical to a standard distribution) as there remain small asymmetries in the posterior density. There are multiple reasons for these deviations which are discussed in more detail in Section 2.2 once we established how the transformation g is constructed.
If we proceed to higher order expansions of Equation (14), we notice that a repetitive picture emerges: The leading order derivative tensorḡ i ,j... that appears in the expansion may be set in direct analogy to Equations (18) and (20) as where . . . denotes the higher order derivatives. The remaining error contributions at each order take a similar form as in Equation (23), where the matrix M reappears in between all possible combinations of the remaining derivatives of x that appear using the Leibniz rule. Note that for increasing order, the number of terms that contribute to the error also increases. Specifically for the nth order expansion of Equation (14) we get m = ∑ n−1 k=1 ( n k ) contributions to the error. Therefore, even if each individual contribution by means of M is small, the expansion error eventually becomes large once high order expansions become relevant. Therefore the proposed approximation only remains locally valid aroundξ.
Nevertheless, we may proceed to combine the derivative tensors of g determined above in order to get the Jacobian of the transformation g as or equivalentlȳ From the zeroth order, Equation (16), we get thatḡ i ,j = √M i j up to a unitary transformation, and we can sum up the Taylor series in x of Equation (26) to arrive at an index free representation of the Jacobian as Upon integration, this yields a transformation The resulting transformation takes an intuitive form: The approximation to the distance between a point g(ξ) and the transformed expansion point g(ξ) consists of the distance w.r.t. the prior measure ξ −ξ and the distance w.r.t. the likelihood measure x(ξ) − x(ξ) , back-projected into the prior domain using the local transformation atξ. Finally, the metric atξ is used as a measure for the local curvature. Equation (28) is only defined up to an integration constant, and therefore, without loss of generality, we may set g(ξ) = 0 to obtain the final approximative coordinate transformation as

Basic Properties
In order to study a few basic properties of this transformation, for simplicity, we first consider a posterior distribution with a metric that allows for an exact isometry g iso to Euclidean space. Specifically let g iso be a coordinate transformation satisfying Equation (13). The posterior distribution in coordinates y = g iso (ξ) is given via the push-forward of P(ξ|d) through g iso as P(y|d) ∝ (g iso P(ξ|d))(y) = P(ξ|d) ∂g iso ∂ξ and the information Hamiltonian takes the form where H 0 denotes y independent contributions. We may study the curvature of the posterior in coordinates y given as: which we can use to construct a metric M(y) in analogy to Equation (10) by taking the expectation value of the curvature w.r.t. the likelihood. This yields The first terms yields the identity, as it is the defining property of g iso . Furthermore, in case we are able to say that R(y) is small compared to the identity, we notice that the quantity M(ξ) (Equation (10)), that we referred to as the posterior metric, approximately transforms like a proper metric under g iso . In this case we find that the isometry g iso between the Riemannian manifold associated with M(ξ) and the Euclidean space is also a transformation that removes the complex geometry of the posterior. To further study R(y), we consider its two contributions separately, where for the first part, the log-determinant (or logarithmic volume), we get that it becomes small compared to the identity if Therefore, the curvature of the log-determinant of the metric has to be much smaller then the metric itself. To study the second term of R(y), we may again split the discussion into a prior and a likelihood dominated regime, depending on the ξ at which we evaluate the expression. In a prior dominated regime we get that as the metric is close to the identity in this regime (and therefore ξ ≈ y). In a likelihood dominated regime we get thatH ≈ H(d|ξ) and therefore

∂H(ξ) ∂ξ
So at least in a prior dominated regime, as well as a likelihood dominated regime, the posterior Hamiltonian transforms in an analogous way as the manifold, under the transformation g iso , if Equation (34) also holds true.
For a practical application, however, in all but the simplest cases the isometry g iso is not accessible, or might not even exist. Therefore, in general, we have to use the approximation g(ξ;ξ), as defined in Equation (29), instead. We may express the transformation of the metric using g(ξ;ξ), and find that now with ξ = g −1 y;ξ and analogous for ξ .R is defined by replacing g iso with g for the entire expression of R. We notice that this transformation does not yield the identity, except when evaluated at the expansion point ξ =ξ. Therefore, in addition to the error R there is a deviation from the identity related to the expansion error as one moves away fromξ. At this point we would like to emphasize that the posterior Hamiltonian H and the Riemannian manifold constructed from M are only loosely connected due to the errors described byR and the additional expansion error. They are arguably small in many cases and in the vicinity ofξ, but we do not want to claim that this correspondence is valid in general (see Section 5.3). Nevertheless, we find that in many cases this correspondence works well in practice. Some illustrative examples are given in Section 3.1.2.

Posterior Approximation
Utilizing the derived coordinate transformation for posterior approximation is mainly based on the idea that in the transformed coordinate system, the posterior takes a simpler form. In particular we aim to remove parts (if not most) of the complex geometry of the posterior, such that a simple probability distribution, e.g., a Gaussian distribution, yields a good approximation.

Direct Approximation
Assuming that all the errors discussed in the previous section are small enough, we may attempt to directly approximate the posterior distribution via a unit Gaussian in the coordinates y as in this case the transformed metric M(y) is close to the identity. As the coordinate transformation g, defined via Equation (29), is only known up to an integration constant by construction, the posterior approximation is achieved by a shifted unit Gaussian in y. This shift needs to be determined, which we can do by maximizing the transformed posterior distribution w.r.t. y. Here g −1 (y;ξ) denotes the inverse of g(ξ;ξ) w.r.t. its first argument. Equivalently we can minimize the information Hamiltonian H(y|d), defined as Minimizing H(y|d) yields the maximum a posterior solution y * which, in case the posterior is close to a unit Gaussian in the coordinates y, can be identified with the shift in y. As g is an invertible function, we may instead minimizeH w.r.t. ξ and apply g to the result in order to obtain y * . Specifically Therefore we can circumvent the inversion of g at any point during optimization. Now suppose that we use any gradient based optimization scheme to minimize for ξ, starting from some initial position ξ 0 . If we set the expansion pointξ, used to construct g, to be equal to ξ 0 , we notice thatM as the expansion of the metric is valid to first order by construction. Therefore if we set the expansion pointξ to the current estimate of ξ after every step, we can replace the approximated metricM with the true metric M and arrive at an optimization objective of the formξ = argmin Note that g(ξ;ξ) = 0 by construction, and therefore y * = 0, as there is a degeneracy between a shift in y and a change of the expansion pointξ. Once the optimal expansion pointξ is found, we directly retrieve a generative process to sample from our approximation to the posterior distribution. Specifically where g −1 is only implicitly defined using Equation (29) and therefore its inverse application has to be approximated numerically in general.

Numerical Approximation to Sampling
Recall that To generate a posterior sample for ξ we have to draw a random realization for y from a unit Gaussian, and then solve Equation (46) for ξ. To avoid the matrix square root ofM, we may instead define Sampling from P(z) is much more convenient then constructing the matrix square root, sincē and therefore a random realization may be generated using Finally, a posterior sample ξ is retrieved by inversion of Equation (47). We numerically approximate the inversion by minimizing the squared difference between z and g(ξ). Specifically, Note that if g is invertible then alsog is invertible asM is a symmetric positive definite matrix. Therefore the quadratic form of Equation (50) has a unique global optimum at zero which corresponds to the inverse of Equation (47).
In practice, this optimum is typically only reached approximately. For an efficient numerical approximation, throughout this work, we employ a second order quasi-Newton method, named NewtonCG [23], as implemented in the NIFTy framework. Within the NewtonCG algorithm, we utilize the metricM(ξ) as a positive-definite approximation to the curvature of the quadratic form in Equation (50). Furthermore, its inverse application, required for the second order optimization step of NewtonCG, is approximated with the conjugate gradient (CG) [24] method, which requires the metric to be only implicitly available via matrix-vector products. In addition, in practice we find that the initial position ξ 0 of the minimization procedure can be set to be equal to the prior realization η 1 used to construct z (Equation (49)) in order to improve convergence as ξ = η 1 is the solution of Equation (47) for all degrees of freedom unconstrained by the likelihood. Alternatively, for weakly non-linear problems, initializing ξ 0 as the solution of the linearized problem can significantly improve the convergence. The full realization of the sampling procedure is summarized in Algorithm 1.

Algorithm 1:
Approximate posterior samples using inverse transformation

Properties
We may qualitatively study some basic properties of the coordinate transformation and the associated approximation using illustrative one and two dimensional examples. To this end, consider a one dimensional log-normal prior model with zero mean and standard deviation σ p of the form from which we obtain a measurement d subject to independent, additive Gaussian noise with standard deviation σ n such that the likelihood takes the form The posterior distribution is given as and its metric takes the form In this one dimensional example we can construct the exact transformation g iso that maps from ξ to the transformed coordinates y, by integrating the square root of Equation (55) over ξ. The resulting transformation can be seen in the central panel of Figure 2, for an example with σ p = 3 and σ n = 0.3 and measured data d = 0.5. In addition, we depict the approximated transformation g(ξ;ξ) for multiple expansion points ξ ∈ {−1, −0.6, −0.2}. We see that the function approximation quality depends on the choice of the expansion pointξ as the approximation error is smallest in the vicinity ofξ. In order to transform the posterior distribution P (Equation (54)) into the new coordinated system, not all parts of the transformation are equally relevant and therefore different expansion points result in more/less complex transformed distributions (see top panel of Figure 2). Finally, if we use a standard distribution in the transformed coordinates y and transform it back using the inverse transformations g −1 (y;ξ), we find that the approximation quality of the resulting distributions Qξ depends onξ. The distributions are illustrated in the left panel of Figure 2 together with the Kullback-Leibler divergence KL between the true posterior distribution P and the approximations Qξ. We also illustrate the "geometrically optimal" approximation using a standard distribution in y and the optimal transformation g iso and find that while the approximation error becomes minimal in this case, it remains non-zero. Considering the discussion in Section 2.2, this result is to be expected due to the error contribution from the change in volume associated with the transformation g. As a comparison we also depict the optimal linear approximation of P, that is a normal distribution in the coordinates ξ with optimally chosen mean and standard deviation. We see that even the worst expansion pointξ = −0.2, that is far away from the optimum, still yields a better approximation of the posterior.
As a second example we consider the task of inferring the mean m and variance v of a single, real valued Gaussian random variable d. In terms of s = (m, v), the likelihood takes the form Furthermore we assume a prior model for s by means of a generative model of the form where ξ 1 and ξ 2 follow standard distributions a priori. This artificial model results in a linear prior correlation between the mean and the log-variance and thus introduces a non-linear coupling between m and v. The resulting two dimensional posterior distribution P(ξ 1 , ξ 2 ) can be seen in the left panel of Figure 3, together with the two marginals P(ξ 1 ) and P(ξ 2 ) for a given measurement d = 0. We approximate this posterior distribution following the direct approach described in Section 3.1, where the expansion pointξ is obtained from minimizing the sum of the posterior Hamiltonian and the log-determinant of the metric (see Equation (43)). The resulting approximative distribution Q D is shown in the right panel of Figure 3, where the location ofξ is indicated as a blue cross. In comparison to the true distribution, we see that both, the joint distribution as well as the marginals are in a good agreement qualitatively, which is also supported quantitatively by a small difference of the KL between P and Q D (see Figure 3). The difference between P and Q D appears to increase in regions further away from the expansion point, which is to be expected due to the local nature of the approximation. However, non-linear features such as the sharp peak at the "bottom" of P (Figure 3), are also present in Q D , although slightly less prominent. This demonstrates that relevant non-linear structure can, to some degree, be captured by the coordinate transformation g derived from the metric M of the posterior.  (54)). The true posterior P(ξ|d), displayed as the black solid line in the left panel, is transformed into the coordinate system y using the optimal transformation g iso (blue), as well as three approximations g thereof with expansion pointsξ ∈ {−1, −0.6, −0.2} (orange, green, red). The resulting distributions P(y|d) are displayed in the top panel of the figure as solid lines, color coded according the used transformation g (or g iso in case of blue). The black, dashed line in the top panel displays a standard distribution in y. The location of the expansion pointξ, and its associated point in y, is highlighted via the color coded, dotted lines. Finally, the direct approximations to the posterior associated with the transformations, meaning the push-forwards of the standard distribution in y using the inverse of the various transformations g −1 , are displayed in the left panel as dashed lines, color coded according to their used transformation. As a comparison, the "optimal linear approximation" (black dotted line in the central panel), which corresponds to the optimal approximation of the posterior with a normal distribution in ξ (black dotted line in left panel), is displayed as a comparison. To numerically quantify the information distance between the true distribution P and its approximations Q • , the Kullback-Leibler (KL) divergences between P and Q • are displayed in the top left of the image. The numerical values of the KL are given in nats (meaning the KL is evaluated in the basis of the natural logarithm).  (56) and (57)). The central panel shows the two dimensional density and the red dashed lines are logarithmically spaced contours. The top and left sub-panels display the marginal posterior distributions for ξ 1 and ξ 2 , respectively. Right: Approximation Q D to the posterior distribution using the direct method (Section 3.1). As a comparison, the contours (red dashed) and the marginal distributions (red solid) of the true posterior distribution P are displayed in addition to the approximation. The blue cross in the central panel denotes the location of the expansion point used to construct Q D . Above the panel we display the optimal (KL(P; Q D )) and variational (KL(Q D ; P)) Kullback-Leibler divergences between P and Q D .
Although these low-dimensional, illustrative examples appear promising, there remains one central issue left to be addressed before the approach can be applied to highdimensional problems. In particular, the direct approach possesses a substantial additional computational burden compared to, e.g., a maximum a posteriori (MAP) estimate in ξ which is obtained by minimizing the posterior Hamiltonian H. For the direct approach, the optimization objective Equation (43) consists not only of H, but also of the log-determinant of the metric M. In all but the simplest examples this term cannot be computed directly but has to be approximated numerically as in high dimensions an explicit representation of the matrix becomes infeasible and M is only implicitly accessible through matrix vector products (MVPs). There are a variety of stochastic log-determinant (more specifically trace-log) estimators based on combining Hutchinsons' trace-estimation [25] with approximations to the matrix logarithm using, e.g., Chebychev polynomials [26], Krylov subspace methods [27], or moment constrained estimation based on Maximum Entropy [28]. While all these methods provide a significant improvement in performance compared to directly computing the determinant, they nevertheless typically require many MVPs in order to yield an accurate estimate. For large and complex problems, evaluating an MVP of M is dominated by applying the Jacobian of x, more precisely of the generative process s (ξ), and its adjoint to a vector. Similarly, evaluating the gradient of H is also dominated by an MVP that invokes applying the adjoint Jacobian of s (ξ). Therefore the computational overhead compared to a MAP estimate in ξ is, roughly, multiplicative in the number of MVPs. For large, non-linear problems, this quickly becomes infeasible as nonlinear optimization typically requires many steps to reach a sensible approximation to the optimum.
Nevertheless there remain some important exceptions, in which a fast and scalable algorithm emerges. In particular recall that where the last equality arises from applying the matrix determinant lemma. Therefore in cases where the dimensionality of the so-called data-space (i.e., the target space of x) is much smaller then the dimensionality of the signal space (the domain of ξ), the latter representation of the metric is of much smaller dimension. Thus, in cases where either the signal-or the data-space is small, or in weakly non-linear cases (i.e., if M is close to the identity), the log-determinant may be approximated efficiently enough to give rise to a fast and scalable algorithm. For the (arguably most interesting) class of problems where neither of these assumptions is valid, however, the direct approach to obtain the optimal expansion point becomes too expensive for practical purposes as none of the log-determinant estimators scale linearly with the size of the problem in general.

Geometric Variational Inference (geoVI)
As we shall see, it is possible to circumvent the need to compute the log-determinant of the metric at any point, if we employ a specific variant of a variational approximation to obtain the optimal expansion point. To this end, we start with a variational approximation to the posterior P, assuming that the approximative distributionQ is given as the unit Gaussian in y transformed via g. To this end let denote the approximation to the posterior conditional to the expansion pointξ. The variationally optimalξ can be found by optimization of the forward Kullback-Leibler divergence betweenQ and P, as given via where KL 0 denotes contributions independent ofξ, and H(ξ|d) and HQ denote the Hamiltonians of the posterior and the approximation, respectively. We notice that in this form, a minimization of the KL w.r.t.ξ does not circumvent a computation of the log-determinant of the metric. Within the KL, this term arises from the entropy of the approximatioñ Q, and can be understood as a measure of the volume associated with the distribution. In order to avoid this term, our idea is to propose an alternative family of distributions Q m (ξ|ξ), defined as a shifted version ofQ. Specifically we let ξ → m + ξ −ξ such that the distribution may be written as where we also introduced the residual r, which measures the deviations fromξ, and the associated distribution Q(r|ξ). In words, Q m (ξ|ξ) is the distribution using the residual statistics r, around an expansion pointξ, but shifted to m. One can easily verify that the entropy related to Q m becomes independent of m, as shifts are volume-preserving transformations. Therefore we may use some fixed expansion pointξ, and find the optimal shift m using the KL which now may be written as ξ=ξ+r Q(r|ξ) where KL denotes the KL up to m independent contributions. After optimization for m, we can update to a new expansion point, and use it to define a new family of distributions Q m which are a more appropriate class of approximations. In general, the expectation value in KL cannot be computed analytically, but it can be approximated using a set of N samples r * i i∈{1,...,N} , drawn from Q(r|ξ), which yields Sampling from Q(r|ξ) is defined as in Section 3.1.1, where the sampling procedure for Q(ξ|ξ) is described, with the addition that a sample r * is is obtained from a sample for ξ * as r * = ξ * −ξ.
Optimizing KL w.r.t. m yields the variational optimum for the distribution Q m (ξ|ξ), given a fixed, predetermined expansion pointξ. In order to move the expansion pointξ towards the optimal point, its location is updated subsequently and the KL is re-estimated using novel samples from Q(r|ξ) with an updatedξ. Specifically, we initialize the optimization algorithm at some position m 0 , setξ = m 0 to obtain a set of samples r * i (0) i∈{1,...,N} , and use this set to approximate the KL. This approximation is then used to obtain an optimal shift m 1 . Given this optimal shift, a new expansion pointξ = m 1 is defined and used to obtain a novel set of samples r * i (1) i∈{1,...,N} which defines a new estimate for the KL. This estimate is furthermore used to obtain a novel optimal m, and so on. An illustrative view of this procedure is given in Figure 4. Finally, the entire procedure of optimizing the KL for m and re-estimation of the KL via a novel expansion point is repeated until the algorithm converges to an optimal point m * =ξ * . To optimize KL for m, we again employ the NewtonCG algorithm, and use the average of the metric M as a proxy for the curvature of KL to perform the optimization step. Specifically we use as the metric of KL. We call this algorithm the geometric Variational Inference (geoVI) method. A pseudo-code summary of geoVI is given in Algorithm 2. It is noteworthy that, as described in Section 3.1.1, an implementation of the proposed sampling procedure for the residual r, and as a result also of the geoVI method itself, inevitably relies on numerical approximations to realize a sample for r. To better understand the impact of such approximations, we have to consider its impact on the distribution Q(r|ξ). To this end, we denote with f the function that, given the expansion pointξ, turns two standard distributed random vectors η 1 and η 2 into a random realization of r. Specifically where the functional form of f is defined by combination of Equations (49) and (50). using f we may write the geoVI distribution Q as Q(r|ξ) = δ r − f (η 1 , η 2 ;ξ) N (η 1 ; 0, 1) N (η 2 ; 0, 1) dη 1 dη 2 .
Any numerical algorithm used to approximate the sampling, irrespective of its exact form, may be described by replacing the function f , leading to exact sampling from Q, with some approximation f which leads to an approximation of the distribution for r, which we denote as Q(r|ξ). Therefore, in a way, the geoVI result using a numerical approximation for sampling can be understood as the variational optimum chosen from the family of distributions Q, rather than Q. Therefore, even for a non-zero approximation error in f , the result remains a valid optimum of a variational approximation, it is simply the family of distributions used for approximation that has changed. This finding is of great relevance in practice, as there is typically a trade off between numerical accuracy of the generated samples and computational efforts. Thus, we may achieve faster convergence at a cost of accuracy in the approximation, but without completely detaching from the theoretical optimum, so long as f remains sufficiently close to f . Nevertheless, as motivated in the introduction, it is important for the chosen family to contain distributions close to the true posterior, and therefore it remains important that the family Q remains close to the family of Q as only for Q the geometric correspondence to the posterior has been established. A detailed study to further quantify this result, is left to future work.

MGVI as a First Order Approximation
We can compare the geoVI algorithm to the aforementioned variational approximation technique called Metric Gaussian variational inference (MGVI), and notice some key similarities. In particular the optimization heuristics with repeated alternation between sampling of r * and optimization for m is entirely equivalent. The difference occurs in the distribution Q(r|ξ) used for approximation. In MGVI, Q is assumed to be a Gaussian distribution in r, as opposed to the Gaussian distribution in the transformed space y used in geoVI. Specifically where the inverse of the posterior metric M, evaluated at the expansion pointξ, is used as the covariance. As it turns out, the distribution Q MGVI arises naturally as a first order approximation to the coordinate transformation used in the geoVI approach. Specifically if we consider the geoVI distribution of r given in terms of a generative process r = g −1 (y;ξ) −ξ with y ∼ N (y; 0, 1) , and expand it around y = 0 to first order, we get that Therefore, to first order in y, we get that This correspondence shows that geoVI is a generalization of MGVI in non-linear cases. This is a welcome result, as numerous practical applications [29][30][31] have shown that already MGVI provides a sensible approximation to the posterior distribution. On the other hand, it provides further insight in which cases the MGVI approximation remains valid, and when it reaches its limitations. In particular if M(m + r) ≈M , ∀r = g −1 (y,ξ) −ξ with y ∼ N (y; 0, 1) , we get that the first order approximation of Equation (69) yields a close approximation of the inverse and geoVI reduces to the MGVI algorithm. In contrast, geoVI with its non-linear inversion requires the log determinant of the metric M to be approximately constant throughout the sampling regime. This is a much less restrictive requirement then Equation (71), as the variation of eigenvalues of M is considered on a logarithmic scale whereas it is considered on linear scale in Equation (71). Furthermore, the log-determinant is invariant under unitary transformations which means that local rotations of the metric, and therefore changes in orientation as we move along the manifold, can be captured by the non-linear approach, whereas Equation (71) does not hold any more if the orientation varies as a function of r. Therefore we expect the proposed approach to be applicable in a more general context, while still retaining the MGVI properties, as it reproduces MGVI in the linear limit.

Examples
We can visually compare the geoVI and the MGVI algorithm using the two-dimensional example previously mentioned in Section 3.1.2. In analogy to Figure 3 we depict the approximation to the posterior density together with its two marginals in Figure 5. We see that geoVI yields a similar result compared to the direct approach here, while it provides a significant improvement compared to the approximation capacity of MGVI. To conclude the illustrative examples, we consider a single observation of the product of a normal and a log-normal distributed quantity subject to independent, additive Gaussian noise. The full model consists of a likelihood and a prior of the form This example should serve as an illustration of the challenges that arise when attempting a separation of non-linearly coupled quantities from a single observation. Such separation problems reappear in Section 4 in much more intertwined and high dimensional examples, but much of the structural challenges can already be seen in this simple two-dimensional problem. Figure 6 displays the results of the direct approach as well as the geoVI and MGVI methods for a measurement setting of d = −0.3 and σ n = 0.1. As a comparison, we also depict the results from performing a variational approximation using a normal distribution with a diagonal covariance, also known as a mean-field approximation (MFVI), as well as an approximation with a normal distribution using a full-rank matrix as its covariance (FCVI). Both, the diagonal as well as the full-rank covariance are considered parameters of the distribution, and have to be optimized for in addition to the mean of the normal distribution. An efficient implementation thereof is described in [7]. We notice that both, the direct and the geoVI approach manage to approximate the true posterior distribution well, although the KL values indicate that the approximation by geoVI is worse by ≈0.016 nats compared to the direct approach. Here the passive update of the expansion point used in this approach reaches its limitations as in cases where the posterior distribution becomes increasingly narrow towards the optimal expansion point, the static sample statistics of r can get stuck during optimization and increasingly repeated re-sampling becomes necessary as one moves closer to the optimum. Nevertheless, the geoVI approximation remains a good approximation to the true distribution, especially when compared to the approaches using a normal distribution such as MGVI, MFVI, and FCVI.

Applications
To investigate the performance of the geoVI algorithm in high dimensional imaging problems, we apply it to two mock data examples and compare it to the results using MGVI. In the first example, which serves as an illustration, the geoVI results are additionally compared to the results obtained from applying a Hamiltonian Monte-Carlo (HMC) sampler [12] to the mock example (see Section 5.2 for further information on HMC). The second example is an illustration of a typical problem encountered in astrophysical imaging. Both examples consist of hierarchical Bayesian models with multiple layers which are represented as a generative process. One particularly important process for the class of problems at hand are statistically homogeneous and isotropic Gaussian processes with unknown power spectral density, for which a flexible generative model has been presented in [32]. This process is at the core of a variety of astrophysical imaging applications [32][33][34][35], and therefore an accurate posterior approximation of problems involving this model is crucial. To better understand the inference challenges that arise in problems using this particular model, we briefly summarize some of its key properties.

Gaussian Processes with Unknown Power Spectra
Consider a zero mean, square integrable random process s x defined on a L-dimensional space subject to periodic boundary conditions which, for simplicity, we assume to have size one. Specifically let x ∈ Λ = [0, 1] L and thus s ∈ L 2 (Λ). A Gaussian process with mean zero and covariance function S xy is said to be statistically homogeneous and isotropic, if S is a function of the Euclidean distance between two points i.e., Furthermore, as implied by the Wiener Wiener-Khinchin theorem [36], the linear operator associated with S becomes diagonal in the Fourier space associated with Λ, and therefore s may be represented in terms of a Fourier series with coefficientss k , where k labels the Fourier coefficients. These coefficients are independent, zero mean Gaussian random variables with variance |s k | 2 P(s) which is also known as the power spectrum P s of s. As P s encodes the correlation structure of s, its functional form is crucial to determine the prior statistical properties of s. In [32], a flexible, non-parametric prior process for the power spectrum has been proposed by means of a Gauss-Markov process on log-log-scale. This process models the spectrum as a straight line on log-log-scale (resulting in a power law in |k| on linear scale) with possible continuous deviations thereof. These deviations are itself defined as a Gauss-Markov process (specifically an integrated Wiener process) and their respective variance is, among others, an additional scalar parameter steering the properties of this prior process that are also considered to be random variables that have to be inferred. These parameters are summarized in Table 1. A more formal derivation of this model in terms of a generative process relating standard distributed random variables ξ p to a random realization P s (ξ p ) of this prior model, is given in Appendix B. In order to use this prior within a larger inference model, the underlying space has to be discretized such that the solution of the resulting discrete problem remains consistent with the continuum. We achieve this discretization by means of a truncated Fourier series for s such that s may be written as where F denotes a discrete Fourier transformation (DFT) and F † its back-transformation.
If we additionally evaluate s on a regular grid on Λ, we can replace the DFT with a fast Fourier transformation (FFT) which is numerically more efficient. For a detailed description on how the spatial discretization is constructed please refer to [37,38]. In this work, however, we are primarily interested in evaluating the approximation quality of the proposed algorithm geoVI, and therefore, from now on, we regard all inference problems involving this random process to be high, but finite, dimensional Bayesian inference problems and ignore the fact that it was constructed from a corresponding continuous, infinite dimensional, inference problem.

Log-normal Process with Noise Estimation
As a first example we consider a log-normal process e s , defined over a one-dimensional space, with s being a priori distributed according to the aforementioned Gaussian process prior with unknown power spectrum. The observed data d (see top panel of Figure 7) consists of a partially masked realization of this process subject to additive Gaussian noise with standard deviation σ n . In addition to s and its power spectrum P s , we also assume σ n to be unknown prior to the observation and place a log normal prior on it. Therefore the corresponding likelihood takes the form We apply the geoVI algorithm (Figure 7),the MGVI algorithm (Figure 8), and an HMC sampler ( Figure 9) to this problem and construct a set of 3000 approximate posterior samples for all methods. The HMC results serve as the true reference here, as the true posterior distribution is too high dimensional to be accessible directly and HMC is known to reproduce the true posterior in the limit of infinite samples. Considering solely the reconstruction of e s , we see that both methods, geoVI and MGVI, agree with the true signal largely within their corresponding uncertainties. Overall we find that the geoVI solution is slightly closer to the ground truth compared to MGVI and the posterior uncertainty is smaller for geoVI in most regions, with the exception of the unobserved region, where it is larger compared to MGVI (see residual plot of Figures 7 and 8). In this region MGVI appears to slightly underestimate the posterior uncertainty. In addition, in the bottom panels of Figures 7 and 8, we depict the posterior distribution of the noise standard deviation σ n as well as the posterior mean of the power spectrum P s , together with corresponding posterior samples. Here the difference between geoVI and MGVI becomes evident more visibly, which, to some degree, is to be expected due to the more non-linear coupling of σ n and P s to the data compared to e s . Indeed we find that the posterior distribution of σ n recovered using MGVI is overestimating the noise level of the reconstruction. The geoVI algorithm is also slightly overestimating σ n , however we find that for geoVI the posterior yields σ geoVI n = 0.220 ± 0.026 which places the true value of σ n = 0.2 approximately 0.8-sigma away from the posterior mean. In contrast, for MGVI, we get that σ MGVI n = 0.233 ± 0.011 for with the ground truth is almost a 3-sigma event. Considering the HMC results (bottom panel of Figure 9), the geoVI results appear to be closer to the HMC distribution compared to MGVI, although the HMC distribution for σ n is broader and even closer to the true value then geoVI. In addition we find that the overall reconstruction quality of the power spectrum P s is significantly increased when moving from MGVI to geoVI. While MGVI manages to recover the overall slope of the power-law, it fails to reconstruct the devations from this power-law as well as the overall statistical properties of P s as encoded in the parameters of Table 1. In contrast, the geoVI algorithm is able to pick up some of these features and recovers posterior statistical properties of the power spectrum similar to the ground truth. In addition the posterior uncertainty appears to be on a reasonable scale, as opposed to the MGVI reconstruction which significantly underestimates the posterior uncertainty. The structures on the smallest scales (largest values for |k|), however, appear to be underestimated by the geoVI mean, although the posterior uncertainty increases significantly in these regimes. In comparison to HMC we find that the results are in agreement for the large scales, although the geoVI uncertainties appear to be slightly larger. On small scales, the methods deviate stronger, and the under-estimation of the spectrum seen by geoVI is absent in the HMC results.   Figure 7, but for the approximation using the MGVI algorithm. . Same setup as in Figure 7, but for the approximation using the HMC sampling.
To further study the posterior distribution of the various scalar parameters that enter the power spectrum model (see Table 1), as well as the noise standard deviation σ n , we depict the reconstructed marginal posterior distributions for all pairs of inferred scalar parameters. Figures 10-12 show the posterior distributions recovered using geoVI, MGVI, and HMC, respectively. All parameters are displayed in their corresponding standard coordinate system, i.e., they all follow a zero-mean unit variance normal distribution prior to the measurement. From an inference perspective, some of these parameters are very challenging to reconstruct, as their coupling to the observed data is highly non-linear and influenced by many other parameters of the model. In turn, their values are highly influential to the statistical properties of more interpretable variables such as the observed signal e s and its spectrum P s . We see that despite these challenges the geoVI posterior appears to give reasonable results, that are largely in agreement with the ground truth, within uncertainties. Thus, the algorithm appears to be able to pick up parts of the nonlinear structure of the posterior, which is validated when compared to the MGVI algorithm, as for these parameters the MGVI reconstruction ( Figure 11) does not yield reliable results any more. This is to be expected in case of significant non-linearity as MGVI is the first order approximation of geoVI. In comparison to HMC (Figure 12), however, we find that there remain some differences in the recovered posterior distributions. The HMC results regarding the "fluctuations" and "noise std." parameters are more centered on the ground truth and in particular the posterior distribution of the "slope" parameter is significantly different and more constrained, compared to the geoVI results. These differences indicate that there remain some limitations to the recovered geoVI results in the regime of highly non-linear parameters of the model which we may associate to the theoretical limitations discussed in Section 2.2.

Separation of Diffuse Emission from Point Sources
In a second inference problem we consider the imaging task of separating diffuse, spatially extended and correlated emission e s from, bright, but uncorrelated point sources p in an image. This problem is often encountered within certain astrophysical imaging problems [39] where the goal is to recover the emission of spatially extended structures such as gas or galactic dust. This emission usually gets superimposed by the bright emission of stars (point sources) in the image plane, and only their joint emission can be observed. In this example we assume that the emission is observed through a detection device that convolves the incoming emission with a spherical symmetric point spread function R and ultimately measures photon counts on a pixelated grid. Specifically we may define a Poisson process with count rate where i labels the pixels of the detector. We assume the diffuse emission to follow a statistically homogeneous and isotropic log-normal distribution with unknown prior power spectrum. Thus, s is again distributed according to the prior process previously given in Section 4.1. The point sources follow an inverse-gamma distribution at every point (x, y) of the image plane, given as where in the particular example we used (α, q) = (2, 3). A visualization of the problem setup with the various stages of the observation process is given in Figure 13.  (Table 1), and the noise standard deviation. All parameters, including the noise parameter, are given in their corresponding prior standard coordinate system, i.e., have a normal distribution with zero mean and variance one as a prior distribution. Each square panel corresponds to the joint posterior of the parameter in the respective row and column. In addition, for each row and each column the one-dimensional marginal posteriors of the corresponding parameter are displayed as blue lines. The red lines in the 1-D, and the red dots in the 2-D plots denote the values of the ground truth used to realize the ground truth values of the spectrum P s , the signal e s , and finally the observed data d.  Figure 11. Same setup as in Figure 10, but for the approximation using the MGVI algorithm.  Figure 10, but for the approximation using HMC sampling.
We employ the geoVI and MGVI algorithms to infer all, a priori standard distributed, degrees of freedom of the model and recover the power spectrum P s for the diffuse emission together with its hyper parameters, the realized emission e s and the point sources p, from the Poisson count data d. The reconstructed two dimensional images of p and e s are displayed in Figure 14 together with the recovered count rate λ, and compared to their respective ground truth. We find that in this example there is barely a visible difference between the reconstructed diffuse emission of MGVI and geoVI. Both reconstructions are in good agreement with the ground truth. For the point sources, we find that the brightest sources are well recovered by both algorithms, while geoVI manages to infer a few more of the fainter sources as opposed to MGVI. Nevertheless, for both algorithms, the posterior mean does not recover very faint sources present in the true source distribution. This can also be seen in Figure 15, where we depict the per-pixel flux values for all locations in the image against their reconstructed values, for both, the diffuse emission and the point sources. We find that the MGVI and the geoVI are, on average, in very good agreement. It is noteworthy that the deviations between the true and the reconstructed flux values increases towards smaller values, which is to be expected due to the larger impact of the Poisson noise. For the spatially independent point sources, there appears to be a transition regime around a flux of ≈50, below which point sources become barely detectable. All in all, for the diffuse emission as well as the point sources, both reconstruction methods apparently yield similar results, consistent with the ground truth. In addition, in Figure 16, we depict the inferred power spectra. We find that the overall shape is reconstructed well by both algorithms, but smaller, more detailed features can only be recovered using geoVI. In addition we find that the statistical properties of the spectrum, as indicated by, e.g., the roughness of the spectrum as a function of the Fourier modes |k|, are well reconstructed by geoVI and in agreement with the true spectrum, whereas the MGVI reconstruction, including the posterior samples, appear to be systematically too smooth compared to the ground truth. As discussed in the previous example in more detail, the parameters that enter the model to determine these properties of the spectrum are highly non-linearly coupled and influenced by the observed data and therefore the linear approximation as used in MGVI becomes, at some point, invalid.

Further Properties and Challenges
Aside from the apparent capacity to approximate non-linear and high-dimensional posterior distributions, there are some further properties that can be derived from geoVI and the associated coordinate transformation. In the following, we show how to obtain a lower bound to the evidence using the geoVI results. Furthermore, we outline a way to utilize the coordinate transformation in the context of Hamilton Monte-Carlo (HMC) sampling. Finally, some limitations remain to the approximation capacity of geoVI in its current form, which are discussed in Section 5.3.

Evidence Lower Bound (ELBO)
With the results of the variational approximation at hand, we can provide an Evidence lower bound (ELBO). To this end consider the Hamiltonian H(ξ|d) of the posterior which takes the form and the Hamiltonian of the approximation H Q as a function of r, given as H Q (r|ξ) = 1 2 g(ξ + r;ξ) T g(ξ + r;ξ) Using these Hamiltonians, we may write the variational approximation as As H(d) = − log(P(d)), we can derive a lower bound for the logarithmic evidence P(d) using the KL as log(P(d)) ≥ log(P(d)) − KL(Q; P) , where the lower bound becomes maximal if the KL becomes minimal. Thus, we may use our final expansion pointξ obtained from minimizing the KL together with the Hamiltonians (Equations (80) and (81)) to arrive at log(P(d)) − KL(Q; P) = where r * i i∈{1,...,N} are a set of samples drawn from the approximation Q(r|ξ). Under the assumption that the log determinant ofM is approximately constant throughout the typical set reached by Q, we may replace its sample average with the value obtained atξ to arrive at where we also replaced the metric of the expansionM with the metric of the posterior M as they are identical when evaluated at the expansion pointξ (see Equation (41)). The assumption that the log determinant does not vary significantly within the typical set is also a requirement for the approximation Q to be a close match for the posterior P and in turn it is a necessary condition for the ELBO to be a tight lower bound to the evidence as only in this case the KL may become small. Therefore Equation (85) is a justified simplification in case the entire variational approximation itself is justified. Nevertheless it may be useful to compute the log determinant also for the posterior samples if feasible, as it provides a valuable consistency check for the method itself.

RMHMC with Metric Approximation
As initially discussed in the introduction, aside from Variational inference methods there exist Markov chain Monte-Carlo (MCMC) methods that utilize the geometry of posterior to increase sampling efficiency. A recently introduced Hybrid Monte-Carlo (HMC) method called Riemannian manifold HMC (or RMHMC) utilizes the same posterior metric as discussed in this work in order to define a Riemannian manifold on which the HMC integration is performed. As one of the key results presented here yields an approximate isometry for this manifold, we like to study the impact of the proposed coordinate transformation on RMHMC. To do so, recall that in HMC the random variable ξ ∈ R M , which is distributed according to a posterior distribution P(ξ), is accompanied by another random variable p ∈ R M , called momentum, and their joint distribution P(ξ, p) is factorized by means of the posterior P(ξ), and the conditional distribution P(p|ξ). The main idea of HMC is to regard the joint Hamiltonian H(ξ, p) = − log(P(ξ, p)) as an artificial Hamiltonian system that can be used to construct a new posterior sample from a previous one by following trajectories of the Hamiltonian dynamics. In particular suppose that we are given some random realization ξ 0 of P(ξ), we may use the conditional distribution P(p|ξ 0 ) to generate a random realization p 0 . Given a pair (ξ 0 , p 0 ), HMC solves the dynamical system associated with the Hamiltonian H(ξ, p) for some integration time t * , to obtain a new pair (ξ * , p * ). As Hamiltonian dynamics is both energy and volume preserving by construction, one can show that if (ξ 0 , p 0 ) is a random realization of P(ξ, p), then also (ξ * , p * ) is. This procedure may be repeated until a desired number of posterior samples is collected. In practice, the performance of an HMC implementation for a specific distribution P(ξ) strongly depends on the choice of conditional distribution P(p|ξ). To simplify the Hamiltonian trajectories and enable a fast traversion of the posterior,

Pathological Cases
As discussed in Section 2.1, one property that can violate our assumptions are nonmonotonic changes in the metric. To this end, consider a sigmoid-normal distributed random variable, and a measurement subject to additive, independent noise of the form where σ(•) denotes the sigmoid function. The resulting posterior, its associated coordinate transformation, as well as its geoVI approximation, is displayed in Figure 17 for a case with (σ p , σ n , d) = (3, 0.2, 0.2). We find that similar to the one-dimensional log-normal example of Section 3.1.2, the approximation quality depends on the chosen expansion point. However, the changes in approximation quality are much more drastic as in the log-normal example. In particular, due to the sigmoid non-linearity, there exists a turning point in the coordinate transformation g, and if we choose an expansion point close to this point, we see that the approximation to the transformation strongly deviates from the optimal transformation as we move away from this point. As a result, in this case the approximation to the posterior (left panel of Figure 17) obtains a heavy tail that is neither present in the true posterior nor the approximation using the optimal transformation. Nevertheless there may very well also exist a case where such a heavy tail is present in the optimal approximation to the transformation. Even in the depicted case, where the tail is only present for sub-optimal choices of the expansion point, an optimization algorithm might have to traverse this sub-optimal region to reach the optimum. Thus, the heavy tail can lead to extreme samples for some intermediate approximation, and therefore the geoVI algorithm could become unstable.  Figure 2, but for the sigmoid-normal distributed case. In addition to the exact isometry g iso , the approximation using the optimal expansion pointξ = −0.68 and a pathological heavy-tail example usingξ = −0.1 is displayed.
In a second example we consider a bi-modal posterior distribution, generated from a Gaussian measurement of a polynomial. Specifically we consider a likelihood of the form with ξ being a priori standard distributed. As can be seen in Figure 18, this scenario leads to a bi-modal posterior distribution with two well separated, asymmetric modes. We find that the geometrically optimal transformation g iso also leads to a bi-modal distribution in the transformed coordinates, however the local asymmetry and curvature of each mode has approximately been removed. Thus, while an approximation of the posterior by means of a single unit Gaussian distribution is apparently not possible, each mode may be approximated individually, at least in case the modes are well separated. If we consider the approximation of the coordinate transformation used within geoVI, and choose as an expansion point the optimal point associated with one of the two modes, we get that for the chosen mode the approximation remains valid and the transformation is close to the optimal transformation. However, if we move away from the mode towards the other mode, the approximation quickly deviates from g iso and eventually becomes non-invertible. Therefore only the approximation of one of the modes is possible. Here, care must be taken, as in practical applications the inversion of g is performed numerically and one has to ensure that the inversion does not end up on the second branch of the transformation. This summarizes the two main issues that may render a geoVI approximation of a posterior distribution invalid. The challenges and issues related to multi modality appear to be quite fundamental, as in its current form, the geoVI method falls into the category of methods that utilize local information of the posterior which all suffer from the inability to deal with more than a single mode. The problems related to turning points are more specific for geoVI, and its implications need to be further studied in order to generalize its range of applicability in the future. One promising finding is that this issue appears to be solely related to the local approximation of the transformation with a "bad" expansion point, as the geometrically optimal transformation g iso apparently does not show such behavior. Therefore an extension of the current approximation technique using, e.g., multiple expansion points, or identifying and excluding these "bad" expansion points during optimization, may provide a solution to this problem. At the current stage of the development, however, it is unclear how to incorporate such ideas into the algorithm without loss of the functional form of g that allows for the numerically efficient implementation at hand.

Summary and Outlook
In this work we introduced a coordinate transformation for posterior distributions that yields a coordinate system in which the distributions take a geometrically simple form. In particular we construct a metric as the sum of the Fisher metric of the likelihood and the identity matrix for a standard prior distribution, and construct the transformation that relates this metric to the Euclidean metric. Using this transformed coordinate system, we introduce geometric Variational Inference (geoVI), where we perform a variational approximation of the transformed posterior distribution with a normal distribution with unit covariance. As the coordinate transformation is only approximately available and utilizes an expansion point around which it is most accurate, the VI task reduces to finding the optimal expansion point such that the variational KL between the true posterior and the approximation becomes minimal. There exists a numerically efficient realization that enables high-dimensional applications of geoVI because even though the transformation is non-volume preserving, geoVI avoids a computation of the related log-determinant of the Jacobian of the transformation at any point. The expansion point used to generate intermediate samples is only passively updated. Furthermore, the application of the constructed coordinate transformation is similar to the cost of computing the gradient of the posterior Hamiltonian. In addition, to generate random realizations, computing the appearing matrix square root of the metric can be entirely avoided, and the inverse transformation is achieved implicitly by second order numerical inversion.
Despite being an approximation method, we find that geoVI is successfully applicable in non-linear, but uni-modal settings, which we demonstrated with multiple examples. We see that non-linear features of the posterior distribution can accurately be captured by the coordinate transformation in low-dimensional examples. This property may translate into high dimensions, as it increases the overall reconstruction quality there when compared to its linearized version MGVI. Nevertheless we also find remaining pathological cases in which further development is necessary to achieve a good approximation quality.
In addition to posterior approximation, geoVI results can be used in order to provide an evidence lower bound (ELBO) which is used for model comparison. Finally we demonstrate the overlap to another posterior sampling technique based on Hamilton Monte-Carlo (HMC), that utilizes the same metric used in geoVI, called Riemannian manifold HMC.
All in all, the geoVI algorithm, and more generally the constructed approximative coordinate transformation, are a fast and accurate way to approximate non-linear and high-dimensional posterior distributions. geoVI, Jakob Knollmüller for the development of the MGVI algorithm, and Martin Reinecke for his contributions to NIFTy.

Conflicts of Interest:
The authors declare no conflict of interest.   For some likelihoods, however, such a decomposition is not accessible in a simple form. One example that is being used in this work is a normal distribution with unknown mean m and variance v. The Hamiltonian of a one dimensional example takes the form (A4) While there is no simple decomposition by means of the Jacobian of some function x, there is an approximation available for which x takes the form (A6) Note that as opposed to the Fisher metric, this approximation depends on the observed data d. In fact we can recover the Fisher metric from this approximation by taking the expectation value w.r.t. the likelihood. Specifically and therefore it may be regarded as a local approximation using the observed data. All examples of this work that use a normal distribution where in addition to the mean also the variance is inferred, use this approximation.

Multiple Likelihoods
In general, we may encounter measurement situations where multiple likelihoods are involved, e.g., if we aim to constrain s with multiple data-sets simultaneously. Specifically consider a set of D data-sets {d i } i∈{1,...,D} , and an associated mutually independent set of likelihoods, such that the joint likelihood takes the form we get that the corresponding Fisher metric takes the form If we assume that we have, for every individual metric M d i |s , an associated transformation x i (s ) available that satisfies Equation (A2), we see that we can stack them together to form a combined transformation x(s ) ≡ x 1 (s ), . . . , x D (s ) T , that automatically satisfies (A2) for the joint metric M d 1 ,...,d D |s .
variables (again see Table 1) that determine the variance and shape of the deviations of τ from a straight line (i.e., the deviations of A from a power-law).