Variational Bayesian Approximation (VBA): Implementation and Comparison of Different Optimization Algorithms

In any Bayesian computations, the first step is to derive the joint distribution of all the unknown variables given the observed data. Then, we have to do the computations. There are four general methods for performing computations: Joint MAP optimization; Posterior expectation computations that require integration methods; Sampling-based methods, such as MCMC, slice sampling, nested sampling, etc., for generating samples and numerically computing expectations; and finally, Variational Bayesian Approximation (VBA). In this last method, which is the focus of this paper, the objective is to search for an approximation for the joint posterior with a simpler one that allows for analytical computations. The main tool in VBA is to use the Kullback–Leibler Divergence (KLD) as a criterion to obtain that approximation. Even if, theoretically, this can be conducted formally, for practical reasons, we consider the case where the joint distribution is in the exponential family, and so is its approximation. In this case, the KLD becomes a function of the usual parameters or the natural parameters of the exponential family, where the problem becomes parametric optimization. Thus, we compare four optimization algorithms: general alternate functional optimization; parametric gradient-based with the normal and natural parameters; and the natural gradient algorithm. We then study their relative performances on three examples to demonstrate the implementation of each algorithm and their efficiency performance.


Introduction
The Bayesian inference starts by defining the joint posterior distribution of all the unknowns in the models given the data.Then, to obtain a point estimation, there are two main solutions: Joint Maximum A Posterior (JMAP), and computation of the posterior expectations.JMAP requires optimization to determine the unknown parameters that maximize the joint posterior distribution.Computing posterior expectations requires integration and involves calculating the expected values of unknown parameters.These computations are possible analytically in the simple case of linear and Normal priors or for exponential families with conjugate priors.In all other cases, the computation of multivariate integration is too costly.Then, there are two main tools.First, sampling methods such as Markov Chain Monte Carlo (MCMC), slice sampling, nested sampling, etc., where we generate samples and then compute means, variances, covariances, etc.Second, analytic approximation methods, which first approximate the original posterior with a simpler one to facilitate analytical computations.The Variational Bayesian Approximation (VBA) is the main method of this kind, using the KLD divergence as the criterion to obtain such an approximation.However, even if theoretically, the functional optimization of KLD is possible, it still needs integral computations, which are possible if we have mated by q(x, y) = q 1 (x)q 2 (y) via minimizing KL(q 1 q 2 |p).When q is separable in all the variables of p, the approximation is also called Mean Field Approximation (MFA).
To obtain the approximate marginals q 1 and q 2 , we must minimize KL(q 1 q 2 |p).A first standard and general algorithm is the alternate analytic optimization of KL(q 1 q 2 |p) with respect to q 1 , and then, to q 2 .Finding the expression of the functional derivatives of KL(q 1 q 2 |p) for q 1 and q 2 and equating them to zero alternatively, we obtain an iterative optimization algorithm.A second general approach is its optimization in the Riemannian Manifold, where we consider the case p is in the exponential family and so are q 1 and q 2 .For this case, KL(q 1 q 2 |p) becomes a function of the parameters θ of the exponential family.Then, we can use any other optimization algorithm to obtain those parameters.
In this paper, we review the VBA technique and then compare four different VBA optimization algorithms in three examples: standard alternate analytic optimization, a gradient-based algorithm for normal and natural parameter space [17,18] and a natural gradient algorithm [19][20][21] The paper aims to consider the first algorithm as our principal method and compare it with the three other algorithms.Building upon our initial proceedings paper [22], presented at the 41st International Workshop, on Bayesian inference and maximum entropy methods in 2022, this journal article further investigates more theoretical details, algorithms, and comparison study.
In this article, we consider the case of exponential families, so that p and q are both in exponential families.Then, we write the expression of the KL(q|p) and explore four different estimation algorithms for unknown parameters in a model that incorporates prior information.The first iterative algorithm handles directly KL(•), alternatively optimizing it with respect to each marginal q i .This is the standard optimization used often in VBA.The function to be optimized is KLD.First, the gradient of KLD for all unknown parameters needs to be found.Then, we can begin with initial values for the parameters, either estimated from data or chosen deliberately.Then, we repeat the iterative algorithm until it converges to some points.If we denote the unknown normal parameter space with θ, then the iterative optimization algorithm can be written as: θ(k+1) = θ(k) − γ∇KL( θ(k) ) for gradient-based algorithm and the same with the natural parameters Λ(k+1) = Λ(k) − γ∇KL( Λ(k) ) with different values of γ.The natural gradient definition, mentioned by Amari [23], is: where F and h are the Fisher information matrix and objective function, respectively.In our case, h is the KLD.To make Fisher's formula understandable, we change p(x, y) to p(x, y|θ) to explicitly show the parameters p(x, y), and differentiate for θ.The Fisher information matrix of p(x, y|θ) is given by: F =⟨∇ ln p(x, y|θ)∇ ln p(x, y|θ) ⊺ ⟩ p(x,y) = − ⟨H ln p(x,y|θ) ⟩ p(x,y) , where H ln p(x,y|θ) and ⟨•⟩ p(x,y) are the Hessian matrix of ln p(x, y|θ) concerning θ and the expectation respect to p(x, y) distribution, respectively.An approximation of the Fisher information, introduced by Schraudolph [24], is called Empirical Fisher: F =⟨∇ ln p(x, y|θ)∇ ln p(x, y|θ) ⊺ ⟩ q(x,y) ∥S XY ∥ ∑ (x,y)∈S XY ∇ ln p(y|x, θ)∇ ln p(y|x, θ) ⊺ .
The fundamental natural gradient iteration algorithm is: where ρ k is a learning-rate schedule.Martens [25] suggested an optimum update based on a particular second-order local approximation of h, ρ k = 1.We consider three examples: Normal-Inverse-Gamma, multivariate Normal, and linear inverse problems to assess the performance and convergence speed of the algorithms through multiple simulations.We propose the following organization of this paper: In Section 2, we present a brief explanation of the basic VBA analytical alternate optimization algorithm.In Section 3, we illustrate our first example related to Normal-Inverse-Gamma distribution analytically and, in practice, explain the outcomes of four algorithms to estimate the unknown parameters.In Section 4, we study a more complex example of a multivariate Normal distribution whose means and variance-covariance matrix are unknown and have a Normal-Inverse-Wishart distribution.This section aims to demonstrate marginal distributions of µ and Σ using a set of multivariate Normal observations using these mean and variance.In Section 5, the example is closest to realistic situations and is a linear inverse problem.We simulate the model with different dimensions to see the changes in the performance of the algorithms.In Section 6, we present our work summary in the article and compare the four recursive algorithms in three different examples.

Variational Bayesian Approach (VBA)
In this paper, our focus is on MFA as a subset of VBA.In MFA, the posterior distribution p(x, y) is approximated by separable q(x, y): q(x, y) = q 1 (x)q 2 (y).[26] is an information measure of discrepancy between two probability functions and one of the most likely used information for divergence and separation measurement and disparity of two density functions [27].One advantage of KL(q|p) is its effectiveness in solving distributionally robust optimization problems [28].Despite its computational and theoretical benefits, KL(q|p) faces certain challenges.These include asymmetries, which complicate optimal model selection [27], and the analytical complexity of KL(q|p) when comparing normal mixture models, along with a lack of practical computational algorithms [29].Let p(x) and q(x) be two density functions of a continuous random variable x for support set S X .KL(q|p) function is introduced as:

KL(q|p)
The basic structure of VBA is to minimize KL(q|p) and find an estimation for p with the density factorial over all variables, consisting of state, hidden, and all unknown parameters.For instance, x and y represent the state and hidden variables, respectively.The unknown parameters refer to the parameters of the prior distributions of x and y.
For simplicity, we assume a bivariate case of distribution p(x, y), and want to assess it via the alternative optimization, then we have: where: q 2 (y) ln q 2 (y)y ., are, respectively, the Shannon entropies of x and of y, and: ⟨ln p(x, y)⟩ q 1 q 2 = (x,y)∈S XY q 1 (x)q 2 (y) ln p(x, y)x .y . .
H(q 1 ) and H(q 2 ) are fixed term, so the minimization is only on ⟨ln p(x, y)⟩ q 1 q 2 .Now, differentiating the Equation ( 5) with respect to q 1 , and then with respect to q 2 and equating them to zero, we obtain: and q 2 (y) ∝ exp ⟨ln p(x, y)⟩ q 1 (x) .( 6) These results can be easily extended to more dimensions [10].They do not have any closed form, because they depend on the expression of p(x, y) and those of q 1 and q 2 .An interesting case is the case of exponential families and conjugate priors, where writing: p(x, y) = p(x|y)p(y), and p(y|x we can consider p(y) as prior, p(x|y) as the likelihood, and p(y|x) as the posterior distributions.We know that, if p(x, y) is in the exponential families, q 1 (x) and q 2 (y) are also in the exponential families, and q 1 (x) is conjugate to p(y|x) and q 2 (y) is conjugate to p(x|y).To illustrate all these properties, we give details of these expressions for a first simple example of Normal-Inverse-Gamma p(x, y) = N (x|µ, y)I G(y|α, β) with q 1 (x) = N (x|µ ′ , v) and q 2 = IG(y|α, β).For this simple case, first we give the expression of KL(q|p) with q 1 (x) = N (x| µ ′ , v) and q 2 (y) = IG(y| α, β) as a function of the parameters θ = ( µ ′ , v, α, β) and then the expressions of the four above-mentioned algorithms and we study their convergence.For a numerical comparison, we start by generating n = 100 samples from p(x, y) = N (x|0, y)I G(y|3, 1), so we know the true parameters (µ = 0, α = 3, β = 1).Then, for different initializations of θ = ( µ ′ , v, α, β), we run the four algorithms and compare the results.The final resulting margin, which is in this case the Student-t S(x|µ, α, β) = N (x|µ, y)I G(y|α, β)y . .We can then compare the true S(x|µ, α, β) with the obtained S(x| µ, α, β) as well as with q(x, y| µ, v, α, β) = N (x| µ, v)I G(y| α, β).As we see in the proceeding of the paper, the initialization is very important.When we have samples of (x, y), we can try to initialize the parameters by their empirical values obtained by simple classical methods such as the method of moments.Another issue is the stopping criteria, which can be based either on the KLD criterion during successive iterations or any distances between the parameters at successive iterations.

Normal-Inverse-Gamma Distribution Example
The purpose of this section is to explain in detail the process of performing calculations in the alternative analytic algorithm.For this, we consider a simple case Normal-Inverse-Gamma Distribution for which, we have all the necessary expressions.This distribution has many applications in fields, like finance, econometrics, engineering, and Machine Learning.The objective here is to compare the four different algorithms mentioned earlier.
The practical problem considered here is the following: Suppose the data as z = z 1 , ...z N and model it by Z = X + E, where E ∼ N (0, v) and so Z|X ∼ N (X, v).Thus, we have X ∼ N (µ, Y) and Y ∼ IG(α, β), then: Putting all the elements, we see that p(x, y|z) is a N IG model and not separable in x and y.We like to approximate it with a separable one q(x, y) = q 1 (x)q 2 (y).However, in this simple case, assume we have a sensor, which delivers a physical quantity X, N times, We want to model these data.In the first step, we model it as N(x|µ ′ , v) with fixed µ ′ and v.Then, it is easy to estimate the parameters (µ ′ , v) either by Maximum Likelihood or Bayesian strategy.If we assume that the model is Normal with unknown variance and call this variance y and assign an IG prior to it, then we have a model N IG for p(x, y).The N IG priors have been applied to the wavelet context with correlated structures because they are conjugated with Normal priors [30].We choose Normal-Inverse-Gamma distribution because of this conjugated property and ease of handling.We showed that the margins are St and IG.Working directly with St is difficult.So, we want to approximate it with a Normal q 1 (x).This is equivalent to approximating p(x, y) with q 1 (x)q 2 (y).Now, we want to find the parameters µ ′ , v, α, and β, which minimize KL(q 1 q 2 |p).This process is called MFVBA.Then, we compare four algorithms to obtain the parameters, which minimize KL(q 1 q 2 |p).KL(q 1 q 2 |p) is convex with respect to q 1 if q 2 is fixed and is convex with respect to q 2 if q 1 is fixed.So, we hope that the iterative algorithm converges.However, KL(q 1 q 2 |p) may not be convex in the space of parameters.So, we have to study the shape of this criterion concerning the parameters µ ′ , v, α, and β.
We want to find p(x).For this process, we assume a simple Normal model, but with unknown variance y.So that, the forward model can be written as p(x, y) = N (x|µ, y)I G(y|α, β).In this simple example, we know that p(x) is a Student-t distribution obtained by: We approximate three parameters θ = (µ, α, β) from the data x and find an approximated distribution q(x) for p(x).
The main idea is to find such q 1 (x)q 2 (y) as an approximation of p(x, y).Here, we show the standard alternative analytic optimization, step by step.For this, we start by choosing the conjugate families q 1 (x) = N (x| µ ′ , v) and q 2 (y) = IG(y| α, β).Note that ⟨x⟩ p = µ and ⟨x⟩ q 1 = µ ′ .
In the first step, we have to calculate ln p(x, y) mentioned earlier: where c is a constant value term independent of x and y.First of all, to use the iterative algorithm given in (6), starting by q 1 = N(x| µ ′ , v) we have to find q 2 (y), so we start by finding q 2 (y).The integration of ln p(x, y) is with respect to q 1 (x): Thus, the function q 2 (y) is equivalent to an inverse gamma distribution IG( α, v+( µ− µ ′ ) 2 2 + β).To do so, we only use q 1 distribution.We have to take integral of ln p(x, y) over q 2 to find q 1 : Note that the first term does not depend on x and ⟨ 1 y ⟩ q 2 = 2 α 2 β+ v+( µ− µ ′ ) 2 , so: We see that q 1 is again a Normal distribution but with updated parameters . Note that, we obtained the conjugal property: If p(x|y) = N (x|µ, y) and p(y) = IG(y|α, β), then p(y|x . In this case, we also know that p(x|α, β) = St(x|µ ′ , α, β).
In these calculations, we first calculated the distribution of q 2 and then q 1 .If we do the opposite and first obtain q 1 and then q 2 , the parameter v is eliminated in the iterative calculations and there is no recursive relationship between β and v, and it is only necessary to calculate the value of parameter β.
1. Standard alternate optimization algorithm: This algorithm converges to v = (2 and β = 0, so v = 0 that means very strange convergence.That is why this alternate algorithm can not work in this case. For other algorithms based on normal parameters, such as gradient-based and natural gradient algorithms, it is necessary to find the expression of KL(q 1 q 2 |p) as a function of the normal parameters θ = ( α, β, µ ′ , v): Then, we also need the gradient expression of ∇KL( θ) for θ: The details are available in Appendix A.
2. The gradient-based algorithm with normal parameters: where γ is a fixed value for the gradient algorithm.We propose two values for γ be equal to 1 and 1 ∥∇KL( θ)∥ .We need to calculate the Fisher information for the third algorithm based on the normal parameters.Then, the natural gradient based on KL( θ) is calculated as follows: The corresponding algorithm is in the following: 3. The natural gradient algorithm: .
We consider another sub-algorithm for gradient-based optimization with natural parameters, explained in detail in Appendix A. The algorithm for the last two components produces different results.
We generate n = 100 samples from the model p(x, y) = N (x|1, y)I G(y|3, 1) for the numerical computations.Thus, we know the exact values of the unknown parameters, just keeping in mind, not used in algorithms.The estimated parameters are in Table 1 using the alternative, gradient-based with γ = 1, 1 ∥∇KL( θ)∥ , natural gradient algorithms and gradient considering natural parameters along with their contour and surface plots in Figures 1 and 2, respectively.All four algorithms attempt to minimize the same criterion.So, the objectives are consistent, but the number of steps in the recursive process may vary.The requirements must meet the minimum KL(•).In this simple example of the Normal-Inverse-Gamma distribution with the model p(x, y) = N (x|1, y)I G(y|3, 1), the convergence step numbers of the alternative, gradient-based with γ = 1, ∥KL(•)∥, gradient with natural parameters, and natural gradient algorithms are 1, 10, 8, 2, and 9 using Maximum Likelihood Estimation (MLE) initializations.One crucial point is that the process should not be repeated excessively because we need to find the local minimum of KL(•).In this model, the VBA and gradient with respect to the natural parameters are the fastest, and the VBA provides the best approximation for the model depicted in Figure 2.
We simulate more models in Table 1 with the same joint distribution of the Normal-Inverse-Gamma along with different parameters.We draw all final results in Figure 2 and see their visual appearances.We start the recursive processes with two primary value groups for the unknown variables containing the MLE and the desired with no evidence.The selection for µ and v is quite simple because we have its data and can plot the histogram and approximately guess the correct values.The situation for α and β is different and a bit problematic to surmise perfectly.If we assume any outliers for α and β, the algorithm finds a minimization for KL(•) in the first stage of the iteration or another local minimum far from the truth.The more the iteration loop is executed, the KLD value increases or decreases in any case.Thus, the initializations are crucial to obtain the closest results for the unknown parameters.For instance, in N (x|1, y)I G(y|4, 6) with elementary points µ = 1, v = 3.5, α = 3, and β = 3 in the natural gradient analytic algorithm, the repetitive result is the same as the initial points.So, these initializations are not suitable for the available data.
An algorithm with the lowest KLD value provides a more accurate estimate of the model parameters.As we discussed earlier, for the model N (x|0, y)I G(y|3, 1), the first, fourth, and fifth algorithms estimate the parameters well, as shown in Figure 2, and Table 1.Remember, we use two values for γ in the gradient-based algorithm with normal parameters, so the total estimated values are five for each unknown parameter.The first three algorithms approximate the parameters of the model N (x|1, y)I G(y|4, 6) well, but the issue lies in fitting the center, resulting in a shifted approximation.All methods perform well in the models N (x| − 1, y)I G(y|7, 10), N (x|2, y)I G(y|6, 10), and N (x| − 2, y)I G(y|10, 11).The optimal options for N (x| − 1, y)I G(y|7, 10) include VBA, gradient with γ = 1, and gradient with respect to natural parameters.Considering the model N (x|2, y)I G(y|6, 10), the best approach is to use gradient algorithms.The best option for the final model is the alternative approximation.
Table 1.Simulation results for five different models using the Normal-Inverse-Gamma distribution are presented here.We use two initialization point groups.The first column shows the MLE for each parameter.The second column contains some intended initializations that we refer to as evidence-free initialization.They are not based on any information and are optional.It is evident that when we apply data-derived points, the results are highly accurate.However, when we rely on evidence-free points, the results deviate significantly from the true models.Also, we calculate the KL(•) values for the initializations and the final estimations for each method.Each algorithm optimizes the function KL(•) in different ways.The alternative provides acceptable parameter estimations for the correct parameters, although the criterion is to minimize KL(•).The corresponding plots can be found in Figure 2 to provide a comprehensive overview of the approximations and facilitate comparisons.

MLE Initializations
Evidence-Free Initializations To gain an overview of the KL(•) process, we plot its trend with respect to the number of iterations in Figure 3.The left column shows the results for the MLE initializations, where the minimization of KL(•) occurs more quickly compared to the right column, which is associated with the evidence-free initializations.We do not plot the trend of the natural gradient's KL(•) in the second and last two models.The reason is that the KL(•) becomes NaN in the initial iterations, so it is necessary to start the recursive process with some points estimated from the available data.The algorithms approximate the joint density function with a separable one but with different accuracy.In the following section, we tackle a more complex model.

Multivariate Normal-Inverse-Wishart Example
In the previous section, we explain how to perform VBA optimization methods to approximate a complicated joint distribution function by a tractable factorial of margins over a simple case study.In this section, a multivariate Normal case p(x) = N (x| µ, Σ) is considered, which is approximated by q(x) = ∏ i N (x i | µ i , νi ) for different shapes for the covariance matrix Σ.
We assume that the basic structure of an available data set is multivariate Normal with the unknown mean vector µ and variance-covariance matrix Σ.Their joint prior distribution is a Normal-Inverse-Wishart distribution of N IW ( µ, Σ| µ 0 , κ, Ψ, ν) defined by N ( µ| µ 0 , 1  κ Σ)I W ( Σ| Ψ, ν), which is the generalized form of classical N IG.One of its applications is in image segmentation tasks.The posteriors are multivariate Normal for the mean vector and Inverse Wishart for the variance-covariance matrix.Inverse Wishart distribution has many properties on the density function of unknown variances, and so there is a variety of references.For example, Bouriga and Féron [31] worked on covariance matrix estimation using Inverse Wishart distribution.In this regard, they inquired about posterior properties and Bayesian risk.Also, they applied Daniels and Kass prior [32] to the hyper-parameters of Σ.
Before approximating the expression of p(x, µ, Σ), let us define some notations that we are using here: Since the Normal-Inverse-Wishart distribution is a conjugate prior distribution for multivariate Normal, the posterior distribution of µ and Σ again belongs to the same family, and their corresponding margins are: where n and x are the sample size and observations of x, respectively.The proof is shown in Appendix B.
For other algorithms, we define Θ = ( κ, µ 0 , Λ, ν, Ψ) for calculation of KL( Θ).After some tedious computation available in Appendix C, the KL( Θ) is equivalence by: and the corresponding gradient-based and natural gradient algorithms are determined similar to Section 3 based on ∇KL( Θ), also available in Appendix C.
To present the performance of the four algorithms, we work on a data set coming from x ∼ N IW (x|µ, Σ) whose parameters have the below density structure: We use only the data of x in the estimation processes.The results of algorithms are in Table 2 and drowned in Figure 4 along with its true contour plot of the model.In this figure, we have two different rows of results.In the first row, we use the MLE estimation of the mean and covariance parameters as our pre-information to start the algorithms and find the best local minimum for the KL(•).In this regard, we put some gusted value for κ and ν.At the bottom of the figure, we use some uninformative initializations.This choice is crucial because the evidence-free primitive starter may end in some local optimization, which is far from the part we need.Thus, using MLE prevents this problem to a great extent.Although the result for the alternative optimization is good using evidence-free elementary points in Figure 4, it does not always work this way and depends on luck.Other gradient-based and natural gradient algorithms do not work well in this example.One reason can be the more dimensions, the worse the results.
In Table 2, we have more models with a variety of dimensions 2, 3, and 5 applying two initialization groups, MLE and uninformative starters.In all cases, the standard alternative algorithm has the closest estimation to the real model in the case of KL(•) minimization.The larger the KL(•), the greater the distance between the true and estimated posterior distributions.Therefore, all gradient-based and natural gradient algorithms are not worthy practices to extract the posterior distribution with more dimensions in the data, while our alternative acts better in this situation.In Table 3, we provide a Normal-Inverse-Wishart model with 10 dimensions.In comparison, the alternative method approximates the distribution with so many parameters more precisely than the other methods, which seems easy to use with any large-scale dimension dataset, and the result is suitable enough.2. From the left to right, the columns show the contour plots with MLE and evidence-free initialization, respectively.The first raw is for the result of the standard alternative optimization, and the two others are for gradient and natural gradient algorithms, respectively.In this example, the standard alternative method has the most fitted result.(a) Model ( 18 0.05 0.01 0.01 0.00 −0.01 0.01 0.03 0.00 −0.01 0.02 0.01 0.00 0.09 −0.01 0.00 0.00 0.00 −0.01 0.05 −0.01 −0.01 0.02 0.00 0.00 0.05

Table 3.
We provide an example of running the algorithms on a large-scale dataset.The standard alternative algorithm is outstanding here.
In the inverse problem, we have the same multivariate case, except in Section 3 that this time, the sensor does not give directly f , but g, such that g = H f + ϵ where ϵ is again a Normal with known or unknown variance.In the first step, we assume that the variance v ϵ and H are known.H is called the transfer function of the sensor.So, this time we have p(g, f , v) = N (g| f )N ( f |v)IG(v) and we approximate it by q 1 ( f |g, ṽ)q 2 ( ṽ).The main reason of this choice for q 1 ( f |g, ṽ) is that, when g and ṽ are known, this becomes a Normal q 1 ( f |g, ṽ) = N( f | µ, Σ) where the expressions of µ and Σ are given in the paper.Now, we have a multivariate N IG problem.Thus, the joint distribution of g, f , and v is estimated by VBA as the following relation: Although we can mathematically compute the margins in this particular example, we desire to approximate them via the iterative alternative algorithm q(g, f , ṽ) = q 1 (g| f )q 2 ( f ) q 3 ( ṽ) compared them with gradient and natural gradient-based algorithms.The objective function is the estimation of q 2 ( f ), but in the recursive process, q 1 (g| f ) and q 3 ( ṽ) are updated, too.For simplicity, we suppose that the transposition matrix H is an identical matrix I.The final outputs are as follows, with the details and algorithm in Appendix D: The corresponding KL( θ) for θ = ( µ g , ṽϵ , ṽ f , α, β) is below and the details are available in Appendix E along with its gradient and other algorithms: We choose a model to see the performance of these margins and compare them with gradient-based considering normal and natural parameters and natural gradient algorithms.The selected model is g = H f + ϵ with the following knowledge: In the assessment procedure, we do not apply the above information.The outputs of algorithms are shown in Figure 5, as well as the actual contour plot.The KL(•) for the four algorithms, starting from 108.91, are 4.93, 108.91, 19.57, 108.91 , and 108.91, respectively.In this example, the best diagnosis is from the alternative algorithm with the minimum of KL(•).
The objective of the inverse problem is to approximate the distribution of f .We use the distribution of g in Figure 5 to show the number of method accuracies.In Figure 5, we can observe improved approximations for the second and last two plots.In these algorithms, the best choice is the initialization, and the distribution has a higher KL(•) by reputation.Here, are the conjectures of the standard alternative, gradient-based with γ = 1, ∥KL( θ)∥ −1 and natural parameters, and natural gradient algorithms, respectively: 178 .We simulate additional g = H f + ϵ models with varying dimensions in Table 4. Based on the results, the optimal posterior is obtained from the standard alternative algorithm when we increase the dimensions, as it exhibits lower values of KL(•).In the inverse problem, the primary focus is on observing f , so another way to compare could be the accuracy of v f .When it comes to the variance-covariance matrix of f , the most effective algorithm is parametric natural gradient optimization for approximating the posterior distribution of f because it yields results that closely align with the data of f .We would like to remind you that we do not use f data in the recursive processes; we only have g data.It is applicable to real-world situations, but since this is a simulated example, we have the f data, but we do not utilize it.In the example above, we observe that the alternative method yields a lower KL(•) in the two-dimensional dataset, too.As the number of dimensions increases, so does the value of KL(•).However, the accuracy of the variance-covariance of f is not reduced.The initializations are calculated based on some guesses using available data g.Although we do not need the values of µ f and v f for the algorithm implementations, we have included two columns for them.The purpose of inverse problems is to extract f from the dataset of g. v f is a diagonal matrix, which we represent by its main diagonal components as a vector here. 536.28

Conclusions
This paper presents four approximation methods for estimating the density functions of hidden variables, referred to as VBA.We also consider three examples of the Normal-Inverse-Gamma, Normal-Inverse-Wishart, and linear inverse problems.We provide the details of the first model here and include the details for two other examples in the appen-dices.In all three models, the parameters are unobserved and estimated using recursive algorithms.We attempt to approximate the joint complex distribution by simplifying the margin factorials to resemble independent cases.We compare the performance and accuracy of VBA algorithms.The standard alternative analytic optimization algorithm demonstrates the highest robustness in minimizing KLD, particularly as the number of dimensions increases, and converges fairly quickly.The numerical computation cost is negligible in our examples because we find the explicit form for each parameter, and the algorithms are well-formulated.The main difference in algorithms lies in the accuracy of the results.They estimate the complex joint distribution using separable ones.In the linear inverse problem, the standard alternative algorithm yields lower KL(•) values, while the parametric gradient algorithm with γ = ∥KL(•)∥ −1 produces variance-covariance matrices that are closest to the real ones.The overall performance of the alternative is most satisfactory, especially in high dimensions.

−
Thus, the desire function KL( θ) is: where, ψ 0 (•) is the polygamma function of order 0, or called digamma function.The gradient expression concerning θ = ( α, β, µ ′ , v) is: where, ψ 1 (•) is the polygamma function of order 1.We substitute (A6) in the gradient-based formulas, so the algorithm is as follows: 2. The gradient-based algorithm with normal parameters: where γ is a fix value.For the natural gradient algorithm, we need to calculate the gradient of ln p(x, y| θ) in (9).Remember, we change the notation p(x, y) to p(x, y| θ) for the natural gradient algorithm explained in (2).Since Equation ( 9) is a function of α, β, and µ ′ , the final algorithm has no information about v. Therefore, we replace ln p(x, y| θ) with ln q(x, y| θ): So, we obtain the following Fisher information matrix by using ln q(x, y| θ): ⟩ q(x,y) .
Another gradient algorithm to consider is based on the natural parameters Λ.We use q(x, y| Λ) instead of q(x, y) to clarify the computations, Λ = ( λ1 , λ2 , λ3 , λ4 ): So, f is separable due to the diag function.We summarize all in a multivariate Normal distribution and µ g = µ f : where all the mathematical operations on vectors are componentwise, and we find that µ g = µ f .The recursive algorithm is fixed for ṽϵ and α, which can be estimated from the data set.The alternative algorithm is in the following for other parameters: 1.Standard alternate optimization algorithm: ] 2 ) + 2n β(k+1) + 2 ṽϵ (k+1) α(k+1)   .
To make the last two algorithms, we have to differentiate KL( θ) with respect to θ: where 1 is the all-ones vector, and also all operations are componentwise.The gradient algorithm is as follows: 2. The gradient-based algorithm with normal parameters: ) , ṽ(k+1) Another algorithm we consider here is the natural gradient algorithm with respect to the redefine θ.We fix ṽϵ and determine θ = ( µ g , ṽ f , α, β) because the empirical Fisher infor- mation respect to ṽϵ is 0. It is needed to calculate the Fisher information of ln q(g, f , ṽ| θ): ln q(g, f , ṽ| θ) ∝ − Its corresponding Fisher information matrix is: .

Figure 4 .
The corresponding estimation values are in Table ); (b) Estimations of the Model. 5231.57

Figure 5 .
(a) True contour plot of model (5), almost separable; (b) These are the estimation of the model based on the standard alternative analytic approximation, gradient-based algorithm with γ = 1, ∥KL( θ)∥ −1 and concerning the natural parameters, and natural gradient algorithm, respectively, from left to right.

Table 2 .
There are some models with their estimation via four optimization algorithms with two different initializations.The lower the KL(•), the more accurate the fitted posterior distribution.

Table 4 .
We have additional simulation results for the model g = H f + ϵ with 4, 6, and 10 dimensions.