Variationally Inferred Sampling through a Refined Bound

In this work, a framework to boost the efficiency of Bayesian inference in probabilistic models is introduced by embedding a Markov chain sampler within a variational posterior approximation. We call this framework “refined variational approximation”. Its strengths are its ease of implementation and the automatic tuning of sampler parameters, leading to a faster mixing time through automatic differentiation. Several strategies to approximate evidence lower bound (ELBO) computation are also introduced. Its efficient performance is showcased experimentally using state-space models for time-series data, a variational encoder for density estimation and a conditional variational autoencoder as a deep Bayes classifier.


Introduction
Bayesian inference and prediction in large, complex models, such as in deep neural networks or stochastic processes, remains an elusive problem [1][2][3]. Variational approximations (e.g., automatic differentiation variational inference (ADVI) [4]) tend to be biased and underestimate uncertainty [5]. On the other hand, depending on the target distribution, Markov Chain Monte Carlo (MCMC) [6] methods, such as Hamiltonian Monte Carlo (HMC) [7]), tend to be exceedingly slow [8] in large scale settings with large amounts of data points and/or parameters. For this reason, in recent years, there has been increasing interest in developing more efficient posterior approximations [9][10][11] and inference techniques that aim to be as general and flexible as possible so that they can be easily used with any probabilistic model [12,13].
It is well known that the performance of a sampling method depends heavily on the parameterization used [14]. This work proposes a framework to automatically tune the parameters of a MCMC sampler with the aim of adapting the shape of the posterior, thus boosting the Bayesian inference efficiency. We deal with a case in which the latent variables or parameters are continuous. Our framework can also be regarded as a principled way to enhance the flexibility of variational posterior approximation in search of an optimally tuned MCMC sampler; thus the proposed name of our framework is the variationally inferred sampler (VIS).
The idea of preconditioning the posterior distribution to speed up the mixing time of a MCMC sampler has been explored recently in [15,16], where a parameterization was learned before sampling via HMC. Both papers extend seminal work in [17] by learning an efficient and expressive deep, non-linear transformation instead of a polynomial regression. However, they do not account for tuning the parameters of the sampler, as introduced in Section 3, where a fully, end-to-end differentiable sampling scheme is proposed.
The work presented in [18] introduced a general framework for constructing more flexible variational distributions, called normalizing flows. These transformations are one of the main techniques used to improve the flexibility of current variational inference (VI) approaches and have recently pervaded the approximate Bayesian inference literature with developments such as continuous-time normalizing flows [19] (which extend an initial simple variational posterior with a discretization of Langevin dynamics) or householder flow for mixtures of Gaussian distributions [20]. However, they require a generative adversarial network (GAN) [21] to learn the posterior, which can be unstable in highdimensional spaces. We overcome this problem with our novel formulation; moreover, our framework is also compatible with different optimizers, rather than only those derived from Langevin dynamics [22]. Other recent proposals create more flexible variational posteriors based on implicit approaches typically requiring a GAN, as presented in [23] and including unbiased implicit variational inference (UIVI) [24] or semi-implicit variational inference (SIVI) [25]. Our variational approximation is also implicit but uses a sampling algorithm to drive the evolution of the density, combined with a Dirac delta approximation to derive an efficient variational approximation, as reported through extensive experiments in Section 5.
Closely related to our framework is the work presented in [26], where a variational autoencoder (VAE) is learned using HMC. We use a similar compound distribution as the variational approximation, yet our approach allows any stochastic gradient MCMC to be embedded, as well as facilitating the tuning of sampler parameters via gradient descent. Our work also relates to the recent idea of sampler amortization [27]. A common problem with these approaches is that they incur in an additional error-the amortization gap [28]which we alleviate by evolving a set of particles through a stochastic process in the latent space after learning a good initial distribution, meaning that the initial approximation bias can be significantly reduced. A recent related article was presented in [29], which also defined a compound distribution. However, our focus is on efficient approximation using the reverse KL divergence, which allows sampler parameters to be tuned and achieves superior results. Apart from optimizing this kind of divergence, the main point is that we can compute the gradients of sampler parameters (Section 3.3), whereas in [29] the authors only consider a parameterless sampler: thus, our framework allows for greater flexibility, helping the user to tune sampler hyperparameters. In the Coupled Variational Bayes (CVB) [30] approach, optimization is in the dual space, whereas we optimize the standard evidence lower bound (ELBO). Note that even if the optimization was exact, the solutions would coincide, and it is not clear yet what happens in the truncated optimization case,other than performing empirical experiments on given datasets. We thus feel that there is room for implicit methods that perform optimization in the primal space (besides this, they are easier to implement). Moreover, the previous dual optimization approach requires the use of an additional neural network (see the paper on the Coupled Variational Bayes (CVB) approach or [31]). This adds a large number of parameters and requires another architecture decision. With VIS, we do not need to introduce an auxiliary network, since we perform a "non-parametric" approach by back-propagating instead through several iterations of SGLD. Moreover, the lack of an auxiliary network simplifies the design choices.
Thus, our contributions include a flexible and consistent variational approximation to the posterior, embedding an initial variational approximation within a stochastic process; an analysis of its key properties; the provision of several strategies for ELBO optimization using the previous approximation; and finally, an illustration of its power through relevant complex examples.

Background
Consider a probabilistic model p(x|z) and a prior distribution p(z), where x denotes the observations and z ∈ R d the unobserved latent variables or parameters, depending on the context. Whenever necessary for disambiguation purposes, we shall distinguish between z for latent variables and θ for parameters. Our interest is in performing inference regarding the unobserved z by approximating its posterior distribution The integral assessing the evidence p(x) = p(z)p(x|z)dz is typically intractable. Thus, several techniques have been proposed to perform approximate posterior inference [3].

Inference as Optimization
Variational inference (VI) [4] tackles the problem of approximating the posterior p(z|x) with a tractable parameterized distribution q φ (z|x). The goal is to find the parameters φ so that the variational distribution q φ (z|x) (also referred to as variational guide or variational approximation) can be as close as possible to the actual posterior. Closeness is typically measured through the Kullback-Leibler divergence KL(q φ ||p), reformulated into the ELBO as follows: This is the objective to be optimized, typically through stochastic gradient descent techniques. To enhance flexibility, a standard choice for q φ (z|x) is a Gaussian distribution N (µ φ (x), σ φ (x)), with the mean and covariance matrix defined through a deep, non-linear model conditioned on observation x.

Inference as Sampling
HMC [7] is an effective sampling method for models whose probability is pointwise computable and differentiable. When scalability is an issue, as proposed by the authors in [32], a formulation of a continuous-time Markov process that converges to the target distribution p(z|x) can be used, which is based on the Euler-Maruyama discretization of Langevin dynamics where η t is the step size at time period t, and I is the identity matrix. The required gradient ∇ log p(z t , x) can be estimated using mini-batches of data. Several extensions of the original Langevin sampler have been proposed to increase its mixing speed, such as in [33][34][35][36]. We refer to these extensions as stochastic gradient MCMC samplers (SG-MCMC) [37].

A Variationally Inferred Sampling Framework
In standard VI, the variational approximation is analytically tractable and typically chosen as a factorized Gaussian, as mentioned above. However, it is important to note that other distributions can be adopted as long as they are easily sampled and their log-density and entropy values evaluated. However, in the rest of this paper, we focus on the Gaussian case, as the usual choice in the Bayesian deep learning community. Stemming from this variational approximation, we introduce several elements to construct the VIS.
Our first major modification of standard VI proposes the use of a more flexible distribution, approximating the posterior by embedding a sampler through where q 0,φ (z|x) is the initial and tractable density q φ (z|x) (i.e., the starting state for the sampler). We designate this as refined variational approximation. The conditional distribution Q η,T (z|z 0 ) refers to a stochastic process parameterized by η and used to evolve the original density q 0,φ (z|x) for T periods, so as to achieve greater flexibility. Specific forms for Q η,T (z|z 0 ) are described in Section 3.1. Observe that when T = 0, no refinement steps are performed and the refined variational approximation coincides with the original one; on the other hand, as T increases, the approximation will be closer to the exact posterior, assuming that Q η,T is a valid MCMC sampler in the sense of [37]. We next maximize a refined ELBO objective, replacing in Equation (1) the original q φ by q φ,η : This is done to optimize the divergence KL(q φ,η (z|x)||p(z|x)). The first term of Equation (4) requires only being able to sample from q φ,η (z|x); however, the second term, the entropy −E q φ,η (z|x) log q φ,η (z|x) , also requires the evaluation of the evolving, implicit density. Section 3.2 describes efficient methods to approximate this evaluation. As a consequence, performing variational inference with the refined variational approximation can be regarded as using the original variational guide while optimizing an alternative, tighter ELBO, as Section 4.2 shows. The above facilitates a framework for learning the sampler parameters φ, η using gradient-based optimization, with the help of automatic differentiation [38]. For this, the approach operates in two phases. First, in a refinement phase, the sampler parameters are learned in an optimization loop that maximizes the ELBO with the new posterior. After several iterations, the second phase, focused on inference, starts. We allow the tuned sampler to run for sufficient iterations, as in SG-MCMC samplers. This is expressed algorithmically as follows.
Refinement phase: Repeat the following until convergence: 1.

4.
Perform automatic differentiation on the objective wrt parameters φ, η to update them.

1.
Sample an initial set of particles, Since the sampler can be run for a different number of steps depending on the phase, we use the following notation when necessary: VIS-X-Y denotes T = X iterations during the refining phase and T = Y iterations during the inference phase.
Let us specify now the key elements.

The Sampler Q η,T (Z|Z 0 )
As the latent variables z are continuous, we evolve the original density q 0,φ (z|x) through a stochastic diffusion process [39]. To make it tractable, we discretize the Langevin dynamics using the Euler-Maruyama scheme, arriving at the stochastic gradient Langevin dynamics (SGLD) sampler (2). We then follow the process Q η,T (z|z 0 ), which represents T iterations of the MCMC sampler.
As an example, for the SGLD sampler z t = z t−1 + η∇ log p(x, z t−1 ) + ξ t , where t iterates from 1 to T. In this case, the only parameter is the learning rate η and the noise is ξ t ∼ N (0, 2η I). The initial variational distribution q 0,φ (z|x) is a Gaussian parameterized by a deep neural network (NN). Then, after T iterations of the sampler Q are parameterized by η, we arrive at q φ,η .
An alternative arises by ignoring the noise ξ [22], thus refining the initial variational approximation using only the stochastic gradient descent (SGD). Moreover, we can use Stein variational gradient descent (SVGD) [40] or a stochastic version [36] to apply repulsion between particles and promote more extensive explorations of the latent space.

Approximating the Entropy Term
We propose four approaches for the ELBO optimization which take structural advantage of the refined variational approximation.

Particle Approximation (VIS-P)
In this approach, we approximate the posterior q φ,η (z|x) by a mixture of Dirac deltas (i.e., we approximate it with a finite set of particles), by sampling z (1) , . . . , z (M) ∼ q φ,η (z|x) and setting In this approximation, the entropy term in (4) is set to zero. Consequently, the sample converges to the maximum posterior (MAP). This may be undesirable when training generative models, as the generated samples usually have little diversity. Thus, in subsequent computations, we add to the refined ELBO the entropy of the initial variational approximation, E q 0,φ (z|x) log q 0,φ (z|x) , which serves as a regularizer alleviating the previous problem. When using SGD as the sampler, the resulting ELBO is tighter than that without refinement, as shown in Section 4.2.

MC Approximation (VIS-MC)
Instead of performing the full marginalization in Equation (3), we approximate it with , we consider the joint distribution for the refinement. However, in inference we only keep the z T values. The entropy for each factor in this approximation is straightforward to compute. For example, for the SGLD case, we have This approximation tracks a better estimate of the entropy than VIS-P, as we are not completely discarding it; rather, for each t, we marginalize out the corresponding z t using one sample.

Gaussian Approximation (VIS-G)
This approach is targeted at settings in which it could be helpful to have a posterior approximation that places density over the whole z space. In the specific case of using SGD as the inner kernel, we have By treating the gradient terms as points, the refined variational approximation can be computed as q φ,η (z|x) = N (z|z T , σ φ (x)). Observe that there is an implicit dependence on η through z T .

Fokker-Planck Approximation (VIS-FP)
Using the Fokker-Planck equation, we derive a deterministic sampler via iterations of the form Then, we approximate the density q φ,η (z|x) using a mixture of Dirac deltas. A detailed derivation of this approximation is given in Appendix A.

Back-Propagating through the Sampler
In standard VI, the variational approximation q(z|x; φ) is parameterized by φ. The parameters are learned employing SGD, or variants such as Adam [41], using the gradient ∇ φ ELBO(q). We have shown how to embed a sampler inside the variational guide. It is therefore also possible to compute a gradient of the objective with respect to the sampler parameters η (see Section 3.1). For instance, we can compute a gradient ∇ η ELBO(q) with respect to the learning rate η from the SGLD or SGD processes to search for an optimal step size at every VI iteration. This is an additional step apart from using the gradient ∇ φ ELBO(q) which is used to learn a good initial sampling distribution.

Analysis of Vis
Below, we highlight key properties of the proposed framework.

Consistency
The VIS framework is geared towards SG-MCMC samplers, where we can compute the gradients of sampler hyperparameters to speed up mixing time (a common major drawback in MCMC [42]). After back-propagating for a few iterations through the SG-MCMC sampler and learning a good initial distribution, one can resort to the learned sampler in the second phase, so standard consistency results from SG-MCMC apply as T → ∞ [43].

Refinement of ELBO
Note that, for a refined guide using the VIS-P approximation and M = 1 samples, the refined objective function can be written as noting that z = z 0 + η∇ log p(x, z 0 ) when using SGD for T = 1 iterations. This is equivalent to the refined ELBO in (4). Since we are perturbing the latent variables in the steepest direction, we show easily that, for a moderate η, the previous bound is tighter , the one for the original variational guide q(z 0 |x). This reformulation of ELBO is also convenient since it provides a clear way of implementing our refined variational inference framework in any probabilistic programming language (PPL) supporting algorithmic differentiation.
Respectively, for the VIS-FP case, we find that its deterministic flow follows the same trajectories as SGLD: based on standard results of MCMC samplers [44], we have A similar reasoning applies to the VIS-MC approximation; however, it does not hold for VIS-G since it assumes that the posterior is Gaussian.

Taylor Expansion
This analysis applies only to VIS-P and VIS-FP. As stated in Section 4.2, within the VIS framework, optimizing the ELBO resorts to the performance of max z log p(x, z + ∆z), where ∆z is one iteration of the sampler; i.e., ∆z = η∇ log p(x, z) in the SGD case (VIS-P), or ∆z = η∇(log p(x, z) − log q(z)) in the VIS-FP case. For notational clarity, we consider the case T = 1, although a similar analysis follows in a straightforward manner if more refinement steps are performed.
Consider a first-order Taylor expansion of the refined objective log p(x, z + ∆z) ≈ log p(x, z) + (∆z) ∇ log p(x, z).
Taking gradients with respect to the latent variables z, we arrive at ∇ z log p(x, z + ∆z) ≈ ∇ z log p(x, z) + η∇ z log p(x, z) ∇ 2 z log p(x, z), where we have not computed the gradient through the ∆z term (i.e., we treated it as a constant for simplification). Then, the refined gradient can be deemed to be the original gradient plus a second order correction. Instead of being modulated by a constant learning rate, this correction is adapted by the chosen sampler. The experiments in Section 5.4 show that this is beneficial for the optimization as it typically takes fewer iterations than the original variant to achieve lower losses. By further taking gradients through the ∆z term, we may tune the sampler parameters such as the learning rate as presented in Section 3.3. Consequently, the next subsection describes two differentiation modes.

Two Automatic Differentiation Modes for Refined ELBO Optimization
For the first variant, remember that the original variant can be rewritten (which we term Full AD) as We now define a stop gradient operator ⊥ (which corresponds to detach in Pytorch or stop_gradient in tensorflow) that sets the gradient of its operand to zero-i.e., ∇ x ⊥(x) = 0-whereas in a forward pass, it acts as the identity function-that is, ⊥(x) = x. With this, a variant of the ELBO objective (which we term Fast AD) is Full AD ELBO enables a gradient to be computed with respect to the sampler parameters inside ∆z at the cost of a slight increase in computational burden. On the other hand, the Fast AD variant may be useful in numerous scenarios, as illustrated in the experiments.

Complexity
Since we need to back propagate through T iterations of an SG-MCMC scheme, using standard results of meta-learning and automatic differentiation [45], the time complexity of our more intensive approach (Full-AD) is O(mT), where m is the dimension of the hyperparameters (the learning rate of SG-MCMC and the latent dimension). Since for most use cases, the hyperparameters lie in a low-dimensional space, the approach is therefore scalable.

Experiments
The following experiments showcase the power of our approach as well as illustrating the the impact of various parameters on its performance, guiding their choice in practice. We also present a comparison with standard VIS and other recent variants, showing that the increased computational complexity of computing gradients through sampling steps is worth the gains in flexibility. Moreover, the proposed framework is compatible with other structured inference techniques, such as the sum-product algorithm, as well as serving to support other tasks such as classification.
Within the spirit of reproducible research, the code for VIS has been released at https://github.com/vicgalle/vis. The VIS framework is implemented with Pytorch [46], although we have also released a notebook for the first experiment using Jax to highlight the simple implementation of VIS. In any case, we emphasize that the approach facilitates rapid iterations over a large class of models.

Funnel Density
We first tested the framework on a synthetic yet complex target distribution. This experiment assessed whether VIS is suitable for modeling complex distributions. The target bi-dimensional density was defined through z 1 ∼ N (0, 1.35) z 2 ∼ N (0, exp(z 1 )). We adopted the usual diagonal Gaussian distribution as the variational approximation. For VIS, we used the VIS-P approximation and refined it for T = 1 steps using SGLD. Figure 1 top shows the trajectories of the lower bound for up to 50 iterations of variational optimization with Adam: our refined version achieved a tighter bound. The bottom figures present contour curves of the learned variational approximations. Observe that the VIS variant was placed closer to the mean of the true distribution and was more disperse than the original variational approximation, illustrating the fact that the refinement step helps in attaining more flexible posterior approximations.

State-Space Markov Models
We tested our variational approximation on two state-space models: one for discrete data and another for continuous observations. These experiments also demonstrated that the framework is compatible with standard inference techniques such as the sum-product scheme from the Baum-Welch algorithm or Kalman filter. In both models, we performed inference on their parameters θ. All the experiments in this subsection used the Fast AD version (Section 4.4) as it was not necessary to further tune the sampler parameters to obtain competitive results. p(x t |z t , θ em )p(z t |z t−1 , θ tr )p(θ), where each conditional is a categorical distribution taking five different classes. The prior is p(θ) = p(θ em )p(θ tr ) based on two Dirichlet distributions that sample the observation and state transition probabilities, respectively. Dynamic Linear Model (DLM): The model equations are as in (7), although the conditional distributions are now Gaussian and the parameters θ refer to the observation and transition variances.
For each model, we generated a synthetic dataset and used the refined variational approximation with T = 0, 1, 2. For the original variational approximation to the parameters θ, we used a Dirac delta. Performing VI with this approximation corresponded to MAP estimation using the Baum-Welch algorithm in the HMM case [47] and the Kalman filter in the DLM case [48], as we marginalized out the latent variables z 1:τ . We used the VIS-P variant since it was sufficient to show performance gains in this case. Figure 2 shows the results. The first row reports the experiments related to the HMM, the second row those for the DLM. We report the evolution of the log-likelihood during inference in all graphs; the first column reports the number of ELBO iterations, and the second column portrays clock times as the optimization takes place. They confirm that VIS (T > 0) achieved better results than standard VI (T = 0) for a comparable amount of time. Note also that there was not as much gain when changing from T = 1 to T = 2 as there is from T = 0 to T = 1, suggesting the need to carefully monitor this parameter. Finally, the top-right graph for the case T = 0 is shorter as it requires less clock time.

Prediction with an HMM
With the aim of assessing whether ELBO optimization helps in attaining better auxiliary scores, results in a prediction task are also reported. We generated a synthetic time series of alternating values of 0 and 1 for τ = 105 timesteps. We trained the previous HMM model on the first 100 points and report in Table 1 the accuracy of the predictive distribution p(y t ) averaged over the final five time-steps. We also report the predictive entropy as it helps in assessing the confidence of the model in its predictions, as a strictly proper scoring rule [49]. To guarantee the same computational budget time and a fair comparison, the model without refinement was run for 50 epochs (an epoch was a full iteration over the training dataset), whereas the model with refinement was run for 20 epochs. It can be observed that the refined model achieved higher accuracy than its counterpart. In addition, it was more correctly confident in its predictions.

Prediction with a DLM
We tested the VIS framework on Mauna Loa monthly CO 2 time-series data [50]. We used the first 10 years as a training set, and we tested over the next 2 years. We used a DLM composed of a local linear trend plus a seasonal block of periodicity 12. Data were standardized to a mean of zero and standard deviation of one. To guarantee the same computational budget time, the model without refining was run for 10 epochs, whereas the model with refinement was run for 4 epochs. Table 2 reports the mean absolute error (MAE) and predictive entropy. In addition, we computed the interval score [49], as a strictly proper scoring rule. As can be seen, for similar clock times, the refined model not only achieved a lower MAE, but also its predictive intervals were narrower than the non-refined counterpart.

Variational Autoencoder
The third batch of experiments showed that VIS was competitive with respect to other algorithms from the recent literature, including unbiased implicit variational inference (UIVI [24]), semi-implicit variational inference (SIVI [25]), variational contrastive divergence (VCD [29]), and the HMC variant from [26], showing that our framework can outperform those approaches in similar experimental settings.
To this end, we tested the approach with a variational autoencoder (VAE) model [51]. The VAE defines a conditional distribution p θ (x|z), generating an observation x from a latent variable z using parameters θ. For this task, our interest was in modeling the 28 × 28 image distributions underlying the MNIST [52] and the fashion-MNIST [53] datasets. To perform inference (i.e., to learn the parameters θ) the VAE introduces a variational approximation q φ (z|x). In the standard setting, this distribution is Gaussian; we instead used the refined variational approximation comparing various values of T. We used the VIS-MC approximation (although we achieved similar results with VIS-G) with the Full AD variant given in Section 4.4.
For the experimental setup, we reproduced the setting in [24]. For p θ (x|z), we used a factorized Bernoulli distribution parameterized by a two layer feed-forward network with 200 units in each layer and relu activations, except for a final sigmoid activation. As a variational approximation q φ (z|x), we used a Gaussian with mean and (diagonal) covariance matrix parameterized by two distinct neural networks with the same structure as previously used, except for sigmoid activation for the mean and a softplus activation for the covariance matrix.
Results are reported in Table 3. To guarantee fair comparison, we trained the VIS-5-10 variant for 10 epochs, whereas all the other variants were trained for 15 epochs (fMNIST) or 20 epochs (MNIST), so that the VAE's performance was comparable to that reported in [24]. Although VIS was trained for fewer epochs, by increasing the number T of MCMC iterations, we dramatically improved the test log-likelihood. In terms of computational complexity, the average time per epoch using T = 5 was 10.46 s, whereas with no refinement (T = 0), the time was 6.10 s (which was the reason behind our decision to train the refined variant for fewer epochs): a moderate increase in computing time may be worth the dramatic increase in log-likelihood while not introducing new parameters into the model, except for the learning rate η. Finally, as a visual inspection of the VAE reconstruction quality trained with VIS, Figures 3 and 4, respectively, display 10 random samples of each dataset.

Variational Autoencoder as a Deep Bayes Classifier
In the final experiments, we investigated whether VIS can deal with more general probabilistic graphical models and also perform well in other inference tasks such as classification. We explored the flexibility of the proposed scheme to solve inference problems in an experiment with a classification task in a high-dimensional setting with the MNIST dataset. More concretely, we extended the VAE model, conditioning it on a discrete variable y ∈ Y = {0, 1, . . . , 9}, leading to a conditional VAE (cVAE). The cVAE defined a decoder distribution p θ (x|z, y) on an input space x ∈ R D given a class label y ∈ Y, latent variables z ∈ R d and parameters θ. Figure 5 depicts the corresponding probabilistic graphic model. Additional details regarding the model architecture and hyperparameters are given in Appendix B. To perform inference, a variational posterior was learned as an encoder q φ (z|x, y) from a prior p(z) ∼ N (0, I). Leveraging the conditional structure on y, we used the generative model as a classifier using the Bayes rule, where we used M Monte Carlo samples z (m) ∼ q φ (z|x, y). In the experiments, we set M = 5. Given a test sample x, the labelŷ with the highest probability p(y|x) is predicted. For comparison, we performed several experiments changing T in the transition distribution Q η,T of the refined variational approximation. The results are given in Table 4, which reports the test accuracy at end of the refinement phase. Note that we are comparing different values of T depending on their use in refinement or inference phases (in the latter, the model and variational parameters were kept frozen). The model with T re f = 5 was trained for 10 epochs, whereas the other settings were for 15 epochs, to give all settings a similar training time. Results were averaged over three runs with different random seeds. In all settings, we used the VIS-MC approximation for the entropy term. From the results, it is clear that the effect of using the refined variational approximation (the cases when T > 0) is crucially beneficial to achieve higher accuracy. The effect of learning a good initial distribution and inner learning rate by using the gradients ∇ φ ELBO(q) and ∇ η ELBO(q) has a highly positive impact in the accuracy obtained.
On a final note, we have not included the case of only using an SGD or an SGLD sampler (i.e., without learning an initial distribution q 0,φ (z|x)) since the results were much worse than those in Table 4 for a comparable computational budget. This strongly suggests that, for inference in high-dimensional, continuous latent spaces, learning a good initial distribution through VIS may accelerate mixing time dramatically.

Conclusions
In this work, we have proposed a flexible and efficient framework to perform largescale Bayesian inference in probabilistic models. The scheme benefits from useful properties and can be employed to efficiently perform inference with a wide class of models such as state-space time series, variational autoencoders and variants such as the conditioned VAE for classification tasks, defined through continuous, high-dimensional distributions.
The framework can be seen as a general approach to tuning MCMC sampler parameters, adapting the initial distributions and learning rate. Key to the success and applicability of the VIS framework are the ELBO approximations based on the introduced refined variational approximation, which are computationally cheap but convenient.
Better estimates of the refined density and its gradient may be a fruitful line of research, such as the spectral estimator used in [54]. Another alternative is to use a deterministic flow (such as SGD or SVGD), keeping track of the change in entropy at each iteration using the change of the variable formula, as in [55]. However, this requires a costly Jacobian computation, making it unfeasible to combine with our approach of back-propagation through the sampler (Section 3.3) for moderately complex problems. We leave this for future exploration. Another interesting and useful line of further research would be to tackle the case in which the latent variables z are discrete. This would entail adapting the automatic differentiation techniques to be able to back-propagate the gradients through the sequences of acceptance steps necessary in Metropolis-Hastings samplers.
In order to deal with the implicit variational density, it may be worthwhile to consider optimizing the Fenchel dual of the KL divergence, as in [31]. However, this requires the use of an auxiliary neural network, which may entail a large computational price compared with our simpler particle approximation.
Lastly, probabilistic programming offers powerful tools for Bayesian modeling. A PPL can be viewed as a programming language extended with random sampling and Bayesian conditioning capabilities, complemented with an inference engine that produces answers to inference, prediction and decision-making queries. Examples include WinBUGS [56], Stan [57] or the recent Edward [58] and Pyro [59] languages. We plan to adapt VIS into several PPLs to facilitate the adoption of the framework.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Fokker-Planck Approximation (Vis-Fp)
The Fokker-Planck equation is a PDE that describes the temporal evolution of the density of a random variable under a (stochastic) gradient flow [39]. For a given SDE the corresponding Fokker-Planck equation is We are interested in converting the SGLD dynamics to a deterministic gradient flow.
Proposition A1. The SGLD dynamics, given by the SDE have an equivalent deterministic flow, written as the ODE dz = (∇ log p(z) − ∇ log q t (z))dt.
Proof. Let us write the Fokker-Planck equation for the respective flows. For the Langevin SDE, it is ∂ ∂t . On the other hand, the Fokker-Planck equation for the deterministic gradient flow is given by The result immediately follows since ∂ ∂z [∇ log q t (z)q t (z)] = ∂ 2 ∂z 2 [q t (z)]. Given that both flows are equivalent, we restrict our attention to the deterministic flow. Its discretization leads to iterations of the form z t = z t−1 + η(∇ log p(z t−1 ) − ∇ log q t−1 (z t−1 )).
In order to tackle the last term, we make the following particle approximation. Using a variational formulation, we have Then, we smooth the true density q convolving it with a kernel K, typically the rbf kernel, K(z, z ) = exp{−γ z − z 2 }, where γ is the bandwidth hyperparameter, leading to If we consider a mixture of Dirac deltas, q(z) = 1 M ∑ M m=1 δ(z − z m ), then the approximation is given by −∇ log q(z) ≈ − ∑ k ∇ z m K(z m , z n ) ∑ n K(z m , z n ) − ∑ l ∇ z m K(z m , z l ) ∑ n K(z n , z l ) , which can be inserted into Equation (A1). Finally, note that it is possible to back-propagate through this equation; i.e., the gradients of K can be explicitly computed.

. Prediction Task in a DLM
The DLM model comprises a linear trend component plus a seasonal block with a period of 12. The trend is specified as x t = z level,t + t t ∼ N (0, σ obs ) z level,t = z level,t−1 + z slope,t−1 + t t ∼ N (0, σ level ) z slope,t = z slope,t−1 + t t ∼ N (0, σ slope ).