Sampling the Variational Posterior with Local Refinement

Variational inference is an optimization-based method for approximating the posterior distribution of the parameters in Bayesian probabilistic models. A key challenge of variational inference is to approximate the posterior with a distribution that is computationally tractable yet sufficiently expressive. We propose a novel method for generating samples from a highly flexible variational approximation. The method starts with a coarse initial approximation and generates samples by refining it in selected, local regions. This allows the samples to capture dependencies and multi-modality in the posterior, even when these are absent from the initial approximation. We demonstrate theoretically that our method always improves the quality of the approximation (as measured by the evidence lower bound). In experiments, our method consistently outperforms recent variational inference methods in terms of log-likelihood and ELBO across three example tasks: the Eight-Schools example (an inference task in a hierarchical model), training a ResNet-20 (Bayesian inference in a large neural network), and the Mushroom task (posterior sampling in a contextual bandit problem).


Introduction
Uncertainty plays a crucial role in a multitude of machine learning applications, ranging from weather prediction to drug discovery. Poor predictive uncertainty risks potentially poor outcomes, especially in domains such as medical diagnosis or autonomous vehicles, where high confidence errors may be especially costly [1]. Thus, it is tremendously important that the underlying model provides high quality uncertainty estimates along with its predictions. By marginalizing over a posterior distribution over the parameters given the training data, Bayesian inference provides a principled approach to capturing uncertainty. Unfortunately, exact Bayesian inference is not generally tractable. Variational inference (VI) instead approximates the true posterior with a simpler distribution. VI is appealing since it reduces the problem of inference to an optimization problem, where the goal is to minimize the discrepancy between the true posterior and the variational posterior.
The key challenge, however, is the task of training expressive posterior approximations that can capture the true posterior without significantly increasing computational and memory costs. The most widely used one is the mean-field approximation, where the posterior is represented using an independent Gaussian distribution over all the model parameters. The mean-field approximation is easy to train, but it fails to capture dependencies and multi-modality in the true posterior.
This paper describes a novel method for generating samples from a highly flexible posterior approximation. The idea is to start with a coarse, mean-field approximation and make a series of inexpensive, local refinements to it. At the end, we draw a sample from the refined region. We show that through this process, we can generate samples that capture both dependencies and multi-modality in the true posterior.
The refinements take place at gradually decreasing scales starting with large scale changes, moving towards small scale adjustments. The regions of these adjustments are determined by sampling the values of additive auxiliary variables. Formally, we express the model parameters w using a number of additive auxiliary variables w = a 1 + . . . + a K (Figure 1 left) that leave the marginal distribution unchanged. The refinement process takes place over K optimization steps. In each step, we sample the value of an auxiliary variable according to the current variational approximation a k ∼ q(a k ) and optimize the approximation by conditioning on the newly sampled value q(w) ≈ p(w|x, y, a 1:k ) (k = 1 . . . K). At the end, we obtain a sample w = a 1 + . . . + a K from the refined posterior q ref (w). To obtain further samples, we must go back to our initial, coarse approximation and repeat the K-step process again. We refer to the refinements as local, because after sampling each auxiliary variable, the process moves towards smaller scale adjustments until it reaches w.
The refined posterior is a highly flexible approximation to the true posterior. It is able to capture dependencies and multi-modality even when these are absent from the initial variational approximation. We demonstrate the multi-modality of the refined posterior on a synthetic example, and we show how the refined posterior is able to capture dependencies in a hierarchical inference problem.
We theoretically show that the refined posterior improves the ELBO over the initial variational approximation. We also demonstrate this empirically by applying the method to Bayesian neural networks on common regression and image classification benchmarks.
Generating each sample requires a series of optimization steps that come with associated computational costs. We found that in a deep neural network, the computational overhead of generating a small set of samples for prediction amounts to ∼30% of the cost of training the initial variational approximation; thus, the refinement process is able to generate a set of high-quality posterior samples at the cost of a small computational overhead (compared to training a standard mean-field approximation).
An ideal application of our method is using it to generate posterior samples for Thompson sampling, which is a popular approach to tackle contextual bandit tasks. It works by sampling a random hypothesis from the posterior to decide on each action. In this scenario, the computational cost is not a key consideration, we can spend further computation on generating high quality posterior samples. We show that the high quality samples generated by refining the posterior improve the performance of Thompson sampling in contextual bandit task as measured by the cumulative regret. In each iteration, the value of an auxiliary variable is fixed, and the posterior is locally adjusted. In the final iteration, a sample is drawn from q(w). Through the iterations, the variational distribution is able to approximate well the true posterior in a local region.

Organization of the Paper
In Section 2, we start by introducing the notation and giving an overview of variational inference. Then, we present our proposed algorithm for generating samples from a refined variational distribution. Through two examples, we show that refined posterior can capture both dependencies and multi-modality. In Section 3, we provide theoretical guarantees that the refinement step always improves the quality of the variational distribution (measured by the ELBO) under mild conditions. In Section 4, we evaluate the effectiveness of the method on Bayesian neural networks on a set of UCI regression and image classification benchmarks. We observe that our method consistently improves the quality of the approximation, as evidenced by a higher ELBO and likelihood of the samples. We also demonstrate that the high-quality posterior samples can be used in Thompson sampling to reduce the cumulative regret in a contextual bandit task. In Section 5, we discuss a related works and place our method in context.

Materials and Methods
In this section, we first describe standard variational inference (VI), followed by a detailed description of our proposed sample generation method that refines the variational posterior. The inputs and labels are denoted by x ⊆ X and y ⊆ Y, respectively, and w denotes the model parameters.

Variational Inference
Exact Bayesian inference is often intractable and is NP-hard in the worst case. Variational inference attempts to approximate the true posterior p(w|x, y) with an approximate posterior q φ (w), typically from a simple family of distributions, for example independent Gaussians over the weights, i.e., the mean-field approximation. To ensure that the approximate posterior is close to the true posterior, the parameters of q φ (w), φ are optimized to minimize their Kullback-Leibler divergence: KL q φ (w) || p(w|x, y) . At the limit of KL q φ (w) || p(w|x, y) = 0, the approximate posterior exactly captures the true posterior, although this might not be achievable if p(w|x, y) is outside of the distribution family of q φ (w).
In order to minimize the KL-divergence, variational inference optimizes the evidence lower bound (ELBO) w.r.t. φ (denoted as L(φ)), which is a lower bound to the log marginal likelihood log p(y|x). Since the marginal log-likelihood can be expressed as the sum of the KL-divergence and the ELBO, maximizing the ELBO is equivalent to minimizing the KL divergence: due to non-negativity of the KL-divergence.
Following the optimization of φ, the model can be used to make predictions on unseen data. For an input x , the predictive distribution p(y |x , y, x) can be approximated by stochastically drawing a small number of sample model parameters w 1:M ∼ q φ (w) and averaging their prediction in an ensemble model p(y |x , y, x) ≈ 1 M ∑ M i=1 p(y |x , w i ).

Refining the Variational Posterior
The main issue with variational inference is the inflexibility of the posterior approximation. The most widely used variant of variational inference, mean-field variational inference, approximates the posterior with independent Gaussians across all dimensions. This approximation is too simplistic to capture the complexities of the posterior for complicated models. With our proposed method, it is feasible to generate samples from a detailed posterior by starting with a mean-field approximation and refining it in selected, local regions. Note that the method does not yield an analytic form to the detailed posterior, it generates a set of samples w 1:M from it.
The graphical model is augmented with a finite number of auxiliary variables a 1:K as shown in Figure 1. The constraints are that (x, y) must be conditionally independent of the auxiliary variables given w and that they must not affect the prior distribution p(w). These constraints ensure that the marginal likelihood log p(y|x) is unchanged, enabling us to train the augmented model with the same ELBO as the unaugmented model; thus, the model is unaffected by the presence of the auxiliary variables. Their purpose is solely to aid the inference procedure. Given a Gaussian prior N (w|0, σ 2 w I) over w, we express w as a sum of independent auxiliary variables (Although we are focusing on one specific definition of the auxiliary variables, additive auxiliary variables, note that all of our results straightforwardly generalize to arbitrary joint distributions p(w, a 1:K ) that meet the constraints).
We refine the approximate posterior to generate each sample w 1:M . Specifically, this refers to iteratively sampling the values of the auxiliary variables a 1:K and then approximating the posterior of w, conditional on the sampled values, i.e., q φ k (w) approximates p(w|x, y, a 1:k ) for iterations k = 1 . . . K (φ k is dependent on a 1:k ) as shown in Algorithm 1.
Sample the value of a k using the current variational approximation and fix its value.
A sample can be obtained by first sampling w ∼ q φ k−1 (w) followed by a k ∼ p(a k |a 1:k−1 , w). This is straightforward for exponential families and factorized distributions. The closed form for q φ k−1 (a k ) is provided in the Appendix A.

2.
Optimize the variational approximation conditional on the sampled a k : This optimization is very fast in practice if φ k is initialized using the solution from the previous iteration: We then obtain w = ∑ K k=1 a k . Analogous to VI, the KL-divergence in step 2 is minimized by maximizing the conditional ELBO where p(w|a 1: . Note that, when k = K, the numerical minimization of KL q φ k (w) || p(w|x, y, a 1:k ) is unnecessary since in this case, the optimal q φ K (w) is a delta function located at the sum of the sampled a 1:K .
In order to generate M independent samples w 1:M from the refined posterior, the previous process has to be repeated M times, sampling new values for a 1:K each time.

Multi-Modal Toy Example
We use a synthetic toy example to demonstrate the procedure and to show that through the refinement steps, the approach is able to capture multiple posterior modes. In this example, we have a single weight w with prior p(w) = N (w|0, 1) and a complex posterior with four modes. Figure 2b shows that a Gaussian approximation fails to capture the multi-modal nature of the true posterior.   We express w as the sum of K = 2 auxiliary variables: w = a 1 + a 2 with p(a 1 ) = N (a 1 |0, 0.8) and p(a 2 ) = N (a 2 |0, 0.2), which recovers p(w) = N (w|0, 1) as per the constraint. The first step of the refinement process is sampling a 1 ∼ q φ 0 (a 1 ) = p(a 1 |w)q φ 0 (w) dw, where q φ 0 (w) is an initial mean field approximation to the poste-rior. Then, the variational posterior is optimized conditional on the sampled a 1 ; that is, Figure 2c shows that the conditional variational posterior is able to fit one of the posterior modes. Over many runs, the different values of a 1 force the conditional posterior to fit different posterior modes, thus allowing the refined posterior to capture the multi-modal nature of the true posterior as shown in Figure 2d. Clearly, the refined posterior is a much better approximation to the true posterior than the Gaussian approximation though we note that the true posterior is not recovered exactly.

Capturing Dependencies: A Hierarchical Example
In this section, we use the eight-schools example from STAN [2,3] to show how the refined posterior can capture dependencies among the hidden variables and to discuss the effect of the number of auxiliary variables on the quality of the posterior approximation.
The eight-schools example studies the coaching effect of 8 schools. Each school reports the mean y i and standard error σ i of its coaching effect where i = 1, . . . , 8. There is no prior reason to believe that any school was more effective than another so the model is stated in a hierarchical manner: where the HalfCauchy distribution refers to a Cauchy distribution supported only on positive values (i.e., a symmetric half of the Cauchy distribution). Factorized approximations perform poorly on this problem due to the dependency of θ on τ (for an excellent analysis of this problem, see [4]). In fact, the MAP solution is at τ = 0, which is distant from the mean-field approximation that STAN uses for variational inference (ADVI, [5]  We show that our method can capture the dependencies between θ and τ. We introduce the following additive auxiliary variables: As required by the constraints, the auxiliary variables leave the model unchanged. Figure 3 left shows the approximate posterior for various K values. At K = 1, the model is equivalent to ADVI, and as K increases, we can see that the refined posterior is able to capture the dependencies between τ and θ 1 and results in a non-Gaussian form. The ground truth samples were obtained using the NUTS sampler in PyMC3 [6,7]. The density plots were generated using kernel-density-estimation.

Limit as K → ∞
A natural question to ask is what happens as the number of auxiliary variables grows to infinity. We can estimate the KL-divergence of the refined posterior and the true posterior in the eight-schools example using kernel density estimation based on the samples generated from the refined posterior. We see that it monotonically decreases (Figure 3 middle). Indeed, we show theoretically that each auxiliary variable increases the ELBO and hence decreases the KL-divergence to the true posterior. However, the precise condition for convergence to the true posterior remains an open question.

Theoretical Results
We claim that the refinement process must improve the variational approximation over the initial mean-field approximation as measured by the ELBO.
This claim is formalized in the following proposition.

Proposition 1.
Let be the ELBO of the refined posterior (where q ref is the distribution that our process generates samples from), let be the ELBO accounting for the auxiliary variables, and let be the ELBO of the initial variational approximation. Then, the following inequalities hold: Thus, ELBO ref , the ELBO of the distribution that we are generating samples from is greater than, or equal to ELBO init , the ELBO of the initial mean-field approximation.

Proof of ELBO ref ≥ ELBO aux
This is a consequence of the fact that a 1:K fully determines w.
Proof. and that q ref (w|a 1:K ) = p(w|a 1: . The proof is concluded using the non-negativity of the KL-divergence. Note that ELBO ref is a valid ELBO-it is a lower bound to the marginal likelihood log p(y|x) ≥ ELBO ref . Therefore, optimizing ELBO ref through our sampling procedure decreases the KL divergence between q ref and the true posterior.

Proof of ELBO aux ≥ ELBO init
We prove this by demonstrating that improvement in the ELBO can be guaranteed in our method under the assumption that the conditional variational posterior The central idea is to show that by initializing φ k at φ * k , the variational distribution remains unchanged-therefore, ELBO aux = ELBO init . Then, as we optimize φ k , we are optimizing the terms in ELBO aux through L |a 1:k (φ k ). Therefore, ELBO aux ≥ ELBO init .
Proof. We prove ELBO aux ≥ ELBO init by demonstrating that improvement in the ELBO can be guaranteed in our method under the assumption that the conditional variational posterior q φ k−1 (w|a k ) is within the variational family of q φ k (w). i.e., This assumption holds for all exponential families of distributions. The objective being optimized in each refinement step is From our assumption in Equation (5), it follows that when we reach the global optima φ k ← arg max φ k L |a 1:k (φ k ). Even in the case when the optimizer is unable to find the global maximum, it is reasonable to assume that L |a 1:k (φ k ) ≥ L |a 1:k (φ * k ), given that we initialize φ k at φ * k . The proof is based on mathematical induction on l. We show that for l = 0 . . . K, which holds at l = 0, since L | (φ 0 ) = ELBO init . Notice that for k = 0 . . . K − 1, where line 1 follows using Equation (7) and line 3 follows using Bayes' theorem: and p(w|a 1:k+1 ) = p(a k+1 |w,a 1:k )p(w|a 1:k ) p(a k+1 |a 1:k ) . After rearranging, Substituting this into the inductive hypothesis at k = l proves the inductive step as shown next: To finish the proof, examine the case l = K. Notice that since a 1:K fully determines w, i.e., q φ K (w) = p(w|a 1:K ) = δ Dirac (w − ∑ K k=1 a k ). Substituting Equation (12) in at l = K yields the desired result: concluding the proof.
Note that this result implies that ELBO aux must grow with each auxiliary variable. We demonstrate this empirically by estimating ELBO aux as we sample the auxiliary variables in a neural network. The result is shown on Figure 4. We see that ELBO aux grows after each iteration, exhibiting a stair pattern.

Experimental Results
We showcase our method on two example tasks: inference in a Bayesian neural network and posterior sampling in a contextual bandit task.

Inference in Deep Neural Networks
The goal of this experiment is twofold. First, we empirically confirm the improvement in the ELBO, and second, we quantify the improvement in the uncertainty estimates due to the refinement. We conduct experiments on regression and classification benchmarks using Bayesian neural networks as the underlying model. We look at the marginal log-likelihood (MLL) of the predictions, as well as accuracy in classification tasks.
We used three baseline models for comparison: mean-field variational inference, multiplicative normalizing flows (MNF), and deep ensemble models. For all methods, we used a batch size of 256 and the Adam optimizer with the default learning rate of 0.001. The hyperparameters of each baseline were tuned using a Bayesian optimization package. We found batch size and learning rate to be consistent across methods.
First, Variational inference (VI, [8,9]). Naturally, we investigate the improvement of our method over variational inference with a mean-field Gaussian posterior approximation. We do inference over all weights and biases with a Gaussian prior centered at 0, the variance of the prior is tuned through empirical Bayes, and the model is trained for 30,000 iterations.
Second, Multiplicative normalizing flows (MNF, [10]). In this work, the posterior means are augmented with a multiplier from a flexible distribution parameterized by the masked RealNVP. This model is trained with the default flow parameters for 30,000 iterations.
Third, Deep ensemble models [11]. Deep ensemble models are shown to be surprisingly effective at quantifying uncertainty. For the regression datasets, we used adversarial training ( = 0.01), whereas in classification we did not (since adversarial training did not give an improvement in the classification benchmarks). For each dataset, we trained 10 ensemble members for 5000 iterations each.
Finally, our work, Refined VI. After training the initial mean-field approximation, we generate M = 10 refined samples w 1:M , each with K = 5 auxiliary variables. The means on the prior distribution for the auxiliary variables are fixed at 0, and their prior variances form a geometric series (the intuition is that the auxiliary variables carry roughly equal information this way): σ 2 a k = 0.7 σ 2 w − ∑ k−1 l=1 σ 2 a l for k = 1 . . . K. We experimented with different ratios between 0 and 1 for the geometric series and we found that 0.7 worked well. In each refinement iteration, we optimized the posterior with Adam [12] for 200 iterations. To keep the training stable, we kept the learning rate proportional to the standard deviation of the conditional posterior: in iteration k, lr = 0.001 × 0.3 k 2 . Our code is available at https://github.com/google/edward2/experimental/auxiliary_sampling.
Following [13], we evaluate the methods on a set of UCI regression benchmarks on a feed forward neural network with a single hidden layer containing 50 units with a ReLU activation function ( Table 1). The datasets used a random 80-20 split for training and testing, and we utilize the local reparametrization trick [14]. On these benchmarks, refined VI consistently improves both the ELBO and the MLL estimates over VI. For refined VI, the ELBO ref cannot be calculated exactly, but ELBO aux provides a lower bound to it, which we can estimate using Equation (13). Note that the gains in MLL are small in this case. Nevertheless, refined VI is one of the best performing approaches on 7 out of the 9 datasets.
We examine the performance on commonly used image classification benchmarks ( Table 2) using LeNet5 architecture [15]. We use the local reparametrization trick [14] for the dense layers and Flipout [16] for the convolutional layers to reduce the gradient noise. We do not use data augmentation in order to stay consistent with the Bayesian framework. Table 2. Refining improves the ELBO across all image classification benchmarks. Results on image classification benchmarks with the LeNet-5 architecture, without data augmentation. Metrics: marginal log-likelihood (MLL, higher is better), accuracy (Acc, higher is better), and the evidence lower bound (ELBO higher is better). Means and standard deviations are shown. Bolded numbers indicate the highest ELBO (ELBO aux is a lower bound to ELBO ref , which is the true ELBO) and underlined numbers indicate the highest MLL. On the classification benchmarks, we again are able to confirm that the refinement step consistently improves both the ELBO and the MLL over VI, with the MLL differences being more significant here than in the previous experiments. Refined VI is unable to outperform deep ensembles in classification accuracy, but it does outperform them in MLL on the largest dataset, CIFAR10.

Deep
To demonstrate the performance on larger scale models, we apply the refining algorithm to residual networks [17] with 20 layers (based on Keras's ResNet implementation).
We look at two models: a standard ResNet, where inference is done over every residual block and a hybrid model (ResNet Hybrid [18]), where inference is only done over the final layer of each residual block, and every other layer is treated as a regular layer. For this model, we used a batch-size of 256 and we decayed the learning rate starting from 0.001 over 200 epochs. We used 10 auxiliary variables each reducing the prior variance by a factor of 0.5. Results are shown in Table 3. Table 3. Results on CIFAR10 with the ResNet architecture, without data augmentation. We observe that our method not only improves significantly in MLL over the VI baseline, but it also significantly improves in accuracy over the strong ensemble baseline. Metrics: marginal log-likelihood (MLL, higher is better), accuracy (Acc, higher is better), and the evidence lower bound (ELBO higher is better). Note that the non-hybrid and the hybrid models are equivalent when trained deterministically. The best MLL result is highlighted in bold. Batch normalization [19] provides a substantial improvement for VI though, this improvement interestingly disappears for the hybrid model. The refined hybrid model outperforms the recently proposed natural gradient VI method by [20] in both MLL and accuracy, but it is still behind some non-Bayesian uncertainty estimation methods [21].

Computational Costs
When introducing a novel algorithm for variational inference, we must discuss the computational costs. The computational complexity grows linearly with both K and M, resulting in an overall O(KM) runtime. The memory requirement is O(M) as it grows linearly with M. For the neural network models, the computational cost of generating the posterior samples is ∼30% of the cost of training the initial mean-field approximation (LeNet-5/CIFAR10 on an NVIDIA P100 GPU using TensorFlow). In practice, we recommend tuning the number of auxiliary variables for the given application; using more auxiliary variables always improves the posterior approximation, but they come with additional computational overhead.

Thompson Sampling
Generating posterior samples for Thompson sampling [22,23] in a contextual bandit problem is an ideal use case for the refinement algorithm. Refinement allows one to tradeoff computational complexity for a higher quality approximation to the posterior. This can be ideal for Thompson sampling where more expensive objectives often warrant spending time computing better approximations.
Thompson sampling works by sampling a hypothesis from the approximate posterior to decide on each action. This balances exploration and exploitation, since probable hypotheses are tested more frequently than improbable ones. In each step,
Take action arg max a E p(r|c,a,w) [r], where r is the reward that is determined by the context c, the action a taken, and the unobserved model parameters w; 3.
Observe reward r and update the approximate posterior q φ (w). We look at the mushroom task [9,24], where the agent is presented with a number of mushrooms that they can choose to eat or pass. The mushrooms are either edible or poisonous. Eating an edible mushroom always yield a reward of 5, while eating a poisonous mushroom yield a reward 5 with probability 50% and −35 with probability 50%. Passing a mushroom gives no reward.
To predict the distribution of the rewards, the agent uses a neural network with 23 inputs and two outputs. The inputs are the 22 observed attributes of the mushrooms and the proposed action (1 for eating and 0 for passing). The output is the mean expected reward. The network has a standard feed-forward architecture with two hidden layers containing 100 hidden units each, with ReLU activations throughout. For the prior, we used a standard Gaussian distribution over the weights.
For the variational posterior, we use a mean-field Gaussian approximation that we update for 500 iterations after observing each new reward. For the updates, we use batches of 64 randomly sampled rewards with an Adam optimizer with learning rate 10 −3 . In refined sampling, we used two auxiliary variables: w = a 1 + a 2 with p(a 1 ) = N (0, 0.7) and p(a 2 ) = N (0, 0.3). To obtain a high quality sample for prediction, we first draw a 1 using the main variational approximation and then refine the posterior over a 2 for 500 iterations. After using the refined sample for prediction, we discard it and update the main variational approximation using the newly observed reward (for 500 iterations). In our experiments, we used three posterior samples to calculate the expected reward, which helps to emphasize exploitation compared to using a single sample.
As baselines, we show the commonly used -greedy algorithm, where the agent takes the action with the highest expected reward according to the maximum-likelihood solution with probability 1 − , and takes a random action with probability .
We measure the performance using the cumulative regret. The cumulative regret measures the difference between our agent and an omniscient agent that makes the optimal choice each time. Lower regret indicates better performance. Figure 5 depicts the results. We see that the refined agent has lower regret throughout, which shows that the higher quality posterior samples translate to improved performance. Until about 3000 iterations, the -greedy algorithms perform well, but they are overtaken by Thompson sampling as the posterior tightens and the agent shifts focus to exploitation.

Related Works
Although, in theory, the Bayesian approach can accurately capture uncertainty, in practice, we find that exact inference is computationally infeasible in most scenarios, and thus, we have to resort to approximate inference methods. There is a wealth of research on approximate inference methods; here, we focus on works closely related to this paper.
Variational inference [25] tries to approximate the true posterior distribution over parameters with a variational posterior from a simple family of distributions. Mean-field VI, which for neural networks traces back to [26], uses independent Gaussian distributions over the parameters to try to capture the posterior. The advantage of the mean-field approximation is that the network can be efficiently trained using the reparameterization trick [27], and the variational posterior has a proper density over the parameter space, which then can be used across tasks, such as continual learning [20,28] and contextual bandits [29]. Recently, [10] showed that normalizing flows can be used to further increase the flexibility of the variational posterior. [30] provide a detailed survey of recent advances in VI.
Our method is a novel variant of the auxiliary variable approaches to VI [31,32] that increase the flexibility of the variational posterior through the use of auxiliary variables.
The key distinction, however, is that instead of trying to train a complex variational approximation over the joint distribution, we iteratively train simple mean-field approximations at the sampled values of the auxiliary variables. Although this poses an O(MK) overhead (K is the number of auxiliary variables and M is the number of posterior samples) over mean-field VI, the training itself remains straightforward and efficient. The introduction of every new auxiliary variable increases the flexibility of the posterior approximation. In contrast to MCMC methods, it is unclear whether the algorithm approaches the true posterior in the limit of infinitely many auxiliary variables.
There are also numerous methods that start with an initial variational approximation and refine it through a few MCMC steps [33][34][35]. The distinction from our algorithm is that we refine the posterior starting at large scale and iteratively move towards smaller scale refinements, whereas these methods only refine the posterior at the scale of the MCMC steps [36][37][38] used boosting to refine the variational posterior, where they iteratively added parameters, such as mixture components to minimize the residual of the ELBO. Our method does not add parameters at training time but instead iteratively refines the samples through the introduction of auxiliary variables. We do not include these in our baselines since they have yet to be applied to Bayesian multi-layer neural networks.
Further related works include methods that iteratively refine the posterior in latent variable models [39][40][41][42]. These methods, however, focus on reducing the amortization gap and do not increase the flexibility of the variational approximation.
Lastly, there are non-Bayesian strategies for estimating epistemic uncertainty in deep learning. Bootstrapping [43] and deep ensembles [11] may be the most promising. Deep ensembles, in particular, have been demonstrated to achieve strong performance on benchmark regression and classification problems and uncertainty benchmarks including outof-distribution detection [11] and prediction under distribution shift [18]. Both methods rely on constructing a set of independently trained models to estimate the uncertainty. Intuitively, the amount of disagreement across models reflects the uncertainty in the ensemble prediction. In order to induce diversity among the ensemble members, bootstrapping subsamples the training set for each member while deep ensembles use the randomness in weight initialization and mini-batch sampling.

Conclusions
In this work, we investigated a novel method for generating samples from a highly flexible posterior approximation, which works by starting with a mean-field approximation and locally refining it in selected regions. We demonstrated that the samples are able to capture dependencies and multi-modality. Furthermore, we showed both theoretically and empirically that the method always improves the ELBO of the initial mean-field approximation and demonstrated its improvement on a hierarchical inference problem, a deep learning benchmark and a contextual bandit task. Funding: This research was funded by EPSRC.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A. Analytical Forms of q φ k−1 (a k ) and q φ k−1 (w|a k ) For a diagonal Gaussian prior distribution p(w) = N (w|0, σ 2 I) (0 denotes the d w dimensional zero vector and I denotes the d w × d w identity matrix where d w is the dimensionality of w), we have w = ∑ K k=1 a k , p(a k ) = N (a k |0, σ 2 k I) for k ∈ {1, . . . , K} such that ∑ K k=1 σ 2 k = σ 2 . The forms of approximate posterior over the auxiliary variables q φ k−1 (a k ) and the conditionals q φ k−1 (w|a k ) can be computed in closed form. We only derive the result in the univariate case, but extending to the diagonal covariance case is straightforward.
First, let p(a k ) = N a k | µ k , σ 2 k . Now, define b k = ∑ k i=1 a i , m k = ∑ K i=k+1 µ i and s 2 k = ∑ K i=k+1 σ 2 i . Since z = ∑ K k=1 a k , using the formula for the conditional distribution of sums of Gaussian random variables (For Gaussian random variables X, Y with means µ x , µ y and variances σ 2 x , σ 2 y and Z = X + Y, p(x|z) is normally distributed with mean Recall that q φ k−1 (a k ) = p(a k |a 1:k−1 , w)q φ k−1 (w) dw , and assume that we have already calculated q φ k−1 (w) = N w | ν k−1 , ρ 2 k−1 . Notice that the quantity of interest is an integral of Gaussian densities, and hence after some algebraic manipulation, we obtain q φ k−1 (a k ) = N a k µ k + (ν k−1 − b k−1 − m k−1 ) Regarding q φ k−1 (w|a k ), we have using Bayes' rule. Again, we see that the desired quantity is a product of Gaussians, which we can derive to arrive at