We showcase our method on two example tasks: inference in a Bayesian neural network and posterior sampling in a contextual bandit task.
4.1. Inference in Deep Neural Networks
The goal of this experiment is twofold. First, we empirically confirm the improvement in the ELBO, and second, we quantify the improvement in the uncertainty estimates due to the refinement. We conduct experiments on regression and classification benchmarks using Bayesian neural networks as the underlying model. We look at the marginal log-likelihood (MLL) of the predictions, as well as accuracy in classification tasks.
We used three baseline models for comparison: mean-field variational inference, multiplicative normalizing flows (MNF), and deep ensemble models. For all methods, we used a batch size of 256 and the Adam optimizer with the default learning rate of 0.001. The hyperparameters of each baseline were tuned using a Bayesian optimization package. We found batch size and learning rate to be consistent across methods.
First, Variational inference (VI, [
8,
9]). Naturally, we investigate the improvement of our method over variational inference with a mean-field Gaussian posterior approximation. We do inference over all weights and biases with a Gaussian prior centered at 0, the variance of the prior is tuned through empirical Bayes, and the model is trained for 30,000 iterations.
Second, Multiplicative normalizing flows (MNF, [
10]). In this work, the posterior means are augmented with a multiplier from a flexible distribution parameterized by the masked RealNVP. This model is trained with the default flow parameters for 30,000 iterations.
Third, Deep ensemble models [
11]. Deep ensemble models are shown to be surprisingly effective at quantifying uncertainty. For the regression datasets, we used adversarial training (
), whereas in classification we did not (since adversarial training did not give an improvement in the classification benchmarks). For each dataset, we trained 10 ensemble members for 5000 iterations each.
Finally, our work, Refined VI. After training the initial mean-field approximation, we generate
refined samples
, each with
auxiliary variables. The means on the prior distribution for the auxiliary variables are fixed at 0, and their prior variances form a geometric series (the intuition is that the auxiliary variables carry roughly equal information this way):
for
. We experimented with different ratios between 0 and 1 for the geometric series and we found that 0.7 worked well. In each refinement iteration, we optimized the posterior with Adam [
12] for 200 iterations. To keep the training stable, we kept the learning rate proportional to the standard deviation of the conditional posterior: in iteration
k,
. Our code is available at
https://github.com/google/edward2/experimental/auxiliary_sampling.
Following [
13], we evaluate the methods on a set of UCI regression benchmarks on a feed forward neural network with a single hidden layer containing 50 units with a ReLU activation function (
Table 1). The datasets used a random 80–20 split for training and testing, and we utilize the local reparametrization trick [
14].
On these benchmarks, refined VI consistently improves both the ELBO and the MLL estimates over VI. For refined VI, the
cannot be calculated exactly, but
provides a lower bound to it, which we can estimate using Equation (
13). Note that the gains in MLL are small in this case. Nevertheless, refined VI is one of the best performing approaches on 7 out of the 9 datasets.
We examine the performance on commonly used image classification benchmarks (
Table 2) using LeNet5 architecture [
15]. We use the local reparametrization trick [
14] for the dense layers and Flipout [
16] for the convolutional layers to reduce the gradient noise. We do not use data augmentation in order to stay consistent with the Bayesian framework.
On the classification benchmarks, we again are able to confirm that the refinement step consistently improves both the ELBO and the MLL over VI, with the MLL differences being more significant here than in the previous experiments. Refined VI is unable to outperform deep ensembles in classification accuracy, but it does outperform them in MLL on the largest dataset, CIFAR10.
To demonstrate the performance on larger scale models, we apply the refining algorithm to residual networks [
17] with 20 layers (based on Keras’s ResNet implementation). We look at two models: a standard ResNet, where inference is done over every residual block and a hybrid model (ResNet Hybrid [
18]), where inference is only done over the final layer of each residual block, and every other layer is treated as a regular layer. For this model, we used a batch-size of 256 and we decayed the learning rate starting from 0.001 over 200 epochs. We used 10 auxiliary variables each reducing the prior variance by a factor of 0.5. Results are shown in
Table 3.
Batch normalization [
19] provides a substantial improvement for VI though, this improvement interestingly disappears for the hybrid model. The refined hybrid model outperforms the recently proposed natural gradient VI method by [
20] in both MLL and accuracy, but it is still behind some non-Bayesian uncertainty estimation methods [
21].
4.3. Thompson Sampling
Generating posterior samples for Thompson sampling [
22,
23] in a contextual bandit problem is an ideal use case for the refinement algorithm. Refinement allows one to trade-off computational complexity for a higher quality approximation to the posterior. This can be ideal for Thompson sampling where more expensive objectives often warrant spending time computing better approximations.
Thompson sampling works by sampling a hypothesis from the approximate posterior to decide on each action. This balances exploration and exploitation, since probable hypotheses are tested more frequently than improbable ones. In each step,
Sample ;
Take action , where r is the reward that is determined by the context c, the action a taken, and the unobserved model parameters ;
Observe reward r and update the approximate posterior .
We look at the mushroom task [
9,
24], where the agent is presented with a number of mushrooms that they can choose to eat or pass. The mushrooms are either edible or poisonous. Eating an edible mushroom always yield a reward of 5, while eating a poisonous mushroom yield a reward 5 with probability 50% and −35 with probability 50%. Passing a mushroom gives no reward.
To predict the distribution of the rewards, the agent uses a neural network with 23 inputs and two outputs. The inputs are the 22 observed attributes of the mushrooms and the proposed action (1 for eating and 0 for passing). The output is the mean expected reward. The network has a standard feed-forward architecture with two hidden layers containing 100 hidden units each, with ReLU activations throughout. For the prior, we used a standard Gaussian distribution over the weights.
For the variational posterior, we use a mean-field Gaussian approximation that we update for 500 iterations after observing each new reward. For the updates, we use batches of 64 randomly sampled rewards with an Adam optimizer with learning rate . In refined sampling, we used two auxiliary variables: with and . To obtain a high quality sample for prediction, we first draw using the main variational approximation and then refine the posterior over for 500 iterations. After using the refined sample for prediction, we discard it and update the main variational approximation using the newly observed reward (for 500 iterations). In our experiments, we used three posterior samples to calculate the expected reward, which helps to emphasize exploitation compared to using a single sample.
As baselines, we show the commonly used -greedy algorithm, where the agent takes the action with the highest expected reward according to the maximum-likelihood solution with probability , and takes a random action with probability .
We measure the performance using the cumulative regret. The cumulative regret measures the difference between our agent and an omniscient agent that makes the optimal choice each time. Lower regret indicates better performance.
Figure 5 depicts the results. We see that the refined agent has lower regret throughout, which shows that the higher quality posterior samples translate to improved performance. Until about 3000 iterations, the
-greedy algorithms perform well, but they are overtaken by Thompson sampling as the posterior tightens and the agent shifts focus to exploitation.