# A Neural Network MCMC Sampler That Maximizes Proposal Entropy

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

- 1.
- Here, we employed the entropy-based objective in a neural network MCMC sampler for optimizing exploration speed. To build the model, we designed a novel flexible proposal distribution wherein the optimization of the entropy objective is tractable.
- 2.
- Inspired by the HMC and the L2HMC [10] algorithm, the proposed sampler uses a special architecture that utilizes the gradient of the target distribution to aid sampling.
- 3.
- We demonstrate a significant improvement in sampling efficiency over previous techniques, sometimes by an order of magnitude. We also demonstrate energy-based model training with the proposed sampler and demonstrate higher exploration and higher resultant sample quality.

## 2. Preliminaries: MCMC Methods from Vanilla to Learned

## 3. A Gradient-Based Sampler with Tractable Proposal Probability

#### 3.1. Proposal Model and How to Use Gradient Information

#### 3.2. Model Formulation

#### 3.3. Optimizing the Proposal Entropy Objective

## 4. Related Work: Other Samplers Inspired by HMC

**A-NICE-MC**[9], which was generalized in [20], used the same accept probability as HMC, but replaced the Hamiltonian dynamics with a flexible volume-preserving flow [21]. A-NICE-MC matches samples from $q({x}^{\prime}|x)$ directly to samples from $p(x)$, using adversarial loss. This permits training the sampler on empirical distributions, i.e., in cases where only samples, but not the density function, are available. The problem with this method is that samples from the resulting sampler can be highly correlated because the adversarial objective only optimizes for the quality of the proposed sample. If the sampler produces a high quality sample x, the learning objective does not encourage the next sample ${x}^{\prime}$ to be substantially different from x. The authors used a pairwise discriminator that empirically mitigated this issue but the benefit in exploration speed is limited.

**neural transport MCMC**[12,13,22], which fits a distribution defined by a flow model ${p}_{g}(x)$ to the target distribution using $\mathbf{KL}\left[{p}_{g}(x)\right|\left|p(x)\right]$. Sampling is then performed with HMC in the latent space of the flow model. Due to the invariance of the KL-divergence with respect to a change of variables, the “transported distribution” in z space ${p}_{{g}^{-1}}(z)$ will be fitted to resemble the Gaussian prior ${p}_{\mathcal{N}}(z)$. Samples of x can then be obtained by passing z through the transport map. Neural transport MCMC improves sampling efficiency compared to sampling in the original space because a distribution closer to a Gaussian is easier to sample. However, the sampling cost is not a monotonic function of the KL-divergence used to optimize the transport map [23].

**normalizing flow Langevin MC**(NFLMC) [11] also used a KL divergence loss. Strictly speaking, this model is a normalizing flow but not an MCMC method. We compare our method to it, because the model architecture, like ours, uses the gradient of the target distribution.

**L2HMC**[10,29] does encourage fast exploration of the state space by employing a variant of the expected square jump objective [30]: $L(x)=\int d{x}^{\prime}q({x}^{\prime}|x)A({x}^{\prime},x)\left|\right|{x}^{\prime}-x{\left|\right|}^{2}$. This objective provides a learning signal even when x is drawn from the exact target distribution $p(x)$. L2HMC generalized the Hamiltonian dynamics with a flexible non-volume-preserving transformation [17]. The architecture of L2HMC is very flexible and uses gradient of target distribution. However, the L2 expected jump objective in L2HMC improves exploration speed only in well-structured distributions (see Figure 1).

## 5. Experimental Result

#### 5.1. Synthetic Dataset and Bayesian Logistic Regression

#### 5.2. Training a Convergent Deep Energy-Based Model

## 6. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Sample Availability

## Abbreviations

MCMC | Markov Chain Monte Carlo |

HMC | Hamiltonian Monte Carlo |

EBM | Energy-based Model |

MALA | Metropolis-Adjusted Langevin Algorithm |

FID | Fréchet Inception Distance |

GAN | Generative Adversarial Network |

ESS | Effective Sample Size |

MH | Metropolis-Hastings |

A-NICE-MC | Adversarial Nonlinear Independent Component Estimation Monte Carlo |

L2HMC | Learn to HMC |

NFLMC | Normalizing Flow Langevine Monte Carlo |

RealNVP | Real-Valued Non-Volume-Preserving flow |

## Appendix A

#### Appendix A.1. Ablation Study: Effect of Gradient Information

**Figure A1.**Illustrating the role of gradient information. (

**a**) Comparison of the proposal entropy (up to a common constant) during training. Full: full model with gradient; LD: model variant 2; no grad: model variant 1. (

**b**–

**d**): Example proposal distribution of full, LD and no grad models.

#### Appendix A.2. Experimental Details

**Table A1.**Table for hyperparamters used in synthetic datasets and Bayesian logistic regression. Width: width of MLP networks. Steps: steps of updates in invertible model f. AR: target acceptance rate. LR: learning rate. Min LR: terminal learning rate of cosine annealing schedule.

Width | Steps | AR | LR | Min LR | |
---|---|---|---|---|---|

50d ICG | 256 | 1 | 0.9 | ${10}^{-3}$ | ${10}^{-5}$ |

2d SCG | 32 | 1 | 0.9 | ${10}^{-3}$ | ${10}^{-5}$ |

100d Funnel-1 | 512 | 3 | 0.7 | ${10}^{-3}$ | ${10}^{-5}$ |

20d Funnel-3 | 1024 | 4 | 0.6 | $5\times {10}^{-4}$ | ${10}^{-7}$ |

German | 128 | 1 | 0.7 | ${10}^{-3}$ | ${10}^{-5}$ |

Australian | 128 | 1 | 0.8 | ${10}^{-3}$ | ${10}^{-5}$ |

Heart | 128 | 1 | 0.9 | ${10}^{-3}$ | ${10}^{-5}$ |

#### Appendix A.3. Additional Experimental Results

**Table A2.**Comparing ESS/s between the learned sampler and HMC on the Bayesian logistic regression task. The learned sampler is significantly more efficient.

Dataset (Measure) | HMC | Ours |
---|---|---|

German (ESS/s) | 772.7 | 3651 |

Australian (ESS/s) | 127.2 | 3433 |

Heart (ESS/s) | 997.1 | 4235 |

**Figure A2.**Visualizations of proposal distributions learned on the Funnel-3 distribution. Yellow dot: x. Blue dots: accepted ${x}^{\prime}$. Black dots: rejected ${x}^{\prime}$. The sampler has an accept rate of 0.6. Although not perfectly covering the target distribution, the proposed samples travel far from the previous sample and in a manner that complies with the geometry of the target distribution.

**Figure A3.**Further results for Deep EBM. (

**a**–

**c**) Proposal entropy, Fréchet Inception Distance (FID) of replay buffer and energy difference during training. Results for MALA are also included in (

**a**,

**b**). (

**a**) shows that the learned sampler achieves better proposal entropy early during training. (

**b**) shows that the learned sampler converges faster than MALA. (

**c**) shows the EBM remains stable with a mixture of positive and negative energy difference. (

**d**) Compares the L2 expected jump of MALA and the learned sampler, plotted in log scale. It has almost the exact same shape as the proposal’s entropy plot in the main text. (

**e**) More samples from sampling process of 100,000 steps with the learned sampler. (

**f**,

**g**) Samples from the replay buffer and the corresponding visualization of the pixel-wise variance of displacement vector z evaluated at the samples. Images in (

**f**,

**g**) are arranged in the same order. Image-like structures that depend on the sample of origin are clearly visible in (

**g**). A MALA sampler would give uniform variance.

**Figure A4.**Checking the correctness of samples and the EBM training process. (

**a**) Comparing the dimension-wise means, standard deviations and 4th moments of samples obtained from HMC and the learned sampler on the Bayesian logistic regression datasets. Moments are matching very closely, indicating the learned sampler generates samples from the correct target distribution. (

**b**) One-hundred-thousand sampling steps by MALA sampler on an EBM energy function trained by the adaptive sampler; samples are initialized from the replay buffer. Samples look plausible throughout the sampling process. This indicates that stable attractor basins are formed that are not specific to the learned sampler, and that EBM training is not biased by the adaptive sampler.

## References

- Noé, F.; Olsson, S.; Köhler, J.; Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science
**2019**, 365, eaaw1147. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Nijkamp, E.; Hill, M.; Han, T.; Zhu, S.C.; Wu, Y.N. On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models. In Proceedings of the Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2019. [Google Scholar]
- Neal, R.M. Probabilistic Inference Using Markov Chain Monte Carlo Methods; Department of Computer Science, University of Toronto: Toronto, ON, Canada, 1993. [Google Scholar]
- Neal, R.M. MCMC using Hamiltonian dynamics. Handb. Markov Chain Monte Carlo
**2011**, 2, 2. [Google Scholar] - Radivojević, T.; Akhmatskaya, E. Modified Hamiltonian Monte Carlo for Bayesian Inference. Stat. Comput.
**2020**, 30, 377–404. [Google Scholar] [CrossRef] [Green Version] - Beskos, A.; Pillai, N.; Roberts, G.; Sanz-Serna, J.M.; Stuart, A. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli
**2013**, 19, 1501–1534. [Google Scholar] [CrossRef] - Betancourt, M.; Byrne, S.; Livingstone, S.; Girolami, M. The geometric foundations of hamiltonian monte carlo. Bernoulli
**2017**, 23, 2257–2298. [Google Scholar] [CrossRef] - Girolami, M.; Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**2011**, 73, 123–214. [Google Scholar] [CrossRef] - Song, J.; Zhao, S.; Ermon, S. A-nice-mc: Adversarial training for mcmc. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5140–5150. [Google Scholar]
- Levy, D.; Hoffman, M.D.; Sohl-Dickstein, J. Generalizing Hamiltonian Monte Carlo with Neural Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Gu, M.; Sun, S.; Liu, Y. Dynamical Sampling with Langevin Normalization Flows. Entropy
**2019**, 21, 1096. [Google Scholar] [CrossRef] [Green Version] - Hoffman, M.; Sountsov, P.; Dillon, J.V.; Langmore, I.; Tran, D.; Vasudevan, S. Neutra-lizing bad geometry in hamiltonian monte carlo using neural transport. arXiv
**2019**, arXiv:1903.03704. [Google Scholar] - Nijkamp, E.; Gao, R.; Sountsov, P.; Vasudevan, S.; Pang, B.; Zhu, S.C.; Wu, Y.N. Learning Energy-based Model with Flow-based Backbone by Neural Transport MCMC. arXiv
**2020**, arXiv:2006.06897. [Google Scholar] - Titsias, M.; Dellaportas, P. Gradient-based Adaptive Markov Chain Monte Carlo. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 15730–15739. [Google Scholar]
- Hastings, W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika
**1970**, 57, 97–109. [Google Scholar] [CrossRef] - Sohl-Dickstein, J.; Mudigonda, M.; DeWeese, M.R. Hamiltonian Monte Carlo without detailed balance. arXiv
**2014**, arXiv:1409.5191. [Google Scholar] - Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv
**2016**, arXiv:1605.08803. [Google Scholar] - Kobyzev, I.; Prince, S.; Brubaker, M.A. Normalizing flows: Introduction and ideas. arXiv
**2019**, arXiv:1908.09257. [Google Scholar] - Papamakarios, G.; Nalisnick, E.; Rezende, D.J.; Mohamed, S.; Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv
**2019**, arXiv:1912.02762. [Google Scholar] - Spanbauer, S.; Freer, C.; Mansinghka, V. Deep Involutive Generative Models for Neural MCMC. arXiv
**2020**, arXiv:2006.15167. [Google Scholar] - Dinh, L.; Krueger, D.; Bengio, Y. Nice: Non-linear independent components estimation. arXiv
**2014**, arXiv:1410.8516. [Google Scholar] - Marzouk, Y.; Moselhy, T.; Parno, M.; Spantini, A. An introduction to sampling via measure transport. arXiv
**2016**, arXiv:1602.05023. [Google Scholar] - Langmore, I.; Dikovsky, M.; Geraedts, S.; Norgaard, P.; Von Behren, R. A Condition Number for Hamiltonian Monte Carlo. arXiv
**2019**, arXiv:1905.09813. [Google Scholar] - Salimans, T.; Kingma, D.; Welling, M. Markov chain monte carlo and variational inference: Bridging the gap. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1218–1226. [Google Scholar]
- Zhang, Y.; Hernández-Lobato, J.M.; Ghahramani, Z. Ergodic measure preserving flows. arXiv
**2018**, arXiv:1805.10377. [Google Scholar] - Postorino, M.N.; Versaci, M. A geometric fuzzy-based approach for airport clustering. Adv. Fuzzy Syst.
**2014**, 2014, 201243. [Google Scholar] [CrossRef] [Green Version] - Tkachenko, R.; Izonin, I.; Kryvinska, N.; Dronyuk, I.; Zub, K. An approach towards increasing prediction accuracy for the recovery of missing IoT data based on the GRNN-SGTM ensemble. Sensors
**2020**, 20, 2625. [Google Scholar] [CrossRef] - Neklyudov, K.; Egorov, E.; Shvechikov, P.; Vetrov, D. Metropolis-hastings view on variational inference and adversarial training. arXiv
**2018**, arXiv:1810.07151. [Google Scholar] - Thin, A.; Kotelevskii, N.; Durmus, A.; Panov, M.; Moulines, E. Metropolized Flow: From Invertible Flow to MCMC. In Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 12–18 July 2020. virtual event. [Google Scholar]
- Pasarica, C.; Gelman, A. Adaptively scaling the Metropolis algorithm using expected squared jumped distance. Stat. Sin.
**2010**, 20, 343–364. [Google Scholar] - Poole, B.; Ozair, S.; Oord, A.V.d.; Alemi, A.A.; Tucker, G. On variational bounds of mutual information. arXiv
**2019**, arXiv:1905.06922. [Google Scholar] - Song, J.; Ermon, S. Understanding the limitations of variational mutual information estimators. arXiv
**2019**, arXiv:1910.06222. [Google Scholar] - Neal, R.M. Slice sampling. Ann. Stat.
**2003**, 31, 705–741. [Google Scholar] [CrossRef] - Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res.
**2014**, 15, 1593–1623. [Google Scholar] - Betancourt, M. A general metric for Riemannian manifold Hamiltonian Monte Carlo. In Lecture Notes in Computer Science, Proceedings of the International Conference on Geometric Science of Information, Paris, France, 28–30 August 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 327–334. [Google Scholar]
- Xie, J.; Lu, Y.; Zhu, S.C.; Wu, Y. A theory of generative convnet. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2635–2644. [Google Scholar]
- Du, Y.; Mordatch, I. Implicit generation and generalization in energy-based models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1064–1071. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637. [Google Scholar]
- Hoffman, M.D. Learning deep latent Gaussian models with Markov chain Monte Carlo. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1510–1519. [Google Scholar]
- Che, T.; Zhang, R.; Sohl-Dickstein, J.; Larochelle, H.; Paull, L.; Cao, Y.; Bengio, Y. Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling. arXiv
**2020**, arXiv:2003.06060. [Google Scholar] - Yu, L.; Song, Y.; Song, J.; Ermon, S. Training Deep Energy-Based Models with f-Divergence Minimization. In Proceedings of the International Conference on Machine Learning, Virtual Event, Vienna, Austria, 12–18 July 2020. [Google Scholar]
- Grathwohl, W.; Wang, K.C.; Jacobsen, J.H.; Duvenaud, D.; Norouzi, M.; Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]

**Figure 1.**Illustration of learning for exploring a state space. Left panel: a Langevin sampler that has poor exploration. Middle panel: our proposed method—samples travel far within the target distribution. Right panel: a sampler with a higher L2 jump than ours—the exploration is still worse. In each panel, the yellow dot on the top left is the initial point x; blue and black dots are accepted and rejected samples, respectively.

**Figure 2.**Comparison of our method with Hamiltonian Monte Carlo (HMC) on the 20d Funnel-3 distribution. (

**a**) Chain and samples of ${x}_{0}$ (from neck to base direction) for HMC. (

**b**) Same as (

**a**) but for our learned sampler. Note, samples in (

**a**) look significantly more correlated than those in (

**b**), although they are plotted over a longer time scale.

**Figure 3.**Training of the convergent energy-based model (EBM) with pixel space sampling. (

**a**) Samples from replay buffer after training. (

**b**) Proposal entropy of trained sampler vs. Metropolis-adjusted Langevin algorithm (MALA) early during training—note that the entropy of the learned sampler is significantly higher. (

**c**) Samples from 100,000 sampling steps by the learned sampler, initialized at samples from replay buffer. Large transitions like the one in the first row are rare; this atypical example was selected for display.

**Table 1.**Performance comparisons. SCG: strongly correlated Gaussian. ICG: ill-conditioned Gaussian. German, Australian, Heart: Datasets for Bayesian logistic regression. ESS: effective sample size (a correlation measure).

Dataset (Measure) | L2HMC | Ours | |
---|---|---|---|

50d ICG (ESS/MH) | 0.783 | 0.86 | |

2d SCG (ESS/MH) | 0.497 | 0.89 | |

50d ICG (ESS/grad) | $7.83\times {10}^{-2}$ | $\mathbf{2}.\mathbf{15}\times {\mathbf{10}}^{-\mathbf{1}}$ | |

2d SCG (ESS/grad) | $2.32\times {10}^{-2}$ | $\mathbf{2}.\mathbf{2}\times {\mathbf{10}}^{-\mathbf{1}}$ | |

Dataset (Measure) | Neutra | Ours | |

Funnel-1 ${x}_{0}$ (ESS/grad) | $8.9\times {10}^{-3}$ | $\mathbf{3}.\mathbf{7}\times {\mathbf{10}}^{-\mathbf{2}}$ | |

Funnel-1 ${x}_{1\cdots 99}$(ESS/grad) | $4.9\times {10}^{-2}$ | $\mathbf{7}.\mathbf{2}\times {\mathbf{10}}^{-\mathbf{2}}$ | |

Dataset (Measure) | A-NICE-MC | NFLMC | Ours |

German (ESS/5k) | 926.49 | 1176.8 | 3150 |

Australian (ESS/5k) | 1015.75 | 1586.4 | 2950 |

Heart (ESS/5k) | 1251.16 | 2000 | 3600 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Li, Z.; Chen, Y.; Sommer, F.T.
A Neural Network MCMC Sampler That Maximizes Proposal Entropy. *Entropy* **2021**, *23*, 269.
https://doi.org/10.3390/e23030269

**AMA Style**

Li Z, Chen Y, Sommer FT.
A Neural Network MCMC Sampler That Maximizes Proposal Entropy. *Entropy*. 2021; 23(3):269.
https://doi.org/10.3390/e23030269

**Chicago/Turabian Style**

Li, Zengyi, Yubei Chen, and Friedrich T. Sommer.
2021. "A Neural Network MCMC Sampler That Maximizes Proposal Entropy" *Entropy* 23, no. 3: 269.
https://doi.org/10.3390/e23030269