1. Introduction
Generating samples from many distributions encountered in Bayesian inference and machine learning is difficult. Markov Chain Monte Carlo (MCMC) provides a robust framework to generate samples from complex target distributions. Through constructing specific Markov chains, the MCMC methods can converge to the correct target distribution with the chains evolving. Presently, MCMC plays an essential role in artificial intelligence applications and probability inference, especially for estimating the expectations of the target functions [
1].
Sampling methods based on dynamics are one of the most popular MCMC methods. The most commonly used dynamics in MCMC are Langevin dynamics and Hamiltonian dynamics. Hamiltonian Monte Carlo (HMC) [
2,
3] has become one of the most popular MCMC algorithms in Bayesian inference and machine learning. Unlike previous MCMC algorithms [
4], HMC takes advantage of the gradient information to explore the continuous probability density function (PDF), which makes HMC more efficient to converge to the target distribution. In particular, HMC transforms the PDF into the potential energy function and adds the kinetic energy function to simulate the motion of the particle in the particular phase space, and thus HMC can satisfy the ergodic property. In practice, HMC exploits the Hamiltonian equation to calculate the new state of the proposed points in the phase space. To maintain a detailed balance, the Metropolis–Hasting (MH) technique is adopted [
5]. Gradient information helps to discover and explore the phase space more efficiently, and there is much further research on how to better leverage gradients for HMC [
6,
7].
Nevertheless, samplers based on dynamics still have some deficiencies. The traditional dynamics samplers [
8] and their developments [
7,
9,
10,
11,
12] have excellent performance in unimodal distributions. However, when facing multi-modal distributions, these algorithms may reveal some problems, especially when the modes are far away from each other. When the modes are close to each other, the momentum variable in dynamics samplers may offer chances for the sample to jump into different modes. When the modes are isolated, the momentum variable cannot jump out of the current mode, because the interval between the two modes has tremendous potential energy. Generally, objects tend to stay in low-energy places which are also referred as high-probability regions. Although we can enlarge the momentum variable to pass through high-potential-energy places, the momentum variable should be exponential-order large, which causes a rapid decrease in the performance of the samplers. To deal with the problem of multi-modal sampling, several studies have been developed [
13,
14,
15,
16]. The authors in [
13] introduced temperature in the simulation, and simulated annealing is used to gradually reduce the temperature from a high initial value to the value at which we wish to sample. Parallel tempering [
17] is one variant of the simulated annealing, which performs inter-distribution exchange with multiple Markov chains to make the sampler explore the state space more freely, without changing the resulting sample distribution. The authors in Sminchisescu and Welling [
14] proposed a new dynamics sampler which is based on a darting algorithm [
18]. Umbrella sampling [
19] is another precise sampling method, which divides the reaction coordinate space into several windows with bias potential to calculate the unbiased free energy in each window. However, when the dimensions are high, these algorithms may have elaborate set-ups and large computational complexity, resulting in low efficiency. The authors in Lan et al. [
15] used the natural gradient of the target distribution to establish paths between different modes, and thus samples can jump through the low-probability regions. This method may suffer from low effective sample size (ESS) [
3], which means that the independence between two neighbor samples is low. The authors in Tripuraneni et al. [
16] introduced the concept of magnetic fields. By means of constructing a dynamics system based on a magnetic field, this method can achieve great performance in multi-modal sampling. However, the setting of the magnetic field parameter is difficult, and this method may also suffer from low ESS in multi-modal sampling problems.
In this paper, we introduce a novel dynamics MCMC method called the variational hybrid Monte Carlo (VHMC). We first improve Hamiltonian dynamics through Langevin dynamics to reduce the autocorrelation of samples and accelerate the convergence of the dynamics sampler. Furthermore, we exploit the variational distribution [
20] of the target distribution to help the dynamics sampler to find the new mode. A new Metropolis–Hasting criterion is proposed to satisfy the detailed balance condition [
5]. We find that VHMC can overcome the distant multi-modal sampling problem since dynamics-based samplers can sample unimodal distributions well and variational distribution guides the dynamics-based sampler to jump between different modes. Finally, detailed proof is given to demonstrate that our algorithm can converge to the target distribution.
Both synthetic data and real data experiments are conducted to verify our theory. We sample points from seven different Gaussian mixture distributions whose dimensions range from 2 to 256. We apply our method to two-class classification exploiting Bayesian logistic regression [
21] to test the performance of VHMC on real datasets. Evaluation indices such as maximum mean discrepancy [
22] and autocorrelation are calculated to assess the quality of samples. Experiment results illustrate that the proposed method is capable of sampling from distant multi-modal distribution while obtaining better performance compared with other state-of-the-art methods [
16,
23].
The main contributions of this work can be summarized as follows. We propose a novel sampler called Langevin Hamiltonian Monte Carlo (LHMC), which achieves lower autocorrelation and faster convergence compared with the HMC sampler. Since we introduce random factors in LHMC, the total energy of the system changes during the simulation, so we design a new Metropolis–Hasting procedure to keep the detailed balance. In addition, to improve the poor performance of the LHMC sampler in multi-modal sampling, we propose a new method VHMC, which uses the variational distribution of the target distribution to guide the sampler to jump through different modes. We use Adam [
24] to find the local points with the highest probability density value, and we use these points to construct a mixture of Gaussian as the variational distribution of the target distribution. Detailed proof is given to prove the correctness of our method.
The rest of this article is organized as follows. In
Section 2, we review the preliminaries of our study, including Hamiltonian Monte Carlo and Langevin dynamics. Then, we introduce our LHMC sampler and show the objective function in
Section 3. In
Section 4, we propose the variational hybrid Monte Carlo, which aims to address the problem of multi-modal sampling. Experiments are reported in
Section 5. Discussion and conclusions are summarized in
Section 6 and
Section 7.
3. Langevin Hamiltonian Monte Carlo
HMC exploits the Hamiltonian dynamics to propose the new sample. However, the HMC sampler may have large autocorrelation in that each new sample is obtained from the deterministic calculation of the last sample. Specifically, Equation (
3) defines the process of calculating a new state
through the old state
. HMC can reduce autocorrelation by increasing the value of the leapfrog size. However, it will make the sampler inefficient.
To further improve the performance of the autocorrelation of the HMC sampler, we propose the Langevin Hamiltonian Monte Carlo (LHMC). The main idea of LHMC is to take advantage of Langevin dynamics to add randomness to the proposed state. In Langevin dynamics, we consider that the total energy consists of the potential energy, kinetic energy and internal energy, which takes the following:
where
Q represents the internal energy. The random thermal motion consumes the internal energy which finally transforms into the kinetic energy, which is described as:
We use the Metropolis–Hasting criterion to accept the samples, so the acceptance rate
takes the following form:
The detailed algorithms of LHMC are illustrated in Algorithm 2. Given the target distribution, LHMC exploits Langevin dynamics and Hamiltonian dynamics to explore state space via the discretization of LHMC, which can be summarized as three substages shown in Algorithm 3. The first substage is Langevin dynamics, which takes form as (
15)–(
17). The second substage is Hamiltonian dynamics, which takes form as (
3), and the last substage is also Langevin dynamics. Assume the initial state is
; then, a half update of the Langevin dynamics can be written as:
The random thermal motion of molecules takes the following form:
The other half update of the Langevin dynamics can be written as:
Algorithm 2: Langevin Hamiltonian Monte Carlo (LHMC) |
Input: Step size , leapfrog size L, starting point , sample number N Output: Samples for to Ndo // # see in Algorithm 3 if then else end if end for
|
Algorithm 3: Discretization for Langevin Hamiltonian Monte Carlo (DLHMC) |
Input: Step size , leapfrog size L, starting point Output: , , Obtaining through ( 15). Obtaining the thermal motion of molecules through ( 16). Obtaining through ( 17). Obtaining by simulating Hamiltonian dynamics through ( 3). Obtaining through ( 15). Obtaining the thermal motion of molecules through ( 16). Obtaining through ( 17).
|
We demonstrate the performance of LHMC on a strongly correlated Gaussian with variances
rotated by
, which is an extreme circumstance of Brooks et al. [
3]. Each method in
Figure 1 is run for 10,000 iterations with 1000 burn-in samples. We experimented 100 times and calculated the mean and variance of autocorrelation and maximum mean discrepancy. We set
where
is an identity matrix with the same dimension as the sample distribution,
,
,
, leapfrog size
. As
Figure 1 illustrates, LHMC achieves lower autocorrelation and faster convergence rate compared with HMC and Metropolis-adjusted Langevin algorithm (MALA) [
29].
5. Experiments
In this section, we investigate the performance of VHMC on multi-modal distributions and real datasets and compare our method with state-of-art algorithms. All our experiments are conducted on a standard computer with a 4.0 GHz Intel core i7 CPU. First, we introduce the performance index that will be used in the following parts.
Effective sample size. The variance of a Monte Carlo sampler is determined by its effective sample size (ESS) [
3], which is defined as:
where
N represents the number of all the samples and
represents the
s—step autocorrelation where autocorrelation is an index that considers the correlation between two samples. Let
X be a set of samples, and
t be the number of iterations (
t is an integer). Then,
is the sample at time
t of
X. The autocorrelation between time
s and
t is defined as:
where
E is the expected value operator. The correlation between two nearby samples can be measured with autocorrelation. The lower the value of autocorrelation, the more independent the samples.
Maximum mean discrepancy. The difference between samples drawn from two distributions can be measured as maximum mean discrepancy (MMD) [
22], which takes the following form:
where
M represents the sample number in
X,
N represents the sample number in
Y and
k represents the kernel function. By calculating the MMD value, we can analyze the convergence rate of the proposed methods.
Relative error of mean. This is a summary of the errors in approximating the expectation of variables across all dimensions [
33], which is computed as:
where
is the average of the
i’th variable at time
t,
is the actual mean value,
d denotes the dimension of sampling distribution and the denominator
represents the sum of
on the true distribution.
5.1. Mixture of Isotropic Gaussians
We conduct our first experiment on two multi-modal distributions where we consider two simple 2
D Gaussian mixtures whose distributions are analytically available. First, we consider a Gaussian mixture distribution whose modes are close to each other, and then we consider a Gaussian mixture whose modes are isolated and far away from each other. The distributions are given as follows:
for
,
,
and
(modes are close to each other) or
(modes are far away from each other). The experiment setting is the same with Tripuraneni et al. [
16]. The purpose of the experiments is to sample points that are independently identically distributed in the multi-modal distribution correctly.
In this experiment, we compare MHMC, HMC, MGHMC [
23] and parallel tempering (SAMCMC) [
17] against VHMC. First, we compare the sample result of these methods intuitively. Then, averaged autocorrelation and MMD are used to compare the performance of each method further. Each method is run for 10,000 iterations with 1000 burn-in samples. The number of leapfrog steps is uniformly drawn from
with
, which is suggested by Livingstone et al. [
34]. We set step size as
, friction coefficient as
and the initiate position as
. The authors in Tripuraneni et al. [
16] indicated that the multi-modal problem is a challenge for HMC samplers. However, we find that HMC samplers can sample points from the multi-modal distribution, especially when the modes are close to each other.
Figure 4 clearly shows that when
, three methods can sample the multi-modal distribution. Nevertheless, there is some difference between them. MHMC may sample from this mixture on a Gaussian distribution, but it updates state and position only according to random-walk proposals, which hardly jumps to the other sampling mode from a far place. The HMC sampler changes its sampling mode more frequently due to the gradient information of the target distribution. Obviously, combining the help of guide points, Brownian movement and gradient information VHMC changes its mode much more frequently than MHMC and HMC. From the result, we can also conclude that when the modes are close to each other, MHMC and HMC may sample this multi-modal distribution. However, when the modes are isolated and far away from each other with larger
, for instance,
, both MHMC and HMC cannot sample from the target distribution, shown in the second row of
Figure 4. This is because only random-walk proposals and gradient information cannot make them directly travel across the large low-probability regions. Nevertheless, our method still performs well by taking advantage of the samples generated from variational distribution to explore the phase space in different modes freely.
For the multiple modes far away from each other, HMC hardly changes its mode, so it converges to the target distribution slowly. On the contrary, VHMC changes the sampling mode very frequently, thus it converges to the target distribution quickly. To compare the convergence rate and the independence of the samples with state-of-the-art sampling methods, we exploit MMD and autocorrelation to describe the performance when sampling the Gaussian mixture.
MMD between exact samples generated from the target density and generated samples is used to describe the convergence performance of the samplers. We use a quadratic kernel [
35]
where
denotes dot product, averaged over 100 runs of the Markov chains.
Figure 5 demonstrates that our method achieves the best performance in convergence rate and autocorrelation after the burn-in period. Since our method converges to the target distribution quickly, we furthermore narrow the number of the first 500 samples. In general, comparing
Figure 4 and
Figure 5, we conclude that the convergence rate is inversely proportional to the MMD and autocorrelation for HMCM, HMC and VHMC.
To test the performance of the proposed method on high-dimensional multi-modal distribution, we conduct our experiments on 2 to 128 dimensions. The target distribution is given as:
where
and
n equals dimensions.
Figure 6 shows that the proposed method has lower REM in high dimensions, which indicates that VHMC can sample from the high-dimensional distant multi-modal distributions.
5.2. Mixture of Heterogeneous Gaussians
In the first experiment, we have already discussed the Gaussian mixture when the variance of the modes is the same. In practice, real data distributions often have different variances and probability of modes. To demonstrate strong stability, we construct two new Gaussian mixtures with different variances and probability of modes. The first one is given as follows:
We set
,
,
,
,
,
. The second one takes the following form:
. Here, we set
,
,
. Similar to the previous experiment, our method runs 10,000 iterations with 1000 burn-in samples.
Figure 7 shows that VHMC has strong stability. Even when the variance becomes tiny, our method still shows advanced performance. From the second column of
Figure 7, we can also observe that the HMC sampler may sample multi-modal distribution especially when the HMC sampler has chances to jump out of one mode. Although the distance between the left mode and the middle mode is the same as the distance between the middle mode and right mode, the different variances force the HMC sampler to sample from the left two modes.
5.3. Bayesian Logistic Regression
Logistic regression (LR) [
36] is a traditional method for classification. We optimize the parameters by maximizing the logistic likelihood function. By exploiting the parameters, we can predict the class of the data.
To verify the performance on real datasets, we apply the proposed method to Bayesian logistic regression (BLR) [
21] and our method is compared with logistic regression (LR), variational Bayesian logistic regression (VBLR) and HMC.
The likelihood function of a two-class classification problem can be defined as:
where
and
and
.
represents the label of the data and
represents the predicted value. We obtain the class of the data by means of integrating the logistic function on the posterior distribution.
We evaluate our methods on eight real-world datasets from the UCI repository [
37]: Pima Indian (PI), Haberman (HA), Mammographic (MA), Blood (BL), Cryotherapy (CR), Immunotherapy (IM), Indian (IN), Diabetic (DI) using Bayesian logistic regression. The eight datasets are normalized to have zero mean value and unit variance. We give the Gaussian distribution
as the prior distribution of the parameters.
In each experiment, we run 10,000 iterations with 2000 burn-in samples. We draw leapfrog steps from a uniform distribution . We set step size and mass matrix . We use the uniform distribution as the variational distribution. We run each dataset 100 times to calculate the mean and the standard deviation.
Results in terms of the accurate rate of prediction and area under the ROC curve (AUC) [
38] are summarized in
Table 1 and
Table 2. The results show that in these eight datasets, VHMC achieves better performance in classification accuracy rate than in AUC. Compared with other methods, VHMC outperforms HMC and provides a similar performance to VBLR, which indicates that the method proposed in this paper can sample actual posterior distribution.