Dynamical Sampling with Langevin Normalization Flows

In Bayesian machine learning, sampling methods provide the asymptotically unbiased estimation for the inference of the complex probability distributions, where Markov chain Monte Carlo (MCMC) is one of the most popular sampling methods. However, MCMC can lead to high autocorrelation of samples or poor performances in some complex distributions. In this paper, we introduce Langevin diffusions to normalization flows to construct a brand-new dynamical sampling method. We propose the modified Kullback-Leibler divergence as the loss function to train the sampler, which ensures that the samples generated from the proposed method can converge to the target distribution. Since the gradient function of the target distribution is used during the process of calculating the modified Kullback-Leibler, which makes the integral of the modified Kullback-Leibler intractable. We utilize the Monte Carlo estimator to approximate this integral. We also discuss the situation when the target distribution is unnormalized. We illustrate the properties and performances of the proposed method on varieties of complex distributions and real datasets. The experiments indicate that the proposed method not only takes the advantage of the flexibility of neural networks but also utilizes the property of rapid convergence to the target distribution of the dynamics system and demonstrate superior performances competing with dynamics based MCMC samplers.


Introduction
In machine learning, Bayesian inference [1] and Bayesian optimization [2], complex probabilistic models typically require the calculation of intractable and high-dimension integrals. For example, for a classification task, we need to predict the class of instances. We assume that p(y * |x * , D) = p(y * |x * , θ)p(θ|D)dθ is the prediction model, where x * represents the instance, y * represents the class, D represents the data, p(y * |x * , θ) is the likelihood function and p(θ|D) is the posterior distribution. When the probabilistic model becomes complex, this integral is intractable. Generally, two kinds of methods are used to approximate the integral, which are Markov chain Monte Carlo (MCMC) [3,4] and variational inference (VI) [5,6]. MCMC is a powerful framework, which is widely used to deal with the complex and intractable probabilistic models [7][8][9]. MCMC methods approximate the complex probability distributions by a large number of samples which are sampled from a Markov chain iteratively. They serve as a fundamental approach in probabilistic inference, which provides the asymptotically unbiased estimation for the probabilistic models, while VI gives the deterministic approximation for the target distributions [10].
Recent MCMC methods can be divided into two aspects. One class of the MCMC methods is slice sampling [11] and the other one is dynamical sampling [12,13]. The main problem of the slice sampler is that when sampling from the distributions with high dimensions, solving the slice interval can be very difficult. Utilizing the dynamics system to construct an efficient Markov chain is commonly employed [14][15][16]. Hamiltonian Monte Carlo (HMC) [14] is one of the dynamics based methods, which has multiple attractive properties concerning rapid explorations of the state space and high acceptance rate of the samples. HMC exploits Hamiltonian dynamics to construct efficient Markov chain Monte Carlo, which has become increasingly popular in machine learning and statistics. Since HMC uses the gradient information of the target distribution, it can explore the state space much more efficiently than the random-walk proposals [17], which ensures the rapid convergence of the sampler. Since it has the property of volume conservation, HMC is able to propose large moves with a higher acceptance rate.
HMC and its further developments [18][19][20][21][22][23] exploit the gradient information of the target distribution to explore the state space. Nevertheless, since the step size of the leapfrog is difficult to choose, there exists the correlation between neighbor samples and thus the high autocorrelation may occur. Though we can enlarge the step size of the leapfrog, it will waste a lot of computation resources. Moreover, they tend to fail when the target distributions are multi-modal [21,[24][25][26]. These MCMC methods usually fail to move from one mode to another because such a move requires passing through low probability regions. These places have large boundary gradients which prevent samplers from traveling through the modes. Therefore, designing an effective sampler for multi-modal distributions has remained a significant challenge.
The disadvantages of the current methods motivate us to design a powerful sampler which can have not only low autocorrelation but also accurate estimation for the target distribution. In this paper, a new sampling method called Langevin normalization flows Monte Carlo (NFLMC) is proposed. We introduce Langevin diffusions to the normalization flows (NFs) [27] to construct a new sampler. The main idea of this method is to train a variational distribution to approximate the target distribution, whose parameters are determined by the neural networks. With the idea of Langevin diffusions, we design new transformation functions for NFs which have the properties of rapid convergence to the target distribution and better approximation to the target distributions. Since we exploit the gradient information of the target distributions, the calculation of the integrals of the Kullback-Leibler (KL) divergence is intractable. So we use the Monte Carlo estimator to calculate the KL divergence. However, the KL divergence calculated by Monte Carlo estimator may be negative in the process of training, which would mislead the final results, so we propose a new loss function to train the NFLMC sampler.
The main contributions of this paper can be summarized as follows.
(1) We introduce Langevin diffusions to normalization flows to construct a novel Monte Carlo sampler. (2) We propose the modified KL divergence as the loss function to train the sampler, which ensures that the proposed method can converge to the target distribution. (3) The proposed method achieves better performances in multi-modal sampling and varieties of complex distributions. (4) we do not need the Metropolis-Hasting procedure [28] to adjust the sampler compared with MCMC samplers. (5) A number of experiments verify the theoretical results and practical value. We apply the proposed method to varieties of distributions and supervised classification tasks using Bayesian logistic regression. The proposed method is compared with state-of-the-art dynamics based MCMC methods [24,29,30] in the autocorrelation rate and convergence speed. The experiments demonstrate that the NFLMC method has a superior performance in sampling complex posterior distributions.
The rest of this article is organized as follows. In Section 2, we review the preliminary of our study, including the introduction of variational inference with normalization flows and Langevin diffusions. In Section 3, we introduce our Langevin normalization flows and describe the transformation functions. In Section 4, we propose the Langevin normalization flows Monte Carlo sampler. Experiments and analysis are given in Section 5. In Section 6, we conclude this paper and discuss the future work.

Normalization Flows
The normalization flows [27] were first introduced to deal with the flexible and complex posterior distributions in the context of variational inference. It is a powerful approach to generate arbitrary posterior distributions utilizing a sequence of invertible transformation. In other words, the initial density will transform to a valid probability distribution through iteratively applying the normalization flows. Given the observed data x, the normalization flows start with an initial variable z 0 generated from a simple distribution q, which has the analytical probability density and then repeatedly apply an invertible transformation function f θ which is parameterized by θ. After a sequence of iterations, a complex and flexible distribution of z T will be obtained. It takes the form as follows: (1) Since the Jacobian determinant of each transformation f θ can be calculated, we can obtain the final distribution π u T through the following equation.
To make Equation (2) tractable, the Jacobian determinant of each transformation function f θ should be carefully designed to satisfy two main properties. First, the transformation function f θ is easy to invert. Second, the Jacobian determinant should be tractable. We assume that z 0 comes from a simple distribution q(z 0 |x) and z T = f θ (z 0 ). When calculating the probability of z T in Equation (2), we need to calculate the Jacobian determinant and use f −1 (z T ) to calculate z 0 . So the transformation function f θ should be easy to invert and the Jacobian determinant should be tractable. Generally, the invertible transformation function f θ with known Jacobian determinant [27] is defined as: where h(·) represents the nonlinear function, m = [m 1 , m 2 , . . . , m n ] and w = [w 1 , w 2 , . . . , w n ] are parameter vectors and b is the scalar and n is the dimension of the parameter vectors. So mh(w T z t−1 + b) can be viewed as a multi-layer perceptron with one hidden layer and a single unit, which is demonstrated in Figure 1. Real-valued non-volume preserving (RNVP) [31] develops a new transformation function, which makes the model more flexible. The main idea of RNVP is that coupling layers are used to construct the normalization flows. Assume that x is the original variable. The coupling layers can be defined as: where function s represents the scale and t represents the translation. Both of them are neural networks. RNVP provides a more powerful and flexible posterior distribution for density estimation.

Langevin Diffusions
Langevin dynamics is a common method to model molecular dynamics systems. A D-dimension Langevin diffusions are a time based stochastic process x = (x t ), t ≥ 0 with stochastic sample paths, which can be defined as a solution to the stochastic differential equation taking the form as follows: where b(x) represents the drift vector, σ(x) represents the volatility matrix and W = (W t ) , t ≥ 0 represents a standard Wiener process [38]. Equation (5) gives the evolution of a random variable under Langevin diffusions but when it comes to the evolution of the probability density function, the diffusions should be described by Fokker-Planck equation [39]. We assume that u(x, t) represents the evolution of the probability density function, x = [x 1 , x 2 , . . . , x D ] T and V(x) = σ(x)σ(x) T . We set b i (x) to be the i-th term of the vector b(x) and V ij (x) to be the i-th row and the j-th column's term of the matrix V(x). So the Fokker-Planck equation can be defined as follows: If we have u(x, t) = π g (x), ∀t ∈ T, then this process is stationary and π g can be viewed as the stationary distribution of the diffusion, which means that if x t ∼ π g (x), then x t+ ∼ π g (x), ∀ > 0 [40]. Langevin diffusion with stationary distribution π g can be defined by the stochastic differential equation [23]: The setting of b, σ and u(x, t) in Equation (7) makes ∂u ∂t = 0, which suggests that the invariant measure of Langevin diffusion is related to π g (x) [40].
Generally, solving the stochastic differential equations exactly is intractable. Since stochastic differential equations usually have strong coupling and nonlinearity, it is difficult to calculate the exact expression of its solution. So it is necessary to utilize the numerical discretization methods to approximate the solution to stochastic differential equations. Euler-Maruyama discretization [41] is one of the common approaches to obtain the approximate solution to the stochastic differential equation, which takes the form as: where z t ∼ N (z|0, I) and represents the step size.
It is noted that Langevin diffusions take advantage of the gradient information of the target distribution. The gradient information makes Langevin diffusions explore the state space efficiently. What's more, Langevin diffusions contain the Wiener process that can be viewed as the random work. The random work helps to explore the state space extensively. The idea of Langevin diffusions are widely used in MCMC methods. Metropolis adjusted Langevin algorithm (MALA) [40] is one of the applications of Langevin diffusions. The main idea of MALA is to give the proposed state through Langevin diffusions, whose equation is given in Equation (8). MALA exploits the Metropolis-Hasting correction [28] to satisfy the detailed balance [42], which ensures that the samples generated from Langevin diffusions will converge to the target distribution. It is the gradient information of the target distribution that accelerates the convergence rate to the stationary distribution of MCMC. Although MALA do provide an efficient way for MCMC to sample from the target distribution, the autocorrelation among samples remains high.
Since NFs provide a more powerful and flexible posterior distribution for density estimation and MALA achieves rapid convergence to the target distribution, we maintain their advantages to develop a new sampler with appropriate training strategy, which can accurately sample from the target distribution with low autocorrelation.

Main Idea
Normalization flows [27,31] approximate the target distributions through a series of transformation functions. In order to approximate the target distributions efficiently and accurately, we utilize the information of the target distributions. Through exploiting the advantages of efficient exploration of Langevin diffusions, we propose a new normalization flow which is called Langevin normalization flows (NFL). We redesign the transformation functions through the gradient information of the target distribution, which helps us to approximate the target distributions precisely and efficiently.
Constructing the Langevin normalization flows has to satisfy two primary conditions. The first one is that the update of each step of the transformation function should be approximately invertible. The second one is that the determinant of the Jacobian and the inverse Jacobian of the transformation function must be tractable. In this way, we can ensure that the distribution obtained through the flows is able to converge to the target distribution.
We then describe the details of our proposed transformation functions for a single Langevin step. We assume that x 1:D is the initial sample, where D is the number of the dimension of the sample. We first update a half of the sample. The transformation functions are as follows: where σ(x) can be viewed as the Wiener process in Langevin diffusions. S(x) represents the logarithmic scale of the sample which is able to rescale the position of the sample. T(x) is the shift of the sample.
σ(x), S(x) and T(x) are all controlled by the neural networks, where W σ , W S and W T are their parameters. U is the energy function of the probability density function. In addition, represents the step size of the Langevin diffusions. It is noted that in Equation (9), we first utilize Langevin diffusions to generate samples and then we use neural networks to further adjust the samples. Since we only update x d+1:D and y 1:D is the intermediate variable, x 1:d should be updated then. It takes the form as: ∇U(y 1:D ) 1:d + · exp(σ(y d+1:D ))) exp(S(y d+1:D )) + T(y d+1:D ), (10) where z 1:D represents the final obtained state after applying the above transformation functions to y 1:D . The advantage of dividing x into two part is that Equation (9) generates y d+1 and affects only x d+1 while Equation (10) generates z 1:d and affects only y 1:d . At the same time, the determinant of the Jacobian is tractable, which relies on the fact that: The Jacobian matrices of these transformation functions are as follows: It is noted that the Jacobian matrices of the transformation functions are upper triangular matrix and lower triangular matrix respectively, which simplify the calculation of the Jacobian determinants. In order to calculate the logarithmic probability of the transformation distribution, we need the help of inverse transformation functions and the inverse logarithmic Jacobian determinants. The logarithmic probability can be computed as follows.
where q represents the initial distribution. The inverse transformation functions f −1 θ take the form as: It is noted that Equation (12) is approximately invertible. Since we introduce gradient information to the transformation functions, the inverse transformation function f −1 θ is difficult to obtain. For instance, in Equation (12), z 1:D is known and we wish to use z 1:D to calculate y 1:D . Although we can easily obtain y d+1:D through the first equation of Equation (12), when it comes to calculating y 1:d , we have to calculate ∇U(y 1:D ) 1:d to update y 1:d . However achieving the closed-form solution for y 1:d = 2 2 · ∇U(y 1:D ) 1:d + const is difficult especially when the gradient function is complex, where const = (z 1:d − T(y d+1:D )) exp(−S(y d+1:D )) − · exp(σ(y d+1:D )). In order to calculate y 1:d , we have additionally introduced a variable t 1 in the process of calculating the inverse transformation function. We set y 1:D in ∇U(y 1:D ) 1:d to be (t 1 , y d+1:D ) and we calculate t 1 without using gradient information. Finally we update y 1:d through ∇U((t 1 , y d+1:D )) 1:d . The error of this approximation is 2 2 [∇U(y 1:D ) 1:d − ∇U((t 1 , y d+1:D )) 1:d ] which depends on the product of 2 2 and ∇U((ξ, y d+1:D )), ξ ∈ (y 1:d , t 1 ). This approach is also exploited in the calculation of x 1:D , which takes the form as: In order to calculate the logarithmic probability of the transformation distribution, we have to compute the inverse logarithmic Jacobian determinants. The final formulas are defined as follows: Particularly, we introduce Langevin diffusions to normalization flows to construct the transformation function. Since the Langevin diffusions exploit the gradient of the target distribution, the transformation function is able to explore the state space efficiently.
Hamiltonian dynamics introduce the auxiliary momentum variable to explore the state space efficiently. Through the transformation of the energy over potential energy and kinetic energy, the total energy remains unchanged. Since the change of the state is associated with the transformation of the energy, designing normalization flows which are based on Hamiltonian dynamics becomes complex, which will be our future work.

Difference between Normalization Flows and Langevin Normalization Flows
There are two main differences between normalization flows and Langevin normalization flows. First, NFL cooperates with Langevin diffusions to construct an efficient and accurate approximation for the target distributions competing with the normalization flows. Second, when approximating the target distribution, the normalization flows are trained to minimize KL(q|p), where q represents the approximation distribution and p represents the target distribution. Since the transformation function is invertible, the integral of KL(q|p) can be calculated precisely. However, for NFL, the transformation functions demonstrated in Equations (12) and (13) are only approximately invertible because of the usage of the gradient information of the target distribution. Since the precise value of KL(q|p) cannot be obtained through integration, Monte Carlo estimation is used to calculate KL(q|p).

Dynamical Sampling Using Langevin Normalization Flows
Probabilistic inference involving multi-modal distributions is very difficult for dynamics based MCMC samplers. Besides, samples generated from these samplers are still highly auto-correlated. In order to solve these problems, we develop a new Monte Carlo sampler using Langevin normalization flows which are called Langevin normalization flows Monte Carlo (NFLMC). Given the target distribution and the initial distribution, NFLMC learns the parameters of the conversion of the initial distribution to the target distribution of the sampler. In the following subsections, we begin to describe the main idea of the method and then we introduce how our method works. Finally, we give the loss function of the training procedure and the algorithm. When the value of loss function converges, NFLMC can precisely sample from the target distribution.

Main Idea
The procedure of NFLMC is elaborated here. Assume that the target distribution is denoted as π t , the initial distribution is denoted as π q , θ represents the parameters of the transformation functions, π u represents the transformation distribution and Ls represents the Langevin step length. First, we generate N samples X = {x (t) } N t=0 , x ∈ R D from π q and initialize the parameters θ in the transformation functions. For each sample x ∈ X, the update equation takes the form as: We repeatedly utilize Equation (15) Ls times to update x d+1:D , where is the step size of Langevin diffusions, ∇U(x 1:D ) d+1:D is the gradient of the energy function of the target distribution. It is noted that the second term in Equation (15) is similar with Equation (8). After applying Ls steps of Langevin diffusions, we rescale x d+1:D through Equation (9) and we obtain y 1:D which is a half update of x 1:D . We then update x 1:d which takes the form as: We also repeatedly utilize Equation (16) Ls times to update x 1:d . After that we rescale x 1:d through Equation (10) and finally we obtain z 1:D = f θ (x 1:D ), where we define the transformation function as f θ . Now we gain the samples z 1:D , z 1:D ∼ π u . In order to optimize the parameters θ in f θ to close to the target distribution. Through minimizing KL(π u |π t ), we are able to obtain the optimal parameters of f θ . Since the integral of KL(π u |π t ) for Langevin normalization flow is intractable, we use the Monte Carlo integral to calculate KL(π u |π t ). The objective function is as follows: As Equation (17) shows, we need the samples generated from π u and the probability of each sample to calculate the loss function. Since we have already had z 1:D generated from π u , we only need to calculate the logarithmic probability for π u (z 1:D ) which takes the form as: where f −1 θ is the inverse transformation function which can be calculated through Equation (12) and Equation (13). Since the update of x 1:D is divided into two parts, the calculation of ln det ∂ f −1 θ ∂z 1:D takes the form as: where ln det where y 1:d , t 1 and t 2 can be calculated through Equations (12) and (13). However, in the progress of optimizing Equation (17), we find that the KL divergence may not be strictly non-negative because of the Monte Carlo integral, so we introduce a new objective function to overcome this problem. The detailed content is discussed in the next subsection.

Loss Function of the Training Procedure
As we have already discussed the transformation function in Langevin normalization flows, we do need a criterion to ensure that the final transformation distribution π u will converge to the target distribution π t . In order to train the parameters θ which control the function σ, S and T, we choose to minimize KL(π u |π t ) as the loss function to guarantee that π u will be the expected distribution. Specifically, we take the advantage of Monte Carlo sampling to calculate the integral in KL divergence. Although the KL divergence is non-negative in theory, Monte Carlo integral may cause the abnormal of the result which means that the KL divergence is negative. In that case, minimizing Equation (17) will enable the loss to be smaller and thus the transformation distribution will not converge to the correct direction. To address this problem, we propose a new loss function which is defined as follows: Since we have E π u ln π u (x) to achieve the purpose of minimizing the KL divergence.

Unnormalized Probability Distributions
In Bayesian machine learning, we generally require sampling from the posterior distribution to approximate the complex probabilistic modal. Since p(θ|D) ∝ p(D|θ)p(θ), the posterior distribution is an unnormalized distribution. So, we discuss the unnormalized probability distributions in this section. We assume that the unnormalized probability distribution p unt (x) equals to π t (x)Z, where Z is the true normalization constant and π t is the probability density function. After utilizing the Equation (21), we observe that: It is noted that the third term 2lnZ · π u (x)ln π u (x) π t (x) dx in Equation (22) can be simplified as: The object of the optimization is to minimize the loss function L π u →p unt , which is equivalent to minimize π u (x)ln 2 π u (x) π t (x) dx and 2lnZ · KL(π u |π t ), for ln 2 Z is a constant. Since the KL divergence is nonnegative, if Z ∈ (0, 1), then 2lnZ · KL(π u |π t ) is negative. Minimizing 2lnZ · KL(π u |π t ) is to maximizing KL(π u |π t ), which will mislead the direction of the optimization.
So as to solve this problem, we introduce a scale parameter γ. We assume p unt (x) = π t (x)Z γ , so the loss function can be written as: As Equation (24) hinted, the function is composed of three terms. The first term is the same as Equation (21). The second term is the scaling term and the last term is a constant term. If γ = Z, then we recover the Equation (21). If γ < Z, then 2ln Z γ · E π u ln π u π t is nonnegative, which not only ensures that the loss function will optimize towards the right direction but also cooperates with the information of KL divergence. In addition, the parameter γ is able to control the force of the optimization of KL divergence.
Furthermore, it is noted that the gradient of the loss function is: , for each sample x (i) , so it can be viewed as the rescale of the gradient of the KL divergence, which proves the correctness of the loss function. The complete algorithm is given in Algorithm 1.

Algorithm 1 Training NFLMC
Input: target distribution π t , step size , learning rate β, scale parameter γ, Langevin step length Ls, number of iterations K iters , sample number N, the initial distribution π q , the transformation distribution π u , the energy function U, the gradient of energy function ∇U and the second order gradient ∇∇U. Output: the parameters θ = (W σ , W S , W T ) of the sampler.
Initializing the parameters θ of the neural network. for k = 1 to K iters do Sample N samples from the proposal distribution π q .
In practice, there are several important points to note about the implementation of Algorithm 1. First, the proposal distribution π q should be a simple distribution which is easy to analyze. We suggest to use the Gaussian distribution as the proposal distribution. Second, the number of the samples N should be set to a large value. In our experiments, we set N = 8000. Third, the scale parameter γ can be estimated through importance sampling. γ = ∑ x∼q(x) Zπ t q(x) , where Zπ t represents the target distribution and q(x) represents the proposal distribution. We have built a demo program which is available at: https://github.com/Emcc81/NFLMC.

Applicability of NFLMC
In this section, we will demonstrate the performance of NFLMC. We present a detailed analysis of our trained sampler on varieties of target distributions. First, we will compare the proposed sampler with RNVP and HMC on five different distributions which are composed of the ring (the ring-shaped density), the ill-conditioned Gaussian, the strongly correlated Gaussian, the Gaussian funnel and the rough Well. After that, we present the results on two multi-modal distributions. Finally, we demonstrate the results on a task from machine learning -Bayesian logistic regression.
All our experiments are conducted on a standard computer with eight Nvidia RTX2080Ti GPUs. The nodes of each layer of the neural networks are set to be 512 with ReLU as the activation function. The number of the layer of the neural networks is set to be 3. Langevin steps are set to be 2 to 5. The number of transformation functions is set to be 8. The learning rate is set to be 0.05. The maximum iteration is set to be 10,000. We estimate scale parameter γ through importance sampling. Now, we introduce the performance index which will be used in the following parts.
Effective sample size-The variance of a Monte Carlo sampler is determined by its effective sample size (ESS) [14] which is defined as: where N represents the total sampling number, M is set to be 30 in our experiments and ρ(s) represents the s-step autocorrelation. Autocorrelation is an index which considers the correlation between two samples. Let X be a set of samples and t be the number of iteration (t is an integer). Then X t is the sample at time t of X. The definition of the autocorrelation between time s and t is: where E is the expected value operator. Autocorrelation can measure the correlation between two nearby samples. If the value of autocorrelation is high, the samples are far from independent and vice versa. Maximum mean discrepancy-The difference between samples drawn from two distributions can be measured as maximum mean discrepancy (MMD) [43] which is defined as follows: where M represents the sample number in X, N represents the sample number in Y and k represents the kernel function. Through MMD, we can analyze the convergence speed of the proposed methods.

Varieties of Unimodal Distributions
Since RNVP performs well in density estimation, we utilize the loss function proposed in Equation (21) to train RNVP to sample from the target distribution. This kind of method is called the naive normalization flows Monte Carlo (NNFMC). We then compare the NFLMC with NNFMC and HMC on convergence rate and autocorrelation, respectively. In each experiment, we set the same learning rate for NFLMC and NNFMC. The initial distributions are all set to be the standard normal distribution. We next introduce the distributions used in the experiment.
Ring: The ring shaped target density. The analytic form of the energy function of the ring is: Ill-conditioned Gaussian: Gaussian distribution with diagonal covariance spaced log-linearly between 10 −2 and 10 2 .
Strongly correlated Gaussian: We rotate a diagonal Gaussian with variances [10 2 , 10 −2 ] by π 4 . This is an extreme version of an example from Brooks [14].
Gaussian funnel: We conduct our experiment on a 2-D funnel, whose energy function takes the form as: exp(x 1 ) + ln (2π · exp(x 1 )) and we set σ = 1.0. As Figure 2 illustrates, our method performs better in all these distributions in terms of convergence rate. In ill conditioned Gaussian, rough well, Gaussian funnel and strongly corrected Gaussian with µ = [0, 0], NFLMC gains fast convergence, which indicates that the Langevin diffusions do help the normalization flows to find the correct direction. In strongly corrected Gaussian with µ = [10,10], NNFMC is unable to converge to the target distribution, since the loss remains high during the training procedure. It is the utilization of gradient information of the target distribution that aids NFLMC to converge the target distribution rapidly.
In ring-shaped distribution, NNFMC has a significant fluctuation in the process of training, while NFLMC converges rapidly during the training procedure, which shows the stability of NFLMC. Since the loss of NNFMC has large fluctuation, we carefully tune the learning rate for NNFMC. As Figure 3 illustrates, NFLMC converges to the target distribution more quickly than NNFMC. Besides, NNFMC has a large error while NFLMC is able to sample from the target distribution precisely.
So what causes the large fluctuation of NNFMC? We think that the lack of strong guidance when exploring the state space makes NNFMC difficult to converge. Since the initial distribution is a standard normal distribution, samples from the initial distribution have large distance with the samples of ring-shaped distribution. So it is challenging for NNFMC to explore state space and the value of the loss function has large fluctuation. In order to verify this thought, we enlarge the size of the ring distribution whose energy function has the form: and we observe that NNFMC fails to sample from this distribution while NFLMC can still converge to the target distribution. As Figure 3 shows, NNFMC cannot find the target distribution, while NFLMC still performs well, for NFLMC utilizes the gradient information of the target distribution.
We then compare our method with HMC in terms of the autocorrelation on five different distributions. Figure 4 demonstrates that NFLMC obtains better performance in autocorrelation, which indicates that NFLMC overcomes the defects of the MCMC samplers. HMC (0.05) and HMC (0.1) represent the HMC sampler with different step size.

Mixtures of Gaussian Distributions
We conduct our second experiment on two multi-modal distributions where we consider two simple 2-D mixtures of Gaussian distributions (MOG) whose probability density function are analytically available. First, we consider a MOG whose modes have the same probability and then we consider a MOG whose modes have different probabilities and further distance. The first distribution is defined as: p(x) = 1 2 N (x|µ, I) + 1 2 N (x| − µ, I), where µ = (2.5, −2.5). The second distribution is defined as: p(x) = 0.88N (x|µ, I) + 0.12N (x| − µ, I), where µ = (4, −4). The experiment settings is the same with Tripuraneni et al. [24]. The purpose of the experiments is to sample points which are i.i.d. distributed in multi-modal distributions correctly.
We compare HMC [14], MHMC [24], MGHMC [30] and NICE-MC [29] against NFLMC. First, we compare the MMD of these methods and then averaged autocorrelation is used to compare the performance of each method further. Each MCMC method is run 32 times and 20,000 iterations with 11,000 burn-in samples. The number of leap-frog steps is uniformly drawn from (100 − l, 100 + l) with l = 20, which is suggested by Livingstone et al. [45]. We set step size = 0.05 and the initiate position x = (0, 0). The initial distribution for NFLMC is a Gaussian distribution with µ = [0, 0] and diag(σ) = [2,2]. As Figure 5 illustrates, NFLMC obtains excellent performance compared with MHMC, HMC and MGHMC regarding MMD and autocorrelation. In addition, NFLMC has a smaller variance of MMD compared with NICE-MC. However, when it comes to autocorrelation, NICE-MC shows the huge fluctuation, while NFLMC remains steady, which manifests the stability of NFLMC.     We then discuss the circumstance in which the modes are far from each other and with different probabilities. When µ in MOG become larger, for instance, µ = (4, −4). In Hamiltonian dynamics, there exists a significant force in this low probability regions which hinder samplers from jumping out of the current mode. In other words, the gradients in boundary regions are tremendous and the momentum will increasingly decrease until it changes its direction which makes HMC and MHMC challenging to sample from the target distribution. So we compare NFLMC with parallel HMC and NICE-MC. The scatter diagram of both parallel HMC and NFLMC is demonstrated in Figure 6. We observe that parallel HMC can sample from the multi-modal distribution but it cannot precisely estimate the probability of each mode. For parallel HMC, it seems that two modes have the same probability. However, the real probability of each mode is π 1 = 0.12, π 2 = 0.88. As Figure 7 illustrates, compared with NICE-MC, NFLMC converges quickly to the target distribution while gains the lower autocorrelation. It is the fact that NFLMC takes advantage of the neural networks to explore the phase space, which results in good performance.

Bayesian Logistic Regression
Logistic regression (LR) [46] is a traditional way for classification. Employing maximizing the logistic likelihood function, we can get the optimized parameters. Through the parameters, we can predict the class of the data. Bayesian logistic regression [47] is also a classic model for classification which takes advantage of logistic sigmoid function as the likelihood function. For the two-class classification, the likelihood function is defined as: p(t|w) = ∏ N n=1 [1 − y n ] 1−t n , where t = (t 1 , . . . , t N ) and y n = p(C 1 |φ n ) = σ(w φ). t n represents the category of the data and y n represents the probability of the data belonging to one class. Through integrating the logistic function on the posterior distribution, we can get the class of the data. However, sometimes the integral

Discussion and Conclusions
In this study, we propose Langevin normalization flows and develop Langevin normalization flows Monte Carlo, a novel scalable sampling algorithm which exploits the flexibility of the neural networks and efficient exploration of Langevin diffusions. We design the appropriate loss function to train the sampler to ensure that the sampler is able to converge to the target distribution. We also discuss the unnormalized probability distributions and propose the appropriate loss function to these distributions. The experiments conducted on synthetic and real datasets suggest that our method is able to sample from the target distributions precisely and independently.
Although HMC has various advantages, it is difficult for us to design the model based on HMC, because the auxiliary momentum variable should be carefully concerned in the transformation function of NFs. In the future, we plan to design the neural network sampler based on Hamiltonian dynamics.

Conflicts of Interest:
The authors declare no conflict of interest.