Abstract
Langevin dynamics (LD) has been extensively studied theoretically and practically as a basic sampling technique. Recently, the incorporation of non-reversible dynamics into LD is attracting attention because it accelerates the mixing speed of LD. Popular choices for non-reversible dynamics include underdamped Langevin dynamics (ULD), which uses second-order dynamics and perturbations with skew-symmetric matrices. Although ULD has been widely used in practice, the application of skew acceleration is limited although it is expected to show superior performance theoretically. Current work lacks a theoretical understanding of issues that are important to practitioners, including the selection criteria for skew-symmetric matrices, quantitative evaluations of acceleration, and the large memory cost of storing skew matrices. In this study, we theoretically and numerically clarify these problems by analyzing acceleration focusing on how the skew-symmetric matrix perturbs the Hessian matrix of potential functions. We also present a practical algorithm that accelerates the standard LD and ULD, which uses novel memory-efficient skew-symmetric matrices under parallel-chain Monte Carlo settings.
1. Introduction
Sampling is one of the most widely used techniques for the approximation of posterior distribution in Bayesian inference []. Markov Chain Monte Carlo (MCMC) is widely used to obtain samples. In MCMC, Langevin dynamics (LD) is a popular choice for sampling from high-dimensional distributions. Each sample in LD moves toward a gradient direction with added Gaussian noise. LD efficiently explore around a mode of a target distribution using the gradient information without being trapped by local minima thanks to added Gaussian noise. Many previous studies theoretically and numerically proved LD’s superior performance [,,,]. Since non-reversible dynamics generally improves mixing performance [,], research on introducing non-reversible dynamics to LD for better sampling performance is attracting attention [].
There are two widely known non-reversible dynamics for LD. One is underdamped Langevin dynamics (ULD) [], which uses second-order dynamics. The other introduces perturbation, which consists of multiplying the skew-symmetric matrix by a gradient []. Here, we refer to the matrix as skew matrices for simplicity and this perturbation technique as skew acceleration. Much research has been done on ULD theoretically [,,] and ULD is widely used in practice, which is also known as stochastic gradient Hamilton Monte Carlo []. In contrast, the application of the skew acceleration for standard Bayesian models is quite limited even though it is expected to show superior performance theoretically [].
For example, skew acceleration has been analyzed focusing on sampling from Gaussian distributions [,,,,], although assuming Gaussian distributions in Bayesian models is restrictive in practice. A recent study [] theoretically showed that skew acceleration accelerates the dynamics around the local minima and saddle points for non-convex functions. Another work [] clarified that the skew acceleration theoretically and numerically improves mixing speed when used as interactions between chains in parallel sampling schemes for non-convex Bayesian models.
Compared to ULD, what seems to be lacking for skew acceleration is a theoretical understanding of issues that are important to practitioners. The most significant problem is that no theory exists for selecting skew matrices. In existing studies, introducing a skew matrix into LD results in equal or faster convergence, denoting that a bad choice of skew matrix results in no acceleration. Thus, choosing appropriate skew matrices is critical. Furthermore, although ULD’s acceleration has been analyzed quantitatively, existing studies have only analyzed skew acceleration qualitatively. Thus, it is difficult to justify the usefulness of skew acceleration in practice compared to ULD. Another issue is that introducing skew matrices requires a vast memory cost in many practical Bayesian models.
The purpose of this study is to solve these problems from theoretical and numerical viewpoints and establish a practical algorithm for skew acceleration. The following are the two major contributions of this work.
Our contribution 1: We present a convergence analysis of skew acceleration for standard Bayesian model settings, including non-convex potential functions using Poincaré constants []. The major advantage of Poincaré constants is that we can analyze skew acceleration through a Hessian matrix and its eigenvalues and develop a practical theory about the selection of J and the quantitative assessment of skew acceleration.
Furthermore, we propose skew acceleration for ULD and present convergence analysis for the first time. Since ULD shows faster convergence than LD, combining skew acceleration with ULD is promising.
Our contribution 2: We develop a practical skew accelerated sampling algorithm for a parallel sampling setting with novel memory-efficient skew matrices. Since a naive implementation of skew acceleration requires a large memory cost to store skew matrices, memory-efficiency is critical in practice. We also present a non-asymptotic theoretical analysis for our algorithm in both LD and ULD settings under a stochastic gradient and Euler discretization. We clarify that introducing skew matrices accelerates the convergence of continuous dynamics, although it increases the discretization and stochastic gradient error. Then to the best of our knowledge, we propose the first algorithm that adaptively controls this trade-off using the empirical distribution of the parallel sampling scheme.
Finally, we verify our algorithm and theory in practical Bayesian problems and compare it with other sampling methods.
Notations: denotes a identity matrix. Capital letters such as X represent random variables, and lowercase letters such as x represent non-random real values. ·, and denote Euclidean inner products, distances and absolute values.
2. Preliminaries
In this section, we briefly introduce the basic settings of LD and non-reversible dynamics for the posterior distribution sampling in Bayesian inference.
2.1. LD and Stochastic Gradient LD
First, we introduce the notations and the basic settings of LD and stochastic gradient LD (SGLD), which is a practical extension of LD. Here denotes a data point in space , denotes the total number of data points, and corresponds to the parameters of a given model, which we want to sample. Our goal is to sample from the target distribution with density , where potential function is the summation of , i.e., . Function is continuous and non-convex. The explicit assumptions made for it are discussed in Section 3.1. The SGLD algorithm [,] is given as a recursion:
where is a step size, is a standard Gaussian random vector, is a temperature parameter of , and is a conditionally unbiased estimator of true gradient . This unbiased estimate of the true gradient is suitable for large-scale data set since we can use not the full gradient, but a stochastic version obtained through a randomly chosen subset of data at each time step. This means that we can reduce the computational cost to calculate the gradient at each time step.
2.2. Poincaré Inequality and Convergence Speed
In sampling, we are interested in the convergence speed to the stationary measure. The speed is often characterized by the the generator associated with Equation (2) and defined as:
where denotes a standard Laplacian on and and denote the domain. This is a self-adjoint operator, which has only discrete spectrums (eigenvalues). with has a spectral gap if the smallest eigenvalue of (other than 0) is positive. We refer to it as . This spectral gap is closely related to Poincaré inequality. Internal energy is defined:
Please note that is satisfied. Then with satisfies the Poincaré inequality with constant c, if for any , with satisfies:
The spectral gap characterizes this constant , which holds (see Appendix A.2 for details). We refer to best constant c as the Poincaré constant []. For notational simplicity, we define and refer to this as the Poincaré constant.
In sampling, crucially, Poincaré inequality dominates the convergence speed in divergence:
where denotes the measure at time t induced by Equation (2) and is the initial measure (see Appendix A.3 for details). Thus, the larger Poincaré constant is, the faster convergence we have.
2.3. Non-Reversible Dynamics
In this section, we introduce the non-reversible dynamics. with is reversible if for any test function , with satisfies
If this is not satisfied, with is non-reversible [].
We introduce two non-reversible dynamics for LD. The first is ULD, which is given as
where is an auxiliary random variable, is a positive constant, and is the variance of the stationary distribution of auxiliary random variable V. The stationary distribution is , where denotes a Gaussian distribution. The superior performance of ULD compared with LD has been studied rigorously [,,]. ULD’s convergence speed is also characterized by the Poincaré constant []. In practice, we use discretization and the stochastic gradient for ULD, which is called the stochastic gradient Hamilton Monte Carlo (SGHMC) []. The second non-reversible dynamics is the skew acceleration given as
where J is a real value skew matrix and is a positive constant. We call this dynamics S-LD. The stationary distribution of S-LD is still , and S-LD shows faster convergence and smaller asymptotic variance [,,,].
3. Theoretical Analysis of Skew Acceleration
In this section, we present a theoretical analysis of skew acceleration in LD and ULD in standard Bayesian settings. We analyze acceleration through the Poincaré constant and connect it with the eigenvalues of the Hessian matrix, which allows us to obtain a practical criterion to choose skew matrices and quantitatively evaluate acceleration. We focus on a setting where a continuous SDE and a full gradient of the potential function is used in this section. The discretized SDE and stochastic gradient are discussed in Section 4.
3.1. Acceleration Characterization by the Poincaré Constant
First, we introduce the same four assumptions as a previous work [], which showed the existence of the Poincaré constant about for LD (see Appendix C for details).
Assumption 1.
(Upper bound of the potential function at the origin) Function u takes nonnegative real values and is twice continuously differentiable on , and constants A and B exist such that for all ,
Assumption 2.
(Smoothness) Function u has Lipschitz continuous gradients; for all , positive constant M exists for all ,
Assumption 3.
(Dissipative condition) Function u satisfies the (m,b)-dissipative condition for all ; for all , and exist such that
Assumption 4.
(Initial condition) Initial probability distribution of has a bounded and strictly positive density , and for all ,
Please note that these assumptions allow us to consider the non-convex potential functions, which are common in practical Bayesian models. Furthermore, we make the following assumption about J.
Assumption 5.
The operator norm of J is bounded:
This means that the largest singular value of J is below 1.
Under these assumptions, we present the convergence behavior of skew acceleration using the Poincaré constant. First, we present the following S-LD result.
Theorem 1.
Under Assumptions 1–5, the S-LD of Equation (9) has exponential convergence,
where is the measure at time t induced by S-LD and is the Poincaré constant of S-LD defined by its generator
Furthermore, satisfies .
The proof is shown in Appendix C. This theorem states that introducing the skew matrices accelerates the convergence of LD by improving the convergence rate from to . Although [] obtained a similar result, we used the Poincaré constant and derived an explicit criterion when holds, as we discuss below.
Next, we also introduce skew acceleration in ULD. Since ULD shows faster convergence than LD in standard Bayesian settings [,], it is promising to combine skew acceleration with ULD to obtain a more efficient sampling algorithm. For that purpose, we propose the following SDE:
where and are real value skew matrices and and are positive constants. We assume that and satisfy Assumption 5. We refer to this method as skew underdamped Langevin dynamics (S-ULD) whose stationary distribution is . See Appendix B for details, which include discussions on other combinations of skew matrices. As for S-ULD, we need an additional assumption about the initial condition of :
Assumption 6.
(Initial condition) Initial probability distribution of has a bounded and strictly positive density that satisfies,
We then provide the following convergence theorem that resembles S-LD.
Theorem 2.
Under Assumptions 1–3, 5, 6, S-ULD has exponential convergence in divergence and its convergence rate is also characterized by as defined in Theorem 1. S-ULD’s convergence equals or exceeds ULD, of which convergence rate is characterized by .
See Appendix C.2 for details. From these theorems, we confirmed that skew acceleration is effective in both S-LD and S-ULD, and the convergence speed is characterized by Poincaré constant defined by Equation (16).
3.2. Skew Acceleration from the Hessian Matrix
Our goal is to clarify what choices of J induce , which leads to acceleration. Therefore, we discuss how Poincaré constant is connected to the eigenvalues and eigenvectors of the perturbed Hessian matrix . Next, we introduce the notations. We express the Hessian of as and the perturbed Hessian matrix as . Please note that H is a real symmetric matrix, which has real eigenvalues and diagonalizable. On the other hand, since is not symmetric, it has complex eigenvalues, although diagonalization is not assured (see Appendix E). We express pairs of eigenvectors and eigenvalues of as , which are ordered as . Here, expresses the real part of complex value and denotes the imaginary part. We express those of as and order them as .
3.2.1. Strongly Convex Potential Function
Assume that U is an m-strongly convex function, where for all , holds. Poincaré constant of LD satisfies []. For the skew acceleration, since Poincaré constant satisfies , where is the best constant that satisfies, for all x, (see Appendix D.1). Therefore, studying the Poincaré constant is equivalent to studying the smallest (real part of the) eigenvalue of the Hessian matrix. Thus, the relation between and must be studied. The following theorem describes how the skew matrices change the smallest eigenvalue.
Theorem 3.
For all , the real parts of the eigenvalues of satisfy
The condition of is shown in Remark 1.
Remark 1.
Denote the set of the eigenvectors of eigenvalue as . If and , then holds. If the cardinality of set is larger than 1, and vectors exist, such that and , then holds.
Refer to Appendix F for the proof. This is an extension of previous work [,]. If is satisfied for all x, we have , i.e., acceleration occurs. We discuss how to construct J such that holds in Section 3.3.
3.2.2. Non-Convex Potential Function
The previous work [] clarified that the Poincaré constant of the non-convex function is characterized by the negative eigenvalue of the saddle point. As shown in Figure 1, denote as the global minima, and is the local minima which has the second smallest value in . We express the saddle point with index one, i.e., there is only one negative eigenvalue at the point, between and as . This means that the eigenvalues of satisfies . Ref. [] clarified that the saddle point characterizes the Poincaré constant as
When skew matrices are introduced, [] clarified the following relation:
Theorem 4.
([]) and equality holds only if .

Figure 1.
Double-potential example: Poincaré constant is related to the eigenvalue at .
Note is not a complex number. Thus, the skew acceleration reduces the negative eigenvalue and leads to a larger Poincaré constant (see Appendix D.2) and results in faster convergence.
In conclusion, introducing the skew matrix changes the Hessian’s eigenvalues and increase the Poincaré constant. If is satisfied, this leads to faster convergence for both convex and non-convex potential functions.
3.3. Choosing J
In this section, we present a method for choosing J that leads to to ensure the acceleration based on the equality conditions in Theorems 3 and 4. Combining these theorems, we obtain the following criterion:
Remark 2.
Given a point x, holds if either the following conditions are satisfied: when , is satisfied. when , holds for any , and for any , and are not satisfied.
The first condition is easily satisfied if we choose J such that . On the other hand, the second condition is difficult to verify since H and its eigenvalues and eigenvectors generally depend on the current position of . Instead of evaluating eigenvalues and eigenvectors of H and directly, we use the random matrix property shown in the next theorem.
Theorem 5.
Suppose the upper triangular entries of J follow a probability distribution that is absolutely continuous with respect to the Lebesgue measure. If is satisfied, then given a point , holds with probability 1.
The proof is given in Appendix G.1. From this theorem, we simply generate J from some probability distribution, such as the Gaussian distribution. Then, we check whether holds. If does not hold, we generate a random matrix J again.
The above theorem is valid only at a given evaluation point x. We can extend the above theorem to all the points over the path of the discretized dynamics (see Appendix G.3). With this procedure, we can theoretically ensure that acceleration occurs with probability one for discretized dynamics.
3.4. Qualitative Evaluation of The Acceleration
So far, we have discussed skew acceleration qualitatively but not quantitatively. Although acceleration’s quantitative evaluation is critical for practical purposes, to the best of our knowledge, no existing work has addressed it. In this section, we present a formula that quantitatively assesses skew acceleration by analyzing the eigenvalues of the Hessian matrix.
Theorem 6.
With the identical notation as in Theorem 3, for all x, we have
In particular, at saddle point , we have
The proofs are shown in Appendix H. When focusing on Equation (22), if is a strongly convex function, since for all , holds and the second term in Equation (22) is positive. From this, holds. A similar relation holds for . In Equation (23), holds. Thus, the changes of the Poincaré constants are proportional to . With these formulas, we can quantitatively evaluate the acceleration. We present numerical experiments to confirm our theoretical findings in Section 6.1.
4. Practical Algorithm for Skew Acceleration
In this section, we discuss skew acceleration in more practical settings compared to Section 3. First, we discuss the memory issue for storing J and the discretization of SDE and the stochastic gradient, which are widely used techniques in Bayesian inference. Finally, we present a practical algorithm for skew acceleration.
4.1. Memory Issue of Skew Acceleration and Ensemble Sampling
For d-dimensional Bayesian models, we need memory space to store skew matrices Js, and this is difficult for high-dimensional models. Instead of storing J, we can randomly generate Js at each time step following Theorem 5. However, we experimentally confirmed that using different Js at each step does not accelerate the convergence (see Section 6). Thus, we need to use a fixed J during the iterations.
As discussed below, we found that the previously proposed accelerated parallel sampling [] can be a practical algorithm to resolve this memory issue. In that method, we simultaneously updated N samples of the model’s parameters with correlation. In such a parallel sampling scheme, a correlation exists among multiple Markov chains, it is more efficient than a naive parallel-chain MCMC, where the samples are independent. We express the n-th sample at time t as and the joint state of all samples at time t as . We express the joint stationary measure as . We express the sum of the potential function as . We then consider the following dynamics:
We call this dynamics skew parallel LD (S-PLD). N-independent parallel LD (PLD) is coupled with the skew matrix. Since each chain in PLD is independent of the other, the Poincaré constant of PLD is also . Ref. [] argued that the Poincaré constant of S-PLD, , satisfies . This means S-PLD shows faster convergence than PLD. As discussed in Section 3.2, these Poincaré constants are characterized by the smallest eigenvalue of the Hessian matrix and where . We denote these smallest eigenvalues as and . As discussed in Section 3.2, acceleration occurs if is satisfied.
In [], they failed to specify the choice of J whose naive construction of J requires memory cost. To reduce the memory cost, we propose the following skew matrix:
where is a skew matrix and ⊗ is a Kronecker product. We then have the following lemma:
Lemma 1.
If is generated based on Theorem 5 and is satisfied, then given a point , J does not satisfy the equality condition in Theorems 3,4, which means with probability 1.
See Appendix G.2 for the proof. Thus, from this lemma, we only need to prepare and store , which requires memory, which does not depend on d. In practical settings, this is a significant reduction of the memory size since the number of parallel chains is smaller than the dimension of models. Please note that we can ensure the acceleration with this J.
Lemma 2.
Under Assumptions 1–5, assume J satisfies the condition of Lemma 1. Then S-PLD shows
where is the measure at time t induced by S-PLD, and is the initial measure defined as the product measure of .
See Appendix I.1 for the proofs. Thus, combined with Lemma 2, S-PLD converges faster than PLD. We also considered the ensemble version of ULD (parallel ULD (PULD)) and its skew accelerated version:
where and are real-valued skew-symmetric matrices, and and are positive constants and . We refer to this dynamics as skew PULD (S-PULD) whose faster convergence can be assured similar to Lemma 2 as shown in Appendix I.2.
4.2. Discussion of the Discretization of SDE and Stochastic Gradient and Practical Algorithm
In this section, we further consider practical settings for S-PLD and S-PULD. We discretize these continuous dynamics, e.g., by the Euler-Maruyama method, and approximate the gradient by the stochastic gradient. Although introducing skew matrices accelerates the convergence of continuous dynamics, it simultaneously increases the discretization and stochastic gradient error, resulting in a trade-off. We present a practical algorithm that controls this trade-off.
4.2.1. Trade-off Caused by Discretization and Stochastic Gradient
We consider the following discretization and stochastic gradient for S-PLD and S-PULD:
and
where is a standard Gaussian random vector. is an unbiased estimator of the gradient . We refer to Equation (29) as skew-SGLD and Equation (30) as skew-SGHMC. For skew-SGHMC, we dropped of S-PULD to decrease the parameters, shown in Appendix B. Please note that skew-SGLD is the identical as the previous dynamics []. We introduce an assumption about the stochastic gradient:
Assumption 7.
(Stochastic gradient) There exists a constant such that
Given a test function f with lipschitzness, we approximate by skew-SGLD or skew-SGHMC, with estimator . The bias of skew-SGLD is upper-bounded as
Theorem 7.
Under Assumptions 1–7, for any and any obeying and , we have
and and depends on the constants of Assumptions 1–7, for the details see Appendix J.
We present a tighter bias bound in Section 4.3 under a stronger assumption. We can show a similar upper bound for the skew-SGHMC using the same proof strategy. This bound resembles of a previous one []; ours shows improved dependency on . The previous results of [] are also limited to LD, not including skew-SGHMC.
Please note that corresponds to the discretization and stochastic gradient error and corresponds to the convergence behavior of S-PLD, which is continuous dynamics. Since , skew acceleration increases the discretization and stochastic gradient error. On the other hand, since , the convergence of the continuous dynamics is accelerated. Thus, skew acceleration causes a trade-off. When is sufficiently small, we derive the explicit dependency of for this trade-off from an asymptotic expansion. Using the quantitative evaluation of skew acceleration in Theorem 6, we obtain
where to are positive constants obtained by the asymptotic expansion. See Appendix K for the details. In the above expression, and correspond to and of Equation (32). Thus, by choosing appropriate , we can control the trade-off.
4.2.2. Practical Algorithm Controlling the Trade-off
Since calculating the optimal that minimizes Equation (33) at each step is computationally demanding, we adaptively tune the value of by measuring the acceleration with kernelized Stein discrepancy (KSD) []. Our idea is to update samples under different and , and compare KSD between the stationary and empirical distributions of these different interaction strengths. Here, is a small increment of . We denote the samples at the th step, which is obtained by Equation (29) as , (or (30) as ). We denote the samples, which are obtained by replacing the above by , as . We denote the KSD between the measure of and stationary measure as and estimate the differences of empirical KSD:
where KSD is estimated by
where l denotes a kernel and we use an RBF kernel. If , which indicates that the empirical distribution of is closer to the stationary distribution than that of . Thus, we should increase the interaction strength from to . If , we decrease it to . We also update to where . The overall process is shown in Algorithm 1. Detailed discussions of the algorithm including how to select , and c are shown in Appendix L.
Algorithm 1 Tuning |
Input: Output:
|
Finally, we present Algorithm 2, which describes the whole process. We update the value of once every step. Please note that its computational cost is not much larger than that of Equation (30). We only calculate the eigenvalues of J once, which requires . The calculation of different KSDs is computationally inexpensive since we can re-use the gradient, which is the most computationally demanding part.
Algorithm 2 Proposed algorithm |
Input: Output: |
4.3. Refined Analysis for the Bias of Skew-SGLD
When using a constant step size for skew-SGLD, the bound in Theorem 7 is meaningless since the first term of Equation (32) will diverge. Here, following [], we present a tighter bound for the bias of skew-SGLD under a stronger assumption.
Theorem 8.
Under Assumptions 1–7, for any and any obeying and , we have
where
and constants and depend on the constants of Assumptions 1–7. Moreover, satisfies . For the details, see Appendix M.
Proof is shown in Appendix M. Please note that even if we use a constant step size for skew-SGLD, the bound in Theorem 8 will not diverge. Here we need the stronger assumption about a step size compared to Theorem 7. From Equation (37), the convergence behavior is characterized by and the bias bound become smaller when become larger. From the definition of , the larger is, the larger we obtain. Thus, as we had seen so far, introducing the skew matrices leads to the larger Poincaré constant, and thus, this leads to larger .
Previous work [] clarified that if is sufficiently small, introducing skew matrices improves the Poincaré constant by a constant factor, which means that we have , where depends on the eigenvector and eigenvalues of the generator . On the other hand, from Theorem 8, for any , to achieve the bias smaller than , it suffice to run skew-SGLD at least for iterations using the appropriate step size h and under the assumption that and are small enough (see Appendix M.2 for details). Combined with these observations, introducing skew matrices into SGLD improves the computational complexity for a constant order. Our numerical experiments show that even constant improvement results in faster convergence in practical Bayesian models.
5. Related Work
In this section, we discuss the relationship between our method and other sampling methods.
5.1. Relation to Non-Reversible Methods
As we discussed in Section 1, our work extends the existing analysis of non-reversible dynamics [,] and presents a practical algorithm. Compared to those previous works, we focus on the practical setting of Bayesian sampling and derive the explicit condition about J for acceleration. We also derived a formula to quantitatively evaluate skew acceleration based on the asymptotic expansion of the eigenvalues of the perturbed Hessian matrix. A previous work [], which derived the optimal skew matrices when the target distribution is Gaussian, requires computational cost to derive optimal skew matrices, and it is unclear whether it works for non-convex potential functions. On the other hand, our construction method for skew matrices is simple, computationally cheap, and can be applied to general Bayesian models.
Our work analyzes skew acceleration for ULD, which is more effective than LD in practical problems. Another work [,] only analyzed skew acceleration for LD. A previous work [] combined a non-reversible drift term with ULD. Unlike our method, this work’s purpose was to reduce the asymptotic variance of the expectation of a test function and is mainly focusing on sampling from Gaussian distribution.
To the best of our knowledge, our work is the first to focus on the memory issue of skew acceleration and develop a memory-efficient skew matrix for ensemble sampling. Our work also presents an algorithm that controls the trade-off for the first time. Another work [] identified the trade-off and handled it by cross-validation, which is computationally inefficient, unfortunately.
Finally, we point out an interesting connection between our skew-SGHMC and the magnetic HMC (M-HMC) []. M-HMC accelerates HMC’s mixing time by introducing a “magnetic” term into the Hamiltonian. That magnetic term is expressed by special skew matrices. Although a previous work [] argued that M-HMC is numerically superior to a standard HMC, its theoretical property remains unclear. Thus, our work can analyze the theoretical behavior of magnetic HMC.
5.2. Relation to Ensemble Methods
Our proposed algorithm is based on ensemble sampling []. Ensemble sampling, in which multiple samples are simultaneously updated with interaction, has been attracting attention numerically and theoretically because of improvements in memory size, computational power, and parallel processing computation schemes []. There are successful, widely used ensemble methods, including SVGD [] and SPOS [], with which we compare our proposed method numerically in Section 6. Although both show numerically good performance, it is unclear how the interaction term theoretically accelerates the convergence since they are formulated as a McKean–Vlasov process, which is non-linear dynamics, complicating establishing a finite sample convergence rate. Our algorithm is an extension of another work [], where the interaction was composed of a skew-acceleration term and can be rigorously analyzed. Compared to that previous work [], we analyzed skew acceleration, focused on the Hessian matrix, and developed practical algorithms, as discussed in Section 4.2, and derived the explicit condition when acceleration occurs, which was unclear [].
Another difference among SPOS, SVGD, and [] is that they use first-order methods; our approach uses the second-order method. Little work has been done on ensemble sampling for second-order dynamics. Recently a second-order ensemble method was proposed [], based on gradient flow analysis. Although its method showed good numerical performance, its theoretical property for finite samples remains unclear since it proposed a scheme as a finite sample approximation of the gradient flow. In contrast, our proposed method is a valid sampling scheme with a non-asymptotic guarantee.
6. Numerical Experiments
The purpose of our numerical experiments is to confirm the acceleration of our algorithm proposed in Section 4 in various commonly used Bayesian models including Gaussian distribution (toy data), latent Dirichlet allocation (LDA), and Bayesian neural net regression and classification (BNN). We compared our algorithm’s performance with other ensemble sampling methods: SVGD, SPOS, standard SGLD, and SGHMC. In all the experiments, the values and the error bars are the mean and the standard deviation of repeated trials. For all the experiments we set and for SGHMC and Skew-SGHMC. As for the hyperparameters of our proposed algorithm, the selection criterion is discussed in Appendix L.
6.1. Toy Data Experiment
The target distribution is the multivariate Gaussian distribution, where we generated and each element of is drawn from the standard Gaussian distribution. The dimension of the target distribution is , we approximate by 20 samples using the proposed ensemble methods. We tested these toy data because the LD for this target distribution is known as the Ornstein–Uhlenbeck process, and its theoretical properties have been studied extensively e.g., []. Thus, by studying the convergence behavior of these toy data, we can understand our proposed method more clearly.
First, we confirmed how the skew-symmetric matrix affects the eigenvalues of the Hessian matrix, as discussed in Section 3, where we only showed the asymptotic expansion for the smallest real part of the eigenvalues and saddle point. Here we can show a similar expansion for the largest real part:
holds.
Then we observed how the largest and smallest real parts of the eigenvalues of depend on . The results are shown in Figure 2, where we averaged 10 trials over a randomly made J with fixed A. The upper-left, upper-right, and lower figures show , , and . These behaviors are consistent with Theorem 3. When is small, its behavior is close to the quadratic function proved in Theorem 3.

Figure 2.
Eigenvalue changes (averaged over ten trials).
Next, we observed the convergence behavior of skew-SGLD and skew-SGHMC. We measured the convergence by maximum mean discrepancy (MMD) [] between the empirical and stationary distributions. For MMD, we used 2000 samples for the target distribution, and we used the Gaussian kernel whose bandwidth is set to the median distance of these 2000 samples. We used gradient descent (GD), with step size . The results are shown in Figure 3. The proposed method shows faster convergence than naive parallel sampling, which is consistent with Table 2.

Figure 3.
Convergence behavior of toy data in MMD (averaged over ten trials).
6.2. LDA Experiment
We tested with an LDA model using the ICML dataset [] following the same setting as []. We used 20 samples for all the methods. Minibatch size is 100. We used step size . First, we confirmed the effectiveness of our proposed Algorithm 1, which adaptively tunes values. For that purpose, we compared the final performance obtained by our methods with a previous method [], in which is selected by cross-validation (CV). Here instead of CV, we just fixed during the sampling and refer to it as fixed . We also tested the case when J is generated randomly at each step with fixed , as discussed in Section 4.1. We refer to it as random J. The results are shown in Figure 4 where skew-SGLD was used. We found that our method showed competitive performance with the best performance of fixed . For the computational cost, we used in Algorithm 2, and our method needed twice the wall clock time than each fixed . This means that our algorithm greatly reduces the total computational time since we tried more than two s in the fixed for CV. We also found that since using different Js at each step did not accelerate the performance, we need to store and fix J during the sampling for acceleration. Next, we compared our method with other ensemble sampling schemes and observed the convergence speed. The result is shown in Figure 5. Skew-SGLD and skew-SGHMC outperformed SGLD and SGHMC, which is consistent with our theory.

Figure 4.
Final performances of LDA under different values of (averaged over ten trials).

Figure 5.
LDA experiments (Averaged over 10 trials).
6.3. BNN Regression and Classification
We tested with the BNN regression task using the UCI dataset [], following a previous setting Liu and Wang []. We used one hidden layer neural network model with ReLU activation and 100 hidden units. We used 10 samples for all the methods. We used the minibatch size 100. We used step size . The results are shown in Table 1 and Table 2. We also tested on BNN classification task using the MNIST dataset. The result is shown in Figure 6. We used one hidden layer neural network model with ReLU activation and 100 hidden units. Batchsize is 500 and we set step size . Our proposed methods outperformed other ensemble methods. Please note that skew-SGHMC and skew-SGLD consistently outperformed SGHMC and SGLD.

Table 1.
Benchmark results on test RMSE for regression task.

Table 2.
Benchmark results on test negative log likelihood for regression task.

Figure 6.
MNIST classification (Averaged over ten trials).
7. Conclusions
We studied skew acceleration for LD and ULD from practical viewpoints and concluded that the improved eigenvalues of the perturbed Hessian matrix caused acceleration and derived the explicit condition for acceleration. We described a novel ensemble sampling method, which couples multiple SGLD or SGHMC with memory-efficient skew matrices. We also proposed a practical algorithm that controls the trade-off of faster convergence and larger discretization and stochastic gradient error and numerically confirmed the effectiveness of our proposed algorithm.
Author Contributions
Conceptualization, F.F. and T.I.; methodology, F.F. and T.I.; software, F.F.; validation, F.F., T.I., N.U. and I.S.; formal analysis, F.F. and I.S.; writing—original draft preparation, F.F.; project administration, F.F.; funding acquisition, F.F. All authors have read and agreed to the published version of the manuscript.
Funding
JST ACT-X: Grant Number JPMJAX190R.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml (accessed on 21 June 2021).
Acknowledgments
FF was supported by JST ACT-X Grant Number JPMJAX190R.
Conflicts of Interest
The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
LD | Langevin Dynamics |
MCMC | Markov Chain Monte Carlo |
ULD | Underdamped Langevin Dynamics |
SGLD | Stochastic Gradient Langevin Dynamics |
SGHMC | Stochastic Gradient Hamilton Monte Carlo |
PLD | Parallel Langevin Dynamics |
PULD | Parallel Underdamped Langevin Dynamics |
SLD | Skew Langevin Dynamics |
S-ULD | Skew Underdamped Langevin Dynamics |
S-PLD | Skew Parallel Langevin Dynamics |
S-PULD | Skew Parallel Underdamped Langevin Dynamics |
KSD | Kernelized Stein Discrepancy |
Appendix A. Additional Backgrounds
We introduce additional backgrounds which are used in our Proof.
Appendix A.1. Wasserstein Distance and Kullback–Leibler Divergence
In this paper, we use the Wasserstein distance. Let us define the Wasserstein distance. Let be a metric space (appropriate space such as Polish space) with field , where is -measurable. Let , are probability measures on E, and . The Wasserstein distance of order p with cost function d between and is defined as
where is the set of all joint probability measures on with marginals and . In this paper, we work on the space . As for the distance, we use the Euclidean distance, . For simplicity, we express the p-Wasserstein distance with the Euclidean distance as . The various properties of Wasserstein distance are summarized in []. We define the Kullback–Leibler (KL) divergence as
Appendix A.2. Markov Diffusion and Generator
Here we introduce the additional explanation about the generator of the Markov diffusion process. Given an SDE,
and we denote the corresponding Markov semigroup as and define the Kolmogorov operator as which is defined as , where is some bounded test function in . A property is called Markov property. A probability measure is the stationary distribution when it satisfies for all measurable bounded function f and t, .
We denote the infinitesimal generator of the associated Markov group as and we call it a generator for simplicity. The linearity of the operators of with the semigroup property indicates that is the derivative of as
where is the identity map. In addition, taking , we have . From the Hille–Yoshida theory [], there exists a dense linear subspace of on which exists. We refer it as . If the Markov semigroup is associated with the SDE of Equation (A3), the generator can be written as
where is the Laplacian in the standard Euclidean space. The generator satisfies .
Appendix A.3. Poincaré Inequality
We use the Poincaré inequality to measure the speed of convergence to the stationary distribution. In this section, we summarize definitions and useful properties of them and see [] for more details. We define the Dirichlet form for all bounded functions where denotes the domain of as
is satisfied. By the partial integration, we have . We define a Dirichlet domain, , which is the set of functions and satisfies .
We say that with satisfies a Poincaré inequality with a positive constant c if for any , with satisfies,
This constant c is closely related to a spectral gap. If the smallest eigenvalue of , , is greater than 0, then it is called the spectral gap. If the spectral gap exists, then it is written as
From this, a constant c which satisfies , can also satisfy the Poincaré inequality. To check the existence of the spectral gap, one approach is to use the Lyapunov function, which is developed by Bakry et al. [].
We can also express the Poincaré inequality via chi divergence. Let us define the divergence for as
Then, we express the Poincaré inequality with a constant c for all as
We obtain the following exponential convergence results from the above functional inequalities for measures.
Theorem A1.
(Exponential convergence in the variance, Theorem 4.2.5 in []) When π satisfies the Poincaré inequality with a constant c, it implies the exponential convergence in the variance with a rate , i.e., for every bounded function ,
where .
We also introduce the important property of Poincaré inequality as for the product measures. These relations play important roles in our analysis.
Theorem A2.
(Stability under the product, Proposition 4.3.1 in []) If and on satisfy the Poincaré inequalities with a constant and , then the product on satisfies the Poincaré inequality with the constant .
Appendix B. Generator of the Underdamped Langevin Dynamics (ULD)
Following [], we define the infinitesimal generator of the ULD as
Then, we define the generator of S-ULD as
where the second line corresponds to the interaction terms. Then it is easily to confirm , where . Thus, the stationary distribution of S-ULD is . We can prove this by simply using the partial integral and using the property of the skew-symmetric matrix. Thus, the stationary distribution of S-ULD is .
We consider other combinations the skew matrices with ULD. For example, we can consider the following more general combination;
compared to S-ULD, there are new two terms are included. We can also derive the infinitesimal generator of this Markov process. We express it as . Then we calculate the infinitesimal change of the expectation of f
which suggests that the stationary distribution of Equation (A14) is different form .
It is widely known that underdamped Langevin dynamics converges to (overdamped) Langevin dynamics. Here we observe that S-ULD converges to Skew-LD in []. The limiting procedure is widely known, for example, see [,,]. We cite Proposition 1 in []; given a stochastic process
and we rescale it by introducing which expresses the small mass limit as
and by taking the limit , the dynamics converges to
See Proposition 1 in [], for the precise statements. Please note that the term related works as preconditioning. Thus, if we set , the obtained dynamics are equivalent to the continuous dynamics of skew-SGLD. Thus, our skew-SGHMC is the natural extension of skew-SGLD.
Appendix C. Proof of Theorem 1
Appendix C.1. Proof for S-LD
First, under Asuumptions 1–5, LD has a spectral gap, and its Poincaré constant is upper bounded as
and this is derived in [].
Next, we introduce the generator of S-LD
where .
The proof is almost similar to [] of Theorem 12.
Proof of Theorem 1.
Since the generator is self-adjoint, and the suitable growth condition, the spectral of is discrete []. We denote the spectrum of as and corresponding normalized eigenvectors as , which are the real functions. We order the spectrum as . Thus, .
As for , although it is not a self-adjoint operator, from Proposition 1 in Franke et al. [], it has discrete complex spectrums. We denote the spectrum of as where and corresponding normalized eigenvector as where are the real functions and then we have
From this definition, by checking the real parts and complex parts, following relations are derived
Due to the divergence-free drift property, for any bounded real value test function ,
where we used the partial integral. This means that for any bounded real function ,
(This only holds for real functions.) Then, we can evaluate the real part of the eigenvalue as follows,
Then, by expanding the eigenfunction by the eigenfunction ,
Thus, the real part of the eigenvalue of is smaller than the smallest eigenvalue of . This means that the spectral gap of is larger than that of , i.e., holds. □
Appendix C.2. Proof of Theorem 2 (S-ULD)
Proof of Theorem 2.
To prove the S-ULD, we use the result of [], which characterize the convergence of ULD via the Poincaré constant. Let us denote as the measure induced by ULD. Then from Theorem 1 of [], if with has the Poincaré constant , we have
where and is given as follows.
where
where is arbitrary sufficiently small positive value such that is satisfies. As for , if there exists a positive constant K, such that , then . In our assumption, this corresponds to , thus . From the above definitions, we can see that the larger is, i.e., the larger the Poincaré constant is the faster convergence ULD shows.
This can also be confirmed numerically, see Figure A1, which shows how the changes under different . We set . From the figure, the larger the Poincaré constant is, the larger becomes.

Figure A1.
The convergence rate of ULD under the different Poincaré constants.
So far, we confirmed that the convergence speed of S-ULD is characterized by the Poincaré constant of . When we consider S-ULD, we simply add the skew matrices term to the generator of the ULD in the proof of Proposition 1 in []. This means that we simply replace the Poincaré constant from to in the proof of Proposition 1 in []. Then, will be replaced with that indicates the faster convergence. □
Appendix D. Eigenvalue and Poincaré Constant
In this section, we discuss the relation between eigenvalues of the Hessian matrix and Poincaré constant.
Appendix D.1. Strongly Convex Potential Function
When we consider LD with m-strongly convex potential function, then the Poincaré constant is m, this means exponential convergence with rate m (See [] for the detail).
We then consider the S-LD with m-strongly convex function. In this setting, by considering the synchronous coupling technique [], we can show that the variance decays exponentially with the rate of the smallest real part of the eigenvalue. This is because that by preparing two S-LD given as
Then we evaluate the behavior of . From Ito lemma and considering the synchronous coupling, we obtain
where is the constant that satisfies for all x, see Appendix E for details. This means that variance decays exponentially with the rate . From the fundamental property of the Poincaré constant (Theorem 4.2.5 in []), is the Poincaré constant. Thus the imaginary part has no effect on the continuous dynamics. Thus, the Poincaré inequality is the smallest real part of the perturbed Hessian matrix.
Appendix D.2. Non-Convex Potential Function
As we discussed in Section 3.1, [] derived the sharper estimation for the Poincaré constant for the non-convex potential function. It is easy to verify that their assumptions are satisfied under our assumption 1–5. Following the main paper, we denote global minima, and is the local minima which have the second smallest value in . We express the saddle point between and as . To be more precise, the saddle point that characterizes the Poincaré constant is known as the critical point with index one defined as
and the eigenvalue of has one negative eigenvalue and positive eigenvalues. We express them as .
Ref. [] studied the Poincaré constant by decomposing the non-convex potential focusing on attractors. By focusing on attractors, they showed that the non-convex potential can be decomposed into the sum of approximately Gaussian distributions. They proved that the Poincaré constant is characterized by the local Poincaré constants, these are derived by the approximate Gaussian distribution on the attractors and their surrounding regions. In addition, they proved that the dominant term of the Poincaré constant is specified by the saddle points between the global minima and the point which takes the second smallest value for . From Theorem 2.12 and Corollary 2.15 in [], the Poincaré constant is characterized by
where Z is the normalizing constant of .
Next, we discuss how this estimate changes when skew matrices are applied. When the skew matrices are introduced, from lemma A.1 in [], at the saddle point, there exists a unique negative real eigenvalue for the perturbed Hessian matrix even if is not a symmetric matrix.
Then from Proposition 5 in [], that negative eigenvalue of the perturbed Hessian is smaller than that of the un-perturbed Hessian matrix at the saddle point. This means that holds.
Finally, from Theorem 5.1 in [] and Theorem 2.12 in [], this improvement of the negative eigenvalue of the saddle point directly leads to the larger Poincaré constant.
Appendix E. Properties of a Skew-Symmetric Matrix
Here, we introduce the basic properties of the skew-symmetric matrices. Let us consider assume that matrix is diagonalizable. Then assume that matrix has l real eigenvalues and complex eigenvalues, . Thus, . We denote the corresponding eigenvectors as for real eigenvalues and for complex eigenvalues and for corresponding conjugate eigenvalues. Then, let us define a matrix V as
Then, we can decompose into a block diagonal matrix [];
Thus, . Then, from the Taylor expansion and expressing its residual by integral, by defining we have
Then, let us apply the Jordan canonical form here. If is diagonalizable, and it is decomposable by the Jordan canonical form shown in Equation (A40). Then, we can decompose as
Then, we obtain
where is the constant that satisfies for all x. Thus, the imaginary part never appears to the upper bound and we only need to focus on the largest real part of the eigenvalues, if the matrix is diagonalizable. Next subsection describes when the non-symmetric matrix is diagonalizable by focusing on the random matrix.
Appendix F. Proof of Theorem 3
Proof.
Since the potential function is m-strongly convex, the smallest eigenvalue of the Hessian matrix H is m, which is larger than 0. Thus, H and are regular matrices. With this in mind, we consider as a similar matrix of . This is easily confirmed by
This means that to study the eigenvalues of , we only need to study the similar matrix . By doing this, A is composed of symmetric and skew-symmetric matrices, which are easy to treat compared to , where the term is difficult to analyze. For simplicity, we omit the dependency of H and on x in this section.
Remark A1.
Please note that we can eliminate the strong convexity of U, if H is a regular matrix. This means that H does not have 0 as an eigenvalue.
For simplicity, we assume that the dimension d is an even number. We assume that the eigenvalues and eigenvectors of A are expressed as
and is ordered as . In this section, we only consider the setting where all the eigenvalue and eigenvector are imaginary for notational simplicity. The extension to the general settings similar to Appendix E and the setting when is d is odd is straightforward.
We denote the eigenvalues and eigenvectors of H as and s are linearly independent. In addition, we assume that . From this definition, by checking the real parts and complex parts, the following relations are derived
thus, by the skew-symmetric property
and in the third equality, we used the property
since is a skew-symmetric matrix. Then, we expand and by as
since s are eigenvalues of H, which can be used as the basis for . Then we substitute this into Equation (A50) and we have
This means that any real part of the eigenvalue of A is larger than which is the smallest eigenvalue of H. Thus, if the is the smallest real part of the eigenvalue of A, that is larger than the smallest eigenvalue of H. This concludes the proof.
In the same way,
which means any real part of the eigenvalues of A is smaller than the largest eigenvalue of H. Thus, if is the largest real part of the eigenvalues of A, it is smaller than the largest eigenvalue of H.
Equality condition:
Next, we discuss when the equality holds for . First, we assume that eigenvalues of H are distinct, thus, there is only one eigenvector for . Later, we discuss if eigenvalues are not distinct. From Equation (A54), we have
in general. Please note that if and does not correspond to , then must appear in the summation and equality never holds. So, the condition is
must hold for the equality.
Based on this, let us assume that where . We consider the case . Then we need to solve the simultaneous equations
this is obtained by the definition of the eigenvalue of A and
this is obtained from the definition of eigenvalues of H. Then multiplying from the left, we obtain and . Thus, . means from the property of the complex eigenvectors. Thus, we obtain for . Then, the following relation holds,
Since and has the inverse matrix, this condition indicates that
This is the condition that holds. The same relation can be derived for .
Next, we assume that eigenvalues of H are not distinct. Let us denote the set of eigenvectors of the eigenvalue as . Please note that if and does not included in , then must appear and equality never holds. Thus
must hold for equality. Based on this, let us assume that where . We consider the case . Then
then we obtain the condition
□
Appendix G. Proofs of Random Matrices
Appendix G.1. Proof of Theorem 5
Proof.
The proof is the straightforward consequence of lemma in [], that is
Lemma in ([]) If is a polynomial in real variables , which is not identically zero, then the subset of the Euclidean m-space has the Lebesgue measure zero.
We use this lemma to prove that the probability of is 0 by showing that the probability mass of has Lebesgue measure zero.
We use the same notation as in Appendix F. Recall Equation (A64), which is the condition of equality about . We express the elements of and as and . Then the equality condition can be written as
Then we define the polynomial about
To apply lemma of [], we must confirm that is not always 0. This is clear from the definition of f since we generate randomly from the distribution that is absolutely continuous with respect to Lebesgue measure and and and either .
Then, given an evaluation point x, from lemma of [], the subset of that satisfies has Lebesque measure zero. Thus, if we generate from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), holds probability 0. This concludes the proof. □
Appendix G.2. Proof of Lemma 1
Proof.
We first discuss the condition about . Since , and we denote the set of eigenvalues of as . In general, the eigenvalues of the matrix that is composed of the Kronecker product with two matrices, e.g., A and B, are given as the product of each eigenvalue of A and B []. Thus, since J is the Kronecker product of and , if does not have 0 as an eigenvalue, J does not have 0 as an eigenvalue.
Next, we discuss another equality condition. We use the similar notation as in Appendix F, but now the dimension of the matrix J is . We express the eigenvalue which has the smallest real part as and its eigenvector as . The elements of and as and . We also express these as where .
We use the Kronecker product property:
where indicates the element of i-th row and j-th column of where we use the property of the Kronecker product and the Vec operator in the second equality [].
The proof is almost similar to Appendix G.1. Then the equality condition can be written as
where is the d-dimensional Euclidean norm since . Then we define the polynomial about
In a similar discussion with Appendix G.1, it is clear that f is not always 0. Thus, given an evaluation point x, from lemma of [], the subset of that satisfies has Lebesque measure zero. Thus, if we generate from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), holds probability 0. This concludes the proof. □
Appendix G.3. Extending the Theorem to the Path
About Theorem 5 and Lemma 1, the statement holds true when we fix an evaluation point x. To ensure the acceleration, we need to extend Theorem 5 and Lemma 1 from a single evaluation point to the path of the stochastic process for S-LD, S-PLD, S-ULD, and S-PULD.
First, the condition of is not related to the evaluation point. Thus, we need to consider the equality condition for . As for this condition, as we had seen in Theorem 5 and Lemma 1, if we generate the random matrix J which is absolutely continuous with respect to Lebesgue measure, then the equality condition is not satisfied with probability 1 at the given evaluation point. The important point in those proof is to prove that the event when the equality holds has Lebesgue measure 0 at the given evaluation point using the lemma of [].
Let us consider when two evaluation points are given (e.g., , ), and we check whether the random matrix J satisfies the above equality condition or not. We can easily prove that at each evaluation point, such an event (we express them as and ) has Lebesgue measure 0 using the lemma of [] (We refer to this as and where P is the law induced by generating the random matrix that has independent elements). So, the volume of the event of sum of and are also 0 (). By repeating this procedure, when given a finite number of evaluation points, , the sum of such probability is 0 (this indicates ).
When we consider the discretized dynamics of S-LD, S-PLD, and so on, and update samples up to k-iterations, then there exist k evaluation points. So, by applying the above discussion, we can ensure that along the path of the discretized dynamics, the equality condition does not hold with probability 1. On the other hand, as for the continuous dynamics, the evaluation point is infinite, thus when we cannot conclude that the probability that the equality does not hold is 1.
Appendix H. Proof of Theorem 6
We use the same notation as in Appendix F. We consider the expansion concerning and we consider the following setting,
which indicates that by introducing the skew-acceleration terms, the pairs of eigenvalues and eigenvectors of are expressed by the small perturbation for the eigenvalues and eigenvectors of H. Since are the eigenvalues of H and they can be used as an orthogonal basis, thus we expand by this basis. We obtain
where .
Appendix H.1. Asymptotic Expansion When the Smallest Eigenvalue of H(x) Is Positive
We work on the similar matrix of , that is where . See Appendix G.1 for the detail. Please note that this similar matrix only exists when the smallest eigenvalue of is positive. Thus, the following discussion cannot apply to the case at the saddle point, where negative eigenvalues appear. We discuss the saddle point expansion later.
From the definition, we have
We rearrange this equation as
First, we focus on the first-order expansion. This means we neglect high-order terms. Then, we have
By multiplying to Equation (A76) from the left-hand side, we have
Since due to the skew-symmetric property of V. Thus, we have
up to the first-order expansion. Then we substitute this into Equation (A76) and multiplying where , we have
Then we have
Then we obtain
We substitute this into Equation (A75), and multiplying , we have
Since and and , we have
Thus, we have
Thus, by taking the real part, and note that , we have
This concludes the proof.
Appendix H.2. Expansion of the Eigenvalue at the Saddle Point
Here we derive the formula of the expansion of the eigenvalue at the saddle point. Since the smallest eigenvalue is negative, we cannot use the similar matrix as shown above. Instead, we use the relation,
where we used the definition of the eigenvalues and eigenvectors. Here, we express and its pairs of eigenvalues and eigenvectors as . As introduced in the above, we substitute the expansion to Equation (A86), then we obtain
Then, in the same way as above, since are the eigenvalues of H and they can be used as an orthogonal basis, we expand by this basis. This means
where . By multiplying to Equation (A87) where from left-hand side and neglecting high-order terms, we have
Next, Then by multiplying to Equation (A87) from left-hand side, we have
Then by substituting with coefficient Equation (A89), we have
This concludes the proof.
Appendix I. Convergence Rate of Parallel Sampling Schemes
Appendix I.1. Proof of Lemma 2
First, we introduce the notations. We express the random variables of S-PLD as . We express the measure induced by S-PLD as , which uses the as an interaction term. Thus, we express the measure of PLD as , we can decompose the measure as marginals. We also denote the marginal measure of S-PLD for . Please note that initial distribution is and its marginals are as defined in Assumption 4.
Please note that the marginal measure of PLD is the same as those of LD if the initial measures are all the same, thus each marginal satisfy the Poincaré constant . This is also the result of the tensorization property of the spectral gap (Proposition 4.3.1 in Bakry et al. []).
As for the initial condition, from the fact that divergence is the special case of Renyi divergence (), and from the tensorization property of the Renyi divergence (see Theorem 28 in []), we have
Then we have
If the skew acceleration is applied, from the same discussion as S-LD (see Appendix C.1), S-PLD has the Poincaré constant which is larger than . We express it as . Then we have
At first, since there exists a constant N in the convergence bound, this bound seems not useful. However, as we discussed below, when we bound the bias or variance, these bound is meaningful. For example, let us consider approximating the true expectation by the ensemble samples . Then we are interested in bounding the error
For this purpose, we can bound this by 2-Wasserstein distance as
where we assumed that f shows lipschitzness and used the fact that shows lipschitzness.
To bound the distance, we use the basic relation
where is the Poincaré constant. This is established by the definition of Wasserstein distance and -divergence, see [] for the detail. Then combined with above relations, we obtain the bias bound of S-PLD as
In the same way, we obtain the bias bound of PLD as
Thus, while the explicit dependency on N disappeared, but S-PLD shows faster convergence through the relation of . Moreover, if we use the skew matrices, which does not satisfy the equality condition, we have .
Appendix I.2. Proof for S-ULD
We can characterize the convergence rate almost in the same way as Appendix C.2. The derivation is the same above, thus we only show the result
where and is given as follows.
and
where is arbitrary sufficiently small positive value such that is satisfies. and
Appendix J. Proof of Theorem 7
We show our theorem again with explicit constants
Theorem A3.
Under Assumptions 1–7, for any and any obeying and , we have
where
Then obtained bound is , which is independent of N. Thus, this result is much better than those in []. Additionally, note that we can derive the similar bias bound for skew-SGHMC in the same way as skew-SGLD.
Proof.
For notational simplicity, we express the random variables of skew-SGLD which uses the as an interaction term as and those of S-PLD as . In this section, for simplicity, we express them as and . We denote the measure of and as and . We also denote the marginal measure of and as and .
Then, we first decompose the bias as
where we used the Jensen inequality for the first term in the last inequality and we move outside the . In addition, each expectation only depends on the marginal measures in the first term and we use the property of the 2-Wasserstein (2-W) distance. Furthermore, we decompose the first term as
where denotes the measure induced by , which is the naive parallel sampling without a skew-symmetric interaction.
In conclusion, our task is to bound each , , terms in the above. Bounding is already discussed in Appendix I.1.
Next, we work on and . Following [], we use weighted CKP inequality to bound the 2-W distance. From Bolley and Villani [], using the weighted CKP inequality, we can bound each 2-W distance by the relative entropy (KL divergence). This weighted CKP inequality indicates that
with
and
with
We point out that using not and in weighted CKP inequality is important. This is because since is the constant based on the parallel-chain Monte Carlo without skew-symmetric term, thus the parallel chain can be decomposed each independent chains. Thus, actually does not depend on i and it does not depend on N and shows dependency. However, and show which shows linear dependency on N since there is an interaction term between parallel chains and we cannot decompose the parallel chain easily. Thus, this results in unsatisfactory dependency on N. This is the reason we introduced in our theoretical analysis.
Please note that since is induced by the naive parallel chain, each marginal is independent with each other and takes the same measure if the initial measure is the same. Thus, . From now on, we express the marginal as for simplicity. Thus, .
Then substituting the above WKP inequalities and using the Jensen inequality, we obtain
To analyze the discretization error, we use the following key lemma:
Lemma A1.
Assume that there exist random variables and . We denote the product space as . Let us introduce and . Let us express their joint probability measures as expressed as , , let us denote the marginal measures of each Xs and Ys as and . If holds, we have
Combining these results with the above bias bound, we obtain
Thus, we need to bound and and . We can upper-bound them using the results of []. For that purpose, we need to replace the constants in [] as we show in the below. Here, we discuss how the constants in the assumption are changed in the ensemble scheme. We define
First, we focus on the smoothness condition. From Assumption 2 and lemma 8 in [], we have
where the norm in the right-hand side is the Euclidean norm in .
Next, we discuss the smoothness condition. Define . Then, Let and under the assumptions 1 to 6, we have
Next, we check about the condition of the drift function at the origin: . We can calculate in the same way as the smoothness condition. Then we have
Next, we study the condition about the stochastic gradient: . This can be easily modified to
Finally, we discuss about the initial condition: . We assume that the initial probability distribution is , which means that all the marginal probability is the same. Then
In this way, the constants in the assumptions are modified and expressed with N and . Then combined with the results of [], we can derive the following relations
where
This concludes the proof. □
Appendix J.1. Proof of Lemma A1
Proof.
We prove this lemma using the Donsker–Varadhan representation of the relative entropy []. The relative entropy admits the dual representation as:
where supremum is taken over all function T of which the expectation of and T are finite. We then restrict the function class into a class where each expectation of and are finite. Then by definition,
Then we have
□
Appendix K. Order Expansion
Bias Expansion for S-PLD
Recall that the bias of S-PLD is
where
First, we discuss the convergence of the continuous dynamics. Using the eigenvalue expansion in Theorem 6, with some positive constant , we have
Then by assuming is small enough and considering the Tayler expansion, we have
As for the discretization and stochastic gradient error, using the Taylor expansion, there exists a positive constant and , such that
Combining these terms, we have
Thus, there exists an optimal , which minimizes the bias. Please note that at , acceleration always occurs. As k goes to infinity, the second third terms 0, thus the first term will be dominant, which means we have larger discretization and stochastic gradient error.
Appendix L. Hyperparameters of the Proposed Algorithm
Here we discuss how to set hyperparameters in the algorithm. There are three hyperparameters, , , and c. We numerically found that setting work well for real dataset including LDA experiment, and Bayesian neural network regression and classification. For toy dataset, we set .
As for and , we empirically found that using the following scaling trick works well for real dataset including LDA experiment, and Bayesian neural network regression and classification,
and using . The intuition is that the magnitude of the gradient can be very different in each dimension, so we introduce the scaling by the gradient. We also multiply h so that the stochastic gradient and discretization error of the skew term will not be dominant compared to usual gradient term. Finally, we multiply some constant so that will not be too small.
Appendix M. Proof of Theorem 8
In this section, we derive the upper-bound of the bias of skew-SGLD based on []. This approach requires us to use the logarithmic Sobolev inequality [], which is stronger than the Poincaré inequality. First, we present the definition of the logarithmic Sobolev inequality. We say that on with satisfies the logarithmic Sobolev inequality with constant in case for all function f on with ,
This logarithmic Sobolev inequality is stronger than the Poincaré inequality and induces the convergence in KL divergence. See [] for details. It was proved in [,] that our dynamics, LD, SLD, PLD, S-PLD, and skew-SGLD satisfy the logarithmic Sobolev inequalities under our assumptions. We express the constant of the logarithmic Sobolev inequality for skew-SGLD as . This constant depends on the skew matrices and the Poincaré constant. We estimate this constant in Appendix M.1.
To upper-bound the bias, here we control the KL divergence. We denote the law of skew-SGLD at iteration k with interaction strength as . We upper-bound the bias by 2-Wasserstein distance
Then, from the transportation inequality [],
Thus, we will upper bound the KL divergence using the technique in []. However, in the original proof, a full gradient is used so we replace it with the stochastic gradient. Moreover, we introduce the skew interaction term.
First, Lemma 11 in [] is modified to
Then Lemma 12 in [] is modified to
for any integrable .
Herein after, we drop from , , and for notational simplicity. We focus on skew-SGLD at iteration k, we consider the following SDE for
where is the stochastic gradient conditioned on . The solution of this SDE is
We would like to derive the continuity equation correspond to Equation (A154). Following [], we express as and as for simplicity. Let denote the joint distribution of . Then, the conditional and marginal relations are written as
The conditional density follows the FP equation
Then following [], to derive the evolution of , we take the expectation over
Then, we take the expectation regarding for the stochastic gradient in the above equation and include it into for notational simplicity. Then following the discussion of Lemma 3 in [], we obtain
where and
Then, from [], we can upper-bound the second term by
and the third term is upper-bounded by
where we used lemma 2 and 7 in []. Finally, from the original proof of [] we obtain
Then, in conclusion, under obeying and , we obtain
For simplicity, we assume that and , then we obtain
Then using , we obtain
If , we obtain
From this one step inequality, we obtain
Then, finally we obtain
where
Appendix M.1. Estimation of the Logarithmic Sobolev Constant
In this section, we estimate the logarithmic Sobolev constants using the technique of restricted logarithmic Sobolev inequality, which was introduced in [].
The technique of [] estimates the constant of the logarithmic Sobolev inequality as follows. Assume that on with satisfies the Poincaré inequality with constant m. Then, for any function u on that satisfies
we find a constant b that satisfies
Then the logarithmic constant is larger than . Thus, we only need to focus on the restricted function class to estimate a constant b. We slightly change the Lemma 3.2 of [] that estimate the constant b in Equation (A176) to apply it in our setting. In Lemma 3.2 of [], it was proved that if u on satisfies the conditions in Equation (A175), then for any , we have
where we assume that satisfies the Poincaré inequality with constant m. If there exists a constant C such that
then by setting , we can show that
Thus, the constant b in Equation (A176) is and the logarithmic constant is .
Thus, We analyze the constant C. The first term of the integral in Equation (A178) is lower-bounded bounded by
where we used the property of , see [] for details. As for the second term, it is lower-bounded by
Thus, by setting
we can estimate the logarithmic constant as .
In our setting, this is modified to
where
Finally, if we increase , increases. Thus, since , we obtain .
Appendix M.2. Computational Complexity
To derive the computational complexity, for simplicity, we assume that and We also set for simplicity. This means that the variance of the stochastic gradient is small enough and we use small . Then the bias is
where
Then we define
and use the step size that satisfies . Then when we use
we have
References
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Raginsky, M.; Rakhlin, A.; Telgarsky, M. Non-convex learning via Stochastic Gradient Langevin Dynamics: A nonasymptotic analysis. In Proceedings of the Conference on Learning Theory, Amsterdam, The Netherlands, 7–10 July 2017; pp. 1674–1703. [Google Scholar]
- Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the International Conference on Machine Learning, Washington, DC, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
- Livingstone, S.; Girolami, M. Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions. Entropy 2014, 16, 3074–3102. [Google Scholar] [CrossRef]
- Hartmann, C.; Richter, L.; Schütte, C.; Zhang, W. Variational Characterization of Free Energy: Theory and Algorithms. Entropy 2017, 19, 626. [Google Scholar] [CrossRef] [Green Version]
- Neal, R.M. Improving asymptotic variance of MCMC estimators: Non-reversible chains are better. arXiv 2004, arXiv:math/0407281. [Google Scholar]
- Neklyudov, K.; Welling, M.; Egorov, E.; Vetrov, D. Involutive mcmc: A unifying framework. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; pp. 7273–7282. [Google Scholar]
- Gao, X.; Gurbuzbalaban, M.; Zhu, L. Breaking Reversibility Accelerates Langevin Dynamics for Non-Convex Optimization. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 17850–17862. [Google Scholar]
- Eberle, A.; Guillin, A.; Zimmer, R. Couplings and quantitative contraction rates for Langevin dynamics. Ann. Probab. 2019, 47, 1982–2010. [Google Scholar] [CrossRef] [Green Version]
- Gao, X.; Gürbüzbalaban, M.; Zhu, L. Global convergence of stochastic gradient Hamiltonian Monte Carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. arXiv 2018, arXiv:1809.04618. [Google Scholar]
- Cheng, X.; Chatterji, N.S.; Abbasi-Yadkori, Y.; Bartlett, P.L.; Jordan, M.I. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv 2018, arXiv:1805.01648. [Google Scholar]
- Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International conference on machine learning, Beijing, China, 21–26 June 2014; pp. 1683–1691. [Google Scholar]
- Hwang, C.R.; Hwang-Ma, S.Y.; Sheu, S.J. Accelerating gaussian diffusions. Ann. Appl. Probab. 1993, 3, 897–913. [Google Scholar] [CrossRef]
- Hwang, C.R.; Hwang-Ma, S.Y.; Sheu, S.J. Accelerating diffusions. Ann. Appl. Probab. 2005, 15, 1433–1444. [Google Scholar] [CrossRef] [Green Version]
- Hwang, C.R.; Normand, R.; Wu, S.J. Variance reduction for diffusions. Stoch. Process. Their Appl. 2015, 125, 3522–3540. [Google Scholar] [CrossRef]
- Duncan, A.B.; Lelièvre, T.; Pavliotis, G.A. Variance Reduction Using Nonreversible Langevin Samplers. J. Stat. Phys. 2016, 163, 457–491. [Google Scholar] [CrossRef] [Green Version]
- Duncan, A.B.; Nüsken, N.; Pavliotis, G.A. Using Perturbed Underdamped Langevin Dynamics to Efficiently Sample from Probability Distributions. J. Stat. Phys. 2017, 169, 1098–1131. [Google Scholar] [CrossRef] [Green Version]
- Futami, F.; Sato, I.; Sugiyama, M. Accelerating the diffusion-based ensemble sampling by non-reversible dynamics. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; pp. 3337–3347. [Google Scholar]
- Bakry, D.; Gentil, I.; Ledoux, M. Analysis and Geometry of Markov Diffusion Operators; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 348. [Google Scholar]
- Roussel, J.; Stoltz, G. Spectral methods for Langevin dynamics and associated error estimates. ESAIM Math. Model. Numer. Anal. 2018, 52, 1051–1083. [Google Scholar] [CrossRef] [Green Version]
- Menz, G.; Schlichting, A. Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. Ann. Probab. 2014, 42, 1809–1884. [Google Scholar] [CrossRef] [Green Version]
- Liu, Q.; Lee, J.; Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 24–26 June 2016; pp. 276–284. [Google Scholar]
- Vempala, S.; Wibisono, A. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8094–8106. [Google Scholar]
- Lelièvre, T.; Nier, F.; Pavliotis, G.A. Optimal non-reversible linear drift for the convergence to equilibrium of a diffusion. J. Stat. Phys. 2013, 152, 237–274. [Google Scholar] [CrossRef] [Green Version]
- Tripuraneni, N.; Rowland, M.; Ghahramani, Z.; Turner, R. Magnetic Hamiltonian Monte Carlo. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3453–3461. [Google Scholar]
- Nusken, N.; Pavliotis, G. Constructing sampling schemes via coupling: Markov semigroups and optimal transport. SIAM/ASA J. Uncertain. Quantif. 2019, 7, 324–382. [Google Scholar] [CrossRef]
- Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Proceedings of the Advances In Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2378–2386. [Google Scholar]
- Zhang, J.; Zhang, R.; Chen, C. Stochastic particle-optimization sampling and the non-asymptotic convergence theory. arXiv 2018, arXiv:1809.01293. [Google Scholar]
- Wang, Y.; Li, W. Information Newton’s flow: Second-order optimization method in probability space. arXiv 2020, arXiv:2001.04341. [Google Scholar]
- Wibisono, A. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the Conference On Learning Theory, Stockholm, Sweden, 6–9 July 2018; pp. 2093–3027. [Google Scholar]
- Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
- Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R.D.; Neven, H. Bayesian sampling using stochastic gradient thermostats. In Proceedings of the Advances in neural information processing systems, Montreal, QC, Canada, 8–11 December 2014; pp. 3203–3211. [Google Scholar]
- Patterson, S.; Teh, Y.W. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3102–3110. [Google Scholar]
- Dua, D.; Graff, C. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 21 July 2021).
- Villani, C. Optimal transportation, dissipative PDE’s and functional inequalities. In Optimal Transportation and Applications; Springer: Berlin/Heidelberg, Germany, 2003; pp. 53–89. [Google Scholar]
- Bakry, D.; Barthe, F.; Cattiaux, P.; Guillin, A. A simple proof of the Poincaré inequality for a large class of probability measures including the log-concave case. Electron. Commun. Probab 2008, 13, 21. [Google Scholar] [CrossRef]
- Nelson, E. Dynamical Theories of Brownian Motion; Princeton University Press: Princeton, NJ, USA, 1967; Volume 3. [Google Scholar]
- Pavliotis, G.A. Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations; Springer: Berlin/Heidelberg, Germany, 2014; Volume 60. [Google Scholar]
- Franke, B.; Hwang, C.R.; Pai, H.M.; Sheu, S.J. The behavior of the spectral gap under growing drift. Trans. Am. Math. Soc. 2010, 362, 1325–1350. [Google Scholar] [CrossRef] [Green Version]
- Landim, C.; Seo, I. Metastability of Nonreversible Random Walks in a Potential Field and the Eyring-Kramers Transition Rate Formula. Commun. Pure Appl. Math. 2018, 71, 203–266. [Google Scholar] [CrossRef]
- Landim, C.; Mariani, M.; Seo, I. Dirichlet’s and Thomson’s principles for non-selfadjoint elliptic operators with application to non-reversible metastable diffusion processes. Arch. Ration. Mech. Anal. 2019, 231, 887–938. [Google Scholar] [CrossRef] [Green Version]
- Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2012; Volume 3. [Google Scholar]
- Okamoto, M. Distinctness of the Eigenvalues of a Quadratic form in a Multivariate Sample. Ann. Statist. 1973, 1, 763–765. [Google Scholar] [CrossRef]
- Petersen, K.B.; Pedersen, M.S. The Matrix Cookbook; Technical University of Denmark: Lynby, Denmark, 2012; Available online: http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html (accessed on 21 July 2021).
- Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
- Chewi, S.; Le Gouic, T.; Lu, C.; Maunu, T.; Rigollet, P.; Stromme, A. Exponential ergodicity of mirror-Langevin diffusions. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 19573–19585. [Google Scholar]
- Bolley, F.; Villani, C. Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. In Annales de la Faculté des Sciences de Toulouse: Mathématiques; Université Paul Sabatier: Toulouse, France, 2005; Volume 14, pp. 331–352. [Google Scholar]
- Donsker, M.D.; Varadhan, S.S. Asymptotic evaluation of certain Markov process expectations for large time. IV. Commun. Pure Appl. Math. 1983, 36, 183–212. [Google Scholar] [CrossRef]
- Carlen, E.; Loss, M. Logarithmic Sobolev inequalities and spectral gaps. Contemp. Math. 2004, 353, 53–60. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).