Next Article in Journal
Granger-Causality Inference of the Existence of Unobserved Important Components in Network Analysis
Next Article in Special Issue
Fast Compression of MCMC Output
Previous Article in Journal
A Novel Algebraic Structure of (α,β)-Complex Fuzzy Subgroups
Previous Article in Special Issue
Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices

1
Communication Science Laboratories, NTT, Hikaridai, Seika-cho, “Keihanna Science City”, Kyoto 619-0237, Japan
2
Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(8), 993; https://doi.org/10.3390/e23080993
Submission received: 21 June 2021 / Revised: 14 July 2021 / Accepted: 27 July 2021 / Published: 30 July 2021
(This article belongs to the Special Issue Approximate Bayesian Inference)

Abstract

:
Langevin dynamics (LD) has been extensively studied theoretically and practically as a basic sampling technique. Recently, the incorporation of non-reversible dynamics into LD is attracting attention because it accelerates the mixing speed of LD. Popular choices for non-reversible dynamics include underdamped Langevin dynamics (ULD), which uses second-order dynamics and perturbations with skew-symmetric matrices. Although ULD has been widely used in practice, the application of skew acceleration is limited although it is expected to show superior performance theoretically. Current work lacks a theoretical understanding of issues that are important to practitioners, including the selection criteria for skew-symmetric matrices, quantitative evaluations of acceleration, and the large memory cost of storing skew matrices. In this study, we theoretically and numerically clarify these problems by analyzing acceleration focusing on how the skew-symmetric matrix perturbs the Hessian matrix of potential functions. We also present a practical algorithm that accelerates the standard LD and ULD, which uses novel memory-efficient skew-symmetric matrices under parallel-chain Monte Carlo settings.

1. Introduction

Sampling is one of the most widely used techniques for the approximation of posterior distribution in Bayesian inference [1]. Markov Chain Monte Carlo (MCMC) is widely used to obtain samples. In MCMC, Langevin dynamics (LD) is a popular choice for sampling from high-dimensional distributions. Each sample in LD moves toward a gradient direction with added Gaussian noise. LD efficiently explore around a mode of a target distribution using the gradient information without being trapped by local minima thanks to added Gaussian noise. Many previous studies theoretically and numerically proved LD’s superior performance [2,3,4,5]. Since non-reversible dynamics generally improves mixing performance [6,7], research on introducing non-reversible dynamics to LD for better sampling performance is attracting attention [8].
There are two widely known non-reversible dynamics for LD. One is underdamped Langevin dynamics (ULD) [9], which uses second-order dynamics. The other introduces perturbation, which consists of multiplying the skew-symmetric matrix by a gradient [8]. Here, we refer to the matrix as skew matrices for simplicity and this perturbation technique as skew acceleration. Much research has been done on ULD theoretically [9,10,11] and ULD is widely used in practice, which is also known as stochastic gradient Hamilton Monte Carlo [12]. In contrast, the application of the skew acceleration for standard Bayesian models is quite limited even though it is expected to show superior performance theoretically [8].
For example, skew acceleration has been analyzed focusing on sampling from Gaussian distributions [13,14,15,16,17], although assuming Gaussian distributions in Bayesian models is restrictive in practice. A recent study [8] theoretically showed that skew acceleration accelerates the dynamics around the local minima and saddle points for non-convex functions. Another work [18] clarified that the skew acceleration theoretically and numerically improves mixing speed when used as interactions between chains in parallel sampling schemes for non-convex Bayesian models.
Compared to ULD, what seems to be lacking for skew acceleration is a theoretical understanding of issues that are important to practitioners. The most significant problem is that no theory exists for selecting skew matrices. In existing studies, introducing a skew matrix into LD results in equal or faster convergence, denoting that a bad choice of skew matrix results in no acceleration. Thus, choosing appropriate skew matrices is critical. Furthermore, although ULD’s acceleration has been analyzed quantitatively, existing studies have only analyzed skew acceleration qualitatively. Thus, it is difficult to justify the usefulness of skew acceleration in practice compared to ULD. Another issue is that introducing skew matrices requires a vast memory cost in many practical Bayesian models.
The purpose of this study is to solve these problems from theoretical and numerical viewpoints and establish a practical algorithm for skew acceleration. The following are the two major contributions of this work.
Our contribution 1: We present a convergence analysis of skew acceleration for standard Bayesian model settings, including non-convex potential functions using Poincaré constants [19]. The major advantage of Poincaré constants is that we can analyze skew acceleration through a Hessian matrix and its eigenvalues and develop a practical theory about the selection of J and the quantitative assessment of skew acceleration.
Furthermore, we propose skew acceleration for ULD and present convergence analysis for the first time. Since ULD shows faster convergence than LD, combining skew acceleration with ULD is promising.
Our contribution 2: We develop a practical skew accelerated sampling algorithm for a parallel sampling setting with novel memory-efficient skew matrices. Since a naive implementation of skew acceleration requires a large memory cost to store skew matrices, memory-efficiency is critical in practice. We also present a non-asymptotic theoretical analysis for our algorithm in both LD and ULD settings under a stochastic gradient and Euler discretization. We clarify that introducing skew matrices accelerates the convergence of continuous dynamics, although it increases the discretization and stochastic gradient error. Then to the best of our knowledge, we propose the first algorithm that adaptively controls this trade-off using the empirical distribution of the parallel sampling scheme.
Finally, we verify our algorithm and theory in practical Bayesian problems and compare it with other sampling methods.
Notations: I d denotes a d × d identity matrix. Capital letters such as X represent random variables, and lowercase letters such as x represent non-random real values. ·, · and | · | denote Euclidean inner products, distances and absolute values.

2. Preliminaries

In this section, we briefly introduce the basic settings of LD and non-reversible dynamics for the posterior distribution sampling in Bayesian inference.

2.1. LD and Stochastic Gradient LD

First, we introduce the notations and the basic settings of LD and stochastic gradient LD (SGLD), which is a practical extension of LD. Here z i denotes a data point in space Z , | Z | denotes the total number of data points, and x R d corresponds to the parameters of a given model, which we want to sample. Our goal is to sample from the target distribution with density d π ( x ) e β U ( x ) d x , where potential function U ( x ) is the summation of u : R d × Z R , i.e., U ( x ) = 1 | Z | i = 1 | Z | u ( x , z i ) . Function u ( · , · ) is continuous and non-convex. The explicit assumptions made for it are discussed in Section 3.1. The SGLD algorithm [2,3] is given as a recursion:
X k + 1 = X k h U ^ ( X k ) + 2 h β 1 ϵ k ,
where h R + is a step size, ϵ k R d is a standard Gaussian random vector, β is a temperature parameter of π , and U ^ ( X k ) is a conditionally unbiased estimator of true gradient U ( X k ) . This unbiased estimate of the true gradient is suitable for large-scale data set since we can use not the full gradient, but a stochastic version obtained through a randomly chosen subset of data at each time step. This means that we can reduce the computational cost to calculate the gradient at each time step.
The discrete time Markov process in Equation (1) is the discretization of the continuous-time LD [2]:
d X t = U ( X t ) d t + 2 β 1 d w t ,
where w t denotes the standard Brownian motion in R d . The stationary measure of Equation (2) is d π ( x ) e β U ( x ) d x .

2.2. Poincaré Inequality and Convergence Speed

In sampling, we are interested in the convergence speed to the stationary measure. The speed is often characterized by the the generator associated with Equation (2) and defined as:
L f ( X t ) : = lim s 0 + E ( f ( X t + s ) | X t ) f ( X t ) s = U ( X t ) · + β 1 Δ f ( X t ) ,
where Δ denotes a standard Laplacian on R d and f D ( L ) and D ( L ) L 2 ( π ) denote the L domain. This L is a self-adjoint operator, which has only discrete spectrums (eigenvalues). π with L has a spectral gap if the smallest eigenvalue of L (other than 0) is positive. We refer to it as ρ 0 ( > 0 ) . This spectral gap is closely related to Poincaré inequality. Internal energy is defined:
E ( f ) : = R d f L f d π .
Please note that E ( f ) > 0 is satisfied. Then π with L satisfies the Poincaré inequality with constant c, if for any f D ( L ) , π with L satisfies:
f 2 d π f d π 2 c E ( f ) .
The spectral gap characterizes this constant c 1 ρ 0 , which holds (see Appendix A.2 for details). We refer to best constant c as the Poincaré constant [19]. For notational simplicity, we define m 0 : = 1 c and refer to this m 0 as the Poincaré constant.
In sampling, crucially, Poincaré inequality dominates the convergence speed in χ 2 divergence:
d μ t d π 1 2 d π : = χ 2 ( μ t π ) e 2 m 0 β t χ 2 ( μ 0 π ) ,
where μ t denotes the measure at time t induced by Equation (2) and μ 0 is the initial measure (see Appendix A.3 for details). Thus, the larger Poincaré constant m 0 is, the faster convergence we have.

2.3. Non-Reversible Dynamics

In this section, we introduce the non-reversible dynamics. π with L is reversible if for any test function f , g D ( L ) , π with L satisfies
R d f L g d π = R d g L f d π .
If this is not satisfied, π with L is non-reversible [19].
We introduce two non-reversible dynamics for LD. The first is ULD, which is given as
d X t = Σ 1 V t d t , d V t = U ( X t ) d t γ Σ 1 V t d t + 2 γ β 1 d w t ,
where V R d is an auxiliary random variable, γ R is a positive constant, and Σ is the variance of the stationary distribution of auxiliary random variable V. The stationary distribution is π ˜ : = π N ( 0 , Σ ) e β U ( x ) 1 2 Σ 1 v 2 , where N denotes a Gaussian distribution. The superior performance of ULD compared with LD has been studied rigorously [9,10,11]. ULD’s convergence speed is also characterized by the Poincaré constant [20]. In practice, we use discretization and the stochastic gradient for ULD, which is called the stochastic gradient Hamilton Monte Carlo (SGHMC) [10]. The second non-reversible dynamics is the skew acceleration given as
d X t = ( I + α J ) U ( X t ) d t + 2 β 1 d w t ,
where J is a real value skew matrix and α R + is a positive constant. We call this dynamics S-LD. The stationary distribution of S-LD is still π , and S-LD shows faster convergence and smaller asymptotic variance [13,14,15,18].

3. Theoretical Analysis of Skew Acceleration

In this section, we present a theoretical analysis of skew acceleration in LD and ULD in standard Bayesian settings. We analyze acceleration through the Poincaré constant and connect it with the eigenvalues of the Hessian matrix, which allows us to obtain a practical criterion to choose skew matrices and quantitatively evaluate acceleration. We focus on a setting where a continuous SDE and a full gradient of the potential function is used in this section. The discretized SDE and stochastic gradient are discussed in Section 4.

3.1. Acceleration Characterization by the Poincaré Constant

First, we introduce the same four assumptions as a previous work [2], which showed the existence of the Poincaré constant about m 0 for LD (see Appendix C for details).
Assumption 1. 
(Upper bound of the potential function at the origin) Function u takes nonnegative real values and is twice continuously differentiable on R d , and constants A and B exist such that for all z Z ,
| u ( 0 , z ) | A , u ( 0 , z ) B .
Assumption 2. 
(Smoothness) Function u has Lipschitz continuous gradients; for all z Z , positive constant M exists for all x , y R d ,
u ( x , z ) u ( y , z ) M x y .
Assumption 3. 
(Dissipative condition) Function u satisfies the (m,b)-dissipative condition for all z Z ; for all x R d , m > 0 and b 0 exist such that
x · u ( x , z ) m x 2 + b .
Assumption 4. 
(Initial condition) Initial probability distribution μ 0 of X 0 has a bounded and strictly positive density p 0 , and for all x R d ,
κ 0 : = log R d e x 2 p 0 ( x ) d x < .
Please note that these assumptions allow us to consider the non-convex potential functions, which are common in practical Bayesian models. Furthermore, we make the following assumption about J.
Assumption 5. 
The operator norm of J is bounded:
J 2 1 .
This means that the largest singular value of J is below 1.
Under these assumptions, we present the convergence behavior of skew acceleration using the Poincaré constant. First, we present the following S-LD result.
Theorem 1.
Under Assumptions 1–5, the S-LD of Equation (9) has exponential convergence,
χ 2 ( μ t α π ) e 2 m ( α ) β t χ 2 ( μ 0 π ) ,
where μ t α is the measure at time t induced by S-LD and m ( α ) is the Poincaré constant of S-LD defined by its generator
L α f ( x ) : = ( I + α J ) U ( x ) · + β 1 Δ f ( x ) .
Furthermore, m ( α ) satisfies m ( α ) m 0 .
The proof is shown in Appendix C. This theorem states that introducing the skew matrices accelerates the convergence of LD by improving the convergence rate from m 0 to m ( α ) . Although [18] obtained a similar result, we used the Poincaré constant and derived an explicit criterion when m ( α ) = m 0 holds, as we discuss below.
Next, we also introduce skew acceleration in ULD. Since ULD shows faster convergence than LD in standard Bayesian settings [10,11], it is promising to combine skew acceleration with ULD to obtain a more efficient sampling algorithm. For that purpose, we propose the following SDE:
d X t = Σ 1 V t d t + α 1 J 1 U ( X t ) d t ,
d V t = U ( X t ) d t γ ( Σ 1 + α 2 J 2 ) V t d t + 2 γ β 1 d w t ,
where J 1 and J 2 are real value skew matrices and α 1 and α 2 are positive constants. We assume that J 1 and J 2 satisfy Assumption 5. We refer to this method as skew underdamped Langevin dynamics (S-ULD) whose stationary distribution is π ˜ = π N ( 0 , Σ ) e β U ( x ) 1 2 Σ 1 v 2 . See Appendix B for details, which include discussions on other combinations of skew matrices. As for S-ULD, we need an additional assumption about the initial condition of V 0 :
Assumption 6. 
(Initial condition) Initial probability distribution μ 0 ( x , v ) of ( X 0 , V 0 ) has a bounded and strictly positive density p 0 that satisfies,
κ 0 : = log R 2 d e x 2 + v 2 p 0 ( ( x , v ) ) d x d v < .
We then provide the following convergence theorem that resembles S-LD.
Theorem 2.
Under Assumptions 1–3, 5, 6, S-ULD has exponential convergence in χ 2 divergence and its convergence rate is also characterized by m ( α ) as defined in Theorem 1. S-ULD’s convergence equals or exceeds ULD, of which convergence rate is characterized by m 0 .
See Appendix C.2 for details. From these theorems, we confirmed that skew acceleration is effective in both S-LD and S-ULD, and the convergence speed is characterized by Poincaré constant m ( α ) defined by Equation (16).

3.2. Skew Acceleration from the Hessian Matrix

Our goal is to clarify what choices of J induce m ( α ) > m 0 , which leads to acceleration. Therefore, we discuss how Poincaré constant m ( α ) is connected to the eigenvalues and eigenvectors of the perturbed Hessian matrix ( I + α J ) 2 U ( x ) . Next, we introduce the notations. We express the Hessian of U ( x ) as H ( x ) and the perturbed Hessian matrix as H ( x ) : = ( I + α J ) H ( x ) . Please note that H is a real symmetric matrix, which has real eigenvalues and diagonalizable. On the other hand, since H is not symmetric, it has complex eigenvalues, although diagonalization is not assured (see Appendix E). We express pairs of eigenvectors and eigenvalues of H ( x ) as { ( v i α ( x ) , λ i α ( x ) ) } i = 1 d , which are ordered as Re ( λ 1 α ( x ) ) ) Re ( λ d α ( x ) ) . Here, Re ( λ 1 α ( x ) ) expresses the real part of complex value λ 1 α and Im denotes the imaginary part. We express those of H ( x ) as { ( v i 0 ( x ) , λ i 0 ( x ) ) } i = 1 d and order them as λ 1 0 ( x ) λ d 0 ( x ) .

3.2.1. Strongly Convex Potential Function

Assume that U is an m-strongly convex function, where for all x R d , m λ 1 0 ( x ) holds. Poincaré constant m 0 of LD satisfies m 0 = m [19]. For the skew acceleration, since Poincaré constant satisfies m ( α ) = m ( α ) , where m ( α ) is the best constant that satisfies, for all x, m ( α ) Re λ 1 α ( x ) (see Appendix D.1). Therefore, studying the Poincaré constant is equivalent to studying the smallest (real part of the) eigenvalue of the Hessian matrix. Thus, the relation between λ 1 0 ( x ) and Re λ 1 α ( x ) must be studied. The following theorem describes how the skew matrices change the smallest eigenvalue.
Theorem 3.
For all x R d , the real parts of the eigenvalues of H satisfy
m λ 1 0 ( x ) Re λ 1 α ( x ) Re λ d α ( x ) λ d 0 ( x ) .
The condition of λ 1 0 ( x ) = Re λ 1 α x ) ) is shown in Remark 1.
Remark 1.
Denote the set of the eigenvectors of eigenvalue λ 1 0 ( x ) as V 1 0 . If V 1 0 = { v } and J v = 0 , then λ 1 0 ( x ) = Re λ 1 α x ) ) holds. If the cardinality of set V 1 0 is larger than 1, and vectors v , v V 1 0 exist, such that λ 1 0 α J v = ( Im λ 1 α ) v and λ 1 0 α J v = ( Im λ 1 α ) v , then λ 1 0 ( x ) = Re λ 1 α x ) )  holds.
Refer to Appendix F for the proof. This is an extension of previous work [8,13]. If λ 1 0 ( x ) < Re λ 1 α ( x ) is satisfied for all x, we have m 0 < m ( α ) , i.e., acceleration occurs. We discuss how to construct J such that λ 1 0 ( x ) < Re λ 1 α ( x ) holds in Section 3.3.

3.2.2. Non-Convex Potential Function

The previous work [21] clarified that the Poincaré constant of the non-convex function is characterized by the negative eigenvalue of the saddle point. As shown in Figure 1, denote x 1 as the global minima, and x 2 is the local minima which has the second smallest value in U ( x ) . We express the saddle point with index one, i.e., there is only one negative eigenvalue at the point, between x 1 and x 2 as x . This means that the eigenvalues of H ( x ) satisfies λ 1 0 ( x ) < 0 < λ 2 0 ( x ) < < λ d 0 ( x ) . Ref. [21] clarified that the saddle point x characterizes the Poincaré constant as
m 0 1 1 | λ 1 ( x ) | e β ( U ( x ) U ( x 1 ) U ( x 2 ) ) .
When skew matrices are introduced, [8] clarified the following relation:
Theorem 4. 
([8]) λ 1 α ( x ) λ 1 0 ( x ) < 0 and equality holds only if J v 1 α ( x ) = 0 .
Note λ 1 α ( x ) is not a complex number. Thus, the skew acceleration reduces the negative eigenvalue and leads to a larger Poincaré constant (see Appendix D.2) and results in faster convergence.
In conclusion, introducing the skew matrix changes the Hessian’s eigenvalues and increase the Poincaré constant. If λ 1 0 ( x ) Re λ 1 α ( x ) is satisfied, this leads to faster convergence for both convex and non-convex potential functions.

3.3. Choosing J

In this section, we present a method for choosing J that leads to λ 1 0 ( x ) Re λ 1 α ( x ) to ensure the acceleration based on the equality conditions in Theorems 3 and 4. Combining these theorems, we obtain the following criterion:
Remark 2.
Given a point x, λ 1 0 ( x ) Re λ 1 α ( x ) holds if either the following conditions are satisfied: ( i ) when V 1 0 = { v } , J v 0 is satisfied. ( i i ) when | V 1 0 | > 1 , J v 0 holds for any v V 1 0 , and for any v , v V 1 0 , λ 1 0 α J v = ( Im λ 1 α ) v and λ 1 0 α J v = ( Im λ 1 α ) v are not satisfied.
The first condition ( i ) is easily satisfied if we choose J such that Ker J = { 0 } . On the other hand, the second condition ( i i ) is difficult to verify since H and its eigenvalues and eigenvectors generally depend on the current position of X t . Instead of evaluating eigenvalues and eigenvectors of H and H directly, we use the random matrix property shown in the next theorem.
Theorem 5.
Suppose the upper triangular entries of J follow a probability distribution that is absolutely continuous with respect to the Lebesgue measure. If Ker J = { 0 } is satisfied, then given a point x R d , λ 1 0 ( x ) Re λ 1 α ( x ) holds with probability 1.
The proof is given in Appendix G.1. From this theorem, we simply generate J from some probability distribution, such as the Gaussian distribution. Then, we check whether Ker J = { 0 } holds. If Ker J = { 0 } does not hold, we generate a random matrix J again.
The above theorem is valid only at a given evaluation point x. We can extend the above theorem to all the points over the path of the discretized dynamics (see Appendix G.3). With this procedure, we can theoretically ensure that acceleration occurs with probability one for discretized dynamics.

3.4. Qualitative Evaluation of The Acceleration

So far, we have discussed skew acceleration qualitatively but not quantitatively. Although acceleration’s quantitative evaluation is critical for practical purposes, to the best of our knowledge, no existing work has addressed it. In this section, we present a formula that quantitatively assesses skew acceleration by analyzing the eigenvalues of the Hessian matrix.
Theorem 6.
With the identical notation as in Theorem 3, for all x, we have
Re λ 1 α ( x ) = λ 1 0 ( x ) + α 2 k = 2 d λ 1 0 ( x ) λ k 0 ( x ) | v k 0 ( x ) J v 1 0 ( x ) | 2 λ k 0 ( x ) λ 1 0 ( x ) + O ( α 3 ) .
In particular, at saddle point x , we have
λ 1 α ( x ) = λ 1 0 ( x ) + α 2 k = 2 d λ 1 0 ( x ) λ k 0 ( x ) | v k 0 ( x ) J v 1 0 ( x ) | 2 λ k 0 ( x ) λ 1 0 ( x ) + O ( α 3 ) .
The proofs are shown in Appendix H. When focusing on Equation (22), if U ( x ) is a strongly convex function, since for all k > 1 , λ k ( x ) > λ 1 ( x ) > 0 holds and the second term in Equation (22) is positive. From this, Re λ 1 α ( x ) > λ 1 0 ( x ) holds. A similar relation holds for Re ( λ d α ( x ) ) . In Equation (23), λ 1 α ( x ) < λ 1 0 ( x ) < 0 holds. Thus, the changes of the Poincaré constants are proportional to α 2 . With these formulas, we can quantitatively evaluate the acceleration. We present numerical experiments to confirm our theoretical findings in Section 6.1.

4. Practical Algorithm for Skew Acceleration

In this section, we discuss skew acceleration in more practical settings compared to Section 3. First, we discuss the memory issue for storing J and the discretization of SDE and the stochastic gradient, which are widely used techniques in Bayesian inference. Finally, we present a practical algorithm for skew acceleration.

4.1. Memory Issue of Skew Acceleration and Ensemble Sampling

For d-dimensional Bayesian models, we need O ( d 2 ) memory space to store skew matrices Js, and this is difficult for high-dimensional models. Instead of storing J, we can randomly generate Js at each time step following Theorem 5. However, we experimentally confirmed that using different Js at each step does not accelerate the convergence (see Section 6). Thus, we need to use a fixed J during the iterations.
As discussed below, we found that the previously proposed accelerated parallel sampling [18] can be a practical algorithm to resolve this memory issue. In that method, we simultaneously updated N samples of the model’s parameters with correlation. In such a parallel sampling scheme, a correlation exists among multiple Markov chains, it is more efficient than a naive parallel-chain MCMC, where the samples are independent. We express the n-th sample at time t as X t ( n ) R d and the joint state of all samples at time t as X t N : = ( X t ( 1 ) , , X t ( N ) ) R d N . We express the joint stationary measure as π N : = π π ( x N ) e β i = 1 N U ( x ( i ) ) . We express the sum of the potential function as U N : = i = 1 N U ( x ( i ) ) . We then consider the following dynamics:
d X t N = ( I d N + α J ) U N ( X t N ) d t + 2 β 1 d w t ,
U N ( X t N ) : = U ( X t ( 1 ) ) , , U ( X t ( N ) ) .
We call this dynamics skew parallel LD (S-PLD). N-independent parallel LD (PLD) is coupled with the skew matrix. Since each chain in PLD is independent of the other, the Poincaré constant of PLD is also m 0 . Ref. [18] argued that the Poincaré constant of S-PLD, m ( α , N ) , satisfies m ( α , N ) m 0 . This means S-PLD shows faster convergence than PLD. As discussed in Section 3.2, these Poincaré constants are characterized by the smallest eigenvalue of the Hessian matrix 2 U N ( x N ) and ( I d N + α J ) 2 U N ( x N ) where x N R d N . We denote these smallest eigenvalues as λ 1 0 ( x N ) and Re λ 1 α ( x N ) . As discussed in Section 3.2, acceleration occurs if λ 1 0 ( x N ) Re λ 1 α ( x N ) is satisfied.
In [18], they failed to specify the choice of J whose naive construction of J requires O ( d 2 N 2 ) memory cost. To reduce the memory cost, we propose the following skew matrix:
J : = J 0 I d ,
where J 0 is a N × N skew matrix and ⊗ is a Kronecker product. We then have the following lemma:
Lemma 1.
If J 0 is generated based on Theorem 5 and Ker J 0 = { 0 } is satisfied, then given a point x N , J does not satisfy the equality condition in Theorems 3,4, which means λ 1 0 ( x N ) Re λ 1 α ( x N ) with probability 1.
See Appendix G.2 for the proof. Thus, from this lemma, we only need to prepare and store J 0 , which requires O ( N 2 ) memory, which does not depend on d. In practical settings, this is a significant reduction of the memory size since the number of parallel chains is smaller than the dimension of models. Please note that we can ensure the acceleration with this J.
Lemma 2.
Under Assumptions 1–5, assume J satisfies the condition of Lemma 1. Then S-PLD shows
χ 2 ( μ t α , N π N ) e 2 m ( α , N ) β t χ 2 ( μ 0 N π N ) ,
where μ t α , N is the measure at time t induced by S-PLD, and μ 0 N is the initial measure defined as the product measure of μ 0 .
See Appendix I.1 for the proofs. Thus, combined with Lemma 2, S-PLD converges faster than PLD. We also considered the ensemble version of ULD (parallel ULD (PULD)) and its skew accelerated version:
d X t N = Σ 1 V t N d t + α 1 J 1 U N ( X t N ) d t , d V t N = U N ( X t N ) d t γ ( Σ 1 + α 2 J 2 ) V t N d t + 2 γ β 1 d w t ,
where J 1 and J 2 R d N × d N are real-valued skew-symmetric matrices, and α 1 and α 2 R + are positive constants and V t N = V t ( 1 ) , , V t ( N ) R d N . We refer to this dynamics as skew PULD (S-PULD) whose faster convergence can be assured similar to Lemma 2 as shown in Appendix I.2.

4.2. Discussion of the Discretization of SDE and Stochastic Gradient and Practical Algorithm

In this section, we further consider practical settings for S-PLD and S-PULD. We discretize these continuous dynamics, e.g., by the Euler-Maruyama method, and approximate the gradient by the stochastic gradient. Although introducing skew matrices accelerates the convergence of continuous dynamics, it simultaneously increases the discretization and stochastic gradient error, resulting in a trade-off. We present a practical algorithm that controls this trade-off.

4.2.1. Trade-off Caused by Discretization and Stochastic Gradient

We consider the following discretization and stochastic gradient for S-PLD and S-PULD:
X k + 1 N = X k N h ( I d N + α J ) U ^ N ( X k N ) + 2 h β 1 ϵ k ,
and
X k + 1 N = X k N + Σ 1 V k N h + α J U ^ N ( X k N ) h V k + 1 N = V k N U ^ N ( X k N ) h γ Σ 1 V k N h + 2 γ β 1 h ϵ k ,
where ϵ k R d N is a standard Gaussian random vector. U ^ N ( X N ) is an unbiased estimator of the gradient U N ( X N ) . We refer to Equation (29) as skew-SGLD and Equation (30) as skew-SGHMC. For skew-SGHMC, we dropped J 2 of S-PULD to decrease the parameters, shown in Appendix B. Please note that skew-SGLD is the identical as the previous dynamics [18]. We introduce an assumption about the stochastic gradient:
Assumption 7. 
(Stochastic gradient) There exists a constant δ [ 0 , 1 ) such that
E [ U ^ ( x ) U ( x ) 2 ] 2 δ M 2 x 2 + B 2 .
Given a test function f with L f lipschitzness, we approximate f d π by skew-SGLD or skew-SGHMC, with estimator 1 N n = 1 N f ( X k ( n ) ) . The bias of skew-SGLD is upper-bounded as
Theorem 7.
Under Assumptions 1–7, for any k N and any h ( 0 , 1 m 4 M 2 ) obeying k h 1 and β m 2 , we have
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f ( C 1 ( α ) k h ( i ) + C 2 e β 1 m ( α , N ) k h ( i i ) )
and C 1 and C 2 depends on the constants of Assumptions 1–7, for the details see Appendix J.
We present a tighter bias bound in Section 4.3 under a stronger assumption. We can show a similar upper bound for the skew-SGHMC using the same proof strategy. This bound resembles of a previous one [18]; ours shows improved dependency on k h . The previous results of [18] are also limited to LD, not including skew-SGHMC.
Please note that ( i ) corresponds to the discretization and stochastic gradient error and ( i i )  corresponds to the convergence behavior of S-PLD, which is continuous dynamics. Since C 1 ( α ) C 1 ( α = 0 ) , skew acceleration increases the discretization and stochastic gradient error. On the other hand, since m ( α , N ) m 0 , the convergence of the continuous dynamics is accelerated. Thus, skew acceleration causes a trade-off. When α is sufficiently small, we derive the explicit dependency of α for this trade-off from an asymptotic expansion. Using the quantitative evaluation of skew acceleration in Theorem 6, we obtain
E 1 N n = 1 N f ( X k ( n ) ) R d f d π ( d 1 α + d 2 α 2 ) k h ( i ) α 2 d 0 e β 1 m 0 k h ( i i ) + O ( α 3 ) + const ,
where d 0 to d 2 are positive constants obtained by the asymptotic expansion. See Appendix K for the details. In the above expression, ( i ) and ( i i ) correspond to ( i ) and ( i i ) of Equation (32). Thus, by choosing appropriate α , we can control the trade-off.

4.2.2. Practical Algorithm Controlling the Trade-off

Since calculating the optimal α that minimizes Equation (33) at each step is computationally demanding, we adaptively tune the value of α by measuring the acceleration with kernelized Stein discrepancy (KSD) [22]. Our idea is to update samples under different α and α + η , and compare KSD between the stationary and empirical distributions of these different interaction strengths. Here, η R + is a small increment of α . We denote the samples at the ( k + 1 ) th step, which is obtained by Equation (29) as X k + 1 , α N : = X k , α N h ( I d N + α J ) U ^ N ( X k , α N ) + 2 h β 1 ϵ k , (or (30) as X k + 1 , α N : = X k N + Σ 1 V k N h + α J U ^ N ( X k N ) h ). We denote the samples, which are obtained by replacing the above α by α + η , as X k + 1 , α + η N . We denote the KSD between the measure of X k + 1 , α N and stationary measure π as K S D ( k + 1 , α ) and estimate the differences of empirical KSD:
Δ : = K S D ^ ( k + 1 , α ) K S D ^ ( k + 1 , α + η ) ,
where KSD is estimated by
K S D ^ ( k , α ) = 1 N ( N 1 ) i = 1 N u q ( X k , α ( i ) , X k , α ( j ) ) ,
u q ( x , x ) : = x log π ( x ) l ( x , x ) x log π ( x ) + x log π ( x ) x l ( x , x ) + x l ( x , x ) x log π + Tr x , x l ( x , x ) ,
where l denotes a kernel and we use an RBF kernel. If Δ > 0 , which indicates that the empirical distribution of X k + 1 , α + η N is closer to the stationary distribution than that of X k + 1 , α N . Thus, we should increase the interaction strength from α to α + η . If Δ < 0 , we decrease it to α η . We also update η to c η where c ( 0 , 1 ] . The overall process is shown in Algorithm 1. Detailed discussions of the algorithm including how to select α 0 , η 0 , and c are shown in Appendix L.
Algorithm 1 Tuning α
Input:  X k N , η k , α k , c
Output:  α k + 1 , η k + 1
  1:
Calculate X k + 1 , α k N and X k + 1 , α k + η k N .
  2:
Calculate Δ : = K S D ^ ( k + 1 , α k ) K S D ^ ( k + 1 , α k + η k )
  3:
if  Δ > 0  then
  4:
 Update α k + 1 = α k + η k
  5:
 Update η k + 1 = η k
  6:
else
  7:
 Update α k + 1 = | α k η k |
  8:
 Update η k + 1 = c η k
  9:
end if
Finally, we present Algorithm 2, which describes the whole process. We update the value of α once every k step. Please note that its computational cost is not much larger than that of Equation (30). We only calculate the eigenvalues of J once, which requires O ( N 3 ) . The calculation of different KSDs is computationally inexpensive since we can re-use the gradient, which is the most computationally demanding part.
Algorithm 2 Proposed algorithm
Input:  X 0 N , h , α 0 , η , k , K , c , ( V 0 N , γ , Σ 1 )
Output:  X K N
  1:
Make a N × N random matrix J 0 and check ker J 0 = { 0 }
  2:
Set J = J 0 I d
  3:
for  k = 0 to K do
  4:
if  k k = 0  then
  5:
  Update α by Algorithm 1
  6:
end if
  7:
 Update X k N by Equation (29) (for skew-SGLD)
  8:
 (Update ( X k N , V k N ) by Equation (30) for skew-SGHMC)
  9:
end for

4.3. Refined Analysis for the Bias of Skew-SGLD

When using a constant step size for skew-SGLD, the bound in Theorem 7 is meaningless since the first term of Equation (32) will diverge. Here, following [23], we present a tighter bound for the bias of skew-SGLD under a stronger assumption.
Theorem 8.
Under Assumptions 1–7, for any k N and any h ( 0 , 1 λ ( α , N ) 4 2 M 2 m 4 M 2 ) obeying k h 1 and β m 2 , we have
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f 2 λ ( α , N ) e λ ( α , N ) k h KL ( μ 0 | π ) + C 3 ( α ) λ ( α , N ) ,
where
λ ( α , N ) : = 1 ( 1 + m ( α , N ) 1 β C ( m 0 ) ) 2 π e 2 + 3 2 m ( α , N ) 1 1
and constants C 3 ( α ) and C ( m 0 ) depend on the constants of Assumptions 1–7. Moreover, λ ( α , N ) satisfies λ ( α , N ) λ ( α = 0 , N ) . For the details, see Appendix M.
Proof is shown in Appendix M. Please note that even if we use a constant step size for skew-SGLD, the bound in Theorem 8 will not diverge. Here we need the stronger assumption about a step size compared to Theorem 7. From Equation (37), the convergence behavior is characterized by λ ( α , N ) and the bias bound become smaller when λ ( α , N ) become larger. From the definition of λ ( α , N ) , the larger m ( α , N ) is, the larger λ ( α , N ) we obtain. Thus, as we had seen so far, introducing the skew matrices leads to the larger Poincaré constant, and thus, this leads to larger λ ( α , N ) .
Previous work [18] clarified that if α is sufficiently small, introducing skew matrices improves the Poincaré constant by a constant factor, which means that we have m ( α , N ) m 0 O ( α 2 ) , where O ( α 2 ) depends on the eigenvector and eigenvalues of the generator L . On the other hand, from Theorem 8, for any ξ > 0 , to achieve the bias smaller than ξ , it suffice to run skew-SGLD at least for k 2 λ ( α , N ) h ln L f ξ KL ( μ 0 | π ) 2 λ ( α , N ) iterations using the appropriate step size h and under the assumption that δ and α are small enough (see Appendix M.2 for details). Combined with these observations, introducing skew matrices into SGLD improves the computational complexity for a constant order. Our numerical experiments show that even constant improvement results in faster convergence in practical Bayesian models.

5. Related Work

In this section, we discuss the relationship between our method and other sampling methods.

5.1. Relation to Non-Reversible Methods

As we discussed in Section 1, our work extends the existing analysis of non-reversible dynamics [8,18] and presents a practical algorithm. Compared to those previous works, we focus on the practical setting of Bayesian sampling and derive the explicit condition about J for acceleration. We also derived a formula to quantitatively evaluate skew acceleration based on the asymptotic expansion of the eigenvalues of the perturbed Hessian matrix. A previous work [24], which derived the optimal skew matrices when the target distribution is Gaussian, requires O ( d 3 ) computational cost to derive optimal skew matrices, and it is unclear whether it works for non-convex potential functions. On the other hand, our construction method for skew matrices is simple, computationally cheap, and can be applied to general Bayesian models.
Our work analyzes skew acceleration for ULD, which is more effective than LD in practical problems. Another work [8,18] only analyzed skew acceleration for LD. A previous work [17] combined a non-reversible drift term with ULD. Unlike our method, this work’s purpose was to reduce the asymptotic variance of the expectation of a test function and is mainly focusing on sampling from Gaussian distribution.
To the best of our knowledge, our work is the first to focus on the memory issue of skew acceleration and develop a memory-efficient skew matrix for ensemble sampling. Our work also presents an algorithm that controls the trade-off for the first time. Another work [18] identified the trade-off and handled it by cross-validation, which is computationally inefficient, unfortunately.
Finally, we point out an interesting connection between our skew-SGHMC and the magnetic HMC (M-HMC) [25]. M-HMC accelerates HMC’s mixing time by introducing a “magnetic” term into the Hamiltonian. That magnetic term is expressed by special skew matrices. Although a previous work [25] argued that M-HMC is numerically superior to a standard HMC, its theoretical property remains unclear. Thus, our work can analyze the theoretical behavior of magnetic HMC.

5.2. Relation to Ensemble Methods

Our proposed algorithm is based on ensemble sampling [26]. Ensemble sampling, in which multiple samples are simultaneously updated with interaction, has been attracting attention numerically and theoretically because of improvements in memory size, computational power, and parallel processing computation schemes [26]. There are successful, widely used ensemble methods, including SVGD [27] and SPOS [28], with which we compare our proposed method numerically in Section 6. Although both show numerically good performance, it is unclear how the interaction term theoretically accelerates the convergence since they are formulated as a McKean–Vlasov process, which is non-linear dynamics, complicating establishing a finite sample convergence rate. Our algorithm is an extension of another work [18], where the interaction was composed of a skew-acceleration term and can be rigorously analyzed. Compared to that previous work [18], we analyzed skew acceleration, focused on the Hessian matrix, and developed practical algorithms, as discussed in Section 4.2, and derived the explicit condition when acceleration occurs, which was unclear [18].
Another difference among SPOS, SVGD, and [18] is that they use first-order methods; our approach uses the second-order method. Little work has been done on ensemble sampling for second-order dynamics. Recently a second-order ensemble method was proposed [29], based on gradient flow analysis. Although its method showed good numerical performance, its theoretical property for finite samples remains unclear since it proposed a scheme as a finite sample approximation of the gradient flow. In contrast, our proposed method is a valid sampling scheme with a non-asymptotic guarantee.

6. Numerical Experiments

The purpose of our numerical experiments is to confirm the acceleration of our algorithm proposed in Section 4 in various commonly used Bayesian models including Gaussian distribution (toy data), latent Dirichlet allocation (LDA), and Bayesian neural net regression and classification (BNN). We compared our algorithm’s performance with other ensemble sampling methods: SVGD, SPOS, standard SGLD, and SGHMC. In all the experiments, the values and the error bars are the mean and the standard deviation of repeated trials. For all the experiments we set γ = 1 and Σ 1 = 300 for SGHMC and Skew-SGHMC. As for the hyperparameters of our proposed algorithm, the selection criterion is discussed in Appendix L.

6.1. Toy Data Experiment

The target distribution is the multivariate Gaussian distribution, π = N ( μ , Ω ) where we generated Ω 1 = A A and each element of A R 2 d × d is drawn from the standard Gaussian distribution. The dimension of the target distribution is d = 50 , we approximate by 20 samples using the proposed ensemble methods. We tested these toy data because the LD for this target distribution is known as the Ornstein–Uhlenbeck process, and its theoretical properties have been studied extensively e.g., [30]. Thus, by studying the convergence behavior of these toy data, we can understand our proposed method more clearly.
First, we confirmed how the skew-symmetric matrix affects the eigenvalues of the Hessian matrix, as discussed in Section 3, where we only showed the asymptotic expansion for the smallest real part of the eigenvalues and saddle point. Here we can show a similar expansion for the largest real part:
Re λ d N α = λ d N 0 + α 2 k = 1 d N 1 λ d N 0 λ k 0 | v k 0 J v d N 0 | 2 λ k 0 λ d N 0 + O ( α 3 ) .
Re λ d N α λ d N α holds.
Then we observed how the largest and smallest real parts of the eigenvalues of ( I + α J ) Ω 1 depend on α . The results are shown in Figure 2, where we averaged 10 trials over a randomly made J with fixed A. The upper-left, upper-right, and lower figures show Re ( λ 1 ( α ) ) , Re ( λ d N ( α ) ) , and Re ( λ 1 ( α ) ) / Re ( λ d N ( α ) ) . These behaviors are consistent with Theorem 3. When α is small, its behavior is close to the quadratic function proved in Theorem 3.
Next, we observed the convergence behavior of skew-SGLD and skew-SGHMC. We measured the convergence by maximum mean discrepancy (MMD) [31] between the empirical and stationary distributions. For MMD, we used 2000 samples for the target distribution, and we used the Gaussian kernel whose bandwidth is set to the median distance of these 2000 samples. We used gradient descent (GD), with step size h = 1 × 10 4 . The results are shown in Figure 3. The proposed method shows faster convergence than naive parallel sampling, which is consistent with Table 2.

6.2. LDA Experiment

We tested with an LDA model using the ICML dataset [32] following the same setting as [33]. We used 20 samples for all the methods. Minibatch size is 100. We used step size h = 5 × 10 4 . First, we confirmed the effectiveness of our proposed Algorithm 1, which adaptively tunes α values. For that purpose, we compared the final performance obtained by our methods with a previous method [18], in which α is selected by cross-validation (CV). Here instead of CV, we just fixed α during the sampling and refer to it as fixed α . We also tested the case when J is generated randomly at each step with fixed α , as discussed in Section 4.1. We refer to it as random J. The results are shown in Figure 4 where skew-SGLD was used. We found that our method showed competitive performance with the best performance of fixed α . For the computational cost, we used k = 2 in Algorithm 2, and our method needed twice the wall clock time than each fixed α . This means that our algorithm greatly reduces the total computational time since we tried more than two α s in the fixed α for CV. We also found that since using different Js at each step did not accelerate the performance, we need to store and fix J during the sampling for acceleration. Next, we compared our method with other ensemble sampling schemes and observed the convergence speed. The result is shown in Figure 5. Skew-SGLD and skew-SGHMC outperformed SGLD and SGHMC, which is consistent with our theory.

6.3. BNN Regression and Classification

We tested with the BNN regression task using the UCI dataset [34], following a previous setting Liu and Wang [27]. We used one hidden layer neural network model with ReLU activation and 100 hidden units. We used 10 samples for all the methods. We used the minibatch size 100. We used step size h = 5 × 10 5 . The results are shown in Table 1 and Table 2. We also tested on BNN classification task using the MNIST dataset. The result is shown in Figure 6. We used one hidden layer neural network model with ReLU activation and 100 hidden units. Batchsize is 500 and we set step size h = 5 × 10 5 . Our proposed methods outperformed other ensemble methods. Please note that skew-SGHMC and skew-SGLD consistently outperformed SGHMC and SGLD.

7. Conclusions

We studied skew acceleration for LD and ULD from practical viewpoints and concluded that the improved eigenvalues of the perturbed Hessian matrix caused acceleration and derived the explicit condition for acceleration. We described a novel ensemble sampling method, which couples multiple SGLD or SGHMC with memory-efficient skew matrices. We also proposed a practical algorithm that controls the trade-off of faster convergence and larger discretization and stochastic gradient error and numerically confirmed the effectiveness of our proposed algorithm.

Author Contributions

Conceptualization, F.F. and T.I.; methodology, F.F. and T.I.; software, F.F.; validation, F.F., T.I., N.U. and I.S.; formal analysis, F.F. and I.S.; writing—original draft preparation, F.F.; project administration, F.F.; funding acquisition, F.F. All authors have read and agreed to the published version of the manuscript.

Funding

JST ACT-X: Grant Number JPMJAX190R.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml (accessed on 21 June 2021).

Acknowledgments

FF was supported by JST ACT-X Grant Number JPMJAX190R.

Conflicts of Interest

The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
LDLangevin Dynamics
MCMCMarkov Chain Monte Carlo
ULDUnderdamped Langevin Dynamics
SGLDStochastic Gradient Langevin Dynamics
SGHMCStochastic Gradient Hamilton Monte Carlo
PLDParallel Langevin Dynamics
PULDParallel Underdamped Langevin Dynamics
SLDSkew Langevin Dynamics
S-ULDSkew Underdamped Langevin Dynamics
S-PLDSkew Parallel Langevin Dynamics
S-PULDSkew Parallel Underdamped Langevin Dynamics
KSDKernelized Stein Discrepancy

Appendix A. Additional Backgrounds

We introduce additional backgrounds which are used in our Proof.

Appendix A.1. Wasserstein Distance and Kullback–Leibler Divergence

In this paper, we use the Wasserstein distance. Let us define the Wasserstein distance. Let ( E , d ) be a metric space (appropriate space such as Polish space) with σ field A , where d ( · , · ) is A × A -measurable. Let μ , ν are probability measures on E, and p 1 . The Wasserstein distance of order p with cost function d between μ and ν is defined as
W p d ( μ , ν ) = inf π Π ( μ , ν ) d ( x , y ) p d π ( x , y ) 1 / p ,
where Π ( μ , ν ) is the set of all joint probability measures on E × E with marginals μ and ν . In this paper, we work on the space R d . As for the distance, we use the Euclidean distance, · . For simplicity, we express the p-Wasserstein distance with the Euclidean distance as W p . The various properties of Wasserstein distance are summarized in [35]. We define the Kullback–Leibler (KL) divergence as
KL ( ν μ ) = log d ν d μ d ν , ν μ , + , otherwise .

Appendix A.2. Markov Diffusion and Generator

Here we introduce the additional explanation about the generator of the Markov diffusion process. Given an SDE,
d X t = U ( X t ) d t + 2 β 1 d w ( t ) ,
and we denote the corresponding Markov semigroup as P = { P t } t > 0 and define the Kolmogorov operator as P s which is defined as P s f ( X t ) = E [ f ( X t + s ) | X ( t ) ] , where f : R d R is some bounded test function in L 2 ( μ ) . A property P s + t = P s P t is called Markov property. A probability measure π is the stationary distribution when it satisfies for all measurable bounded function f and t, R d P t f d π = R d f d π .
We denote the infinitesimal generator of the associated Markov group as L and we call it a generator for simplicity. The linearity of the operators of P t with the semigroup property indicates that L is the derivative of P t as
1 h ( P t + h P t ) = P t 1 h ( P h I d ) = 1 h ( P h I d ) P t ,
where I d is the identity map. In addition, taking h 0 , we have P t = L P t = P t L . From the Hille–Yoshida theory [19], there exists a dense linear subspace of L 2 ( π ) on which L exists. We refer it as D ( L ) . If the Markov semigroup is associated with the SDE of Equation (A3), the generator can be written as
L f ( X t ) : = lim h 0 + E ( f ( X t + h ) | X t ) f ( X t ) h = U ( X t ) · + β 1 Δ f ( X t ) ,
where Δ is the Laplacian in the standard Euclidean space. The generator satisfies L 1 = 0 , R d L f d π = 0 .

Appendix A.3. Poincaré Inequality

We use the Poincaré inequality to measure the speed of convergence to the stationary distribution. In this section, we summarize definitions and useful properties of them and see [19] for more details. We define the Dirichlet form E ( f ) for all bounded functions f D ( L ) where D ( L ) denotes the domain of L as
E ( f ) : = R d f L f d π .
E ( f ) > 0 is satisfied. By the partial integration, we have E ( f ) = R d f L f d π = 1 β R d f 2 d π . We define a Dirichlet domain, D ( E ) , which is the set of functions f L 2 ( π ) and satisfies E ( f ) < .
We say that π with L satisfies a Poincaré inequality with a positive constant c if for any f D ( E ) , π with L satisfies,
f 2 d π f d π 2 c E ( f ) .
This constant c is closely related to a spectral gap. If the smallest eigenvalue of L , λ , is greater than 0, then it is called the spectral gap. If the spectral gap λ > 0 exists, then it is written as
λ : = inf f D ( E ) E ( f ) f 2 d π : f 0 , f d π = 0 .
From this, a constant c which satisfies c 1 / λ , can also satisfy the Poincaré inequality. To check the existence of the spectral gap, one approach is to use the Lyapunov function, which is developed by Bakry et al. [36].
We can also express the Poincaré inequality via chi divergence. Let us define the χ 2 divergence for μ π as
χ 2 ( μ π ) : = d μ d π 1 L π 2 2 = R d d μ d π 1 2 d π .
Then, we express the Poincaré inequality with a constant c for all μ π as
χ 2 ( μ π ) c E d μ d π .
We obtain the following exponential convergence results from the above functional inequalities for measures.
Theorem A1. 
(Exponential convergence in the variance, Theorem 4.2.5 in [19]) When π satisfies the Poincaré inequality with a constant c, it implies the exponential convergence in the variance with a rate 2 / c , i.e., for every bounded function f : R d R ,
Var π ( P t f ) e 2 t / c Var π ( f ) ,
where Var π ( f ) : = R d f 2 d π R d f d π 2 .
We also introduce the important property of Poincaré inequality as for the product measures. These relations play important roles in our analysis.
Theorem A2. 
(Stability under the product, Proposition 4.3.1 in [19]) If μ 1 and μ 2 on R d satisfy the Poincaré inequalities with a constant c 1 and c 2 , then the product μ 1 μ 2 on R d R d satisfies the Poincaré inequality with the constant max ( c 1 , c 2 ) .

Appendix B. Generator of the Underdamped Langevin Dynamics (ULD)

Following [10], we define the infinitesimal generator of the ULD as
L f ( x , v ) : = ( γ v + U ( x ) ) v f ( x , v ) + γ β 1 Δ f ( x , v ) + v x f ( x , v ) .
Then, we define the generator of S-ULD as
L f ( x , v ) : = ( γ v + U ( x ) ) v f ( x , v ) + γ β 1 Δ f ( x , v ) + v x f ( x , v ) + α 1 J 1 U ( x ) x f ( x , v ) + α 1 J 2 Σ 1 v v f ( x , v ) ,
where the second line corresponds to the interaction terms. Then it is easily to confirm R 2 d L f ( x , v ) d π ˜ = 0 , where π ˜ : = π N ( 0 , Σ ) e β U ( x ) 1 2 Σ 1 v 2 . Thus, the stationary distribution of S-ULD is π ˜ . We can prove this by simply using the partial integral and using the property of the skew-symmetric matrix. Thus, the stationary distribution of S-ULD is π ˜ .
We consider other combinations the skew matrices with ULD. For example, we can consider the following more general combination;
d X t = Σ 1 V t d t + α 1 J 1 U ( X t ) d t + α 2 Σ 1 J 2 V t d t d V t = U ( X t ) d t γ Σ 1 V t d t + α 3 J 3 V t d t + α 4 J 4 U ( X t ) d t + 2 γ β 1 d w t ,
compared to S-ULD, there are new two terms are included. We can also derive the infinitesimal generator of this Markov process. We express it as L ˜ . Then we calculate the infinitesimal change of the expectation of f
R 2 d L ˜ f ( x , v ) d π ˜ 0 ,
which suggests that the stationary distribution of Equation (A14) is different form π ˜ .
It is widely known that underdamped Langevin dynamics converges to (overdamped) Langevin dynamics. Here we observe that S-ULD converges to Skew-LD in [18]. The limiting procedure is widely known, for example, see [17,37,38]. We cite Proposition 1 in [17]; given a stochastic process
d X t = Σ 1 V t d t + α 1 J 1 U ( X t ) d t , d V t = U ( X t ) d t γ Σ 1 V t d t α 2 Σ 1 J 2 V t d t + 2 γ d w t ,
and we rescale it by introducing ϵ which expresses the small mass limit as
d X t = 1 ϵ Σ 1 V t d t + α 1 J 1 U ( X t ) d t , d V t = 1 ϵ U ( X t ) d t 1 ϵ 2 γ Σ 1 V t d t 1 ϵ 2 α 2 Σ 1 J 2 V t d t + 1 ϵ 2 γ d w t ,
and by taking the limit ϵ 0 , the dynamics converges to
d X t = ( α 2 J 2 + γ ) 1 U ( X t ) d t α 1 J 1 U ( X t ) + ( α 2 J 2 + γ ) 1 2 γ d w t .
See Proposition 1 in [17], for the precise statements. Please note that the term related J 2 works as preconditioning. Thus, if we set α 2 J 2 = 0 , the obtained dynamics are equivalent to the continuous dynamics of skew-SGLD. Thus, our skew-SGHMC is the natural extension of skew-SGLD.

Appendix C. Proof of Theorem 1

Appendix C.1. Proof for S-LD

First, under Asuumptions 1–5, LD has a spectral gap, and its Poincaré constant is upper bounded as
1 m 0 2 C ( d + b β ) m β exp 2 m ( M + B ) ( b β + d ) + β ( A + B ) + 1 m β ( d + b β ) .
and this is derived in [2].
Next, we introduce the generator of S-LD
L α f ( x ) = U α ( x ) · + β 1 Δ f ( x ) ,
where U α ( x ) : = U ( x ) + α J U ( x ) .
The proof is almost similar to [18] of Theorem 12.
Proof of Theorem 1. 
Since the generator L α = 0 is self-adjoint, and the suitable growth condition, the spectral of L α = 0 is discrete [19]. We denote the spectrum of L α = 0 as { λ k } k = 0 R and corresponding normalized eigenvectors as { e k } k = 0 , which are the real functions. We order the spectrum as 0 > λ 0 > λ 1 > . Thus, m 0 = λ 0 .
As for L α , although it is not a self-adjoint operator, from Proposition 1 in Franke et al. [39], it has discrete complex spectrums. We denote the spectrum of L α as λ + i μ C where λ , μ R and corresponding normalized eigenvector as u + i v where u , v are the real functions and then we have
L α ( u + i v ) = ( λ + i μ ) ( u + i v ) .
From this definition, by checking the real parts and complex parts, following relations are derived
L α u = λ u μ v ,
L α v = λ v + μ u .
Due to the divergence-free drift property, for any bounded real value test function g ( x ) ,
g ( L α = 0 L α ) g d π = α g γ · g d π = α g γ · g d π ,
where we used the partial integral. This means that for any bounded real function g ( x ) ,
g L α = 0 g d π = g L α g d π .
(This only holds for real functions.) Then, we can evaluate the real part of the eigenvalue λ as follows,
u L α = 0 u d π + v L α = 0 v d π = u L α u d π + v L α v d π = λ u 2 d π + v 2 d π = λ .
Then, by expanding the eigenfunction u , v by the eigenfunction { e k } ,
λ = u L α = 0 u d π + v L α = 0 v d π = k λ k u e k d π 2 + v e k d π 2 λ 0 k u e k d π 2 + v e k d π 2 λ 0 .
Thus, the real part of the eigenvalue of L α is smaller than the smallest eigenvalue of L α . This means that the spectral gap of L α is larger than that of L α = 0 , i.e., m ( α ) m 0 holds. □

Appendix C.2. Proof of Theorem 2 (S-ULD)

Proof of Theorem 2. 
To prove the S-ULD, we use the result of [20], which characterize the convergence of ULD via the Poincaré constant. Let us denote μ ˜ t as the measure induced by ULD. Then from Theorem 1 of [20], if π with L has the Poincaré constant m 0 , we have
χ 2 ( μ ˜ t π ˜ ) 1 + ϵ ¯ 1 ϵ ¯ e λ γ t χ 2 ( μ ˜ t π ˜ ) .
where ϵ ¯ and λ γ is given as follows.
λ γ = Λ ( γ , ϵ ¯ min ( γ , γ 1 ) ) 1 + ϵ ¯ min ( γ , γ 1 ) ,
where
Λ ( γ , ϵ ) = γ Σ 1 1 1 + m 0 Σ 1 β 2 1 2 ( S S + + ) 2 + ( S + ) 2 ,
S = ϵ λ h a m ,
S + = ϵ ( R h a m + γ Σ 1 / 2 ) ,
S + + = γ Σ 1 ϵ ,
λ h a m = 1 1 + m 0 Σ 1 β 1 ,
ϵ = ϵ ¯ min ( γ , γ 1 ) ,
where ϵ ¯ is arbitrary sufficiently small positive value such that Λ ( γ , ϵ ¯ min ( γ , γ 1 ) ) > 0 is satisfies. As for R h a m , if there exists a positive constant K, such that 2 U K I , then R h a m max { K , 2 } . In our assumption, this corresponds to β M , thus R h a m max { β M , 2 } . From the above definitions, we can see that the larger m 0 is, i.e., the larger the Poincaré constant is the faster convergence ULD shows.
This can also be confirmed numerically, see Figure A1, which shows how the Λ changes under different m 0 . We set Σ 1 = 100 . From the figure, the larger the Poincaré constant is, the larger Λ becomes.
Figure A1. The convergence rate of ULD under the different Poincaré constants.
Figure A1. The convergence rate of ULD under the different Poincaré constants.
Entropy 23 00993 g0a1
So far, we confirmed that the convergence speed of S-ULD is characterized by the Poincaré constant of L . When we consider S-ULD, we simply add the skew matrices term to the generator of the ULD in the proof of Proposition 1 in [20]. This means that we simply replace the Poincaré constant from m 0 to m ( α ) in the proof of Proposition 1 in [20]. Then, m 0 will be replaced with m ( α ) that indicates the faster convergence. □

Appendix D. Eigenvalue and Poincaré Constant

In this section, we discuss the relation between eigenvalues of the Hessian matrix and Poincaré constant.

Appendix D.1. Strongly Convex Potential Function

When we consider LD with m-strongly convex potential function, then the Poincaré constant is m, this means exponential convergence with rate m (See [19] for the detail).
We then consider the S-LD with m-strongly convex function. In this setting, by considering the synchronous coupling technique [11], we can show that the variance decays exponentially with the rate of the smallest real part of the eigenvalue. This is because that by preparing two S-LD ( X t , Y t ) given as
d X t = ( I + α J ) U ( X t ) d t + 2 β 1 d w t , d Y t = ( I + α J ) U ( Y t ) d t + 2 β 1 d w t .
Then we evaluate the behavior of X t Y t 2 . From Ito lemma and considering the synchronous coupling, we obtain
d d t X t Y t 2 = ( X t Y t ) · ( I + α J ) β ( U ( X t ) U ( Y t ) ) 2 m ( α ) β X t Y t 2 ,
where m ( α ) is the constant that satisfies m ( α ) Re λ 1 α ( x ) for all x, see Appendix E for details. This means that variance decays exponentially with the rate 2 m ( α ) β . From the fundamental property of the Poincaré constant (Theorem 4.2.5 in [19]), m ( α ) is the Poincaré constant. Thus the imaginary part has no effect on the continuous dynamics. Thus, the Poincaré inequality is the smallest real part of the perturbed Hessian matrix.

Appendix D.2. Non-Convex Potential Function

As we discussed in Section 3.1, [21] derived the sharper estimation for the Poincaré constant for the non-convex potential function. It is easy to verify that their assumptions are satisfied under our assumption 1–5. Following the main paper, we denote x 1 global minima, and x 2 is the local minima which have the second smallest value in U ( x ) . We express the saddle point between x 1 and x 2 as x . To be more precise, the saddle point that characterizes the Poincaré constant is known as the critical point with index one defined as
U ( x ) = inf max s [ 0 , 1 ] U ( γ ( s ) ) : γ C ( [ 0 , 1 ] , R d ) , γ ( 0 ) = x 1 , γ ( 1 ) = x 2 ,
and the eigenvalue of 2 U ( x ) has one negative eigenvalue and d 1 positive eigenvalues. We express them as λ 1 ( x ) < 0 < λ 2 ( x ) < , λ d ( x ) .
Ref. [21] studied the Poincaré constant by decomposing the non-convex potential focusing on attractors. By focusing on attractors, they showed that the non-convex potential can be decomposed into the sum of approximately Gaussian distributions. They proved that the Poincaré constant is characterized by the local Poincaré constants, these are derived by the approximate Gaussian distribution on the attractors and their surrounding regions. In addition, they proved that the dominant term of the Poincaré constant is specified by the saddle points between the global minima and the point which takes the second smallest value for U ( x ) . From Theorem 2.12 and Corollary 2.15 in [21], the Poincaré constant is characterized by
m 0 1 det H ( x ) Z | λ 1 ( x ) | det H ( x 1 ) det H ( x 2 ) e β ( U ( x ) U ( x 1 ) U ( x 2 ) ) 1 | λ 1 ( x ) | e β ( U ( x ) U ( x 1 ) U ( x 2 ) ) ,
where Z is the normalizing constant of e β U ( x ) .
Next, we discuss how this estimate changes when skew matrices are applied. When the skew matrices are introduced, from lemma A.1 in [40], at the saddle point, there exists a unique negative real eigenvalue λ 1 α ( x ) < 0 for the perturbed Hessian matrix even if ( I + α J ) H is not a symmetric matrix.
Then from Proposition 5 in [8], that negative eigenvalue of the perturbed Hessian is smaller than that of the un-perturbed Hessian matrix at the saddle point. This means that λ 1 α ( x ) λ 1 ( x ) < 0 holds.
Finally, from Theorem 5.1 in [41] and Theorem 2.12 in [21], this improvement of the negative eigenvalue of the saddle point directly leads to the larger Poincaré constant.

Appendix E. Properties of a Skew-Symmetric Matrix

Here, we introduce the basic properties of the skew-symmetric matrices. Let us consider assume that d × d matrix H = ( I + α J ) H is diagonalizable. Then assume that matrix H has l real eigenvalues λ 1 , , λ l and 2 m complex eigenvalues, μ 1 = α 1 ± i β 1 , , μ m = α m ± i β m . Thus, d = l + 2 m . We denote the corresponding eigenvectors as { v j } j = 1 l for real eigenvalues and { w j = a j + i b j } j = 1 m for complex eigenvalues { μ j } j = 1 m and { w ¯ j } for corresponding conjugate eigenvalues. Then, let us define a d × d matrix V as
V = [ v 1 , , v l , a 1 , b 1 , , a m , b m ] .
Then, we can decompose H into a block diagonal matrix [42];
H V = V D
D = λ 1 λ l α 1 0 0 α 1 α m 0 0 α m : = A + 0 0 0 β 1 β 1 0 0 β m β m 0 : = B .
Thus, D : = A + B . Then, from the Taylor expansion and expressing its residual by integral, by defining H ( x ) : = 2 U ( x ) we have
( x y ) ( I + α J ) U ( x ) U ( y ) = ( x y ) 0 1 ( I + α J ) H ( y + τ ( x y ) ) ( x y ) d τ .
Then, let us apply the Jordan canonical form here. If ( I + α J ) H is diagonalizable, and it is decomposable by the Jordan canonical form shown in Equation (A40). Then, we can decompose ( I + α J ) H as
( I + α J ) H ( x + τ ( x ( t ) x ) ) = V D V 1 .
Then, we obtain
( x y ) ( I + α J ) U ( x ) U ( y ) = ( x y ) 0 1 ( I + α J ) H ( y + τ ( x y ) ) ( x y ) d τ = 0 1 ( x y ) V ( A + B ) V 1 x y d t = 0 1 ( x y ) V A V 1 x ( t ) x d t m ( α ) x ( t ) x 2 .
where m ( α ) is the constant that satisfies m ( α ) min { λ 1 , , λ l , α 1 , , α m } for all x. Thus, the imaginary part never appears to the upper bound and we only need to focus on the largest real part of the eigenvalues, if the matrix is diagonalizable. Next subsection describes when the non-symmetric matrix H is diagonalizable by focusing on the random matrix.

Appendix F. Proof of Theorem 3

Proof. 
Since the potential function is m-strongly convex, the smallest eigenvalue of the Hessian matrix H is m, which is larger than 0. Thus, H and H 1 / 2 are regular matrices. With this in mind, we consider H + H 1 / 2 J H 1 / 2 as a similar matrix of H : = ( I + J ) H . This is easily confirmed by
H 1 / 2 ( H + H 1 / 2 J H 1 / 2 ) H 1 / 2 = H .
This means that to study the eigenvalues of H , we only need to study the similar matrix A : = H + H 1 / 2 J H 1 / 2 . By doing this, A is composed of symmetric and skew-symmetric matrices, which are easy to treat compared to H , where the term J H is difficult to analyze. For simplicity, we omit the dependency of H and H on x in this section.
Remark A1. 
Please note that we can eliminate the strong convexity of U, if H is a regular matrix. This means that H does not have 0 as an eigenvalue.
For simplicity, we assume that the dimension d is an even number. We assume that the eigenvalues and eigenvectors of A are expressed as
A w j = μ j w j A ( a j + i b j ) = ( α j + i β j ) ( a j + i b j ) .
and α j is ordered as α 1 α 2 , . In this section, we only consider the setting where all the eigenvalue and eigenvector are imaginary for notational simplicity. The extension to the general settings similar to Appendix E and the setting when is d is odd is straightforward.
We denote the eigenvalues and eigenvectors of H as { λ j , v j } j = 1 d and v j s are linearly independent. In addition, we assume that λ 1 , , λ d . From this definition, by checking the real parts and complex parts, the following relations are derived
A a j = α j a j β b j ,
A b j = α j b j + β a j .
thus, by the skew-symmetric property
a j A a j + b j A b j = α j ( a j 2 + b j 2 ) = α j
= a j H a j + b j H b j ,
and in the third equality, we used the property
a j H 1 / 2 J H 1 / 2 a j = b j H 1 / 2 J H 1 / 2 b j = 0 ,
since H 1 / 2 J H 1 / 2 is a skew-symmetric matrix. Then, we expand a j and b j by v j as
a k = j = 1 d a k v j
b k = j = 1 d b k v j v j ,
since v j s are eigenvalues of H, which can be used as the basis for R d . Then we substitute this into Equation (A50) and we have
α k = j = 1 d λ j ( a k v j ) 2 + j = 1 d λ j ( b k v j ) 2 λ 1 j = 1 d ( a k v j ) 2 + ( b k v j ) 2 ) = λ 1 .
This means that any real part of the eigenvalue of A is larger than λ 1 which is the smallest eigenvalue of H. Thus, if the α 1 is the smallest real part of the eigenvalue of A, that is larger than the smallest eigenvalue of H. This concludes the proof.
In the same way,
α k = j = 1 d λ j ( a k v j ) 2 + j = 1 d λ j ( b k v j ) 2 λ d j = 1 d ( a k v j ) 2 + ( b k v j ) 2 ) = λ d ,
which means any real part of the eigenvalues of A is smaller than the largest eigenvalue of H. Thus, if α is the largest real part of the eigenvalues of A, it is smaller than the largest eigenvalue of H.
Equality condition:
Next, we discuss when the equality holds for α 1 = λ 1 . First, we assume that eigenvalues of H are distinct, thus, there is only one eigenvector for λ 1 . Later, we discuss if eigenvalues are not distinct. From Equation (A54), we have
α 1 = j = 1 d λ j ( a 1 v j ) 2 + j = 1 d λ j ( b 1 v j ) 2 λ 1 j = 1 d ( a 1 v j ) 2 + ( b 1 v j ) 2 ) = λ 1 ,
in general. Please note that if a 1 and b 1 does not correspond to v 1 , then λ j 1 > λ 1 must appear in the summation and equality never holds. So, the condition is
a 1 , b 1 v 1 ,
must hold for the equality.
Based on this, let us assume that w 1 = c a 1 + i c b 1 where c 2 + c 2 = 1 . We consider the case a 1 = b 1 = v 1 . Then we need to solve the simultaneous equations
A ( c a 1 + i c b 1 ) = ( λ 1 + i β 1 ) ( c a 1 + i c b 1 ) = ( λ 1 c c β 1 ) v 1 + i ( c β 1 + λ 1 c ) v 1 ,
this is obtained by the definition of the eigenvalue of A and
A ( c a 1 + i c b 1 ) = λ 1 1 / 2 c ( I λ 1 1 / 2 + α H 1 / 2 J ) v 1 + i λ 1 1 / 2 c ( I λ 1 1 / 2 + α H 1 / 2 J ) v 1 ,
this is obtained from the definition of eigenvalues of H. Then multiplying v 1 from the left, we obtain c β 1 = 0 and c β 1 = 0 . Thus, β 1 = 0 . β 1 = 0 means b 1 = 0 from the property of the complex eigenvectors. Thus, we obtain w 1 = a 1 = v 1 for λ 1 = α 1 . Then, the following relation holds,
λ 1 v 1 = A v 1 = H v 1 + α H 1 / 2 J H 1 / 2 v 1 = λ 1 v 1 + α λ 1 1 / 2 H 1 / 2 J v 1 .
Since λ 1 0 and H 1 / 2 has the inverse matrix, this condition indicates that
α J v 1 = 0 .
This is the condition that λ 1 = α 1 holds. The same relation can be derived for λ d = α d .
Next, we assume that eigenvalues of H are not distinct. Let us denote the set of eigenvectors of the eigenvalue λ 1 0 as { v 1 0 } . Please note that if a 1 and b 1 does not included in V 1 0 , then λ j 1 > λ 1 must appear and equality never holds. Thus
a 1 , b 1 V 1 0
must hold for equality. Based on this, let us assume that w 1 = c a 1 + i c b 1 where c 2 + c 2 = 1 . We consider the case a 1 b 1 . Then
H 1 / 2 A ( c a 1 + i c b 1 ) = λ 1 1 / 2 ( λ 1 + i β 1 ) ( c a 1 + i c b 1 ) H 1 / 2 ( H + α H 1 / 2 J H 1 / 2 ) ( c a 1 + i c b 1 ) = λ 1 1 / 2 c ( I + α J ) a 1 + i λ 1 1 / 2 c ( I + α J ) b 1 ,
then we obtain the condition
λ 1 c α J a 1 = β 1 c b 1
λ 1 c α J b 1 = β 1 c a 1 .

Appendix G. Proofs of Random Matrices

Appendix G.1. Proof of Theorem 5

Proof. 
The proof is the straightforward consequence of lemma in [43], that is
Lemma in ([43]) If f ( x 1 , , x m ) is a polynomial in real variables x 1 , , x m , which is not identically zero, then the subset N m = { ( x 1 , , x m ) | f ( x 1 , , x m ) = 0 } of the Euclidean m-space R m has the Lebesgue measure zero.
We use this lemma to prove that the probability of λ 1 = α 1 is 0 by showing that the probability mass of λ 1 = α 1 has Lebesgue measure zero.
We use the same notation as in Appendix F. Recall Equation (A64), which is the condition of equality about λ 1 = α 1 . We express the elements of a 1 and b 1 as a 1 = ( a 1 1 , , a 1 d ) and b 1 = ( b 1 1 , , b 1 d ) . Then the equality condition can be written as
i = 1 d ( j = 1 d λ 1 c α J i , j a 1 j + β 1 c b 1 i ) ) 2 + i = 1 d ( j = 1 d λ 1 c α J i , j b 1 j β 1 c a 1 i ) ) 2 = 0 .
Then we define the polynomial about { J i . j }
f ( J 1 , 2 , , J d 1 , d ) = i = 1 d ( j = 1 d λ 1 c α J i j a 1 j + β 1 c b 1 i ) ) 2 + i = 1 d ( j = 1 d λ 1 c α J i j b 1 j β 1 c a 1 i ) ) 2 .
To apply lemma of [43], we must confirm that f ( J 1 , 2 , , J d 1 , d ) is not always 0. This is clear from the definition of f since we generate J 1 , 2 , , J d 1 , d randomly from the distribution that is absolutely continuous with respect to Lebesgue measure and λ 1 0 and c 2 + c 2 = 1 and either a 1 , b 1 0 .
Then, given an evaluation point x, from lemma of [43], the subset of { J i , j } R d ( d 1 ) / 2 that satisfies f ( J 1 , 2 , , J d 1 , d ) = 0 has Lebesque measure zero. Thus, if we generate { J i , j } from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), f ( J 1 , 2 , , J d 1 , d ) = 0 holds probability 0. This concludes the proof. □

Appendix G.2. Proof of Lemma 1

Proof. 
We first discuss the condition about Ker J 0 = { 0 } . Since J = J 0 I d , and we denote the set of eigenvalues of J 0 as { ω i } . In general, the eigenvalues of the matrix that is composed of the Kronecker product with two matrices, e.g., A and B, are given as the product of each eigenvalue of A and B [44]. Thus, since J is the Kronecker product of J 0 and I d , if J 0 does not have 0 as an eigenvalue, J does not have 0 as an eigenvalue.
Next, we discuss another equality condition. We use the similar notation as in Appendix F, but now the dimension of the matrix J is d N . We express the eigenvalue which has the smallest real part as λ 1 α and its eigenvector as ω 1 α = a 1 + i b 1 . The elements of a 1 and b 1 as a 1 = ( a 1 1 , , a 1 d , a 1 d + 1 , , a 1 d N ) R d N and b 1 = ( b 1 1 , , b 1 d , , b 1 d + 1 , , b 1 d N ) . We also express these as a 1 = ( a 1 ( 1 ) , , a 1 ( N ) ) R d N where a 1 ( i ) = ( a 1 ( i 1 ) d + 1 , , a 1 i d ) R d .
We use the Kronecker product property:
J a 1 = ( J 0 I d ) a 1 = i = 1 N J 0 | i , 1 a 1 ( i ) , , i = 1 N J 0 | i , N a 1 ( i ) ,
where J 0 | i , j indicates the element of i-th row and j-th column of J 0 where we use the property of the Kronecker product and the Vec operator in the second equality [44].
The proof is almost similar to Appendix G.1. Then the equality condition can be written as
n = 1 N λ 1 c α i J 0 | i , n a 1 ( i ) + β 1 c b 1 ( n ) 2 + n = 1 N λ 1 c α i J 0 | i , n b 1 ( i ) + β 1 c a 1 ( n ) 2 = 0 ,
where · is the d-dimensional Euclidean norm since a 1 ( n ) , b 1 ( n ) R d . Then we define the polynomial about { J i . j }
f ( J 1 , 2 , , J N 1 , N ) = n = 1 N λ 1 c α i J 0 | i , n a 1 ( i ) + β 1 c b 1 ( n ) 2 + n = 1 N λ 1 c α i J 0 | i , n b 1 ( i ) + β 1 c a 1 ( n ) 2 .
In a similar discussion with Appendix G.1, it is clear that f is not always 0. Thus, given an evaluation point x, from lemma of [43], the subset of { J i , j } R N ( N 1 ) / 2 that satisfies f ( J 1 , 2 , , J N 1 , d ) = 0 has Lebesque measure zero. Thus, if we generate { J i , j } from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), f ( J 1 , 2 , , J N 1 , N ) = 0 holds probability 0. This concludes the proof. □

Appendix G.3. Extending the Theorem to the Path

About Theorem 5 and Lemma 1, the statement holds true when we fix an evaluation point x. To ensure the acceleration, we need to extend Theorem 5 and Lemma 1 from a single evaluation point to the path of the stochastic process for S-LD, S-PLD, S-ULD, and S-PULD.
First, the condition of Ker J 0 = { 0 } is not related to the evaluation point. Thus, we need to consider the equality condition for Re λ 1 α = λ 1 0 . As for this condition, as we had seen in Theorem 5 and Lemma 1, if we generate the random matrix J which is absolutely continuous with respect to Lebesgue measure, then the equality condition is not satisfied with probability 1 at the given evaluation point. The important point in those proof is to prove that the event when the equality holds has Lebesgue measure 0 at the given evaluation point using the lemma of [43].
Let us consider when two evaluation points are given (e.g., x 1 , x 2 ), and we check whether the random matrix J satisfies the above equality condition or not. We can easily prove that at each evaluation point, such an event (we express them as S 1 and S 2 ) has Lebesgue measure 0 using the lemma of [43] (We refer to this as P ( S 1 ) = 0 and P ( S 2 ) = 0 where P is the law induced by generating the random matrix that has independent d ( d 1 ) / 2 elements). So, the volume of the event of sum of S 1 and S 2 are also 0 ( P ( S 1 S 2 = 0 ). By repeating this procedure, when given a finite number of evaluation points, ( x 1 , , x k ) , the sum of such probability is 0 (this indicates P ( S 1 S 2 , , S k ) = 0 ).
When we consider the discretized dynamics of S-LD, S-PLD, and so on, and update samples up to k-iterations, then there exist k evaluation points. So, by applying the above discussion, we can ensure that along the path of the discretized dynamics, the equality condition does not hold with probability 1. On the other hand, as for the continuous dynamics, the evaluation point is infinite, thus when we cannot conclude that the probability that the equality does not hold is 1.

Appendix H. Proof of Theorem 6

We use the same notation as in Appendix F. We consider the expansion concerning α and we consider the following setting,
w j : = v j + δ v j
μ j : = λ j + δ λ j ,
which indicates that by introducing the skew-acceleration terms, the pairs of eigenvalues and eigenvectors of H are expressed by the small perturbation for the eigenvalues and eigenvectors of H. Since { v j } j = 1 d are the eigenvalues of H and they can be used as an orthogonal basis, thus we expand δ v by this basis. We obtain
δ v j = k j d c j k v k ,
where c j k = δ v j v k .

Appendix H.1. Asymptotic Expansion When the Smallest Eigenvalue of H(x) Is Positive

We work on the similar matrix of H , that is H + α V where V : = H 1 / 2 J H 1 / 2 . See Appendix G.1 for the detail. Please note that this similar matrix only exists when the smallest eigenvalue of H ( x ) is positive. Thus, the following discussion cannot apply to the case at the saddle point, where negative eigenvalues appear. We discuss the saddle point expansion later.
From the definition, we have
H w j = H w j + α V w j = μ j w j = ( λ j + δ λ j ) ( v j + δ v j ) ,
We rearrange this equation as
H v j + H δ v j + α V v j + α V δ v j = λ j v j + δ λ j v j + λ j δ v j + δ λ j δ v j .
First, we focus on the first-order expansion. This means we neglect high-order terms. Then, we have
H v j + H δ v j + α V v j = λ j v j + δ λ j v j + λ j δ v j .
By multiplying v j to Equation (A76) from the left-hand side, we have
λ j + λ j v j δ v j + α v j V v j = λ j + δ λ j + λ j v j δ v j ,
Since v j V v j = 0 due to the skew-symmetric property of V. Thus, we have
δ λ j = 0 ,
up to the first-order expansion. Then we substitute this into Equation (A76) and multiplying v i where i j , we have
λ i c j i + α v i V v j = λ j c j i .
Then we have
c j i = α v i V v j λ j λ i .
Then we obtain
δ v j = α i j d v i V v j λ j λ i v i .
We substitute this into Equation (A75), and multiplying v j , we have
v j H α i j d v i V v j λ j λ i v i + α v j V v j + α v j V α i j d v i V v j λ j λ i v i = δ λ j v j v j + λ j v j α i j d v i V v j λ j λ i v i + δ λ j v j α i j d v i V v j λ j λ i v i .
Since v j V v j = 0 and v j v i = 0 and v j v j = 1 , we have
α 2 i j d v i V v j λ j λ i v j V v i = δ λ j .
Thus, we have
μ j λ j = α j + i β j λ j = α 2 i j d ( v i V v j ) 2 λ j λ i .
Thus, by taking the real part, and note that Re λ j ( α ) = α j , we have
Re λ j ( α ) λ j = α 2 Re i j d ( v i V v j ) 2 λ i λ j + O ( α 3 ) = α 2 i j d λ i λ j ( v i J v j ) 2 λ i λ j + O ( α 3 ) .
This concludes the proof.

Appendix H.2. Expansion of the Eigenvalue at the Saddle Point

Here we derive the formula of the expansion of the eigenvalue at the saddle point. Since the smallest eigenvalue is negative, we cannot use the similar matrix as shown above. Instead, we use the relation,
μ j H w j = H μ j w j = H ( I + α J ) H w j
where we used the definition of the eigenvalues and eigenvectors. Here, we express H : = ( I + α J ) H and its pairs of eigenvalues and eigenvectors as { ( μ i , w i ) } i = 1 d . As introduced in the above, we substitute the expansion to Equation (A86), then we obtain
( λ j + δ λ j ) H ( v j + δ v j ) = H ( I + α J ) H ( v j + δ v j )
Then, in the same way as above, since { v j } j = 1 d are the eigenvalues of H and they can be used as an orthogonal basis, we expand δ v by this basis. This means
δ v j = k = 1 d c j k v k ,
where c j k = δ v j v k . By multiplying v i to Equation (A87) where i j from left-hand side and neglecting high-order terms, we have
c j i = λ j λ j λ i ( v i α J v i ) .
Next, Then by multiplying v j to Equation (A87) from left-hand side, we have
v j H ( α J ) H δ v j = ( δ λ j ) ( λ j + λ j v j δ v j )
Then by substituting δ v j with coefficient Equation (A89), we have
δ λ j = α 2 i j d λ i λ j ( v i J v j ) 2 λ i λ j + O ( α 3 )
This concludes the proof.

Appendix I. Convergence Rate of Parallel Sampling Schemes

Appendix I.1. Proof of Lemma 2

First, we introduce the notations. We express the random variables of S-PLD as Y t N . We express the measure induced by S-PLD as μ t N ( α ) , which uses the α J as an interaction term. Thus, we express the measure of PLD as μ k h N ( 0 ) , we can decompose the measure as marginals. We also denote the marginal measure of S-PLD for Y t ( n ) ν t ( n ) ( α ) . Please note that initial distribution is μ 0 N and its marginals are μ 0 as defined in Assumption 4.
Please note that the marginal measure of PLD is the same as those of LD if the initial measures are all the same, thus each marginal satisfy the Poincaré constant m 0 . This is also the result of the tensorization property of the spectral gap (Proposition 4.3.1 in Bakry et al. [19]).
As for the initial condition, from the fact that χ 2 divergence is the special case of Renyi divergence ( α = 4 ), and from the tensorization property of the Renyi divergence (see Theorem 28 in [45]), we have
χ 2 ( μ t N ( 0 ) , π N ) e 2 β 1 m 0 t χ 2 ( μ 0 N , π N ) = n = 1 N e 2 β 1 m 0 t χ 2 ( μ 0 , π ) .
Then we have
χ 2 ( μ t N ( 0 ) , π N ) e 2 β 1 m 0 t χ 2 ( μ 0 N , π N ) = N e 2 β 1 m 0 t χ 2 ( μ 0 , π ) .
If the skew acceleration is applied, from the same discussion as S-LD (see Appendix C.1), S-PLD has the Poincaré constant which is larger than m 0 . We express it as m ( α , N ) ( m 0 ) . Then we have
χ 2 ( μ t N ( α ) , π N ) N e 2 β 1 m ( α , N ) t χ 2 ( μ 0 , π ) .
At first, since there exists a constant N in the convergence bound, this bound seems not useful. However, as we discussed below, when we bound the bias or variance, these bound is meaningful. For example, let us consider approximating the true expectation f ( x ) d π ( x ) by the ensemble samples 1 N n = 1 N f ( X t ( n ) ) . Then we are interested in bounding the error
E 1 N n = 1 N f ( X k ( n ) ) R d f d π .
For this purpose, we can bound this by 2-Wasserstein distance as
E 1 N n = 1 N f ( X k ( n ) ) R d f d π , L f N W 2 ( μ k h N ( α ) , π N )
where we assumed that f shows L f lipschitzness and used the fact that 1 N n = 1 N f ( x ( n ) ) shows L f / N lipschitzness.
To bound the distance, we use the basic relation
W 2 2 ( ν k h ( α ) , π N ) 2 1 m ( α , N ) χ 2 ( μ k h N ( α ) , π N ) ,
where m ( α , N ) is the Poincaré constant. This is established by the definition of Wasserstein distance and χ 2 -divergence, see [46] for the detail. Then combined with above relations, we obtain the bias bound of S-PLD as
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f 2 m ( α , N ) e β 1 m ( α , N ) k h χ 2 ( μ 0 , π ) 1 / 2 .
In the same way, we obtain the bias bound of PLD as
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f 2 m 0 e β 1 m 0 k h χ 2 ( ν 0 , π ) 1 / 2 .
Thus, while the explicit dependency on N disappeared, but S-PLD shows faster convergence through the relation of m ( α , N ) m 0 . Moreover, if we use the skew matrices, which does not satisfy the equality condition, we have m ( α , N ) > m 0 .

Appendix I.2. Proof for S-ULD

We can characterize the convergence rate almost in the same way as Appendix C.2. The derivation is the same above, thus we only show the result
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f 2 m ( α , N ) 1 + ϵ ¯ 1 ϵ ¯ e λ γ / 2 k h χ 2 ( ν 0 0 , π ) 1 / 2 .
where ϵ ¯ and λ γ is given as follows.
λ γ = Λ ( γ , ϵ ¯ min ( γ , γ 1 ) ) 1 + ϵ ¯ min ( γ , γ 1 ) ,
and
Λ ( γ , ϵ ) = γ Σ 1 1 1 + m 0 Σ 1 β 2 1 2 ( S S + + ) 2 + ( S + ) 2 ,
S = ϵ λ h a m ,
S + = ϵ ( R h a m + γ Σ 1 / 2 ) ,
S + + = γ Σ 1 ϵ ,
λ h a m = 1 1 + m ( α , N ) Σ 1 β 1 ,
ϵ = ϵ ¯ min ( γ , γ 1 ) ,
where ϵ ¯ is arbitrary sufficiently small positive value such that Λ ( γ , ϵ ¯ min ( γ , γ 1 ) ) > 0 is satisfies. and
R h a m max { M , 2 } .

Appendix J. Proof of Theorem 7

We show our theorem again with explicit constants
Theorem A3.
Under Assumptions 1–7, for any k N and any h ( 0 , 1 m 4 M 2 ) obeying k h 1 and β m 2 , we have
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f C ˜ 0 2 δ + C ˜ 1 2 h k η + L f 2 m ( α , N ) χ 2 ( μ 0 , π ) 1 / 2 e β 1 m ( α , N ) k h .
where
C ˜ 0 2 = 12 + 8 κ 0 + 2 b + 2 d β β C 0 + β C 0 ,
C ˜ 1 2 = 12 + 8 κ 0 + 2 b + 2 d β C 1 + C 1
C 0 = ( 1 + α ) 2 M 2 κ 0 + 2 1 1 m b + 2 ( 1 + α ) 2 B 2 + d β + B 2 ,
C 1 = 6 ( 1 + α 2 ) M 2 ( β C 0 + d ) ,
Then obtained bound is O ( k h · h 1 / 4 ) , which is independent of N. Thus, this result is much better than those in [18]. Additionally, note that we can derive the similar bias bound for skew-SGHMC in the same way as skew-SGLD.
Proof. 
For notational simplicity, we express the random variables of skew-SGLD which uses the α J as an interaction term as X k N and those of S-PLD as Y k N . In this section, for simplicity, we express them as X k and Y k . We denote the measure of X k and Y k as ν k h N and μ k h N . We also denote the marginal measure of X k ( n ) and Y k ( n ) as μ k h ( n ) and ν k h ( n ) .
Then, we first decompose the bias as
E 1 N n = 1 N f ( X k ( n ) ) R d f d π = E n = 1 N f ( X k ( n ) ) N E n = 1 N f ( Y k ( n ) ) N + E n = 1 N f ( Y k ( n ) ) N R d f d π E 1 N n = 1 N f ( X k ( n ) ) E 1 N n = 1 N f ( Y k ( n ) ) + E 1 N n = 1 N f ( Y k ( n ) ) R d f d π L f N i = 1 N W 2 ( ν k h ( n ) ( α ) , μ k h ( n ) ( α ) ) + L f N W 2 ( μ k h N ( α ) , π N ) ( i ) ,
where we used the Jensen inequality for the first term in the last inequality and we move 1 N i = 1 N outside the | · | . In addition, each expectation only depends on the marginal measures μ ( i ) in the first term and we use the property of the 2-Wasserstein (2-W) distance. Furthermore, we decompose the first term as
L f N n = 1 N W 2 ( μ k h ( n ) ( α ) , ν k h ( n ) ( α ) ) L f N n = 1 N W 2 ( ν k h ( n ) ( α ) , μ k h ( n ) ( 0 ) ) ( i i ) + W 2 ( μ k h ( n ) ( α ) , μ k h ( n ) ( 0 ) ) ( i i i ) ,
where μ k h ( n ) ( 0 ) denotes the measure induced by P L D , which is the naive parallel sampling without a skew-symmetric interaction.
In conclusion, our task is to bound each ( i ) , ( i i ) , ( i i i ) terms in the above. Bounding ( i ) is already discussed in Appendix I.1.
Next, we work on ( i i ) and ( i i i ) . Following [10], we use weighted CKP inequality to bound the 2-W distance. From Bolley and Villani [47], using the weighted CKP inequality, we can bound each 2-W distance by the relative entropy (KL divergence). This weighted CKP inequality indicates that
W 2 ( ν k h ( n ) ( α ) , μ k h ( n ) ( 0 ) ) C μ k h ( n ) ( 0 ) KL ( ν k h ( n ) ( α ) | μ k h ( n ) ( 0 ) ) 1 / 2 + KL ( ν k h ( n ) ( α ) | μ k h ( n ) ( 0 ) ) 2 1 / 4 ,
with
C μ k h ( n ) ( 0 ) = 2 inf λ > 0 1 λ 3 2 + log R d e λ x ( n ) 2 d μ k h ( n ) ( 0 ) 1 / 2 .
and
W 2 ( μ k h ( n ) ( α ) , μ k h ( n ) ( 0 ) ) C μ k h ( n ) ( 0 ) KL ( μ k h ( n ) ( α ) | μ k h ( n ) ( 0 ) ) 1 / 2 + KL ( μ k h ( n ) ( α ) | μ k h ( n ) ( 0 ) ) 2 1 / 4 ,
with
C μ k h ( n ) ( 0 ) = 2 inf λ > 0 1 λ 3 2 + log R d e λ x ( n ) 2 d μ k h ( n ) ( 0 ) 1 / 2 .
We point out that using C μ k h ( i ) ( 0 ) not C ν k h ( i ) ( α ) and C μ k h ( i ) ( α ) in weighted CKP inequality is important. This is because since μ k h ( i ) ( 0 ) is the constant based on the parallel-chain Monte Carlo without skew-symmetric term, thus the parallel chain can be decomposed each independent chains. Thus, C μ k h ( i ) actually does not depend on i and it does not depend on N and shows O ( d ) dependency. However, C ν k h ( i ) ( α ) and C μ k h ( i ) ( α ) show O ( d N ) which shows linear dependency on N since there is an interaction term between parallel chains and we cannot decompose the parallel chain easily. Thus, this results in unsatisfactory dependency on N. This is the reason we introduced μ k h ( i ) ( 0 ) in our theoretical analysis.
Please note that since μ k h ( n ) ( 0 ) is induced by the naive parallel chain, each marginal is independent with each other and takes the same measure if the initial measure is the same. Thus, μ k h ( 1 ) ( 0 ) = = μ k h ( N ) ( 0 ) . From now on, we express the marginal as μ k h ( 0 ) for simplicity. Thus, C μ k h ( 1 ) ( 0 ) = = C μ k h ( N ) ( 0 ) = C μ k h ( 0 ) .
Then substituting the above WKP inequalities and using the Jensen inequality, we obtain
E 1 N n = 1 N f ( X k ( n ) ) E 1 N n = 1 N f ( Y k ( n ) ) L f C μ k h ( 0 ) 1 N n = 1 N KL ( ν k h ( n ) ( α ) | μ k h ( 0 ) ) 1 / 2 + KL ( ν k h ( n ) ( α ) | μ k h ( 0 ) ) 2 1 / 4 + KL ( μ k h ( n ) ( α ) | μ k h ( 0 ) ) 1 / 2 + KL ( μ k h ( n ) ( α ) | μ k h ( 0 ) ) 2 1 / 4 L f C μ k h ( 0 ) n = 1 N KL ( ν k h ( n ) ( α ) | μ k h ( 0 ) ) N 1 2 + n = 1 N KL ( ν k h ( n ) ( α ) | μ k h ( 0 ) ) 2 N 1 4 + n = 1 N KL ( μ k h ( n ) ( α ) | μ k h ( 0 ) ) N 1 2 + n = 1 N KL ( μ k h ( n ) ( α ) | μ k h ( 0 ) ) 2 N 1 4 .
To analyze the discretization error, we use the following key lemma:
Lemma A1. 
Assume that there exist random variables { X i Ω i } i = 1 N and { Y i Ω i } i = 1 N . We denote the product space as Ω N : = Ω 1 × Ω N . Let us introduce X = ( X 1 , , X N ) Ω N and Y = ( Y 1 , , Y N ) Ω N . Let us express their joint probability measures as expressed as P ( X ) : = P ( X 1 , , X N ) , Q ( Y ) : = Q ( Y 1 , , Y N ) , let us denote the marginal measures of each Xs and Ys as { P i ( X i ) } i = 1 N and { Q i ( Y i ) } i = 1 N . If P i < < Q i holds, we have
i = 1 N KL ( P i ( X i ) Q i ( Y i ) ) KL ( P ( X ) Q ( Y ) ) ,
A proof is given in Appendix J.1. We apply this lemma as
n = 1 N KL ( μ k h ( n ) | μ k h ( 0 ) ) KL ( ν k h N | μ k h N ( 0 ) ) ,
n = 1 N KL ( μ k h ( n ) ( α ) | μ k h ( 0 ) ) KL ( μ k h N ( α ) ) | μ k h N ( 0 ) ) ) .
Combining these results with the above bias bound, we obtain
E 1 N n = 1 N f ( X k ( n ) ) E 1 N n = 1 N f ( Y k ( n ) ) L f C μ k h ( 0 ) KL ( ν k h N ( α ) | μ k h N ( 0 ) ) N 1 2 + KL ( ν k h N ( α ) | μ k h N ( 0 ) ) 2 N 1 4 + KL ( μ k h N ( α ) ( α ) | μ k h N ( 0 ) ) N 1 2 + KL ( μ k h N ( α ) | μ k h N ( 0 ) ) 2 N 1 4 .
Thus, we need to bound KL ( μ k h ( i ) ( α ) | μ k h N ( 0 ) ) and KL ( ν k h N ( α ) | μ k h N ( 0 ) ) and C μ k h ( 0 ) . We can upper-bound them using the results of [2]. For that purpose, we need to replace the constants in [2] as we show in the below. Here, we discuss how the constants in the assumption are changed in the ensemble scheme. We define
u N ( x N ) : = ( u ( x ( 1 ) ) , , u ( x ( N ) ) )
First, we focus on the smoothness condition. From Assumption 2 and lemma 8 in [18], we have
( I + α J ) u N ( x N , z ) ( I + α J ) u N ( y N , z ) ) M ( 1 + α ) x N y N .
where the norm in the right-hand side is the Euclidean norm in R d N .
Next, we discuss the smoothness condition. Define U α ( x N ) : = U N ( x N ) + α J U N ( x N ) . Then, Let x N R d N and under the assumptions 1 to 6, we have
x N · U α ( x N ) m x N 2 b N .
Next, we check about the condition of the drift function at the origin: u ( 0 , z ) B . We can calculate in the same way as the smoothness condition. Then we have
I + α J U N ( 0 N ) B N ( 1 + α ) .
Next, we study the condition about the stochastic gradient: E [ U ^ ( x ) U ( x ) 2 ] 2 δ M 2 x 2 + B 2 . This can be easily modified to
E [ I + α J U ^ N ( x N ) I + α J U N ( x N ) 2 ] ( 1 + α ) 2 E [ U ^ N ( x N ) U N ( x N ) 2 ] ( 1 + α ) 2 i = 1 N E [ U ^ ( x ( i ) ) U ( x ( i ) ) 2 ] ( 1 + α ) 2 i = 1 N 2 δ M 2 x ( i ) 2 + B 2 2 δ ( 1 + α ) 2 M 2 x N 2 + N B 2 .
Finally, we discuss about the initial condition: κ 0 : = log R d e x 2 p 0 ( x ) d x < . We assume that the initial probability distribution is μ 0 N ( X 0 N ) = μ 0 ( X 0 ( 1 ) ) × × μ 0 ( X 0 ( N ) ) , which means that all the marginal probability is the same. Then
κ 0 N : = log R d N e x N 2 μ 0 N ( x N ) d x N = log n = 1 N R d e x ( n ) 2 μ 0 ( x ( n ) ) d x = N κ 0 .
In this way, the constants in the assumptions are modified and expressed with N and α . Then combined with the results of [2], we can derive the following relations
C ν k h ( 0 ) 12 + 8 κ 0 + 2 b + 2 d β ,
KL ( ν k h N | μ k h N ( 0 ) ) N ( C 0 β δ + C 1 η ) k η ,
KL ( μ k h N ( α ) | μ k h N ( 0 ) ) N β 2 α 2 M 2 ( κ 0 + b + d / β m ) k η ,
where
C 0 = ( 1 + α ) 2 M 2 κ 0 + 2 1 1 m b + 2 ( 1 + α ) 2 B 2 + d β + B 2 ,
C 1 = 6 ( 1 + α 2 ) M 2 ( β C 0 + d ) .
This concludes the proof. □

Appendix J.1. Proof of Lemma A1

Proof. 
We prove this lemma using the Donsker–Varadhan representation of the relative entropy [48]. The relative entropy admits the dual representation as:
KL ( P ( X ) Q ( Y ) ) = sup T : Ω N R E P ( X ) [ T ] log E Q ( Y ) [ e T ] ,
where supremum is taken over all function T of which the expectation of e T and T are finite. We then restrict the function class into a class F ( T ) = { T ( X ) | T i : Ω i R , s . t . T ( X ) = i = 1 N T i ( X i ) } where each expectation of e T i and T i are finite. Then by definition,
KL ( P ( X ) Q ( Y ) ) = sup T : Ω R E P ( X ) [ T ] log E Q ( Y ) [ e T ] sup T F E P ( X ) i T i log E Q ( Y ) e i T i .
Then we have
KL ( P ( X ) Q ( Y ) ) sup T F i E P i ( X i ) T i log i E Q i ( Y i ) e T i = sup T F i E P i ( X i ) T i log E Q i ( Y i ) e T i = i sup T i : Ω i R E P i ( X i ) T i log E Q i ( Y i ) e T i = i = 1 N KL ( P i ( X i ) Q i ( Y i ) ) .

Appendix K. Order Expansion

Bias Expansion for S-PLD

Recall that the bias of S-PLD is
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f C ˜ 0 2 δ + C ˜ 1 2 h k η + L f 2 m ( α , N ) χ 2 ( μ 0 ) , π ) 1 / 2 e β 1 m ( α , N ) k h .
where
C ˜ 0 2 = 12 + 8 κ 0 + 2 b + 2 d β β C 0 + β C 0 ,
C ˜ 1 2 = 12 + 8 κ 0 + 2 b + 2 d β C 1 + C 1
C 0 = ( 1 + α ) 2 M 2 κ 0 + 2 1 1 m b + 2 ( 1 + α ) 2 B 2 + d β + B 2 ,
C 1 = 6 ( 1 + α 2 ) M 2 ( β C 0 + d ) ,
First, we discuss the convergence of the continuous dynamics. Using the eigenvalue expansion in Theorem 6, with some positive constant d 0 , we have
m ( α , N ) m 0 + α 2 d 0 + O ( α 3 ) .
Then by assuming α 2 is small enough and considering the Tayler expansion, we have
L f 2 m ( α , N ) χ 2 ( μ 0 , π ) 1 / 2 e β 1 m ( α , N ) t L f χ 2 ( μ 0 , π ) 1 / 2 2 1 m 0 d 0 2 m 0 3 / 2 α 2 e β 1 m 0 t .
As for the discretization and stochastic gradient error, using the Taylor expansion, there exists a positive constant d 1 and d 2 , such that
L f C ˜ 0 2 δ + C ˜ 1 2 h k η ( d 1 α + d 2 α 2 + Const ) k h .
Combining these terms, we have
E 1 N n = 1 N f ( X k ( n ) ) R d f d π ( d 1 α + d 2 α 2 ) k h α 2 L f χ 2 ( μ 0 , π ) 1 / 2 1 2 m 0 3 / 2 e β 1 m 0 t + Const .
Thus, there exists an optimal α , which minimizes the bias. Please note that at k = 0 , acceleration always occurs. As k goes to infinity, the second third terms 0, thus the first term will be dominant, which means we have larger discretization and stochastic gradient error.

Appendix L. Hyperparameters of the Proposed Algorithm

Here we discuss how to set hyperparameters in the algorithm. There are three hyperparameters, α 0 , η , and c. We numerically found that setting c = 0.95 work well for real dataset including LDA experiment, and Bayesian neural network regression and classification. For toy dataset, we set c = 0.9 .
As for α 0 and η , we empirically found that using the following scaling trick works well for real dataset including LDA experiment, and Bayesian neural network regression and classification,
α 0 1 1 N 2 n U ( x 0 ( n ) ) 2 N h .
and using η 0.1 α 0 . The intuition is that the magnitude of the gradient can be very different in each dimension, so we introduce the scaling by the gradient. We also multiply h so that the stochastic gradient and discretization error of the skew term will not be dominant compared to usual gradient term. Finally, we multiply some constant so that α 0 will not be too small.

Appendix M. Proof of Theorem 8

In this section, we derive the upper-bound of the bias of skew-SGLD based on [23]. This approach requires us to use the logarithmic Sobolev inequality [19], which is stronger than the Poincaré inequality. First, we present the definition of the logarithmic Sobolev inequality. We say that π on R d with L satisfies the logarithmic Sobolev inequality with constant λ in case for all function f on R d with R d u 2 d π = 1 ,
R d f 2 ln f 2 d π 2 λ R d f L f d π .
This logarithmic Sobolev inequality is stronger than the Poincaré inequality and induces the convergence in KL divergence. See [19] for details. It was proved in [2,18] that our dynamics, LD, SLD, PLD, S-PLD, and skew-SGLD satisfy the logarithmic Sobolev inequalities under our assumptions. We express the constant of the logarithmic Sobolev inequality for skew-SGLD as λ ( α , N ) . This constant depends on the skew matrices and the Poincaré constant. We estimate this constant in Appendix M.1.
To upper-bound the bias, here we control the KL divergence. We denote the law of skew-SGLD at iteration k with interaction strength α as μ k h N ( α ) . We upper-bound the bias by 2-Wasserstein distance
E 1 N n = 1 N f ( X k ( n ) ) R d f d π , L f N W 2 ( μ k h N ( α ) , π N ) .
Then, from the transportation inequality [19],
W 2 ( μ k , π ) 2 λ ( α , N ) KL ( μ k h N ( α ) | π N ) .
Thus, we will upper bound the KL divergence using the technique in [23]. However, in the original proof, a full gradient U is used so we replace it with the stochastic gradient. Moreover, we introduce the skew interaction term.
First, Lemma 11 in [23] is modified to
E π N U N 2 d N M β .
Then Lemma 12 in [23] is modified to
E μ U N 2 4 M 2 λ KL ( μ | π N ) + 2 d N M β ,
for any integrable μ .
Herein after, we drop N from X N , U N , and U ^ N for notational simplicity. We focus on skew-SGLD at iteration k, we consider the following SDE for t ( k h , ( k + 1 ) h ]
d X t = ( I + α J ) U ˜ ( X k ) d t + 2 β 1 d w t ,
where U ˜ ( X k ) is the stochastic gradient conditioned on X k . The solution of this SDE is
X ( k + 1 ) = X k ( I + α J ) U ˜ ( X k ) h + 2 β 1 ϵ .
We would like to derive the continuity equation correspond to Equation (A154). Following [23], we express X t as x t and X k as x 0 for simplicity. Let ρ 0 t ( x 0 , x t ) denote the joint distribution of ( x 0 , x t ) . Then, the conditional and marginal relations are written as
ρ 0 t ( x 0 , x t ) = ρ 0 ( x 0 ) ρ t | 0 ( x t | x 0 ) = ρ t ( x t ) ρ 0 | t ( x 0 | x t ) .
The conditional density ρ t | 0 ( x t | x 0 ) follows the FP equation
ρ t | 0 ( x t | x 0 ) t = · ( ρ t | 0 ( x t | x 0 ) ( I + α J ) U ˜ ( x 0 ) ) + β 1 Δ ρ t | 0 ( x t | x 0 ) ,
Then following [23], to derive the evolution of ρ t , we take the expectation over ρ 0 ( x 0 )
ρ t ( x ) t = R d ρ t | 0 ( x t | x 0 ) t ρ 0 ( x 0 ) d x 0 = · ( ρ t ( x t ) E ρ 0 | t [ ( I + α J ) U ˜ ( x 0 ) | x t = x ] ) + β 1 Δ ρ t ( x ) .
Then, we take the expectation regarding for the stochastic gradient in the above equation and include it into E ρ 0 | t for notational simplicity. Then following the discussion of Lemma 3 in [23], we obtain
KL ( μ t | π ) t 3 4 I ( μ t N | π N ) + 2 E ρ 0 t [ U ( X t ) U ( X 0 ) 2 ] + 2 ( 1 + α ) 2 E ρ 0 t [ U ( X 0 ) U ˜ ( X 0 ) 2 ] + 2 α 2 E ρ 0 t [ U ( x 0 ) 2 ] ,
where t ( k h , ( k + 1 ) h ] and
X t = X k t ( I + α J ) U ( X k ) + 2 t β 1 ϵ .
Then, from [18], we can upper-bound the second term by
E ρ 0 t [ U ( X 0 ) U ˜ ( X 0 ) 2 ] N C 0 δ ,
C 0 : = 2 M 2 κ 0 + 2 1 1 m b + 2 ( 1 + α ) 2 B 2 + d β + B 2
and the third term is upper-bounded by
E ρ 0 t [ U ( X 0 ) E ρ 0 t [ U ( x 0 ) 2 ] 2 M 2 x 0 2 + 2 N B 2 N C 0 ,
where we used lemma 2 and 7 in [2]. Finally, from the original proof of [23] we obtain
2 E ρ 0 t [ U ( X t ) U ( X 0 ) 2 ] 8 t 2 M 4 λ KL ( μ k N | π N ) + 4 t 2 d N M 3 β + 4 t d N M 2 β .
Then, in conclusion, under h ( 0 , 1 m 4 M 2 ) obeying k h 1 and β m 2 , we obtain
d d t KL ( μ t N | π N ) 3 4 I ( μ t N | π N ) + 8 t 2 M 4 λ ( α , N ) KL ( μ k N | π N ) + 4 t 2 d N M 3 β + 4 t d N M 2 β + 2 N C 0 ( δ ( 1 + α ) 2 + α 2 ) .
For simplicity, we assume that h ( 0 , m 4 M 2 ) and m 4 M 2 < 1 , then we obtain
d d t KL ( μ t N | π N ) 3 4 I ( μ t N | π N ) + 8 t 2 M 4 λ ( α , N ) KL ( μ k N | π N ) + t 2 d N M β ( m + 4 M ) + 2 N C 0 ( δ ( 1 + α ) 2 + α 2 ) .
Then using t ( k h , ( k + 1 ) h ] , we obtain
KL ( μ k + 1 N | π N ) e 3 2 λ ( α , N ) h 1 + 16 h 3 M 4 λ KL ( μ k N | π N ) + e 3 2 λ ( α , N ) h 2 h d N M β ( m + 4 M ) + 8 h N C 0 ( δ ( 1 + α ) 2 + α 2 ) .
If h ( 0 , λ ( α , N ) 4 2 M 2 ) , we obtain
KL ( μ k + 1 N | π N ) e λ ( α , N ) h KL ( μ k N | π N ) + 2 h 2 d N M β ( m + 4 M ) + 8 h N C 0 ( δ ( 1 + α ) 2 + α 2 ) .
From this one step inequality, we obtain
KL ( μ k N | π N ) e λ ( α , N ) k h KL ( μ 0 N | π N ) + 1 1 e λ ( α , N ) h 2 h 2 d N M β ( m + 4 M ) + 8 h N C 0 ( δ ( 1 + α ) 2 + α 2 ) e λ ( α , N ) k h KL ( μ 0 N | π N ) + 2 N λ ( α , N ) h d M β ( m + 4 M ) + 4 C 0 ( δ ( 1 + α ) 2 + α 2 ) .
Then, finally we obtain
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f N 2 λ ( α , N ) KL ( μ k h N ( α ) | π N ) L f 2 λ ( α , N ) e λ ( α , N ) k h KL ( μ 0 | π ) + 2 λ ( α , N ) h d M β ( m + 4 M ) + 4 C 0 ( δ ( 1 + α ) 2 + α 2 ) L f 2 λ ( α , N ) e λ ( α , N ) k h KL ( μ 0 | π ) + C 3 ( α ) λ ( α , N ) ,
where
C 3 ( α ) : = 2 h d M β ( m + 4 M ) + 8 C 0 ( δ ( 1 + α ) 2 + α 2 ) ,
C 0 : = 2 M 2 κ 0 + 2 1 1 m b + 2 ( 1 + α ) 2 B 2 + d β + B 2 .
Moreover, from Appendix M.1, the logarithmic Sobolev constant is
λ ( α , N ) : = 1 ( 1 + β m ( α , N ) 1 | C ( m 0 ) | ) 2 π e 2 + 3 2 m ( α , N ) ,
where
C ( m 0 ) : = E π N [ U N ( x ) ] 1 / 2 + 8 m 0 E π N [ U N ( x ) 2 ] 1 / 2 .

Appendix M.1. Estimation of the Logarithmic Sobolev Constant

In this section, we estimate the logarithmic Sobolev constants using the technique of restricted logarithmic Sobolev inequality, which was introduced in [49].
The technique of [49] estimates the constant of the logarithmic Sobolev inequality as follows. Assume that π on R d with L satisfies the Poincaré inequality with constant m. Then, for any function u on R d that satisfies
R d u d π = 0 and R d u 2 d π = 1 ,
we find a constant b that satisfies
R d u 2 ln u 2 d π b R d u L u d π .
Then the logarithmic constant is larger than 2 ( b + 3 m ) 1 . Thus, we only need to focus on the restricted function class to estimate a constant b. We slightly change the Lemma 3.2 of [49] that estimate the constant b in Equation (A176) to apply it in our setting. In Lemma 3.2 of [49], it was proved that if u on R d satisfies the conditions in Equation (A175), then for any t ( 0 , 1 ) , we have
R d u L u d π t π e 2 R d u 2 ln u 2 d π ( 1 t ) m + t β R d ( 1 2 L U ( x ) π e 2 U ( x ) ) u 2 d π ,
where we assume that π e β U ( x ) satisfies the Poincaré inequality with constant m. If there exists a constant C such that
C β R d ( 1 2 L U ( x ) π e 2 U ( x ) ) u 2 d π > ,
then by setting t = m / ( m + | C | ) , we can show that
R d u L u d π m / ( m + | C | ) π e 2 R d u 2 ln u 2 d π > 0 .
Thus, the constant b in Equation (A176) is b = t = m / ( m + | C | ) and the logarithmic constant is 2 ( m / ( m + | C | ) + 3 m ) 1 .
Thus, We analyze the constant C. The first term of the integral in Equation (A178) is lower-bounded bounded by
E π [ L U ( x ) u 2 ] | E π [ U ( x ) L U ( x ) ] | 1 / 2 | E π [ u 2 L u 2 ] | 1 / 2 2 E π [ U ( x ) 2 ] 1 / 2 ,
where we used the property of L , see [19] for details. As for the second term, it is lower-bounded by
| E π [ U ( x ) u 2 ] | E π [ U 2 ( x ) u 2 ] 1 m | E π [ ( U ( x ) | u | ) L ( U ( x ) | u | ) ] | 8 m E π [ U ( x ) 2 ] 1 / 2 .
Thus, by setting
C : = E π [ U ( x ) ] 1 / 2 + 8 m 0 E π [ U ( x ) 2 ] 1 / 2 ,
we can estimate the logarithmic constant as 2 ( m / ( m + | C | ) + 3 m ) 1 .
In our setting, this is modified to
λ ( α , N ) = 1 ( 1 + β m ( α , N ) 1 | C ( m 0 ) | ) 2 π e 2 + 3 2 m ( α , N ) 1 .
where
C ( m 0 ) : = E π N [ U N ( x ) ] 1 / 2 + 8 m 0 E π N [ U N ( x ) 2 ] 1 / 2 .
Finally, if we increase m ( α , N ) , λ ( α , N ) increases. Thus, since m ( α , N ) m ( α = 0 , N ) , we obtain λ ( α , N ) λ ( α = 0 , N ) .

Appendix M.2. Computational Complexity

To derive the computational complexity, for simplicity, we assume that δ h and We also set α 2 h for simplicity. This means that the variance of the stochastic gradient is small enough and we use small α . Then the bias is
E 1 N n = 1 N f ( X k ( n ) ) R d f d π L f 2 λ ( α , N ) e λ ( α , N ) k h KL ( μ 0 | π ) + C 3 ( α ) λ ( α , N ) L f 2 λ ( α , N ) e λ ( α , N ) k h KL ( μ 0 | π ) + C 3 ( α ) λ ( α , N ) ,
where
C 3 ( α ) : = h 2 d M β ( m + 4 M ) + 8 C 0 ( ( 1 + h 1 / 2 ) 2 + 1 ) ,
C 0 : = 2 M 2 κ 0 + 2 1 1 m b + 2 ( 1 + h 1 / 2 ) 2 B 2 + d β + B 2 .
Then we define
C 3 : = 2 d M β ( m + 4 M ) + 8 C 0 ( ( 1 + h 1 / 2 ) 2 + 1 ) ,
and use the step size that satisfies h = λ ( α , N ) ξ 2 2 C 3 L f . Then when we use
k 2 λ ( α , N ) h ln L f ξ KL ( μ 0 | π ) 2 λ ( α , N ) ,
we have
E 1 N n = 1 N f ( X k ( n ) ) R d f d π ξ 2 + ξ 2 ξ .

References

  1. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  2. Raginsky, M.; Rakhlin, A.; Telgarsky, M. Non-convex learning via Stochastic Gradient Langevin Dynamics: A nonasymptotic analysis. In Proceedings of the Conference on Learning Theory, Amsterdam, The Netherlands, 7–10 July 2017; pp. 1674–1703. [Google Scholar]
  3. Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the International Conference on Machine Learning, Washington, DC, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
  4. Livingstone, S.; Girolami, M. Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions. Entropy 2014, 16, 3074–3102. [Google Scholar] [CrossRef]
  5. Hartmann, C.; Richter, L.; Schütte, C.; Zhang, W. Variational Characterization of Free Energy: Theory and Algorithms. Entropy 2017, 19, 626. [Google Scholar] [CrossRef] [Green Version]
  6. Neal, R.M. Improving asymptotic variance of MCMC estimators: Non-reversible chains are better. arXiv 2004, arXiv:math/0407281. [Google Scholar]
  7. Neklyudov, K.; Welling, M.; Egorov, E.; Vetrov, D. Involutive mcmc: A unifying framework. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; pp. 7273–7282. [Google Scholar]
  8. Gao, X.; Gurbuzbalaban, M.; Zhu, L. Breaking Reversibility Accelerates Langevin Dynamics for Non-Convex Optimization. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 17850–17862. [Google Scholar]
  9. Eberle, A.; Guillin, A.; Zimmer, R. Couplings and quantitative contraction rates for Langevin dynamics. Ann. Probab. 2019, 47, 1982–2010. [Google Scholar] [CrossRef] [Green Version]
  10. Gao, X.; Gürbüzbalaban, M.; Zhu, L. Global convergence of stochastic gradient Hamiltonian Monte Carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. arXiv 2018, arXiv:1809.04618. [Google Scholar]
  11. Cheng, X.; Chatterji, N.S.; Abbasi-Yadkori, Y.; Bartlett, P.L.; Jordan, M.I. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv 2018, arXiv:1805.01648. [Google Scholar]
  12. Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International conference on machine learning, Beijing, China, 21–26 June 2014; pp. 1683–1691. [Google Scholar]
  13. Hwang, C.R.; Hwang-Ma, S.Y.; Sheu, S.J. Accelerating gaussian diffusions. Ann. Appl. Probab. 1993, 3, 897–913. [Google Scholar] [CrossRef]
  14. Hwang, C.R.; Hwang-Ma, S.Y.; Sheu, S.J. Accelerating diffusions. Ann. Appl. Probab. 2005, 15, 1433–1444. [Google Scholar] [CrossRef] [Green Version]
  15. Hwang, C.R.; Normand, R.; Wu, S.J. Variance reduction for diffusions. Stoch. Process. Their Appl. 2015, 125, 3522–3540. [Google Scholar] [CrossRef]
  16. Duncan, A.B.; Lelièvre, T.; Pavliotis, G.A. Variance Reduction Using Nonreversible Langevin Samplers. J. Stat. Phys. 2016, 163, 457–491. [Google Scholar] [CrossRef] [Green Version]
  17. Duncan, A.B.; Nüsken, N.; Pavliotis, G.A. Using Perturbed Underdamped Langevin Dynamics to Efficiently Sample from Probability Distributions. J. Stat. Phys. 2017, 169, 1098–1131. [Google Scholar] [CrossRef] [Green Version]
  18. Futami, F.; Sato, I.; Sugiyama, M. Accelerating the diffusion-based ensemble sampling by non-reversible dynamics. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; pp. 3337–3347. [Google Scholar]
  19. Bakry, D.; Gentil, I.; Ledoux, M. Analysis and Geometry of Markov Diffusion Operators; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 348. [Google Scholar]
  20. Roussel, J.; Stoltz, G. Spectral methods for Langevin dynamics and associated error estimates. ESAIM Math. Model. Numer. Anal. 2018, 52, 1051–1083. [Google Scholar] [CrossRef] [Green Version]
  21. Menz, G.; Schlichting, A. Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. Ann. Probab. 2014, 42, 1809–1884. [Google Scholar] [CrossRef] [Green Version]
  22. Liu, Q.; Lee, J.; Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 24–26 June 2016; pp. 276–284. [Google Scholar]
  23. Vempala, S.; Wibisono, A. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8094–8106. [Google Scholar]
  24. Lelièvre, T.; Nier, F.; Pavliotis, G.A. Optimal non-reversible linear drift for the convergence to equilibrium of a diffusion. J. Stat. Phys. 2013, 152, 237–274. [Google Scholar] [CrossRef] [Green Version]
  25. Tripuraneni, N.; Rowland, M.; Ghahramani, Z.; Turner, R. Magnetic Hamiltonian Monte Carlo. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3453–3461. [Google Scholar]
  26. Nusken, N.; Pavliotis, G. Constructing sampling schemes via coupling: Markov semigroups and optimal transport. SIAM/ASA J. Uncertain. Quantif. 2019, 7, 324–382. [Google Scholar] [CrossRef]
  27. Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Proceedings of the Advances In Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2378–2386. [Google Scholar]
  28. Zhang, J.; Zhang, R.; Chen, C. Stochastic particle-optimization sampling and the non-asymptotic convergence theory. arXiv 2018, arXiv:1809.01293. [Google Scholar]
  29. Wang, Y.; Li, W. Information Newton’s flow: Second-order optimization method in probability space. arXiv 2020, arXiv:2001.04341. [Google Scholar]
  30. Wibisono, A. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the Conference On Learning Theory, Stockholm, Sweden, 6–9 July 2018; pp. 2093–3027. [Google Scholar]
  31. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  32. Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R.D.; Neven, H. Bayesian sampling using stochastic gradient thermostats. In Proceedings of the Advances in neural information processing systems, Montreal, QC, Canada, 8–11 December 2014; pp. 3203–3211. [Google Scholar]
  33. Patterson, S.; Teh, Y.W. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3102–3110. [Google Scholar]
  34. Dua, D.; Graff, C. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 21 July 2021).
  35. Villani, C. Optimal transportation, dissipative PDE’s and functional inequalities. In Optimal Transportation and Applications; Springer: Berlin/Heidelberg, Germany, 2003; pp. 53–89. [Google Scholar]
  36. Bakry, D.; Barthe, F.; Cattiaux, P.; Guillin, A. A simple proof of the Poincaré inequality for a large class of probability measures including the log-concave case. Electron. Commun. Probab 2008, 13, 21. [Google Scholar] [CrossRef]
  37. Nelson, E. Dynamical Theories of Brownian Motion; Princeton University Press: Princeton, NJ, USA, 1967; Volume 3. [Google Scholar]
  38. Pavliotis, G.A. Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations; Springer: Berlin/Heidelberg, Germany, 2014; Volume 60. [Google Scholar]
  39. Franke, B.; Hwang, C.R.; Pai, H.M.; Sheu, S.J. The behavior of the spectral gap under growing drift. Trans. Am. Math. Soc. 2010, 362, 1325–1350. [Google Scholar] [CrossRef] [Green Version]
  40. Landim, C.; Seo, I. Metastability of Nonreversible Random Walks in a Potential Field and the Eyring-Kramers Transition Rate Formula. Commun. Pure Appl. Math. 2018, 71, 203–266. [Google Scholar] [CrossRef]
  41. Landim, C.; Mariani, M.; Seo, I. Dirichlet’s and Thomson’s principles for non-selfadjoint elliptic operators with application to non-reversible metastable diffusion processes. Arch. Ration. Mech. Anal. 2019, 231, 887–938. [Google Scholar] [CrossRef] [Green Version]
  42. Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2012; Volume 3. [Google Scholar]
  43. Okamoto, M. Distinctness of the Eigenvalues of a Quadratic form in a Multivariate Sample. Ann. Statist. 1973, 1, 763–765. [Google Scholar] [CrossRef]
  44. Petersen, K.B.; Pedersen, M.S. The Matrix Cookbook; Technical University of Denmark: Lynby, Denmark, 2012; Available online: http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html (accessed on 21 July 2021).
  45. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  46. Chewi, S.; Le Gouic, T.; Lu, C.; Maunu, T.; Rigollet, P.; Stromme, A. Exponential ergodicity of mirror-Langevin diffusions. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 19573–19585. [Google Scholar]
  47. Bolley, F.; Villani, C. Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. In Annales de la Faculté des Sciences de Toulouse: Mathématiques; Université Paul Sabatier: Toulouse, France, 2005; Volume 14, pp. 331–352. [Google Scholar]
  48. Donsker, M.D.; Varadhan, S.S. Asymptotic evaluation of certain Markov process expectations for large time. IV. Commun. Pure Appl. Math. 1983, 36, 183–212. [Google Scholar] [CrossRef]
  49. Carlen, E.; Loss, M. Logarithmic Sobolev inequalities and spectral gaps. Contemp. Math. 2004, 353, 53–60. [Google Scholar]
Figure 1. Double-potential example: Poincaré constant is related to the eigenvalue at x .
Figure 1. Double-potential example: Poincaré constant is related to the eigenvalue at x .
Entropy 23 00993 g001
Figure 2. Eigenvalue changes (averaged over ten trials).
Figure 2. Eigenvalue changes (averaged over ten trials).
Entropy 23 00993 g002
Figure 3. Convergence behavior of toy data in MMD (averaged over ten trials).
Figure 3. Convergence behavior of toy data in MMD (averaged over ten trials).
Entropy 23 00993 g003
Figure 4. Final performances of LDA under different values of α (averaged over ten trials).
Figure 4. Final performances of LDA under different values of α (averaged over ten trials).
Entropy 23 00993 g004
Figure 5. LDA experiments (Averaged over 10 trials).
Figure 5. LDA experiments (Averaged over 10 trials).
Entropy 23 00993 g005
Figure 6. MNIST classification (Averaged over ten trials).
Figure 6. MNIST classification (Averaged over ten trials).
Entropy 23 00993 g006
Table 1. Benchmark results on test RMSE for regression task.
Table 1. Benchmark results on test RMSE for regression task.
DatasetAvg. Test RMSE
SVGDSPOSSGLDSkew-SGLDSGHMCSkew-SGHMC
Concrete5.709 ± 0.0405.239 ± 0.1995.009 ± 0.0914.973 ± 0.0574.949 ± 0.1444.790 ± 0.081
Kin8nm0.0731 ± 0.00060.0688 ± 0.00030.0693 ± 0.00060.0689 ± 0.00050.0687 ± 0.00010.0683 ± 0.0003
Energy0.520 ± 0.0600.456 ± 0.0300.428 ± 0.0450.412 ± 0.0450.406 ± 0.0190.403 ± 0.008
Bostonhousing3.306 ± 0.0053.107 ± 0.1732.948 ± 0.0842.930 ± 0.0953.053 ± 0.0932.986 ± 0.143
Winequality0.619 ± 0.0010.618 ± 0.0070.641 ± 0.0030.634 ± 0.0040.614 ± 0.0040.613 ± 0.004
PowerPlant4.219 ± 0.0124.160 ± 0.0094.129 ± 0.0024.118 ± 0.0064.112 ± 0.0094.105 ± 0.008
Yacht0.475 ± 0.0490.467 ± 0.1100.464 ± 0.0580.442 ± 0.0460.464 ± 0.0780.432 ± 0.051
Table 2. Benchmark results on test negative log likelihood for regression task.
Table 2. Benchmark results on test negative log likelihood for regression task.
DatasetAvg. Test Negative Log Likelihood
SVGDSPOSSGLDSkew-SGLDSGHMCSkew-SGHMC
Concrete−3.157 ± 0.008−3.124 ± 0.025−3.052 ± 0.009−3.049 ± 0.012−3.046 ± 0.025−3.033 ± 0.021
Kin8nm1.153 ± 0.00841.212 ± 0.0081.223 ± 0.0021.223 ± 0.0051.230 ± 0.00151.235 ± 0.0025
Energy−0.816 ± 0.102−0.976 ± 0.079−0.867 ± 0.056−0.845 ± 0.021−0.843 ± 0.045−0.844 ± 0.041
Bostonhousing−2.98 ± 0.000−2.644 ± 0.027−2.548 ± 0.016−2.539 ± 0.002−2.574 ± 0.019−2.561 ± 0.017
Winequality−1.012 ± 0.000−0.959 ± 0.007−0.976 ± 0.006−0.968 ± 0.005−0.941 ± 0.007−0.938 ± 0.005
PowerPlant−2.871 ± 0.004−2.850 ± 0.004−2.844 ± 0.002−2.842 ± 0.001−2.838 ± 0.004−2.835 ± 0.003
Yacht−1.184 ± 0.06−1.372 ± 0.07−1.077 ± 0.066−1.078 ± 0.030−1.083 ± 0.030−1.079 ± 0.051
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Futami, F.; Iwata, T.; Ueda, N.; Sato, I. Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices. Entropy 2021, 23, 993. https://doi.org/10.3390/e23080993

AMA Style

Futami F, Iwata T, Ueda N, Sato I. Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices. Entropy. 2021; 23(8):993. https://doi.org/10.3390/e23080993

Chicago/Turabian Style

Futami, Futoshi, Tomoharu Iwata, Naonori Ueda, and Issei Sato. 2021. "Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices" Entropy 23, no. 8: 993. https://doi.org/10.3390/e23080993

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop