Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices

Langevin dynamics (LD) has been extensively studied theoretically and practically as a basic sampling technique. Recently, the incorporation of non-reversible dynamics into LD is attracting attention because it accelerates the mixing speed of LD. Popular choices for non-reversible dynamics include underdamped Langevin dynamics (ULD), which uses second-order dynamics and perturbations with skew-symmetric matrices. Although ULD has been widely used in practice, the application of skew acceleration is limited although it is expected to show superior performance theoretically. Current work lacks a theoretical understanding of issues that are important to practitioners, including the selection criteria for skew-symmetric matrices, quantitative evaluations of acceleration, and the large memory cost of storing skew matrices. In this study, we theoretically and numerically clarify these problems by analyzing acceleration focusing on how the skew-symmetric matrix perturbs the Hessian matrix of potential functions. We also present a practical algorithm that accelerates the standard LD and ULD, which uses novel memory-efficient skew-symmetric matrices under parallel-chain Monte Carlo settings.


Introduction
Sampling is one of the most widely used techniques for the approximation of posterior distribution in Bayesian inference [1]. Markov Chain Monte Carlo (MCMC) is widely used to obtain samples. In MCMC, Langevin dynamics (LD) is a popular choice for sampling from high-dimensional distributions. Each sample in LD moves toward a gradient direction with added Gaussian noise. LD efficiently explore around a mode of a target distribution using the gradient information without being trapped by local minima thanks to added Gaussian noise. Many previous studies theoretically and numerically proved LD's superior performance [2][3][4][5]. Since non-reversible dynamics generally improves mixing performance [6,7], research on introducing non-reversible dynamics to LD for better sampling performance is attracting attention [8].
There are two widely known non-reversible dynamics for LD. One is underdamped Langevin dynamics (ULD) [9], which uses second-order dynamics. The other introduces perturbation, which consists of multiplying the skew-symmetric matrix by a gradient [8].
Here, we refer to the matrix as skew matrices for simplicity and this perturbation technique as skew acceleration. Much research has been done on ULD theoretically [9][10][11] and ULD is widely used in practice, which is also known as stochastic gradient Hamilton Monte Carlo [12]. In contrast, the application of the skew acceleration for standard Bayesian models is quite limited even though it is expected to show superior performance theoretically [8].
For example, skew acceleration has been analyzed focusing on sampling from Gaussian distributions [13][14][15][16][17], although assuming Gaussian distributions in Bayesian models is restrictive in practice. A recent study [8] theoretically showed that skew acceleration accelerates the dynamics around the local minima and saddle points for non-convex functions. Another work [18] clarified that the skew acceleration theoretically and numerically improves mixing speed when used as interactions between chains in parallel sampling schemes for non-convex Bayesian models.
Compared to ULD, what seems to be lacking for skew acceleration is a theoretical understanding of issues that are important to practitioners. The most significant problem is that no theory exists for selecting skew matrices. In existing studies, introducing a skew matrix into LD results in equal or faster convergence, denoting that a bad choice of skew matrix results in no acceleration. Thus, choosing appropriate skew matrices is critical. Furthermore, although ULD's acceleration has been analyzed quantitatively, existing studies have only analyzed skew acceleration qualitatively. Thus, it is difficult to justify the usefulness of skew acceleration in practice compared to ULD. Another issue is that introducing skew matrices requires a vast memory cost in many practical Bayesian models.
The purpose of this study is to solve these problems from theoretical and numerical viewpoints and establish a practical algorithm for skew acceleration. The following are the two major contributions of this work.
Our contribution 1: We present a convergence analysis of skew acceleration for standard Bayesian model settings, including non-convex potential functions using Poincaré constants [19]. The major advantage of Poincaré constants is that we can analyze skew acceleration through a Hessian matrix and its eigenvalues and develop a practical theory about the selection of J and the quantitative assessment of skew acceleration.
Furthermore, we propose skew acceleration for ULD and present convergence analysis for the first time. Since ULD shows faster convergence than LD, combining skew acceleration with ULD is promising.
Our contribution 2: We develop a practical skew accelerated sampling algorithm for a parallel sampling setting with novel memory-efficient skew matrices. Since a naive implementation of skew acceleration requires a large memory cost to store skew matrices, memory-efficiency is critical in practice. We also present a non-asymptotic theoretical analysis for our algorithm in both LD and ULD settings under a stochastic gradient and Euler discretization. We clarify that introducing skew matrices accelerates the convergence of continuous dynamics, although it increases the discretization and stochastic gradient error. Then to the best of our knowledge, we propose the first algorithm that adaptively controls this trade-off using the empirical distribution of the parallel sampling scheme.
Finally, we verify our algorithm and theory in practical Bayesian problems and compare it with other sampling methods.
Notations: I d denotes a d × d identity matrix. Capital letters such as X represent random variables, and lowercase letters such as x represent non-random real values. ·, · and | · | denote Euclidean inner products, distances and absolute values.

Preliminaries
In this section, we briefly introduce the basic settings of LD and non-reversible dynamics for the posterior distribution sampling in Bayesian inference.

LD and Stochastic Gradient LD
First, we introduce the notations and the basic settings of LD and stochastic gradient LD (SGLD), which is a practical extension of LD. Here z i denotes a data point in space Z, |Z | denotes the total number of data points, and x ∈ R d corresponds to the parameters of a given model, which we want to sample. Our goal is to sample from the target distribution with density dπ(x) ∝ e −βU(x) dx, where potential function U(x) is the summation of u(x, z i ). Function u(·, ·) is continuous and non-convex. The explicit assumptions made for it are discussed in Section 3.1. The SGLD algorithm [2,3] is given as a recursion: where h ∈ R + is a step size, k ∈ R d is a standard Gaussian random vector, β is a temperature parameter of π, and ∇Û(X k ) is a conditionally unbiased estimator of true gradient ∇U(X k ). This unbiased estimate of the true gradient is suitable for large-scale data set since we can use not the full gradient, but a stochastic version obtained through a randomly chosen subset of data at each time step. This means that we can reduce the computational cost to calculate the gradient at each time step. The discrete time Markov process in Equation (1) is the discretization of the continuoustime LD [2]: where w t denotes the standard Brownian motion in R d . The stationary measure of Equation (2) is dπ(x) ∝ e −βU(x) dx.

Poincaré Inequality and Convergence Speed
In sampling, we are interested in the convergence speed to the stationary measure. The speed is often characterized by the the generator associated with Equation (2) and defined as: where ∆ denotes a standard Laplacian on R d and f ∈ D(L) and D(L) ⊂ L 2 (π) denote the L domain. This −L is a self-adjoint operator, which has only discrete spectrums (eigenvalues). π with L has a spectral gap if the smallest eigenvalue of −L (other than 0) is positive. We refer to it as ρ 0 (>0). This spectral gap is closely related to Poincaré inequality. Internal energy is defined: Please note that E ( f ) > 0 is satisfied. Then π with L satisfies the Poincaré inequality with constant c, if for any f ∈ D(L), π with L satisfies: The spectral gap characterizes this constant c ≤ 1 ρ 0 , which holds (see Appendix A.2 for details). We refer to best constant c as the Poincaré constant [19]. For notational simplicity, we define m 0 := 1 c and refer to this m 0 as the Poincaré constant. In sampling, crucially, Poincaré inequality dominates the convergence speed in χ 2 divergence: where µ t denotes the measure at time t induced by Equation (2) and µ 0 is the initial measure (see Appendix A.3 for details). Thus, the larger Poincaré constant m 0 is, the faster convergence we have.

Non-Reversible Dynamics
In this section, we introduce the non-reversible dynamics. π with L is reversible if for any test function f , g ∈ D(L), π with L satisfies If this is not satisfied, π with L is non-reversible [19]. We introduce two non-reversible dynamics for LD. The first is ULD, which is given as where V ∈ R d is an auxiliary random variable, γ ∈ R is a positive constant, and Σ is the variance of the stationary distribution of auxiliary random variable V. The stationary distribution isπ := π ⊗ N (0, Σ) ∝ e −βU(x)− 1 2 Σ −1 v 2 , where N denotes a Gaussian distribution. The superior performance of ULD compared with LD has been studied rigorously [9][10][11]. ULD's convergence speed is also characterized by the Poincaré constant [20]. In practice, we use discretization and the stochastic gradient for ULD, which is called the stochastic gradient Hamilton Monte Carlo (SGHMC) [10]. The second non-reversible dynamics is the skew acceleration given as where J is a real value skew matrix and α ∈ R + is a positive constant. We call this dynamics S-LD. The stationary distribution of S-LD is still π, and S-LD shows faster convergence and smaller asymptotic variance [13][14][15]18].

Theoretical Analysis of Skew Acceleration
In this section, we present a theoretical analysis of skew acceleration in LD and ULD in standard Bayesian settings. We analyze acceleration through the Poincaré constant and connect it with the eigenvalues of the Hessian matrix, which allows us to obtain a practical criterion to choose skew matrices and quantitatively evaluate acceleration. We focus on a setting where a continuous SDE and a full gradient of the potential function is used in this section. The discretized SDE and stochastic gradient are discussed in Section 4.

Acceleration Characterization by the Poincaré Constant
First, we introduce the same four assumptions as a previous work [2], which showed the existence of the Poincaré constant about m 0 for LD (see Appendix C for details).

Assumption 1.
(Upper bound of the potential function at the origin) Function u takes nonnegative real values and is twice continuously differentiable on R d , and constants A and B exist such that for all z ∈ Z, Assumption 3. (Dissipative condition) Function u satisfies the (m,b)-dissipative condition for all z ∈ Z; for all x ∈ R d , m > 0 and b ≥ 0 exist such that Assumption 4. (Initial condition) Initial probability distribution µ 0 of X 0 has a bounded and strictly positive density p 0 , and for all x ∈ R d , Please note that these assumptions allow us to consider the non-convex potential functions, which are common in practical Bayesian models. Furthermore, we make the following assumption about J. Assumption 5. The operator norm of J is bounded: This means that the largest singular value of J is below 1.
Under these assumptions, we present the convergence behavior of skew acceleration using the Poincaré constant. First, we present the following S-LD result. Theorem 1. Under Assumptions 1-5, the S-LD of Equation (9) has exponential convergence, where µ α t is the measure at time t induced by S-LD and m(α) is the Poincaré constant of S-LD defined by its generator Furthermore, m(α) satisfies m(α) ≥ m 0 .
The proof is shown in Appendix C. This theorem states that introducing the skew matrices accelerates the convergence of LD by improving the convergence rate from m 0 to m(α). Although [18] obtained a similar result, we used the Poincaré constant and derived an explicit criterion when m(α) = m 0 holds, as we discuss below.
Next, we also introduce skew acceleration in ULD. Since ULD shows faster convergence than LD in standard Bayesian settings [10,11], it is promising to combine skew acceleration with ULD to obtain a more efficient sampling algorithm. For that purpose, we propose the following SDE: where J 1 and J 2 are real value skew matrices and α 1 and α 2 are positive constants. We assume that J 1 and J 2 satisfy Assumption 5. We refer to this method as skew underdamped Langevin dynamics (S-ULD) whose stationary distribution isπ = π ⊗ N (0, Σ) ∝ e −βU(x)− 1 2 Σ −1 v 2 . See Appendix B for details, which include discussions on other combinations of skew matrices. As for S-ULD, we need an additional assumption about the initial condition of V 0 : Assumption 6. (Initial condition) Initial probability distribution µ 0 (x, v) of (X 0 , V 0 ) has a bounded and strictly positive density p 0 that satisfies, We then provide the following convergence theorem that resembles S-LD.

Theorem 2.
Under Assumptions 1-3, 5, 6, S-ULD has exponential convergence in χ 2 divergence and its convergence rate is also characterized by m(α) as defined in Theorem 1. S-ULD's convergence equals or exceeds ULD, of which convergence rate is characterized by m 0 .
See Appendix C.2 for details. From these theorems, we confirmed that skew acceleration is effective in both S-LD and S-ULD, and the convergence speed is characterized by Poincaré constant m(α) defined by Equation (16).

Skew Acceleration from the Hessian Matrix
Our goal is to clarify what choices of J induce m(α) > m 0 , which leads to acceleration. Therefore, we discuss how Poincaré constant m(α) is connected to the eigenvalues and eigenvectors of the perturbed Hessian matrix (I + αJ)∇ 2 U(x). Next, we introduce the notations. We express the Hessian of U(x) as H(x) and the perturbed Hessian matrix as H (x) := (I + αJ)H(x). Please note that H is a real symmetric matrix, which has real eigenvalues and diagonalizable. On the other hand, since H is not symmetric, it has complex eigenvalues, although diagonalization is not assured (see Appendix E). We express pairs of eigenvectors and eigenvalues of Here, Re(λ α 1 (x)) expresses the real part of complex value λ α 1 and Im denotes the imaginary part. We express those of and order them as λ 0 1 (x) ≤ · · · ≤ λ 0 d (x).

Strongly Convex Potential Function
Assume that U is an m-strongly convex function, where for all x ∈ R d , m ≤ λ 0 1 (x) holds. Poincaré constant m 0 of LD satisfies m 0 = m [19]. For the skew acceleration, since Poincaré constant satisfies m(α) = m (α), where m (α) is the best constant that satisfies, for all x, m (α) ≤ Re λ α 1 (x) (see Appendix D.1). Therefore, studying the Poincaré constant is equivalent to studying the smallest (real part of the) eigenvalue of the Hessian matrix. Thus, the relation between λ 0 1 (x) and Re λ α 1 (x) must be studied. The following theorem describes how the skew matrices change the smallest eigenvalue.

Theorem 3.
For all x ∈ R d , the real parts of the eigenvalues of H satisfy The condition of λ 0 1 (x) = Re λ α 1 x)) is shown in Remark 1.
Refer to Appendix F for the proof. This is an extension of previous work [8,13].
is satisfied for all x, we have m 0 < m(α), i.e., acceleration occurs. We discuss how to construct J such that λ 0 1 (x) < Re λ α 1 (x) holds in Section 3.3.

Non-Convex Potential Function
The previous work [21] clarified that the Poincaré constant of the non-convex function is characterized by the negative eigenvalue of the saddle point. As shown in Figure 1, denote x 1 as the global minima, and x 2 is the local minima which has the second smallest value in U(x). We express the saddle point with index one, i.e., there is only one negative eigenvalue at the point, between x 1 and x 2 as x * . This means that the eigenvalues of H(x * ) Ref. [21] clarified that the saddle point x * characterizes the Poincaré constant as When skew matrices are introduced, [8] clarified the following relation: is not a complex number. Thus, the skew acceleration reduces the negative eigenvalue and leads to a larger Poincaré constant (see Appendix D.2) and results in faster convergence. In conclusion, introducing the skew matrix changes the Hessian's eigenvalues and increase the Poincaré constant. If λ 0 1 (x) = Re λ α 1 (x) is satisfied, this leads to faster convergence for both convex and non-convex potential functions.
The first condition (i) is easily satisfied if we choose J such that KerJ = {0}. On the other hand, the second condition (ii) is difficult to verify since H and its eigenvalues and eigenvectors generally depend on the current position of X t . Instead of evaluating eigenvalues and eigenvectors of H and H directly, we use the random matrix property shown in the next theorem.

Theorem 5.
Suppose the upper triangular entries of J follow a probability distribution that is absolutely continuous with respect to the Lebesgue measure. If KerJ = {0} is satisfied, then given a point x ∈ R d , λ 0 1 (x) = Re λ α 1 (x) holds with probability 1.
The proof is given in Appendix G.1. From this theorem, we simply generate J from some probability distribution, such as the Gaussian distribution. Then, we check whether KerJ = {0} holds. If KerJ = {0} does not hold, we generate a random matrix J again.
The above theorem is valid only at a given evaluation point x. We can extend the above theorem to all the points over the path of the discretized dynamics (see Appendix G.3). With this procedure, we can theoretically ensure that acceleration occurs with probability one for discretized dynamics.

Qualitative Evaluation of The Acceleration
So far, we have discussed skew acceleration qualitatively but not quantitatively. Although acceleration's quantitative evaluation is critical for practical purposes, to the best of our knowledge, no existing work has addressed it. In this section, we present a formula that quantitatively assesses skew acceleration by analyzing the eigenvalues of the Hessian matrix. Theorem 6. With the identical notation as in Theorem 3, for all x, we have In particular, at saddle point x * , we have The proofs are shown in Appendix H. When focusing on Equation (22), if U(x) is a strongly convex function, since for all k > 1, λ k (x) > λ 1 (x) > 0 holds and the second term in Equation (22) is positive. From this, Re λ α 1 (x) > λ 0 1 (x) holds. A similar relation holds for Re(λ α d (x)). In Equation (23), λ α 1 (x * ) < λ 0 1 (x * ) < 0 holds. Thus, the changes of the Poincaré constants are proportional to α 2 . With these formulas, we can quantitatively evaluate the acceleration. We present numerical experiments to confirm our theoretical findings in Section 6.1.

Practical Algorithm for Skew Acceleration
In this section, we discuss skew acceleration in more practical settings compared to Section 3. First, we discuss the memory issue for storing J and the discretization of SDE and the stochastic gradient, which are widely used techniques in Bayesian inference. Finally, we present a practical algorithm for skew acceleration.

Memory Issue of Skew Acceleration and Ensemble Sampling
For d-dimensional Bayesian models, we need O(d 2 ) memory space to store skew matrices Js, and this is difficult for high-dimensional models. Instead of storing J, we can randomly generate Js at each time step following Theorem 5. However, we experimentally confirmed that using different Js at each step does not accelerate the convergence (see Section 6). Thus, we need to use a fixed J during the iterations.
As discussed below, we found that the previously proposed accelerated parallel sampling [18] can be a practical algorithm to resolve this memory issue. In that method, we simultaneously updated N samples of the model's parameters with correlation. In such a parallel sampling scheme, a correlation exists among multiple Markov chains, it is more efficient than a naive parallel-chain MCMC, where the samples are independent.
We express the n-th sample at time t as X ) ∈ R dN . We express the joint stationary measure as . We express the sum of the potential function as U ⊗N := ∑ N i=1 U(x (i) ). We then consider the following dynamics: We call this dynamics skew parallel LD (S-PLD). N-independent parallel LD (PLD) is coupled with the skew matrix. Since each chain in PLD is independent of the other, the Poincaré constant of PLD is also m 0 . Ref. [18] argued that the Poincaré constant of S-PLD, m(α, N), satisfies m(α, N) ≥ m 0 . This means S-PLD shows faster convergence than PLD. As discussed in Section 3.2, these Poincaré constants are characterized by the smallest eigenvalue of the Hessian matrix ∇ 2 U ⊗N (x ⊗N ) and (I dN We denote these smallest eigenvalues as λ 0 1 (x ⊗N ) and Reλ α 1 (x ⊗N ). As discussed in Section 3.2, acceleration occurs if λ 0 1 (x ⊗N ) = Reλ α 1 (x ⊗N ) is satisfied. In [18], they failed to specify the choice of J whose naive construction of J requires O(d 2 N 2 ) memory cost. To reduce the memory cost, we propose the following skew matrix: where J 0 is a N × N skew matrix and ⊗ is a Kronecker product. We then have the following lemma: See Appendix G.2 for the proof. Thus, from this lemma, we only need to prepare and store J 0 , which requires O(N 2 ) memory, which does not depend on d. In practical settings, this is a significant reduction of the memory size since the number of parallel chains is smaller than the dimension of models. Please note that we can ensure the acceleration with this J.

Lemma 2.
Under Assumptions 1-5, assume J satisfies the condition of Lemma 1. Then S-PLD shows where µ α,⊗N t is the measure at time t induced by S-PLD, and µ ⊗N 0 is the initial measure defined as the product measure of µ 0 . See Appendix I.1 for the proofs. Thus, combined with Lemma 2, S-PLD converges faster than PLD. We also considered the ensemble version of ULD (parallel ULD (PULD)) and its skew accelerated version: where J 1 and J 2 ∈ R dN×dN are real-valued skew-symmetric matrices, and α 1 and α 2 ∈ R + are positive constants and We refer to this dynamics as skew PULD (S-PULD) whose faster convergence can be assured similar to Lemma 2 as shown in Appendix I.2.

Discussion of the Discretization of SDE and Stochastic Gradient and Practical Algorithm
In this section, we further consider practical settings for S-PLD and S-PULD. We discretize these continuous dynamics, e.g., by the Euler-Maruyama method, and approximate the gradient by the stochastic gradient. Although introducing skew matrices accelerates the convergence of continuous dynamics, it simultaneously increases the discretization and stochastic gradient error, resulting in a trade-off. We present a practical algorithm that controls this trade-off.

Trade-Off Caused by Discretization and Stochastic Gradient
We consider the following discretization and stochastic gradient for S-PLD and S-PULD: and where k ∈ R dN is a standard Gaussian random vector. ∇Û ⊗N (X ⊗N ) is an unbiased estimator of the gradient ∇U ⊗N (X ⊗N ). We refer to Equation (29) as skew-SGLD and Equation (30) as skew-SGHMC. For skew-SGHMC, we dropped J 2 of S-PULD to decrease the parameters, shown in Appendix B. Please note that skew-SGLD is the identical as the previous dynamics [18]. We introduce an assumption about the stochastic gradient: Given a test function f with L f lipschitzness, we approximate f dπ by skew-SGLD or skew-SGHMC, with estimator 1 . The bias of skew-SGLD is upperbounded as Theorem 7. Under Assumptions 1-7, for any k ∈ N and any h ∈ (0, 1 ∧ m 4M 2 ) obeying kh ≥ 1 and βm ≥ 2, we have and C 1 and C 2 depends on the constants of Assumptions 1-7, for the details see Appendix J.
We present a tighter bias bound in Section 4.3 under a stronger assumption. We can show a similar upper bound for the skew-SGHMC using the same proof strategy. This bound resembles of a previous one [18]; ours shows improved dependency on kh. The previous results of [18] are also limited to LD, not including skew-SGHMC.
Please note that (i) corresponds to the discretization and stochastic gradient error and (ii) corresponds to the convergence behavior of S-PLD, which is continuous dynamics. Since C 1 (α) ≥ C 1 (α = 0), skew acceleration increases the discretization and stochastic gradient error. On the other hand, since m(α, N) ≥ m 0 , the convergence of the continuous dynamics is accelerated. Thus, skew acceleration causes a trade-off. When α is suffi-ciently small, we derive the explicit dependency of α for this trade-off from an asymptotic expansion. Using the quantitative evaluation of skew acceleration in Theorem 6, we obtain where d 0 to d 2 are positive constants obtained by the asymptotic expansion. See Appendix K for the details. In the above expression, (i) and (ii) correspond to (i) and (ii) of Equation (32). Thus, by choosing appropriate α, we can control the trade-off.

Practical Algorithm Controlling the Trade-Off
Since calculating the optimal α that minimizes Equation (33) at each step is computationally demanding, we adaptively tune the value of α by measuring the acceleration with kernelized Stein discrepancy (KSD) [22]. Our idea is to update samples under different α and α + η, and compare KSD between the stationary and empirical distributions of these different interaction strengths. Here, η ∈ R + is a small increment of α. We denote the samples at the (k + 1)th step, which is obtained by Equation (29) as We denote the samples, which are obtained by replacing the above α by α + η, as X ⊗N k+1,α+η . We denote the KSD between the measure of X ⊗N k+1,α and stationary measure π as KSD(k + 1, α) and estimate the differences of empirical KSD: where KSD is estimated bŷ where l denotes a kernel and we use an RBF kernel. If ∆ > 0, which indicates that the empirical distribution of X ⊗N k+1,α+η is closer to the stationary distribution than that of X ⊗N k+1,α . Thus, we should increase the interaction strength from α to α + η. If ∆ < 0, we decrease it to α − η. We also update η to cη where c ∈ (0, 1]. The overall process is shown in Algorithm 1. Detailed discussions of the algorithm including how to select α 0 , η 0 , and c are shown in Appendix L.

Algorithm 1 Tuning α
Update η k+1 = η k 6: else 7: Update α k+1 = |α k − η k | 8: Update η k+1 = cη k 9: end if Finally, we present Algorithm 2, which describes the whole process. We update the value of α once every k step. Please note that its computational cost is not much larger than that of Equation (30). We only calculate the eigenvalues of J once, which requires O(N 3 ). The calculation of different KSDs is computationally inexpensive since we can re-use the gradient, which is the most computationally demanding part.  (30) for skew-SGHMC) 9: end for

Refined Analysis for the Bias of Skew-SGLD
When using a constant step size for skew-SGLD, the bound in Theorem 7 is meaningless since the first term of Equation (32) will diverge. Here, following [23], we present a tighter bound for the bias of skew-SGLD under a stronger assumption.
Proof is shown in Appendix M. Please note that even if we use a constant step size for skew-SGLD, the bound in Theorem 8 will not diverge. Here we need the stronger assumption about a step size compared to Theorem 7. From Equation (37), the convergence behavior is characterized by λ(α, N) and the bias bound become smaller when λ(α, N) become larger. From the definition of λ(α, N), the larger m(α, N) is, the larger λ(α, N) we obtain. Thus, as we had seen so far, introducing the skew matrices leads to the larger Poincaré constant, and thus, this leads to larger λ(α, N).
Previous work [18] clarified that if α is sufficiently small, introducing skew matrices improves the Poincaré constant by a constant factor, which means that we have m(α, N) − m 0 ≈ O(α 2 ), where O(α 2 ) depends on the eigenvector and eigenvalues of the generator L. On the other hand, from Theorem 8, for any ξ > 0, to achieve the bias smaller than ξ, it suffice to run skew-SGLD at least for k ≥ 2λ(α,N) iterations using the appropriate step size h and under the assumption that δ and α are small enough (see Appendix M.2 for details). Combined with these observations, introducing skew matrices into SGLD improves the computational complexity for a constant order. Our numerical experiments show that even constant improvement results in faster convergence in practical Bayesian models.

Related Work
In this section, we discuss the relationship between our method and other sampling methods.

Relation to Non-Reversible Methods
As we discussed in Section 1, our work extends the existing analysis of non-reversible dynamics [8,18] and presents a practical algorithm. Compared to those previous works, we focus on the practical setting of Bayesian sampling and derive the explicit condition about J for acceleration. We also derived a formula to quantitatively evaluate skew acceleration based on the asymptotic expansion of the eigenvalues of the perturbed Hessian matrix. A previous work [24], which derived the optimal skew matrices when the target distribution is Gaussian, requires O(d 3 ) computational cost to derive optimal skew matrices, and it is unclear whether it works for non-convex potential functions. On the other hand, our construction method for skew matrices is simple, computationally cheap, and can be applied to general Bayesian models.
Our work analyzes skew acceleration for ULD, which is more effective than LD in practical problems. Another work [8,18] only analyzed skew acceleration for LD. A previous work [17] combined a non-reversible drift term with ULD. Unlike our method, this work's purpose was to reduce the asymptotic variance of the expectation of a test function and is mainly focusing on sampling from Gaussian distribution.
To the best of our knowledge, our work is the first to focus on the memory issue of skew acceleration and develop a memory-efficient skew matrix for ensemble sampling. Our work also presents an algorithm that controls the trade-off for the first time. Another work [18] identified the trade-off and handled it by cross-validation, which is computationally inefficient, unfortunately.
Finally, we point out an interesting connection between our skew-SGHMC and the magnetic HMC (M-HMC) [25]. M-HMC accelerates HMC's mixing time by introducing a "magnetic" term into the Hamiltonian. That magnetic term is expressed by special skew matrices. Although a previous work [25] argued that M-HMC is numerically superior to a standard HMC, its theoretical property remains unclear. Thus, our work can analyze the theoretical behavior of magnetic HMC.

Relation to Ensemble Methods
Our proposed algorithm is based on ensemble sampling [26]. Ensemble sampling, in which multiple samples are simultaneously updated with interaction, has been attracting attention numerically and theoretically because of improvements in memory size, computational power, and parallel processing computation schemes [26]. There are successful, widely used ensemble methods, including SVGD [27] and SPOS [28], with which we compare our proposed method numerically in Section 6. Although both show numerically good performance, it is unclear how the interaction term theoretically accelerates the convergence since they are formulated as a McKean-Vlasov process, which is non-linear dynamics, complicating establishing a finite sample convergence rate. Our algorithm is an extension of another work [18], where the interaction was composed of a skew-acceleration term and can be rigorously analyzed. Compared to that previous work [18], we analyzed skew acceleration, focused on the Hessian matrix, and developed practical algorithms, as discussed in Section 4.2, and derived the explicit condition when acceleration occurs, which was unclear [18].
Another difference among SPOS, SVGD, and [18] is that they use first-order methods; our approach uses the second-order method. Little work has been done on ensemble sampling for second-order dynamics. Recently a second-order ensemble method was proposed [29], based on gradient flow analysis. Although its method showed good numerical performance, its theoretical property for finite samples remains unclear since it proposed a scheme as a finite sample approximation of the gradient flow. In contrast, our proposed method is a valid sampling scheme with a non-asymptotic guarantee.

Numerical Experiments
The purpose of our numerical experiments is to confirm the acceleration of our algorithm proposed in Section 4 in various commonly used Bayesian models including Gaussian distribution (toy data), latent Dirichlet allocation (LDA), and Bayesian neural net regression and classification (BNN). We compared our algorithm's performance with other ensemble sampling methods: SVGD, SPOS, standard SGLD, and SGHMC. In all the experiments, the values and the error bars are the mean and the standard deviation of repeated trials. For all the experiments we set γ = 1 and Σ −1 = 300 for SGHMC and Skew-SGHMC. As for the hyperparameters of our proposed algorithm, the selection criterion is discussed in Appendix L.

Toy Data Experiment
The target distribution is the multivariate Gaussian distribution, π = N(µ, Ω) where we generated Ω −1 = A A and each element of A ∈ R 2d×d is drawn from the standard Gaussian distribution. The dimension of the target distribution is d = 50, we approximate by 20 samples using the proposed ensemble methods. We tested these toy data because the LD for this target distribution is known as the Ornstein-Uhlenbeck process, and its theoretical properties have been studied extensively e.g., [30]. Thus, by studying the convergence behavior of these toy data, we can understand our proposed method more clearly.
First, we confirmed how the skew-symmetric matrix affects the eigenvalues of the Hessian matrix, as discussed in Section 3, where we only showed the asymptotic expansion for the smallest real part of the eigenvalues and saddle point. Here we can show a similar expansion for the largest real part: Re λ α dN ≤ λ α dN holds. Then we observed how the largest and smallest real parts of the eigenvalues of (I + αJ)Ω −1 depend on α. The results are shown in Figure 2, where we averaged 10 trials over a randomly made J with fixed A. The upper-left, upper-right, and lower figures show Re(λ 1 (α)), Re(λ dN (α)), and Re(λ 1 (α))/Re(λ dN (α)). These behaviors are consistent with Theorem 3. When α is small, its behavior is close to the quadratic function proved in Theorem 3.
Next, we observed the convergence behavior of skew-SGLD and skew-SGHMC. We measured the convergence by maximum mean discrepancy (MMD) [31] between the empirical and stationary distributions. For MMD, we used 2000 samples for the target distribution, and we used the Gaussian kernel whose bandwidth is set to the median distance of these 2000 samples. We used gradient descent (GD), with step size h = 1 × 10 −4 . The results are shown in Figure 3. The proposed method shows faster convergence than naive parallel sampling, which is consistent with Table 2.

LDA Experiment
We tested with an LDA model using the ICML dataset [32] following the same setting as [33]. We used 20 samples for all the methods. Minibatch size is 100. We used step size h = 5 × 10 −4 . First, we confirmed the effectiveness of our proposed Algorithm 1, which adaptively tunes α values. For that purpose, we compared the final performance obtained by our methods with a previous method [18], in which α is selected by cross-validation (CV). Here instead of CV, we just fixed α during the sampling and refer to it as fixed α. We also tested the case when J is generated randomly at each step with fixed α, as discussed in Section 4.1. We refer to it as random J. The results are shown in Figure 4 where skew-SGLD was used. We found that our method showed competitive performance with the best performance of fixed α. For the computational cost, we used k = 2 in Algorithm 2, and our method needed twice the wall clock time than each fixed α. This means that our algorithm greatly reduces the total computational time since we tried more than two αs in the fixed α for CV. We also found that since using different Js at each step did not accelerate the performance, we need to store and fix J during the sampling for acceleration. Next, we compared our method with other ensemble sampling schemes and observed the convergence speed. The result is shown in Figure 5. Skew-SGLD and skew-SGHMC outperformed SGLD and SGHMC, which is consistent with our theory.

BNN Regression and Classification
We tested with the BNN regression task using the UCI dataset [34], following a previous setting Liu and Wang [27]. We used one hidden layer neural network model with ReLU activation and 100 hidden units. We used 10 samples for all the methods. We used the minibatch size 100. We used step size h = 5 × 10 −5 . The results are shown in Tables 1 and 2. We also tested on BNN classification task using the MNIST dataset. The result is shown in Figure 6. We used one hidden layer neural network model with ReLU activation and 100 hidden units. Batchsize is 500 and we set step size h = 5 × 10 −5 . Our proposed methods outperformed other ensemble methods. Please note that skew-SGHMC and skew-SGLD consistently outperformed SGHMC and SGLD.  Figure 6. MNIST classification (Averaged over ten trials).

Conclusions
We studied skew acceleration for LD and ULD from practical viewpoints and concluded that the improved eigenvalues of the perturbed Hessian matrix caused acceleration and derived the explicit condition for acceleration. We described a novel ensemble sampling method, which couples multiple SGLD or SGHMC with memory-efficient skew matrices. We also proposed a practical algorithm that controls the trade-off of faster convergence and larger discretization and stochastic gradient error and numerically confirmed the effectiveness of our proposed algorithm.

Conflicts of Interest:
The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: In this paper, we use the Wasserstein distance. Let us define the Wasserstein distance. Let (E, d) be a metric space (appropriate space such as Polish space) with σ field A, where d(·, ·) is A × A-measurable. Let µ, ν are probability measures on E, and p ≥ 1. The Wasserstein distance of order p with cost function d between µ and ν is defined as where Π(µ, ν) is the set of all joint probability measures on E × E with marginals µ and ν.
In this paper, we work on the space R d . As for the distance, we use the Euclidean distance, · . For simplicity, we express the p-Wasserstein distance with the Euclidean distance as W p . The various properties of Wasserstein distance are summarized in [35]. We define the Kullback-Leibler (KL) divergence as

. Markov Diffusion and Generator
Here we introduce the additional explanation about the generator of the Markov diffusion process. Given an SDE, and we denote the corresponding Markov semigroup as P = {P t } t>0 and define the Kolmogorov operator as P s which is defined as is some bounded test function in L 2 (µ). A property P s+t = P s • P t is called Markov property. A probability measure π is the stationary distribution when it satisfies for all measurable bounded function f and t, R d P t f dπ = R d f dπ.
We denote the infinitesimal generator of the associated Markov group as L and we call it a generator for simplicity. The linearity of the operators of P t with the semigroup property indicates that L is the derivative of P t as where Id is the identity map. In addition, taking h → 0, we have ∂P t = LP t = P t L. From the Hille-Yoshida theory [19], there exists a dense linear subspace of L 2 (π) on which L exists. We refer it as D(L). If the Markov semigroup is associated with the SDE of Equation (A3), the generator can be written as where ∆ is the Laplacian in the standard Euclidean space. The generator satisfies

Appendix A.3. Poincaré Inequality
We use the Poincaré inequality to measure the speed of convergence to the stationary distribution. In this section, we summarize definitions and useful properties of them and see [19] for more details. We define the Dirichlet form E ( f ) for all bounded functions f ∈ D(L) where D(L) denotes the domain of L as We define a Dirichlet domain, D(E ), which is the set of functions f ∈ L 2 (π) and satisfies E ( f ) < ∞.
We say that π with L satisfies a Poincaré inequality with a positive constant c if for any f ∈ D(E ), π with L satisfies, This constant c is closely related to a spectral gap. If the smallest eigenvalue of L, λ, is greater than 0, then it is called the spectral gap. If the spectral gap λ > 0 exists, then it is written as From this, a constant c which satisfies c ≥ 1/λ, can also satisfy the Poincaré inequality.
To check the existence of the spectral gap, one approach is to use the Lyapunov function, which is developed by Bakry et al. [36]. We can also express the Poincaré inequality via chi divergence. Let us define the χ 2 divergence for µ π as Then, we express the Poincaré inequality with a constant c for all µ π as We obtain the following exponential convergence results from the above functional inequalities for measures.
Theorem A1. (Exponential convergence in the variance, Theorem 4.2.5 in [19]) When π satisfies the Poincaré inequality with a constant c, it implies the exponential convergence in the variance with a rate 2/c, i.e., for every bounded function f : R d → R, where We also introduce the important property of Poincaré inequality as for the product measures. These relations play important roles in our analysis.

Appendix B. Generator of the Underdamped Langevin Dynamics (ULD)
Following [10], we define the infinitesimal generator of the ULD as Then, we define the generator of S-ULD as where the second line corresponds to the interaction terms. Then it is easily to confirm Thus, the stationary distribution of S-ULD isπ. We can prove this by simply using the partial integral and using the property of the skew-symmetric matrix. Thus, the stationary distribution of S-ULD isπ.
We consider other combinations the skew matrices with ULD. For example, we can consider the following more general combination; compared to S-ULD, there are new two terms are included. We can also derive the infinitesimal generator of this Markov process. We express it asL. Then we calculate the infinitesimal change of the expectation of f which suggests that the stationary distribution of Equation (A14) is different formπ. It is widely known that underdamped Langevin dynamics converges to (overdamped) Langevin dynamics. Here we observe that S-ULD converges to Skew-LD in [18]. The limiting procedure is widely known, for example, see [17,37,38]. We cite Proposition 1 in [17]; given a stochastic process and we rescale it by introducing which expresses the small mass limit as and by taking the limit → 0, the dynamics converges to See Proposition 1 in [17], for the precise statements. Please note that the term related J 2 works as preconditioning. Thus, if we set α 2 J 2 = 0, the obtained dynamics are equivalent to the continuous dynamics of skew-SGLD. Thus, our skew-SGHMC is the natural extension of skew-SGLD.

Appendix C. Proof of Theorem 1
Appendix C. 1

. Proof for S-LD
First, under Asuumptions 1-5, LD has a spectral gap, and its Poincaré constant is upper bounded as and this is derived in [2]. Next, we introduce the generator of S-LD The proof is almost similar to [18] of Theorem 12.
As for L α , although it is not a self-adjoint operator, from Proposition 1 in Franke et al. [39], it has discrete complex spectrums. We denote the spectrum of L α as λ + iµ ∈ C where λ, µ ∈ R and corresponding normalized eigenvector as u + iv where u, v are the real functions and then we have L α (u + iv) = (λ + iµ)(u + iv).
(A20) From this definition, by checking the real parts and complex parts, following relations are derived Due to the divergence-free drift property, for any bounded real value test function g(x), where we used the partial integral. This means that for any bounded real function g(x), gL α=0 gdπ = gL α gdπ.
(This only holds for real functions.) Then, we can evaluate the real part of the eigenvalue λ as follows, Then, by expanding the eigenfunction u, v by the eigenfunction {e k }, Thus, the real part of the eigenvalue of L α is smaller than the smallest eigenvalue of L α . This means that the spectral gap of L α is larger than that of L α=0 , i.e., m(α) ≥ m 0 holds.

Appendix C.2. Proof of Theorem 2 (S-ULD)
Proof of Theorem 2 . To prove the S-ULD, we use the result of [20], which characterize the convergence of ULD via the Poincaré constant. Let us denoteμ t as the measure induced by ULD. Then from Theorem 1 of [20], if π with L has the Poincaré constant m 0 , we have where¯ and λ γ is given as follows. where where¯ is arbitrary sufficiently small positive value such that Λ(γ,¯ min(γ, γ −1 )) > 0 is satisfies. As for R ham , if there exists a positive constant K, such that ∇ 2 U ≥ −KI, then R ham ≤ max{K, 2}. In our assumption, this corresponds to βM, thus R ham ≤ max{βM, 2}. From the above definitions, we can see that the larger m 0 is, i.e., the larger the Poincaré constant is the faster convergence ULD shows.
This can also be confirmed numerically, see Figure A1, which shows how the Λ changes under different m 0 . We set Σ −1 = 100. From the figure, the larger the Poincaré constant is, the larger Λ becomes. So far, we confirmed that the convergence speed of S-ULD is characterized by the Poincaré constant of L. When we consider S-ULD, we simply add the skew matrices term to the generator of the ULD in the proof of Proposition 1 in [20]. This means that we simply replace the Poincaré constant from m 0 to m(α) in the proof of Proposition 1 in [20]. Then, m 0 will be replaced with m(α) that indicates the faster convergence.

Appendix D. Eigenvalue and Poincaré Constant
In this section, we discuss the relation between eigenvalues of the Hessian matrix and Poincaré constant.

Appendix D.1. Strongly Convex Potential Function
When we consider LD with m-strongly convex potential function, then the Poincaré constant is m, this means exponential convergence with rate m (See [19] for the detail).
We then consider the S-LD with m-strongly convex function. In this setting, by considering the synchronous coupling technique [11], we can show that the variance decays exponentially with the rate of the smallest real part of the eigenvalue. This is because that by preparing two S-LD (X t , Y t ) given as Then we evaluate the behavior of X t − Y t 2 . From Ito lemma and considering the synchronous coupling, we obtain where m(α) is the constant that satisfies m(α) ≤ Reλ α 1 (x) for all x, see Appendix E for details. This means that variance decays exponentially with the rate 2m(α) β . From the fundamental property of the Poincaré constant (Theorem 4.2.5 in [19]), m(α) is the Poincaré constant. Thus the imaginary part has no effect on the continuous dynamics. Thus, the Poincaré inequality is the smallest real part of the perturbed Hessian matrix.

Appendix D.2. Non-Convex Potential Function
As we discussed in Section 3.1, [21] derived the sharper estimation for the Poincaré constant for the non-convex potential function. It is easy to verify that their assumptions are satisfied under our assumption 1-5. Following the main paper, we denote x 1 global minima, and x 2 is the local minima which have the second smallest value in U(x). We express the saddle point between x 1 and x 2 as x * . To be more precise, the saddle point that characterizes the Poincaré constant is known as the critical point with index one defined as and the eigenvalue of ∇ 2 U(x * ) has one negative eigenvalue and d − 1 positive eigenvalues. We express them as λ 1 (x * ) < 0 < λ 2 (x * ) < . . . , λ d (x * ).
Ref. [21] studied the Poincaré constant by decomposing the non-convex potential focusing on attractors. By focusing on attractors, they showed that the non-convex potential can be decomposed into the sum of approximately Gaussian distributions. They proved that the Poincaré constant is characterized by the local Poincaré constants, these are derived by the approximate Gaussian distribution on the attractors and their surrounding regions. In addition, they proved that the dominant term of the Poincaré constant is specified by the saddle points between the global minima and the point which takes the second smallest value for U(x). From Theorem 2.12 and Corollary 2.15 in [21], the Poincaré constant is characterized by where Z is the normalizing constant of e −βU(x) . Next, we discuss how this estimate changes when skew matrices are applied. When the skew matrices are introduced, from lemma A.1 in [40], at the saddle point, there exists a unique negative real eigenvalue λ α 1 (x * ) < 0 for the perturbed Hessian matrix even if (I + αJ)H is not a symmetric matrix.
Then from Proposition 5 in [8], that negative eigenvalue of the perturbed Hessian is smaller than that of the un-perturbed Hessian matrix at the saddle point. This means that λ α 1 (x * ) ≤ λ 1 (x * ) < 0 holds.
Finally, from Theorem 5.1 in [41] and Theorem 2.12 in [21], this improvement of the negative eigenvalue of the saddle point directly leads to the larger Poincaré constant.

Appendix E. Properties of a Skew-Symmetric Matrix
Here, we introduce the basic properties of the skew-symmetric matrices. Let us consider assume that d × d matrix H = (I + αJ)H is diagonalizable. Then assume that matrix H has l real eigenvalues λ 1 , . . . , λ l and 2m complex eigenvalues, µ 1 = α 1 ± iβ 1 , . . . , µ m = α m ± iβ m . Thus, d = l + 2m. We denote the corresponding eigenvectors as {v j } l j=1 for real eigenvalues and {w j = a j + ib j } m j=1 for complex eigenvalues {µ j } m j=1 and {w j } for corresponding conjugate eigenvalues. Then, let us define a d × d matrix V as . . . , v l , a 1 , b 1 , . . . , a m , b m ]. (A39) Then, we can decompose H into a block diagonal matrix [42]; (A41) Thus, D := A + B. Then, from the Taylor expansion and expressing its residual by integral, by defining H(x) := ∇ 2 U(x) we have Then, let us apply the Jordan canonical form here. If (I + αJ)H is diagonalizable, and it is decomposable by the Jordan canonical form shown in Equation (A40). Then, we can decompose (I + αJ)H as Then, we obtain where m(α) is the constant that satisfies m(α) ≤ min{λ 1 , . . . , λ l , α 1 , . . . , α m } for all x. Thus, the imaginary part never appears to the upper bound and we only need to focus on the largest real part of the eigenvalues, if the matrix is diagonalizable. Next subsection describes when the non-symmetric matrix H is diagonalizable by focusing on the random matrix.

Appendix F. Proof of Theorem 3
Proof. Since the potential function is m-strongly convex, the smallest eigenvalue of the Hessian matrix H is m, which is larger than 0. Thus, H and H 1/2 are regular matrices. With this in mind, we consider H + H 1/2 JH 1/2 as a similar matrix of H := (I + J)H. This is easily confirmed by This means that to study the eigenvalues of H , we only need to study the similar matrix A := H + H 1/2 JH 1/2 . By doing this, A is composed of symmetric and skewsymmetric matrices, which are easy to treat compared to H , where the term JH is difficult to analyze. For simplicity, we omit the dependency of H and H on x in this section.
Remark A1. Please note that we can eliminate the strong convexity of U, if H is a regular matrix. This means that H does not have 0 as an eigenvalue.
For simplicity, we assume that the dimension d is an even number. We assume that the eigenvalues and eigenvectors of A are expressed as and α j is ordered as α 1 ≤ α 2 , . . . . In this section, we only consider the setting where all the eigenvalue and eigenvector are imaginary for notational simplicity. The extension to the general settings similar to Appendix E and the setting when is d is odd is straightforward. We denote the eigenvalues and eigenvectors of H as {λ j , v j } d j=1 and v j s are linearly independent. In addition, we assume that λ 1 ≤, . . . , λ d . From this definition, by checking the real parts and complex parts, the following relations are derived thus, by the skew-symmetric property and in the third equality, we used the property since H 1/2 JH 1/2 is a skew-symmetric matrix. Then, we expand a j and b j by v j as since v j s are eigenvalues of H, which can be used as the basis for R d . Then we substitute this into Equation (A50) and we have This means that any real part of the eigenvalue of A is larger than λ 1 which is the smallest eigenvalue of H. Thus, if the α 1 is the smallest real part of the eigenvalue of A, that is larger than the smallest eigenvalue of H. This concludes the proof.
In the same way, which means any real part of the eigenvalues of A is smaller than the largest eigenvalue of H. Thus, if α is the largest real part of the eigenvalues of A, it is smaller than the largest eigenvalue of H.
Equality condition: Next, we discuss when the equality holds for α 1 = λ 1 . First, we assume that eigenvalues of H are distinct, thus, there is only one eigenvector for λ 1 . Later, we discuss if eigenvalues are not distinct. From Equation (A54), we have in general. Please note that if a 1 and b 1 does not correspond to v 1 , then λ j =1 > λ 1 must appear in the summation and equality never holds. So, the condition is must hold for the equality. Based on this, let us assume that w 1 = ca 1 + ic b 1 where c 2 + c 2 = 1. We consider the case a 1 = b 1 = v 1 . Then we need to solve the simultaneous equations this is obtained by the definition of the eigenvalue of A and this is obtained from the definition of eigenvalues of H. Then multiplying v 1 from the left, we obtain cβ 1 = 0 and c β 1 = 0. Thus, β 1 = 0. β 1 = 0 meansb 1 = 0 from the property of the complex eigenvectors. Thus, we obtain w 1 = a 1 = v 1 for λ 1 = α 1 . Then, the following relation holds, Since λ 1 = 0 and H 1/2 has the inverse matrix, this condition indicates that This is the condition that λ 1 = α 1 holds. The same relation can be derived for λ d = α d . Next, we assume that eigenvalues of H are not distinct. Let us denote the set of eigenvectors of the eigenvalue λ 0 1 as {v 0 1 }. Please note that if a 1 and b 1 does not included in V 0 1 , then λ j =1 > λ 1 must appear and equality never holds. Thus must hold for equality. Based on this, let us assume that w 1 = ca 1 + ic b 1 where c 2 + c 2 = 1. We consider the case a 1 = b 1 . Then then we obtain the condition

Appendix G. Proofs of Random Matrices
Appendix G.1. Proof of Theorem 5 Proof. The proof is the straightforward consequence of lemma in [43], that is Lemma in ( [43]) If f (x 1 , . . . , x m ) is a polynomial in real variables x 1 , . . . , x m , which is not identically zero, then the subset N m = {(x 1 , . . . , x m )| f (x 1 , . . . , x m ) = 0} of the Euclidean m-space R m has the Lebesgue measure zero.
We use this lemma to prove that the probability of λ 1 = α 1 is 0 by showing that the probability mass of λ 1 = α 1 has Lebesgue measure zero. We use the same notation as in Appendix F. Recall Equation (A64), which is the condition of equality about λ 1 = α 1 . We express the elements of a 1 and b 1 as a 1 = (a 1 1 , . . . , a d 1 ) and b 1 = (b 1 1 , . . . , b d 1 ) . Then the equality condition can be written as Then we define the polynomial about {J i.j } f (J 1,2 , . . . , To apply lemma of [43], we must confirm that f (J 1,2 , . . . , J d−1,d ) is not always 0. This is clear from the definition of f since we generate J 1,2 , . . . , J d−1,d randomly from the distribution that is absolutely continuous with respect to Lebesgue measure and λ 1 = 0 and c 2 + c 2 = 1 and either a 1 , b 1 = 0.

Appendix G.2. Proof of Lemma 1
Proof. We first discuss the condition about KerJ 0 = {0}. Since J = J 0 ⊗ I d , and we denote the set of eigenvalues of J 0 as {ω i }. In general, the eigenvalues of the matrix that is composed of the Kronecker product with two matrices, e.g., A and B, are given as the product of each eigenvalue of A and B [44]. Thus, since J is the Kronecker product of J 0 and I d , if J 0 does not have 0 as an eigenvalue, J does not have 0 as an eigenvalue.
Next, we discuss another equality condition. We use the similar notation as in Appendix F, but now the dimension of the matrix J is dN. We express the eigenvalue which has the smallest real part as λ α 1 and its eigenvector as ω α 1 = a 1 + ib 1 . The elements of a 1 and b 1 as a 1 = (a 1 1 , . . . , a d 1 , a d+1 1 , . . . , a dN 1 ) ∈ R dN and b 1 = (b 1 1 , . . . , b d 1 , . . . , b d+1 1 , . . . , b dN 1 ) . We also express these as a 1 = (a (1) , . . . , a id 1 ) ∈ R d . We use the Kronecker product property: where J 0|i,j indicates the element of i-th row and j-th column of J 0 where we use the property of the Kronecker product and the Vec operator in the second equality [44]. The proof is almost similar to Appendix G.1. Then the equality condition can be written as where · is the d-dimensional Euclidean norm since a (n) In a similar discussion with Appendix G.1, it is clear that f is not always 0. Thus, given an evaluation point x, from lemma of [43], the subset of {J i,j } ∈ R N(N−1)/2 that satisfies f (J 1,2 , . . . , J N−1,d ) = 0 has Lebesque measure zero. Thus, if we generate {J i,j } from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), f (J 1,2 , . . . , J N−1,N ) = 0 holds probability 0. This concludes the proof. First, the condition of KerJ 0 = {0} is not related to the evaluation point. Thus, we need to consider the equality condition for Reλ α 1 = λ 0 1 . As for this condition, as we had seen in Theorem 5 and Lemma 1, if we generate the random matrix J which is absolutely continuous with respect to Lebesgue measure, then the equality condition is not satisfied with probability 1 at the given evaluation point. The important point in those proof is to prove that the event when the equality holds has Lebesgue measure 0 at the given evaluation point using the lemma of [43].
Let us consider when two evaluation points are given (e.g., x 1 , x 2 ), and we check whether the random matrix J satisfies the above equality condition or not. We can easily prove that at each evaluation point, such an event (we express them as S 1 and S 2 ) has Lebesgue measure 0 using the lemma of [43] (We refer to this as P(S 1 ) = 0 and P(S 2 ) = 0 where P is the law induced by generating the random matrix that has independent d(d − 1)/2 elements). So, the volume of the event of sum of S 1 and S 2 are also 0 (P(S 1 S 2 = 0). By repeating this procedure, when given a finite number of evaluation points, (x 1 , . . . , x k ), the sum of such probability is 0 (this indicates P(S 1 S 2 , . . . , S k ) = 0).
When we consider the discretized dynamics of S-LD, S-PLD, and so on, and update samples up to k-iterations, then there exist k evaluation points. So, by applying the above discussion, we can ensure that along the path of the discretized dynamics, the equality condition does not hold with probability 1. On the other hand, as for the continuous dynamics, the evaluation point is infinite, thus when we cannot conclude that the probability that the equality does not hold is 1.

Appendix H. Proof of Theorem 6
We use the same notation as in Appendix F. We consider the expansion concerning α and we consider the following setting, which indicates that by introducing the skew-acceleration terms, the pairs of eigenvalues and eigenvectors of H are expressed by the small perturbation for the eigenvalues and eigenvectors of H. Since {v j } d j=1 are the eigenvalues of H and they can be used as an orthogonal basis, thus we expand δv by this basis. We obtain where c jk = δv j v k .

Appendix H.1. Asymptotic Expansion When the Smallest Eigenvalue of H(x) Is Positive
We work on the similar matrix of H , that is H + αV where V := H 1/2 JH 1/2 . See Appendix G.1 for the detail. Please note that this similar matrix only exists when the smallest eigenvalue of H(x) is positive. Thus, the following discussion cannot apply to the case at the saddle point, where negative eigenvalues appear. We discuss the saddle point expansion later.
From the definition, we have We rearrange this equation as First, we focus on the first-order expansion. This means we neglect high-order terms. Then, we have Hv j + Hδv j + αVv j = λ j v j + δλ j v j + λ j δv j . (A76) By multiplying v j to Equation (A76) from the left-hand side, we have Since v j Vv j = 0 due to the skew-symmetric property of V. Thus, we have up to the first-order expansion. Then we substitute this into Equation (A76) and multiplying v i where i = j, we have Then we have Then we obtain We substitute this into Equation (A75), and multiplying v j , we have v j Hα Since v j Vv j = 0 and v j v i = 0 and v j v j = 1, we have Thus, we have Thus, by taking the real part, and note that Reλ j (α) = α j , we have This concludes the proof.

Appendix H.2. Expansion of the Eigenvalue at the Saddle Point
Here we derive the formula of the expansion of the eigenvalue at the saddle point. Since the smallest eigenvalue is negative, we cannot use the similar matrix as shown above. Instead, we use the relation, µ j Hw j = Hµ j w j = H(I + αJ)Hw j (A86) where we used the definition of the eigenvalues and eigenvectors. Here, we express H := (I + αJ)H and its pairs of eigenvalues and eigenvectors as {(µ i , w i )} d i=1 . As introduced in the above, we substitute the expansion to Equation (A86), then we obtain Then, in the same way as above, since {v j } d j=1 are the eigenvalues of H and they can be used as an orthogonal basis, we expand δv by this basis. This means where c jk = δv j v k . By multiplying v i to Equation (A87) where i = j from left-hand side and neglecting high-order terms, we have Next, Then by multiplying v j to Equation (A87) from left-hand side, we have v j H(αJ)Hδv j = (δλ j )(λ j + λ j v j δv j ) (A90) Then by substituting δv j with coefficient Equation (A89), we have This concludes the proof.

Appendix I. Convergence Rate of Parallel Sampling Schemes
Appendix I.1. Proof of Lemma 2 First, we introduce the notations. We express the random variables of S-PLD as Y ⊗N t . We express the measure induced by S-PLD as µ ⊗N t (α), which uses the αJ as an interaction term. Thus, we express the measure of PLD as µ ⊗N kh (0), we can decompose the measure as marginals. We also denote the marginal measure of S-PLD for Y (n) t ν (n) t (α). Please note that initial distribution is µ ⊗N 0 and its marginals are µ 0 as defined in Assumption 4. Please note that the marginal measure of PLD is the same as those of LD if the initial measures are all the same, thus each marginal satisfy the Poincaré constant m 0 . This is also the result of the tensorization property of the spectral gap (Proposition 4.3.1 in Bakry et al. [19]).
As for the initial condition, from the fact that χ 2 divergence is the special case of Renyi divergence (α = 4), and from the tensorization property of the Renyi divergence (see Theorem 28 in [45]), we have Then we have If the skew acceleration is applied, from the same discussion as S-LD (see Appendix C.1), S-PLD has the Poincaré constant which is larger than m 0 . We express it as m(α, N)(≥ m 0 ). Then we have At first, since there exists a constant N in the convergence bound, this bound seems not useful. However, as we discussed below, when we bound the bias or variance, these bound is meaningful. For example, let us consider approximating the true expectation For this purpose, we can bound this by 2-Wasserstein distance as where we assumed that f shows L f lipschitzness and used the fact that 1 To bound the distance, we use the basic relation where m(α, N) is the Poincaré constant. This is established by the definition of Wasserstein distance and χ 2 -divergence, see [46] for the detail. Then combined with above relations, we obtain the bias bound of S-PLD as In the same way, we obtain the bias bound of PLD as Thus, while the explicit dependency on N disappeared, but S-PLD shows faster convergence through the relation of m(α, N) ≥ m 0 . Moreover, if we use the skew matrices, which does not satisfy the equality condition, we have m(α, N) > m 0 .

Appendix I.2. Proof for S-ULD
We can characterize the convergence rate almost in the same way as Appendix C.2. The derivation is the same above, thus we only show the result where¯ and λ γ is given as follows. and where¯ is arbitrary sufficiently small positive value such that Λ(γ,¯ min(γ, γ −1 )) > 0 is satisfies. and R ham ≤ max{M, 2}. (A108)

Appendix J. Proof of Theorem 7
We show our theorem again with explicit constants Theorem A3. Under Assumptions 1-7, for any k ∈ N and any h ∈ (0, 1 ∧ m 4M 2 ) obeying kh ≥ 1 and βm ≥ 2, we have Then obtained bound is O(kh · h 1/4 ), which is independent of N. Thus, this result is much better than those in [18]. Additionally, note that we can derive the similar bias bound for skew-SGHMC in the same way as skew-SGLD.
Proof. For notational simplicity, we express the random variables of skew-SGLD which uses the αJ as an interaction term as X ⊗N k and those of S-PLD as Y ⊗N k . In this section, for simplicity, we express them as X k and Y k . We denote the measure of X k and Y k as ν ⊗N kh and µ ⊗N kh . We also denote the marginal measure of X kh . Then, we first decompose the bias as where we used the Jensen inequality for the first term in the last inequality and we move 1 N ∑ N i=1 outside the | · |. In addition, each expectation only depends on the marginal measures µ (i) in the first term and we use the property of the 2-Wasserstein (2-W) distance. Furthermore, we decompose the first term as where µ (n) kh (0) denotes the measure induced by PLD, which is the naive parallel sampling without a skew-symmetric interaction.
In conclusion, our task is to bound each (i), (ii), (iii) terms in the above. Bounding (i) is already discussed in Appendix I.1.
Next, we work on (ii) and(iii). Following [10], we use weighted CKP inequality to bound the 2-W distance. From Bolley and Villani [47], using the weighted CKP inequality, we can bound each 2-W distance by the relative entropy (KL divergence). This weighted CKP inequality indicates that and We point out that using C in weighted CKP inequality is important. This is because since µ show O(dN) which shows linear dependency on N since there is an interaction term between parallel chains and we cannot decompose the parallel chain easily. Thus, this results in unsatisfactory dependency on N. This is the reason we introduced µ (i) kh (0) in our theoretical analysis. Please note that since µ (n) kh (0) is induced by the naive parallel chain, each marginal is independent with each other and takes the same measure if the initial measure is the same. Thus, µ (1) kh (0) = · · · = µ (N) kh (0). From now on, we express the marginal as µ kh (0) for simplicity. Thus, C µ (1) Then substituting the above WKP inequalities and using the Jensen inequality, we obtain To analyze the discretization error, we use the following key lemma: We denote the product space as Ω ⊗N := Ω 1 × . . . Ω N . Let us introduce X = (X 1 , . . . , X N ) ∈ Ω ⊗N and Y = (Y 1 , . . . , Y N ) ∈ Ω ⊗N . Let us express their joint probability measures as expressed as P(X) := P(X 1 , . . . , X N ), Q(Y) := Q(Y 1 , . . . , Y N ), let us denote the marginal measures of each Xs and Ys as A proof is given in Appendix J.1. We apply this lemma as Combining these results with the above bias bound, we obtain Thus, we need to bound KL(µ (i) kh (α)|µ ⊗N kh (0)) and KL(ν ⊗N kh (α)|µ ⊗N kh (0)) and C µ kh (0) . We can upper-bound them using the results of [2]. For that purpose, we need to replace the constants in [2] as we show in the below. Here, we discuss how the constants in the assumption are changed in the ensemble scheme. We define ∇u ⊗N (x ⊗N ) := (∇u(x (1) ), . . . , ∇u(x (N) )) (A125) First, we focus on the smoothness condition. From Assumption 2 and lemma 8 in [18], we have where the norm in the right-hand side is the Euclidean norm in R dN .
Next, we discuss the smoothness condition. Define ∇U α (x ⊗N ) := ∇U ⊗N (x ⊗N ) + αJ∇U ⊗N (x ⊗N ). Then, Let x ⊗N ∈ R dN and under the assumptions 1 to 6, we have Next, we check about the condition of the drift function at the origin: ∇u(0, z) ≤ B. We can calculate in the same way as the smoothness condition. Then we have Next, we study the condition about the stochastic gradient: This can be easily modified to Finally, we discuss about the initial condition: κ 0 := log R d e x 2 p 0 (x)dx < ∞. We assume that the initial probability distribution is , which means that all the marginal probability is the same. Then In this way, the constants in the assumptions are modified and expressed with N and α. Then combined with the results of [2], we can derive the following relations where This concludes the proof.
Appendix J.1. Proof of Lemma A1 Proof. We prove this lemma using the Donsker-Varadhan representation of the relative entropy [48]. The relative entropy admits the dual representation as: where supremum is taken over all function T of which the expectation of e T and T are finite. We then restrict the function class into a class F (T) = {T(X)|∃T i : Then we have

Bias Expansion for S-PLD
Recall that the bias of S-PLD is First, we discuss the convergence of the continuous dynamics. Using the eigenvalue expansion in Theorem 6 , with some positive constant d 0 , we have Then by assuming α 2 is small enough and considering the Tayler expansion, we have As for the discretization and stochastic gradient error, using the Taylor expansion, there exists a positive constant d 1 and d 2 , such that Combining these terms, we have Thus, there exists an optimal α * , which minimizes the bias. Please note that at k = 0, acceleration always occurs. As k goes to infinity, the second third terms 0, thus the first term will be dominant, which means we have larger discretization and stochastic gradient error.

Appendix L. Hyperparameters of the Proposed Algorithm
Here we discuss how to set hyperparameters in the algorithm. There are three hyperparameters, α 0 , η, and c. We numerically found that setting c = 0.95 work well for real dataset including LDA experiment, and Bayesian neural network regression and classification. For toy dataset, we set c = 0.9.
As for α 0 and η, we empirically found that using the following scaling trick works well for real dataset including LDA experiment, and Bayesian neural network regression and classification, and using η ≈ 0.1α 0 . The intuition is that the magnitude of the gradient can be very different in each dimension, so we introduce the scaling by the gradient. We also multiply h so that the stochastic gradient and discretization error of the skew term will not be dominant compared to usual gradient term. Finally, we multiply some constant so that α 0 will not be too small.

Appendix M. Proof of Theorem 8
In this section, we derive the upper-bound of the bias of skew-SGLD based on [23]. This approach requires us to use the logarithmic Sobolev inequality [19], which is stronger than the Poincaré inequality. First, we present the definition of the logarithmic Sobolev inequality. We say that π on R d with L satisfies the logarithmic Sobolev inequality with constant λ in case for all function f on R d with R d u 2 dπ = 1, This logarithmic Sobolev inequality is stronger than the Poincaré inequality and induces the convergence in KL divergence. See [19] for details. It was proved in [2,18] that our dynamics, LD, SLD, PLD, S-PLD, and skew-SGLD satisfy the logarithmic Sobolev inequalities under our assumptions. We express the constant of the logarithmic Sobolev inequality for skew-SGLD as λ(α, N). This constant depends on the skew matrices and the Poincaré constant. We estimate this constant in Appendix M.1.
To upper-bound the bias, here we control the KL divergence. We denote the law of skew-SGLD at iteration k with interaction strength α as µ ⊗N kh (α). We upper-bound the bias by 2-Wasserstein distance Then, from the transportation inequality [19], Thus, we will upper bound the KL divergence using the technique in [23]. However, in the original proof, a full gradient ∇U is used so we replace it with the stochastic gradient. Moreover, we introduce the skew interaction term.
First, Lemma 11 in [23] is modified to Then Lemma 12 in [23] is modified to for any integrable µ.
Herein after, we drop ⊗N from X ⊗N , ∇U ⊗N , and ∇Û ⊗N for notational simplicity. We focus on skew-SGLD at iteration k, we consider the following SDE for t ∈ (kh, (k + 1)h] where ∇Ũ(X k ) is the stochastic gradient conditioned on X k . The solution of this SDE is We would like to derive the continuity equation correspond to Equation (A154). Following [23], we express X t as x t and X k as x 0 for simplicity. Let ρ 0t (x 0 , x t ) denote the joint distribution of (x 0 , x t ). Then, the conditional and marginal relations are written as The conditional density ρ t|0 (x t |x 0 ) follows the FP equation ∂ρ t|0 (x t |x 0 ) ∂t = ∇ · (ρ t|0 (x t |x 0 )(I + αJ)∇Ũ(x 0 )) + β −1 ∆ρ t|0 (x t |x 0 ), Then following [23], to derive the evolution of ρ t , we take the expectation over ρ 0 (x 0 ) Then, we take the expectation regarding for the stochastic gradient in the above equation and include it into E ρ 0|t for notational simplicity. Then following the discussion of Lemma 3 in [23], we obtain where t ∈ (kh, (k + 1)h] and X t = X k − t(I + αJ)∇U(X k ) + 2tβ −1 .
The technique of [49] estimates the constant of the logarithmic Sobolev inequality as follows. Assume that π on R d with L satisfies the Poincaré inequality with constant m. Then, for any function u on R d that satisfies we find a constant b that satisfies Then the logarithmic constant is larger than 2(b + 3 m ) −1 . Thus, we only need to focus on the restricted function class to estimate a constant b. We slightly change the Lemma 3.2 of [49] that estimate the constant b in Equation (A176) to apply it in our setting. In Lemma 3.2 of [49], it was proved that if u on R d satisfies the conditions in Equation (A175), then for any t ∈ (0, 1), we have LU(x) − πe 2 U(x))u 2 dπ, (A177) where we assume that π ∝ e −βU(x) satisfies the Poincaré inequality with constant m. If there exists a constant C such that then by setting t = m/(m + |C|), we can show that where we used the property of L, see [19] for details. As for the second term, it is lowerbounded by Thus, by setting we can estimate the logarithmic constant as 2(m/(m + |C|) + 3 m ) −1 . In our setting, this is modified to Finally, if we increase m(α, N), λ(α, N) increases. Thus, since m(α, N) ≥ m(α = 0, N), we obtain λ(α, N) ≥ λ(α = 0, N).

Appendix M.2. Computational Complexity
To derive the computational complexity, for simplicity, we assume that δ ≤ h and We also set α 2 ≤ h for simplicity. This means that the variance of the stochastic gradient is small enough and we use small α. Then the bias is where Then we define we have