Comparing Markov Chain Samplers for Molecular Simulation

Markov chain Monte Carlo sampling propagators, including numerical integrators for 1 stochastic dynamics, are central to the calculation of thermodynamic quantities and determination of 2 structure for molecular systems. Efficiency is paramount, and to a great extent, this is determined by the 3 integrated autocorrelation time (IAcT). This quantity varies depending on the observable that is being 4 estimated. It is suggested that it is the maximum of the IAcT over all observables that is the relevant 5 metric. Reviewed here is a method for estimating this quantity. For reversible propagators (which are 6 those that satisfy detailed balance), the maximum IAcT is determined by the spectral gap in the forward 7 transfer operator, but for irreversible propagators, the maximum IAcT can be far less than or greater 8 than what might be inferred from the spectral gap. This is consistent with recent theoretical results 9 (not to mention past practical experience) suggesting that irreversible propagators generally perform 10 better if not much better than reversible ones. Typical irreversible propagators involve a parameter 11 controlling the mix of ballistic and diffusive movement. To gain insight into the effect of the damping 12 parameter for Langevin dynamics, its optimal value is obtained here for a multidimensional quadratic 13 potential energy function. 14

observable is an expectation E[u(Q)] = u(q)ρ Q (q) dq for some "preobservable" u(q).This can be estimated by the mean ŪN = (1/N) ∑ N−1 n=0 u(Q n ), where the random values Q n are obtained from a . Note the use here of upper case to denote random variables.
In practice, sampling performance is improved if the configuration variables q are augmented with auxiliary variables p, e.g., momenta, yielding phase space variables z = (q, p).The probablity density is extended to ρ(q, p) so that ρ(q, p)dp = ρ Q (q), and an MCMC scheme is constructed to produce a chain Z 0 → Z 1 → • • • → Z N−1 where Z = (Q, P).
The samples from the chain tend to be highly correlated, and this greatly reduces the convergence rate as N → +∞.As explained in Sec. 2, the variance of an estimate for E[u(Z)] is where τ is the integrated autocorrelation time (IAcT) for u(z).The effective sample size is therefore N/τ, and the appropriate metric for evaluating a propagator is the effective sample size divided by the computing time.
In a great many practical simulations, the effective sample size is probably close to zero.One can disagree on the significance of such simulations [1].In any case, for the comparison of sampling algorithms, it is possible to choose molecular systems, restrained if necessary, for which it is feasible to attain a decent effective sample size.
Often the spectral gap is cited as the relevant quantity.To understand its role, it is helpful to express ideas in a direct way as in Ref. [2].As detailed in Sec. 2, introduce a forward transfer operator F to express the ratio ρ n /ρ in terms of ρ n−1 /ρ, where ρ n (z) is the probablity density for Z n .Let F 0 = F − E where E u denote the function with constant value E[u(Z)], Assume that the operator F 0 has its spectrum strictly inside the unit circle, as it does in practice.The error in (1/N) ∑ N−1 n=0 ρ n (z) can be shown [1] to be "proportional to" (1 − F 0 ) −1 and therefore to the reciprocal of the spectral gap |1 − λ 2 |, where λ 2 is a nonunit eigenvalue of F nearest 1.
Estimates of the IAcT are obtained by summing the terms of the autocorrelation function, which is constructed from autocovariances of increasing time lags normalized by the (lag zero) covariance.
Each term contributes a roughly equal statistical error but a signal that decays as the lag time increases.
Therefore, in practice, the terms are weighted by a function called a lag window.The lag window must be tailored to the autocorrelation function, and choosing a suitable lag window is very difficult, as mentioned in Sec.2.1.
Reliable estimates of the IAcT are impractical in general.Sec. 3 introduces the concept of quasi-reliability, which aims to enforce a necessary condition for reliable estimates.Informally, the goal is to ensure good coverage of those "parts" of phase space that has been explored, to reduce the possibility of missing an opening to an unexplored part of phase space.More precisely, for an arbitrary subset of phase space, we ask that the proportion of samples in that subset differ from its expectation by no more than some tolerance tol, with, say, 95% confidence.This is shown to be true if the IAcT τ for any preobservable u(z) satisfies τ ≤ tol 2 N. The maximum IAcT τ max is the greatest eigenvalue of where † denotes the adjoint with respect to the inner product v, u = v(z)u(z)ρ(z) dz.For a reversible propagator, where F † = F , τ max is 1 less than twice the reciprocal of the spectral gap.
However, for an irreversible propagator, it can be much smaller, as demonstrated by a simple example in Sec.4.1, or larger as in Sec. 5.As a practical algorithm, it is suggested to obtain the maximum IAcT by first discretizing the space of preobservables.Consideration of a quadratic energy potential suggests using a linear combination of phase space variables (and possibly quadratic terms in these variables).
The idea of seeking the preobservable that maximizes the IAcT is suggested already in Ref. [3], which considers a set of indicator functions as preobservables and uses the greatest IAcT of these to assess sampling thoroughness.In general, maximizing over a linear combination of preobservables can lead to a much larger result than taking the maximum of them individually, due to correlations that might exist between different preobservables.This does not necessarily apply, however, to those considered in Refs.[1,3].
Sec. 3.1 notes that that typical irreversible propagators, termed "quasi-reversible" here, have a forward transfer operator F = R F where each of F and R is reversible and R 2 = 1.For such propagators, the estimation of τ max simplifies somewhat.
Theoretical results [4] indicate that adding irreversibility reduces the autocorrelation times of observables.Sec. 4 gives a couple of very elementary examples illustrating the dramatic increase in τ max if an irreversible propagator F is replaced by its reversible "counterpart" 1 2 (F + F † ).
Discretized Langevin dynamics is a particularly effective general-purpose propagator.
Unfortunately, one must specify a value for the damping coefficient γ.Sec. 5 analyzes τ max for a quadratic potential and obtains an optimal value for the coefficient, namely, the value (3/8) 1/2 = 0.612 • • • times the critical damping coefficent for lowest frequency ω 1 .This value is such that the lowest frequency mode is moderately underdamped, with higher frequencies increasingly underdamped.

Preliminaries
Assuming that Z 0 has probability density ρ(z), the variance of the estimate U N is exactly where the autocovariances The limit N → +∞ gives Eq, (1) where the integrated autocorrelation time As an example of augmenting configuration space, consider ρ(q, p) ∝ exp(−β(V(q)) + 1 2 p T M −1 p)) where p = [p 1 , p 2 , . . ., p ν ] T .A good propagator for this is the BAOAB integrator [5] for Langevin dynamics, whose equations are where M is a matrix chosen to compress the range of vibrational frequencies, F(q) = −∇V(q), M 1/2 M T 1/2 = M, and W t is a vector of independent standard Wiener processes.Each step of the integrator consists of the following substeps: where R n is a vector of independent standard Gaussian random variables.The samples generated from this process are shown [6] to be those from a distribution that differs from the correct one by only O(∆t 2 ).
The special choice γ = 1/(2∆t) is the Euler-Leimkuhler-Matthews integrator [5] for Brownian dynamics with step size δt = ∆t 2 /2.Remarkably, the invariant density of this integrator differs from the correct one by only O(δt 2 ).This integrator can be expressed as a Markov chain configuration space, with propagator which is a discretization of Brownian dynamics The desired samples {Q n } are available as part of the process (and, as a theoretical observation, they can be recovered from the Markov chain {Q n } alone, by eliminating R n in the two equations above and For any MCMC propagator, the forward transfer operator is defined so that where u n (z) = ρ n (z)/ρ(z) and ρ n (z) is the probability density for Z n .In particular, where ρ(z|z ) is the transition probablity for the chain.The operator F has an eigenfunction ϕ 1 (z) ≡ 1 for eigenvalue λ 1 = 1.
A reversible propagator is one that satisfies detailed balance, which means that Detailed balance is equivalent to F † = F , where the adjoint F † is defined by the condition that F v, u = v, F † u for arbitrary u(z) and v(z).The BAOAB integrator is not reversible as a sampling propagator, except for the special case γ = 1/(2∆t); but a scheme consisting of a fixed number of BAOAB steps followed by a momenta flip is reversible.The unmodified BAOAB integrator is, however, in a class of "quasi-reversible" integrators introduced in Sec.3.1.

Estimating integrated autocorrelation time
Suggested [7] as a covariance estimate is the quantity with justification in Ref. [8, pp. 323-324].
The use of all possible terms C N (k)/C N (0) in Eq. ( 3) to estimate the IAcT does not converge in the limit N → +∞, so, in practice, a lag window w(k) is used to increasingly damp terms as the noise to signal ratio increases: An interesting algorithm called acor for estimating the IAcT is available on the web [9].Estimating the IAcT can be quite difficult, and acor can give unsatisfactory results.An attempt to improve it [1] is at best marginally successful.For reversible methods, there are properties of the autocorrelation function that may be useful for improving estimates of it [7].

Quasi-reliable Estimates of Sampling Thoroughness
The idea of quasi-reliability is to require that the sample size N be large enough that, with say 95% confidence, the estimated probability of any subset Ω of phase space differs by no more than tol from its correct value.More specifically, for any subset Ω of phase space, an estimate 1 Ω,N of where τ Ω is the IAcT for 1 Ω , it is enough that For molecular simulations, this requires that only those conformations or clusters of conformations having a probability greater than tol be sampled.
In practice, molecular systems have many symmetries, which dramatically reduces the amount of sampling needed.For example, water molecules are generally considered interchangeable as are many subsets of atoms on a given molecule, e.g., the 2 hydrogen atoms of any water molecule.More formally, there are permutations P of the variables z such that ρ(Pz) = ρ(z) and that P −1 F P = F where (P u)(z) = u(Pz).For such symmetries P, the quasi-reliablity requirement considers only those It is helpful to express the IAcT in terms of the forward transfer operator.It can be shown that and, in particular, using the fact that E F 0 = F 0 E = 0.The integrated autocorrelation time can be rewritten as which is a self-adjoint operator.
For a reversible propagator, for which F and hence F 0 is self adjoint, an arbitrary preobservable u is in many cases expressible as a linear combination of the eigenfunctions ϕ k (z) of F 0 , corresponding to (For a more rigorous treatment, see Ref. [7,Sec. 2].)The IAcT for u is then simply a weighted average, of the values is the largest of these values.Note that, for λ 2 negative, τ could be much less than 1.Having τ < 1 may appear paradoxical until it is recognized that Eq. ( 1) is simply an asymptotic approximation for N → +∞.
For the simple example with F(q) = −ω 2 q, the eigenfunction ϕ 2 (q) = √ 2ωq.The indicator function 1 Ω that is richest in ϕ 2 (q) is the one with Ω = [0, +∞], for which the first weight is 2/π.This means that the IAcT for 1 Ω is at least 2/π of the maximum IAcT.For a multimodal distribution, the eigenfunction ϕ 2 (q) corresponding to the subdominant eigenvalue λ 2 resembles an indicator function more closely [2] than does √ 2ωq.Therefore, little is lost and simplicity is gained, if we use the maximun IAcT over all observables satisfying the symmetries instead of just indicator functions: where is our set of preobservables.This can be simplified to To see this, note that the same supremum is obtained for both objective functions if the function u is restricted so that E u = 0 and that for such a function u the two objective functions are equal.
For the simple example of Eq. ( 8), the IAcT is maximized by u(q) = q.In the case of a multidimensional Gaussian distribution, the maximum occurs for linear combination a T q of q where a is the eigenvector of the Hessian of V(q) corresponding to its smallest eigenvalue ω 2 1 .For multimodal distributions in 1 dimension, it appears that the maximizing preobservable ϕ 2 (q) is qualitatively similar to q in the sense that it is monotonic [2].
As is customary when seeking an unknown function, one considers a finite linear combination u(q) = a T u(q) of basis functions u i ∈ W where the a i are chosen to maximimize the IAcT.From Eq. ( 10), the autocovariance for such u(q) is where Consequently, which is a solution of the generalized eigenvalue problem Without information about the actual distribution ρ(z), a general choice for basis functions might be linear polynomials, which are the "subdominant" eigenfunctions for a Gaussian distribution.Simple examples in Ref. [1] (consisting of a mixture of 2 Gaussians, a one-node neural "network", and logistic regression) demonstrate that the use of linear basis functions can yield an IAcT much greater than that of a preobservable of "interest".For molecular simulation, it is clear that instead of atomic coordinates, a better choice of a basis function the distance between two atoms, each of which is uniquely distinguishable.In particular, for a protein, one might choose α-carbons distributed along the backbone chain of a protein.For further suggestions consult Ref. [10], which considers the automatic construction of indicator functions for estimating τ max , based on the dynamics of the propagator.

Quasi-reversible propagators
As stated above, the BAOAB integrator would be reversible if the final B substep were modified to include a momenta flip, i.e., BR: If the flipped BAOAB integrator is followed by another momenta flip, the result is the original irreversible BAOAB integrator, which intuitively and empirically is a superior sampler.More generally, a sampler might be said to be "quasi-reversible" if its forward transfer operator F = R F where each of F and R is reversible and R 2 = 1.For a momenta flip, the operator R is defined by (Ru)(q, p) = u(q, −p).
Quasi-reversible propagators are special in that their covariance matrices satisfy a special property if basis functions u i are chosen so that and I and I are identity matrices of possibly different dimension.Such basis functions are easy to construct since any function u can be written as a sum of its "even part" 1 2 (u + Ru) and its "odd part" 1 2 (u − Ru).In particular, using the fact that RE = E R, The matrix C k thus partitions into 4 blocks.The symmetric part of C k consists of the two diagonal blocks, and the skew-symmetric part consists of the two off-diagonal blocks.Empirical estimates of C k lack these symmetry properties.The expected symmetries provides twice as many sample for estimating the sampling error in the off-diagonal elements.Additionally, since the IAcT requires only the symmetric part of C k , it is unnecessary to compute the off-diagonal blocks, and the eigenvalue problem decouples into two smaller problems.Also, it follows that the maximizing linear combination is either a linear combination of the even functions or of the odd functions.For molecular dynamics at least, it seems to be the case that the long time scales are present only in the position coordinates, so only the (1, 1) block might be computed.

Irreversible Samplers and Their Superiority
Uniform sampling-if it could be used-converges like O(1/N), which is superior to the O(1/ √ N) convergence of random sampling.It would be desirable to combine them by promoting near-uniform sampling within layers of near-constant energy and using diffusion to move among energy layers.

A very simple example
The following very simple example is an analog of Example 2.8 of Ref. [4] and similar to an example of Ref. [11].Assuming probabilities are represented as row vectors, let F be an n by n probability transition matrix given by where 0 < θ < 1.The stationary probability vector is (1/n)[1, 1, . . ., 1].The inner product for two row vectors u T , v T is v T , u T = v T u, and the adjoint of a transition matrix is its transpose.Since F is a circulant matrix, both it and its adjoint have as eigenvectors , where ζ denotes the nth root of unity exp(2πi/n).A straightforward, though lengthy, calculation shows that for k = 2, 3, . . ., n, and This a most elementary illustration of the assertion [4] that "the asymptotic variance for every observable can be made as small as desired under large irreversible drift".If θ 1, the IAcT is likewise much less than 1 despite the very strong correlation between successive values of the state j.The reason for this is that the IAcT is a measure of sample quality not independence.The measure τ simply says that N samples are as good as N/τ independent random samples.And the F operator encourages a uniform sampling, which is better than uncorrelated random samples.However, as shown in the example that follows, such a dramatic difference between (one less than twice) the reciprocal of the spectral gap and the IAcT does not hold when sampling is inhibited more by energy than by entropy barriers.A reversible propagator can be formed from this simple example by using instead the symmetric part of its propagator: and Clearly, the irreversible propagator is much better for n 1.

A very simple example with a barrier
Consider the 2n by 2n transition matrix given by If ε were zero, the eigenvalue 1 would have multiplicity 2, with orthogonal eigenvectors given by the vector of all ones and by f T = [1, 1, . . ., 1, −1, −1, . . ., −1].For small ε > 0, it is expected that the eigenvector corresponding to the subdominant eigenvalue would be a perturbation of f T .Indeed, it happens that f T F n = (1 − 2ε) f T .This also holds for (F n ) T , so the IAcT for F n is 1/ε − 1.This suggests a value of about n(1/ε − 1) for the IAcT of F , which is corroborated by

Optimal Langevin Damping for a Model Problem
To analyze the effect of the damping parameter γ on the IAcT for Langevin dynamics, given by Eq. ( 4), consider the standard model problem F(q) = −mω 2 q.Changing variables Q t = (βm) −1/2 Q t , P t = β −1/2 m 1/2 P t , and dropping the primes gives Assume exact integration with step size ∆t.
The analysis is rather lengthy, so for the benefit of the reader who wishes to omit it, the discussion and conclusions are given here: 1. Reaching precise conclusions is difficult for most of the analysis unless one assumes that ∆t is not too large, where "not too large" seems to be satisfied in practice.This assumption underlies the statements that follow.
2. The spectral gap is an increasing function of ω, so for a multidimensional quadratic potential energy, the value of γ that maximizes the spectral gap depends on the lowest frequency ω 1 .
3. The spectral gap is maximized for γ ≤ 2ω, corresponding to an underdamped system, for which the spectral gap is ω∆t + O(∆t 2 ).
4. The eigenfunctions of the operator G can be partitioned into eigenspaces P k , k = 0, 1, . .., where P k is a linear combination of k + 1 specific polynomials of degree k in ωq and p.The greatest max is the maximum IAcT over P k .
5. Figure 1 shows that, for fixed ∆t and γ, the value τ (k) max is an increasing function of ω, at least for k = 1, 2, 3, 4. Hence, as for the spectral gap, it is the lowest frequency ω 1 that dictates the maximum τ.

The eigenfunction for τ
(1) max is the linear polynomial ωq.For such a preobservable, it may not seem correct that the IAcT becomes arbitrary close to zero as γ goes to zero.This does not mean, however that the variance goes to 0, because Eq. ( 1) is an asymptotic result, and the IAcT is a prefactor for the leading order 1/N term in the expression for the variance.The next order 1/N 2 term would dominate if γ were very small.An order 1/N 2 error is characteristic of uniform sampling, which would be the consequence of near-Hamiltonian dynamics.Also, linear polynomials are special in that their expectation is independent of total energy, so it matters little that near-Hamiltonian dynamics poorly samples different values of total energy.

The eigenfunction for τ
(2) max is a specific linear combination of ω 2 q 2 − 1 and p 2 − 1.For quadratic polynomials, the total energy does affect its expected value, which is why it is necessary that γ be large enough to sample different energies on a reasonable time scale.

The forward transfer operator
The forward transfer operator is where and L K is the Fokker-Planck operator [12, Eq. (10.1)-(10.3),(10.9)] Using the relation L f = 1 ρ L K (ρ f ), it is easy to show that the adjoint

Gamma for maximum spectral gap
Ref. [ with eigenfunctions where with the radical sign denoting the principal square root, and operators A, B defined by The operator L has eigenfunctions ρ(q, p) −1 ψ k,l (q, p) and the same eigenvalues; F has these same eigenfunctions and has eigenvalues exp(∆t µ k,l ).
The eigenvalue nearest 1 depends on ∆t.For small enough ∆t, it is exp(−γ − ∆t), and the spectral To maximize the spectral gap, choose γ to be no greater than 2ω, corresponding to an underdamped system.

Gamma for maximum IAcT
To obtain τ max , one needs eigenelements for G, given by Eq. ( 2) with F = exp(∆tL).Note that the subspace of polynomials of degree ≤ k is closed under application of L and L † .Moreover, this holds separately for subspace P k of odd polynomials of degree ≤ k for k odd and for subspace P k of even polynomials of degree ≤ k for k even.And it applies also to operators E , F , F † , and G. G † = G and Gv ∈ P k−2 .Let u be a basis for P k chosen so there are even functions of p followed by odd functions of p.In particular, for k = 1, 2, 3, 4, respectively.The polynomials in ωq and in p are modified versions He k of Hermite polynomials.We have Lu = Au for some matrix of constants A, given by (Cf.Risken (1984), Eqs.(10.96),(10.97)for L K .)We have E u = 0 and F 0 u = F u = exp(∆tA)u.Therefore, from Eqs. ( 13) and ( 14), one has For each value of k, τ max is the maximum eigenvalue of As explained in Sec.3.1, if the basis functions are re-ordered so that those that are even functions of p precede those that are odd functions of p, the eigenvalue problem splits into 2 nearly equal parts.
The greater of τ ± is τ + , and it is a straightforward exercise to show that τ + is an decreasing function of ω as along as (ω∆t Peer-reviewed version available at Entropy 2017, 19, 561; doi:10.3390/e19100561

Discussion and Conclusions
It is asserted in the literature [1,3] that for comparing MCMC propagators and for ensuring reasonably reliable simulation results, it is of great value to estimate the maximum autocorrelation time over all possible observables.Certainly, this would seem desirable for comparing propagators, since it is best to assume as little as possible about the actual observables that are to be estimated from the Markov chain.Also, the longest IAcT would be most reliably estimated, because its accurate estimation would call for a greatest number of samples; moreover, the accuracy of estimates of other IAcTs might be influenced by the longest IAcT.Unfortunately, estimating IAcTs and assessing such estimates is a difficult task, which could benefit greatly from further research.
In this article, it is suggested that the maximum IAcT be estimated by considering an arbitrary linear combination of basis functions chosen to capture the lowest frequency motions of the dynamics of the MCMC chain.Also appealing is the approach based on a set of well chosen indicator functions [10].
Another quantity for measuring performance of an MCMC propagator is the spectral gap.It is related to equilibration time, but as a predictor of IAcTs, it can be arbritarily too pessimistic or optimistic.This may occur for irreversible propagators applied to energy landscapes where entropy barriers dominate energy barriers.
Shorter integrated autocorrelation times are thus a primary consideration in the design of MCMC propagators.To this end, it is advantageous not to insist on reversibility.In particular, the ballistic component of propagators like hybrid Monte Carlo and Langevin dynamics may offer dramatic speedups for overcoming entropy barrriers, though no advantage for energy barriers.
Irreversible propagators typically have a parameter that determines the extent of diffusive behavior.
For discrete Langevin dynamics applied to a multidimensional quadratic potential, the optimal value corresponds to slightly underdamped dynamics for the lowest frequency mode, with other modes experiencing even lighter dampling.Although this conclusion is not necessarily indicative of the optimal damping for the usual situation, in which there a multiplicity of energy minima, it can be helpful to begin with an understanding of simple situations.

Table 1 .
τ max is the IAcT for F and τ rv max is that for F rv = 1 2 (F + F T )

Table 1
(a).The excess IAcT due to making the propagator reversible is given by Table1(b).It appears to grow quadratically with n, consistent with the diffusive nature of the propagator.Summing pairs of entries from the two tables shows that the advantage of irreversibility depends on the relative importance of entropy barriers to energy barriers.

preprints.org) | NOT PEER-REVIEWED | Posted: 7 September 2017 doi:10.20944/preprints201709.0021.v1
Hence, the P k are eigenspaces ofr G.These eigenspaces can be further decomposed as follows.Let P k = P k for k < 2, andlet P k = P k ∩ P ⊥ k−2 for k ≥ 2.The claim is that P k is an eigenspace, of dimension k + 1.To confirm this, it is enough to show that Gu, Peer-reviewed version available at Entropy 2017, 19, 561; doi:10.3390/e19100561 v = 0 for any u ∈ P k , v ∈ P k−2 , which follows almost immediately, since Preprints (www.