Geometric ergodicity of the Random Walk Metropolis with position-dependent proposal covariance

We consider a Metropolis-Hastings method with proposal kernel $\mathcal{N}(x,hG^{-1}(x))$, where $x$ is the current state. After discussing specific cases from the literature, we analyse the ergodicity properties of the resulting Markov chains. In one dimension we find that suitable choice of $G^{-1}(x)$ can change the ergodicity properties compared to the Random Walk Metropolis case $\mathcal{N}(x,h\Sigma)$, either for the better or worse. In higher dimensions we use a specific example to show that judicious choice of $G^{-1}(x)$ can produce a chain which will converge at a geometric rate to its limiting distribution when probability concentrates on an ever narrower ridge as $|x|$ grows, something which is not true for the Random Walk Metropolis.


Introduction
Markov chain Monte Carlo (MCMC) methods are techniques for estimating expectations with respect to some distribution π(·), which need not be normalised. This is done by sampling a Markov chain which has limiting distribution π(·), and computing empirical averages. A popular form of MCMC is the Metropolis-Hastings algorithm [1,2], where at each time step a 'proposed' move is drawn from some candidate distribution, and then accepted with some probability, otherwise the chain stays at the current point. Interest lies in finding choices of candidate distribution that will produce sensible estimators for expectations with respect to π(·).
The quality of these estimators can be assessed in many different ways, but a common approach is to understand conditions on π(·) that will result in a chain which converges to its limiting distribution at a geometric rate. If such a rate can be established, then a Central Limit Theorem will exist for expectations of functionals with finite second absolute moment under π(·) if the chain is reversible. 1 A simple yet often effective choice is a symmetric candidate distribution centred at the current point in the chain (with a fixed variance), resulting in the Random Walk Metropolis (RWM) (e.g. [3]). The convergence properties of a chain produced by the RWM are well-studied. In one dimension, essentially convergence is geometric if π(x) decays at an exponential or faster rate in the tails [4], while in higher dimensions an additional curvature condition is required [5]. Slower rates of convergence have also been established in the case of heavier tails [6].
Recently, some MCMC methods have been proposed which generalise the RWM, whereby proposals are still centred at the current point x and symmetric, but the variance changes with x [7,8,9,10,11]. An extension to infinitedimensional Hilbert spaces is also suggested in [12]. The motivation is that the chain can become more 'local', perhaps making larger jumps when out in the tails, or mimicking the local dependence structure of π(·) to propose more intelligent moves. Designing MCMC methods of this nature is particularly relevant for modern Bayesian inference problems, where posterior distributions are often high dimensional and exhibit nonlinear correlations [13]. We term this approach the Position-Dependent Random Walk Metropolis (PDRWM), although technically this is a misnomer, since proposals are no longer random walks. 2 Other choices of candidate distribution designed with distributions that exhibit nonlinear correlations were introduced in [13]. Although powerful, these require derivative information for log π(x), something which can be unavailable in modern inference problems (e.g. [14]). We note that no such information is required for the PDRWM, as evidenced by the particular cases suggested in [7,8,9,10,11]. However, there are relations between the approaches, to the extent that understanding how the properties of the PDRWM differ from the standard RWM should also aid understanding of the methods introduced in [13].
In this article we consider the convergence rate of a Markov chain generated by the PDRWM to its limiting distribution. Our main interest lies in whether this generalisation can change these ergodicity properties compared to the standard RWM with fixed covariance. We focus on the case where the candidate distribution is Gaussian, and in one dimension we establish necessary and sufficient growth conditions on the proposal variance and tail behaviour of π(x) for geometric ergodicity. Some of the results extend naturally to higher dimensions, but we also offer an illustrative example showing that the curvature condition can be alleviated when the proposal covariance is allowed to change with position. In Section 2 necessary concepts about Markov chains are briefly reviewed, before the PDRWM is introduced in Section 3. One dimensional results are given in Section 4, before those for higher dimensions in Section 5 and a discus-sion in Section 6. Throughout π(·) denotes a probability distribution, and π(x) its density with respect to Lebesgue measure.

Markov Chains & Geometric Ergodicity
We will work on the measurable space (X , B), so that each X t ∈ X for a discrete-time Markov chain {X t } t≥0 with time-homogeneous transition kernel P : and P n (x, A) is defined similarly for X i+n . All chains we consider will have invariant distribution π(·), and be both π-irreducible and aperiodic, meaning π(·) is the limiting distribution from π-almost any starting point [15]. We use | · | to denote the Euclidean norm.
In Markov chain Monte Carlo the objective is to construct estimators of E π [f ], for some f : X → R, by computinĝ If π(·) is the limiting distribution for the chain then P will be ergodic, meaninĝ f n a.s.
− − → E π [f ] from π-almost any starting point. For finite n the quality off n intuitively depends on how quickly P n (x, ·) approaches π(·). We call the chain geometrically ergodic if from π-almost any x ∈ X , for some M > 0 and ρ < 1, where µ(·) − ν(·) T V := sup A∈B |µ(A) − ν(B)| is the total variation distance between distributions µ(·) and ν(·) [15]. Geometric ergodicity implies that if E π [|f | 2+δ ] < ∞ for some δ > 0, then for some asymptotic variance v(P, f ). Equation (2) enables the construction of asymptotic confidence intervals forf n [15]. Several techniques now exist for constructing non-asymptotic confidence intervals (e.g. [16,17,18]), but at present it is not yet clear whether these can be applied in the same sort of generality as (2). In some cases, such approaches rely on either geometric ergodicity or the equivalent 3 condition of a spectral gap existing for P [19], so (1) must also be established for many of these non-asymptotic results to hold (e.g. [17]). Geometric ergodicity is also often a requirement in establishing the stability of noisy Markov chains in which P is approximated due to either intractability or computational convenience [20,21] (in other instances slightly weaker but related conditions are needed [22]).
In practice, geometric ergodicity does not guarantee thatf n will be a sensible estimator, as M (x) can be arbitrarily large if the chain is initialised far from the typical set under π(·), and ρ may be very close to 1. However, chains which are not geometrically ergodic can often either get 'stuck' for a long time in low-probability regions or fail to explore the entire distribution adequately, sometimes in ways which are difficult to diagnose using standard MCMC diagnostics.

Establishing geometric ergodicity
It is shown in Chapter 15 of [23] that (1) is equivalent to the condition that there exists a Lyapunov function V : where P V (x) := V (y)P (x, dy). The set C ⊂ X must be small, meaning that for some m ∈ N, ε > 0 and probability measure ν(·) for any x ∈ C and A ∈ B. Equations (3) and (4) are referred to as drift and minorisation conditions. Intuitively, C can be thought of as the centre of the space, and (3) ensures that some one dimensional projection of {X t } t≥0 drifts towards C at a geometric rate when outside. In fact, (3) is sufficient for the return time distribution to C to have geometric tails [23]. Once in C, (4) ensures that with some probability the chain forgets its past and hence regenerates. This regeneration allows the chain to 'couple' with another started at stationarity, giving a bound on the total variation distance through the coupling inequality [15]. More intuition is given in [24]. Transition kernels considered here will be of the Metropolis-Hastings type, given by P (x, dy) = α(x, y)Q(x, dy) + r(x)δ x (dy), where Q(x, dy) = q(y|x)dy is some candidate kernel, α is the 'acceptance rate' and r(x) = 1 − α(x, y)Q(x, dy). Here we choose where a ∧ b denotes the minimum of a and b. This choice implies that P satisfies detailed balance for π(·) [25], and hence the chain is reversible (note that other choices for α can result in non-reversible chains, see [26] for details). In this case (2) applies to a slightly broader class of functionals, namely those with E π [|f | 2 ] < ∞ [19]. Roberts & Tweedie [5], following on from [23], introduced the following regularity conditions. Theorem 1. (Roberts & Tweedie). Suppose that π(x) is bounded away from 0 and ∞ on compact sets, and there exists δ q > 0 and ε q > 0 such that, for every Then the chain with kernel (5) is µ Leb -irreducible and aperiodic, and every nonempty compact set is small.
For the choices of Q considered in this article these conditions hold, and we will restrict ourselves to forms of π(x) for which the same is true (apart from a specific case in Section 5). Under Theorem 1 then (1) only holds if a Lyapunov function V : When P is of the Metropolis-Hastings type, (7) can be written In this case a simple criterion for lack of geometric ergodicity is lim sup |x|→∞ r(x) = 1.
Intuitively this implies that the chain is likely to get 'stuck' in the tails of a distribution for large periods. Jarner & Tweedie [27] introduce a necessary condition for geometric ergodicity through a tightness condition.
The result highlights that when π(·) is heavy-tailed the chain must be able to make very large moves and still be capable of returning to the centre quickly for (1) to hold. In the Metropolis-Hastings case it is straightforward to see that which is a useful approach to establishing lack of (1) in the heavy-tailed case.

Position-dependent Random Walk Metropolis
In the RWM, Q(x, dy) = q(|y − x|)dy, meaning (6) reduces to α(x, y) = 1 ∧ π(y)/π(x). A common choice is Q(x, ·) = N (x, hΣ), with Σ chosen to mimic the global covariance structure of π(·) [3]. Various results exist concerning the optimal choice of h in a given setting (e.g. [28]). It is straightforward to see that Theorem 2 holds here, so that the tails of π(x) must be uniformly exponential or lighter for geometric ergodicity. In one dimension this is in fact a sufficient condition [4], while for higher dimensions additional conditions are required [5]. We return to this case in Section 5.
In the PDRWM Q(x, ·) = N (x, hG −1 (x)), so (6) becomes The intuition here is that proposals are more able to reflect the local dependence structure of π(·). In some cases this dependence may vary greatly in different parts of the state-space, making a global choice of Σ ineffective [9]. Readers familiar with differential geometry will recognise the volume element |G(x)| 1/2 dx and the linear approximations to the distance between x and y taken at each point through G(x) and G(y) if X is viewed as a Riemannian manifold with metric G. We do not explore these observations further here, but the interested reader is referred to [29] for more discussion.
The choice of G(x) is an obvious question. In fact, specific variants of this method have appeared on many occasions in the literature, some of which we now summarise.

Tempered Langevin diffusions
The authors highlight that the diffusion with dynamics dX t = π − 1 2 (X t )dW t has invariant distribution π(·), motivating the choice. The method was shown to perform well for a bi-modal π(x), as larger jumps are proposed in the low density region between the two modes.
Here the intuition is simply that b > 0 means larger jumps will be made in the tails. In one dimension the authors compare the expected squared jumping distance E[(X i+1 − X i ) 2 ] empirically for chains exploring a N (0, 1) target distribution, choosing b adaptively, and found b ≈ 1.6 to be optimal. 3. Regional adaptive Metropolis-Hastings [7,11].
In this case the state-space is partitioned into X 1 ∪ ... ∪ X m , and a different proposal covariance Σ i is learned adaptively in each region 1 ≤ i ≤ m. An extension which allows for some errors in choosing an appropriate partition is discussed in [11] 4. Localised Random Walk Metropolis [10]. G −1 (x) = m k=1q θ (k|x)Σ k . Herě q θ (k|x) are weights based on approximating π(x) with some mixture of Normal/Student's t distributions, using the approach suggested in [30]. At each iteration of the algorithm a mixture component k is sampled fromq θ (·|x), and the covariance Σ k is used for the proposal Q(x, dy). [9].

Kernel adaptive Metropolis-Hastings
for some kernel function k and n past samples {z 1 , ..., z n }, H = I − 1/n1 n×n is a centering matrix, and γ, ν are tuning parameters. The approach is based around performing nonlinear principal components analysis on past samples from the chain to learn a local covariance. Illustrative examples for the case of a Gaussian kernel show that M x HM T x acts as a weighted empirical covariance of samples z, with larger weights given to the z i which are closer to x [9].
The latter cases also motivate any choice of the form for some past samples {z 1 , ..., z n } and weight function w : X × X → [0, ∞) with i w(x, z i ) = 1 that decays as |x − z i | grows, which would also mimic the local curvature of π(·) (taking care to appropriately regularise and diminish adaptation so as to preserve ergodicity, as outlined in [10]). The logic of [13,31] could also be applied, by choosing G(x) as some regularised version of the negative Hessian of log π(x). However, if such derivative information were available it would seem more sensible to use a more sophisticated method than a martingale proposal (see e.g. [13]).

Results in One Dimension
Here the specific choice of G(x) is left open, and we instead consider two different general scenarios as |x| → ∞, i) G −1 (x) → Σ, and ii) G −1 (x) → ∞ at some rate. In theory there is also the possibility that G −1 (x) → 0, though intuitively this would not seem to be a particularly sensible choice as chains would be extremely likely to spend a long time in the tails of a distribution, so we do not consider it.
Three scenarios are considered for the tail behaviour of π(x). We refer to this density as log-concave in the tails if for some x 0 > 0 and a > 0 and a similar condition holds in the negative tail. If (10) is not satisfied but there is some β ∈ (0, 1) such that the above condition can be replaced with π(y)/π(x) ≤ exp{−a(y β − x β )}, then we call the density subexponential (note this is not the standard definition). Finally, we call π(x) 'polynomial-tailed' if π(x) ∝ |x| −p for large |x| and some p ≥ 1. We also apply asymptotic growth conditions for G −1 (x), and without loss of generality assume that these hold for any x larger than the same x 0 in absolute value. We introduce some asymptotic notation in this section. For positive realvalued functions f and g, let The more familiar big-O and little-o notation is also used. The main results of this section are summarised in Table 1 at the end of the section.
The first result emphasises a growing variance as a necessary requirement for geometric ergodicity in the heavy-tailed case.
then the PDRWM can only produce a geometrically ergodic Markov chain if π(x) is log-concave in the tails.
Proof: In this case for any choice of ε > 0 there is a δ > 0 such that Q(x, B δ (x)) > 1 − ε, so Theorem 2 can be applied.
Though the heavy-tailed case is a challenging scenario, the standard RWM with fixed covariance will produce a geometrically ergodic Markov chain if π(x) is log-concave. Next we extend this result to the case of sub-quadratic variance growth in the tails.
and π(x) is log-concave in the tails, then the PDRWM method produces a geometrically ergodic Markov chain from π-almost any starting point. If π(x) is subexponential for some β ∈ (0, 1), then choosing The log-concave proof consists of partitioning X into five regions, and showing that as |x| → ∞, (8) evaluated over each of these regions will either become arbitrarily small or remain strictly negative. We use the Lyapunov function V (x) = e s|x| for some s > 0. This choice allows results about moment generating functions of truncated Gaussian distributions (see Appendix B) to be used, in conjunction with simple bounds on the cumulative distribution function from [32], to establish that (8) will become arbitrarily small for regions of X outside the 'typical set' (x − cx γ/2 , x + cx γ/2 ). Theorem 3.2 from [4] shows that for the RWM with fixed covariance (8) evaluated over this region will be strictly negative. The essence of the argument is that for y > x in the tails, α R (x, y) ≤ e −a(y−x) by log-concavity, so as long as s is chosen to be less than a this decay will dominate any growth in V (y) here. As for any inwards proposals α R (x, y) = 1 then it can be shown that (8) is strictly negative when evaluated over this region.
The crucial additional difficulty in the case of growing covariance is that the acceptance rate in this region (for suitably large x) is now The problematic term lies inside the square bracket: this will be negative for y > x, meaning a large positive component in α(x, y). To deal with this, we use a Taylor expansion of y −γ about x and some simplifications to show that provided γ < 2, for large enough x, locally (for y near x, where the choice of region plays a role) the acceptance rate will still satisfy where δ x can be made arbitrarily small. This allows us to use a similar argument to that in [4] to prove the result. Outside of this region the Gaussian tails of Q(x, ·) take care of any less desirable behaviour of α(x, y). To extend this result to the subexponential case, we choose V (x) = e s|x| β , and Taylor expand |y| β in the typical set to get a suitable bound on α(x, y).
Note that this lemma includes as a special case any instance in which G −1 (x) ↑ σ 2 as |x| → ∞. However, the case G −1 (x) → σ 2 from any direction is actually more straightforward to show, by simply moving x for enough into the tails that G −1 (x) ≈ σ 2 for all y ∈ (x − cx γ/2 , x + cx γ/2 ). In this case the argument in [4] can be applied more straightforwardly.
Although we do not formally prove that the method will not produce a geometrically ergodic chain in the polynomial tailed case when G −1 (x) = o(|x| 2 ), we show intuitively that this will be the case. Assuming that in the tails π(x) ∝ |x| −p for some p > 1 then for large x The first expression on the right hand side converges to 1 as x → ∞, which is akin to the case of fixed proposal covariance. The second term will be larger than one for c > 0 and less than one for c < 0. So the algorithm will exhibit the same 'random walk in the tails' behaviour which is often characteristic of the RWM in this scenario, and so the acceptance rate will fail to enforce a geometric drift back into the centre of the space.
In the case where γ = 2 this will not happen, as the terms in the above expression will be roughly constant with x. We examine this case next.
Here the intuition is that proposals in the tails will take the form y = (1 + ξ √ h)x, which if h is chosen to be small will be similar to y = e ξ √ h x. The latter scheme is sometimes called the multiplicative RWM, and is known to be geometrically ergodic in this scenario (e.g. [3]), as this equates to taking a logtransformation of x, which 'lightens' the tails of the target density to the point where it becomes log-concave.
In this case we take the Lyapunov function V (x) = 1∨|x| s , with s > 0 chosen such that V (y)π(dy) < ∞. We again divide the integral (8) into regions, but in this case we show that each of these can be appropriately bounded simply as functions of the step-size h, i.e. independently of x. By examining each term, we show that for a small enough h the integral will be strictly negative.
The result is positive, but in this case is perhaps an example where the theory does not necessarily translate into an effective scheme in practice. If π(x) has particularly heavy tails, for example, then it is likely that an extremely small value of h would be needed to ensure (1), meaning the geometric rate of convergence ρ would be close to one. Nonetheless, it is an example of how appropriate choice of G −1 (x) can favourably change the ergodicity properties of a sampler.
The final result of this section provides a note of warning, that lack of care in choosing G −1 (x) can have severe consequences for the method. The intuition for this result is straightforward when explained. In the tails, the average proposals will be of size |x| γ/2 , which will be much larger than |x| if γ > 2, meaning most will send the chain even further into the tails in either direction (and hence will likely be rejected). To make this rigorous we show that (9) holds here, by considering the set of proposals A x, := {y ∈ X : α(x, y) ≥ }, and showing that Q(x, A x, ) → 0 as |x| → ∞, for any > 0. A specific example is illustrated in Figure 1. The main results of this section are summarised in Table 1.

Higher Dimensions
Some results from the previous section naturally carry over to higher dimensions. The most straightforward is outlined below.
Lemma 5. If each element of G −1 (x) is bounded above (uniformly in x), then the PDRWM can only produce a geometrically ergodic Markov chain if the tails of π(x) are uniformly exponential or lighter.
Proof: As with Lemma 1, a straightforward application of Theorem 2 gives the result.
It is also intuitive that an analogue to Lemma 4 will exist here. Specifically, if any diagonal component of the covariance G −1 (x) grows at a faster than quadratic rate with x, then the sampler is likely to run into the same difficulties in the tails. Similarly, when G −1 (x) → Σ, it is straightforward to see that the sampler will inherit the geometric ergodicity properties of the RWM with fixed covariance, by a similar argument to that discussed for the proof of Lemma 2 in this case.
As mentioned earlier, in the case G −1 (x) = Σ, additional conditions on π(x) are required for geometric ergodicity in more than one dimension, outlined in [5]. An example is also given in the paper of the simple two-dimensional density π(x, y) ∝ exp(−x 2 − y 2 − x 2 y 2 ), which fails to meet this criterion. The difficult models are those for which probability concentrates on a 'ridge' in the tails, which becomes ever narrower as |x| increases. In this instance, proposals from the RWM are less and less likely to be accepted as |x| grows. The problem is illustrated graphically in Figure 2. Such densities are often encountered as posterior distributions in hierarchical models, with another well-known example being the 'funnel', discussed in [33]. On the same figure there is some graphical evidence that if the proposal covariance is allowed to adjust then this problem can be alleviated somewhat.
To explore this more concretely, we design an extremely simple two dimensional density which exhibits the same features, which we call the 'rectangle' Figure 2: Contours of the density π(x, y) ∝ exp(−x 2 − y 2 − x 2 y 2 ). The left-hand plots show that a RWM with spherical covariance will find it increasingly difficult to propose values which will be accepted as the chain moves into the tails. The right-hand plots suggest that allowing the covariance to change with position might alleviate this issue.

density
where int(z) is the integer part of z ∈ R. This is simply a distribution defined over a sequence of rectangles on the upper-half plane on R 2 (starting at y 2 = 1), each centred on the vertical axis, with height one and with each successive triangle a third of the width and depth of the previous. Intuitively, the density is an ever narrowing staircase, as shown in Figure 3.
For simplicity here we take the Random Walk Metropolis proposal as simply a uniform distribution on the circle of radius one about the current point, so Q R (x, A) = |A ∩ S x |/|S x |, where S x := {y ∈ R 2 ; |y − x| ≤ 1}. To imitate the changing covariance in the PDRWM, we take as a proposal a uniform distribution over an ellipse for which the width is For these choices many of the calculations required in this section reduce to calculating areas of rectangles and ellipses.
The rectangle density does not satisfy the conditions of Theorem 1, as (x) is not bounded away from zero on compact sets, however any small set here must still be compact for both methods specified. To see this, note that for any fixed m < ∞, supp{P m R (x, ·)} is compact, so that for a minorisation condition of the form (4) to hold within some small set C, then we must have that supp{ν(·)} ⊂ supp{P m R (x, ·)} ∩ supp{P m R (y, ·)} for every x, y ∈ C. As this intersection will only be non-empty for bounded |x−y|, C must be compact. The same argument holds for the elliptical case. Because of this, establishing (9) is still sufficient to characterise lack of geometric ergodicity.  Proof: It is sufficient to construct a sequence of points x p ∈ R 2 such that |x p | → ∞ as p → ∞, and show that r(x p ) → 1. Take x p = (0, p) for p ∈ N. In this case r(x p ) is bounded below by one minus the area of the rectangles that x p is on the boundary of divided by the area of the circle |S x | = π. So we have as p → ∞, as required.
The approach makes it clear that reducing the area of an ellipse at the same rate as the area of the rectangles will remove this issue. The next result confirms this intuition.
Lemma 7. The Metropolis-Hastings algorithm with proposal Q P produces a geometrically ergodic Markov chain when π(x) = (x), from π-almost any starting point.
Proof: We can take as a small set C = {y ∈ R 2 ; 1 ≤ y i ≤ 2}, i.e. the largest rectangle on the contour plot, and the Lyapunov function V (x) = |x 2 | + 1 ∨ |x 1 |. For x, y ∈ R, V (y) < V (x) iff y 2 < x 2 . Note also that α(x, y) = 1 for any x, y ∈ R ∩ {y ∈ X : y 2 < x 2 }. It suffices, with these choices, to show that the overlap on the contour plot between the lower hemisphere of each E x and R is larger than that between R and the upper hemisphere for any x ∈ R \ C, which is clearly true from inspecting the figures in Appendix C.

Discussion
In this paper we have analysed the ergodic behaviour of a Metropolis-Hastings method with proposal kernel Q(x, ·) = N (x, hG −1 (x)). In one dimension we have characterised the behaviour in terms of growth conditions on G −1 (x) and tail conditions on the target distribution, and some cases in higher dimensions have also been discussed. The fundamental question of interest was whether generalising an existing Metropolis-Hastings method by allowing the proposal covariance to change with position can alter the ergodicity properties of the sampler. We can confirm that this is indeed possible, either for the better or worse, depending on the choice of covariance. The take home points for practitioners are i) lack of sufficient care in the design of G −1 (x) can have severe consequences (as in Lemma 4), and ii) careful choice of G −1 (x) can have much more beneficial ones, particularly in higher dimensions, as evidenced by the 'rectangle' density example in Section 5.
We feel that such results can also offer insight into similar generalisations of different Metropolis-Hastings algorithms (e.g. [13,34]). For example, it seems intuitive that any method in which the variance grows at a faster than quadratic rate in the tails is unlikely to produce a geometrically ergodic chain. There are connections between the PDRWM and some extensions of the Metropolisadjusted Langevin algorithm [34], the ergodicity properties of which are discussed in [35]. The key difference between the schemes is the inclusion of the drift term G −1 (x)∇ log π(x)/2 in the latter. It is this term which in the main governs the behaviour of the sampler, which is why the behaviour of the PDRWM is different to this scheme (note that gradients are required for all variants, unlike in the PDRWM).
We can apply the general results to the specific variants discussed in Section 3. Provided sensible choices of regions/weights, and diminishing adaptation schemes are chosen, the Regional adaptive Metropolis-Hastings, Locally weighted Metropolis and Kernel-adaptive Metropolis-Hastings samplers should all satisfy G −1 (x) → Σ as |x| → ∞, meaning they will inherit the ergodicity properties of the standard RWM (the behaviour in the centre of the space, however, will likely be different). In the State-dependent Metropolis method provided b ≤ 2 (with suitable tuning in the equality case) then the sampler should also behave reasonably. Whether or not a large enough value of b would be found by a particular adaptation rule in the subexponential case is not entirely clear, and this could be an interesting direction of further study. The Tempered Langevin diffusion scheme, however, will fail to produce a geometrically ergodic Markov chain whenever the tails of π(x) are lighter than that of a Cauchy distribution. In the case of Gaussian tails, for example, G −1 (x) = e x 2 /2 I. To allow reasonable tail exploration, two pragmatic options would be to upper bound G −1 (x) manually or use this scheme in conjunction with another, as there is evidence that the sampler can perform favourably when exploring the centre of a distribution [8]. None of the specific variants discussed here are able to mimic the local curvature of the π(x) in the tails, so as to enjoy the favourable behaviour exemplified in Lemma 7. This is possible using Hessian information as in [13], though should also be possible in cases where this isn't available using appropriate surrogates, at least in some cases.
It is reasonable to ask whether exploring the tails of a distribution adequately is always necessary. If the functions a practitioner is interested in estimating are such that C f (x)π(dx) ≈ f (x)π(dx), whereπ(·) is the target restricted to the centre of the space C, then perhaps this is not so important. Some results in this direction are given in [36]. If this approach is taken, however, whether or not a sampler will perform appropriately becomes a considerably more problemdependent question. Geometric ergodicity, whilst by no means guaranteeing sensible estimators in the non-asymptotic context, does give steps towards this in some generality, through (2). As mentioned earlier, it also appears to have other favourable consequences [16,21]. As such, we feel it is a property worth establishing.

Appendix A. Proofs
Appendix A.1. Proof of Lemma 2 For the log-concave case, take V (x) = e s|x| for some s > 0, and let B A denote the integral (8) over the set A. We first break up X into (−∞, 0] ∪ (0, x − cx , and show that the integral is strictly negative on at least one of these sets, and can be made arbitrarily small as x → ∞ on all others. The −∞ case is analogous from the tail conditions on π(x).
On (∞, 0], we have The integral is now proportional to the moment generating function of a truncated Gaussian distribution (see Appendix B), so is given by

A simple bound on the error function is
which → 0 as x → ∞, so we can make this arbitrarily small. On (0, x − cx γ/2 ], note that e s(|y|−|x|) − 1 is clearly negative throughout this region. So the integral is straightforwardly bounded as B (0,x−cx γ/2 ] ≤ 0 for all x ∈ X .
On (x − cx γ/2 , x + cx γ/2 ], provided x − cx γ/2 is large enough that we are in the tail regime, then for any y in this region A Taylor expansion of y −γ about x gives and multiplying by (y − x) 2 gives If |y − x| = cx γ/2 then this is: As γ < 2 then 3γ/2 < γ + 1, and similarly for successive terms, meaning each gets smaller as |x| → ∞. So we have for large x and y ∈ (x − cx γ/2 , x + cx γ/2 ) So we can analyse how the acceptance rate behaves. First note that for fixed > 0 Similarly we find that the e −a term will dominate for any for which 3 /x γ+1 → 0, i.e. any = o(x γ+1/3 ). If γ < 2 then = cx γ/2 satisfies this condition. So for any y > x in this region we can choose an x such that where δ x can be made arbitrarily small in this region by choosing a large enough x. For the case y < x here we have (for any fixed > 0) So by a similar argument we have α(x, y) > 1 here for large x, as the exponential term will dominate. Combining these results we can write which will be strictly negative for large enough x provided s < a, where q x (·) denotes a zero mean Gaussian distribution with the same variance as Q(x, ·).
On (x + cx γ/2 , x + cx γ ] we can upper bound the acceptance rate as If y ≥ x and x > x 0 then we have For |y − x| = cx η this becomes So provided γ > η the e −a term will dominate for large x. In the equality case we have so provided we choose c such that a > c 2 /2h then the acceptance rate will also decay exponentially. Because of this we have so provided a > c 2 /2h + s then this term can be made arbitrarily small. On (x + cx γ , ∞) using the same properties of truncated Gaussians and error function bounds we have which can be made arbitrarily small provided c > 2s. For the subexponential case, the proof is similar. Take V (x) = e s|x| β , and divide X up into the same regions. Outside of (x − x γ/2 , x + x γ/2 ] the same arguments show that the integral can be made arbitrarily small. On this set, note that in the tails. For y − x = cx η , then for η < 1 − β this becomes negligible, otherwise it will grow as x does. So in this case we further divide the typical set into (x, x + cx 1−β ] ∪ (x + cx 1−β , x + cx γ/2 ). On (x − cx 1−β , x + cx 1−β ) the integral is bounded above by e −c1 Q(x, (x − cx 1−β , x + cx 1−β )) → 0, for some suitably chosen c 1 > 0. On (x − cx γ/2 , x − cx 1−β ] ∪ (x + cx 1−β , x + cx γ/2 ] then for y > x we have α(x, y) ≤ e −c2(y β −x β ) , so we can use the same argument as in the the log-concave case to show that the integral will be strictly negative in the limit.

Appendix A.2. Proof of Lemma 3
Here a typical proposal will be y = x ± ξ √ hx for x sufficiently large, meaning |x − y| = ξ √ hx, with ξ ∼ N (0, 1 2 ). For now we assume both x and y are in the tail regime, meaning G(y) ∝ y −2 and similarly for G(x) (we make this concrete later). We can also take π(y)/π(x) = x p /y p here.
For y = (1 + ξ √ h)x then in the tails the acceptance rate becomes which is completely independent of x. Take V (x) = 1∨|x| s , for some s < 1 which is suitably small that V (y)π(dy) < ∞, together with an extra restriction which we specify later. Then V (y)/V (x) becomes independent of x also. The integral of interest can now be re-written in terms of ξ, with m(·) a standard Gaussian measure, φ(ξ) its density, and α h (ξ) the acceptance rate. So in most of the regions we consider we can choose x large enough that the integral in question is We therefore need to show that this integral is strictly negative for h small enough, and take care of the values of y which may not fall into this region. Again denoting (8)  It is clear that all of these integrals can be made arbitrarily close to zero by making h small enough. The goal is to show that B (∞,∞) < 0 for all h ∈ (0, h 0 ). We proceed by finding the order of h of each B Hi . On Use the change of variables γ = 1 + ξ √ h gives with η ∼ N (−1, h), as s < 1. Using results for truncated Gaussians, we have The lower bound on Φ c from [32] gives , the function |1 + ξ √ h| s − 1 is negative in H 2 , so this integral is trivially bounded as ≤ 0 for any h. Note that this is the entire set of y's for which (A.2) is not the correct integral.
On H 3 = (−δh −1/4 , δh −1/4 ) recall that the acceptance probability is For any ξ > 0 we have We would like to write this as (1 + ξ √ h) −a for some a > 0. If δh 1 4 < 1 we can use a Taylor expansion with remainder log(1 + x) = x − x 2 /2 + r 3 /3 for some r ∈ (0, x) to get the bound x − x 2 /2 ≤ log(1 + x) for 0 ≤ x < 1. For any b < p + 1 then So provided δ is chosen in this way then ∃a > 0 such that α h (ξ) ≤ (1 + ξ √ h) −a for ξ ∈ (0, δh − 1 4 ) and α = 1 for ξ ∈ (−δh − 1 4 , 0) (by simply reversing the signs in the above inequalities). Now the integral of interest can be written So we need to bound Upper and lower bounds for g(ξ) = (1 + ξ √ h) −a on (0, δh − 1 4 ) are The first is a straight line through g(δh − 1 4 ) and g(0) = 1, the second is the straight line through g(0) = 1 with gradient g (0) (as the function is concave). This gives upper and lower bounds for the first two integrals as Combining inequalities, we can get a very loose upper bound on the integral as The exponentials are the dominant terms in the first two expressions, as they shrink to zero much faster than any of the C Hi terms (which still depend on h). To see that this is the case for C H3 , note that (1 + δh It is more straightforward to see that C H1 and C H4 are both O(h 1 2 ). Because of this, we can always choose a h small enough that the last term is arbitrarily larger than all others in the expression, meaning that the integral is strictly negative, as required.

Appendix A.3. Proof of Lemma 4
The goal is to show lim sup α(x, y)Q(x, dy) = 0.
The general strategy will be to find some set A x, := {y ∈ X : α(x, y) ≥ }.
In words, a set which shows the potential candidate moves which have a nonnegligible probability of acceptance. We will then establish that Q(x, A x, ) → 0 as x → ∞, for any > 0.
First recall that for the algorithm in general the acceptance probability for a proposal y is α(x, y) = π(y)|G(y)| 1 2 π(x)|G(x)| If G(x) = O(|x| −γ ), then for large enough x and y the acceptance probability is α(x, y) = 1 ∧ π(y) π(x) |x| |y| As each Q(x, ·) is a Gaussian distribution, we consider a 'typical set' to be For any x, Q(x, T x ) ≈ 0.96. If we can show that i) for large enough x, A x, ⊂ T x , and ii) the ratio Q(x, A x, )/Q(x, T x ) → 0 then we will have established the result.