Convergence Rates for the Constrained Sampling via Langevin Monte Carlo

Sampling from constrained distributions has posed significant challenges in terms of algorithmic design and non-asymptotic analysis, which are frequently encountered in statistical and machine-learning models. In this study, we propose three sampling algorithms based on Langevin Monte Carlo with the Metropolis–Hastings steps to handle the distribution constrained within some convex body. We present a rigorous analysis of the corresponding Markov chains and derive non-asymptotic upper bounds on the convergence rates of these algorithms in total variation distance. Our results demonstrate that the sampling algorithm, enhanced with the Metropolis–Hastings steps, offers an effective solution for tackling some constrained sampling problems. The numerical experiments are conducted to compare our methods with several competing algorithms without the Metropolis–Hastings steps, and the results further support our theoretical findings.


Introduction
Sampling from distributions with some constraints has extensive applications in statistics, machine-learning, and operations research, among other areas. Some distributions have bounded support, such as the simple but versatile uniform distribution which serves as the foundation for a series of Monte Carlo methods, as discussed in [1]. Furthermore, many statistical inference problems involve estimating parameters subject to constraints on the parameter space, which defines a posterior distribution with bounded support in a Bayesian setting. Examples include Latent Dirichlet Allocation [2], truncated data problems in failure and survival time studies [3], ordinal data models [4], constrained lasso and ridge regressions [5], and non-negative matrix factorization [6]. In Bayesian learning, sampling from posterior distributions is a fundamental primitive, used for exploring posterior distributions, identifying the unknown parameters [7,8], obtaining credible intervals, and solving inverse problems [7,8]. Finally, constrained sampling has great potential in solving constrained optimization problems [9,10].
Many Markov Chain Monte Carlo (MCMC) algorithms have been extensively studied for sampling from probability distributions with convex support or more generally with constrained parameters, mainly in the fields of Bayesian statistics and theoretical computer science. Early work includes, among others, [1,[11][12][13][14]. Firstly, based on MCMC algorithms, a direct solution involves discarding samples that violate the constraints, thereby exclusively retaining samples that satisfy the constraints; see, for example, [1,15,16]. However, these rejection-type approaches may encounter an excessive number of rejections or an extremely large acceptance rate within some local subspace that satisfies the constraints, which leads to poor mixing and computational inefficiency, especially for complicated constraints and the high dimensional distributions [17,18]. Secondly, some literature draws inspiration from penalty functions in optimization problems and considers the construction of barriers along the boundaries of the constrained domain, effectively constraining the sampling process within the constrained area. These approaches encounter a major challenge when the samples reach the boundaries of the constraints, necessitating the implementation of a mechanism based on reflection to redirect them back into the constrained region. To address this issue, Ref. [19] extended the Hamiltonian Monte Carlo (HMC) method by setting the potential energy outside the constraint region to infinity, restricting the states to the desired domain. Ref. [20] extended the HMC method to sample from truncated multivariate Gaussian distributions, and Ref. [21] proposed an approach that involves mapping the constrained domain onto a sphere in an augmented space. Thirdly, motivated by the constrained optimization methods, the constrained sampling problem can be reformulated as an unconstrained sampling problem via suitable transformations. Following this idea, Ref. [22] proposed a family of novel algorithms based on HMC through the introduction of Lagrange multipliers that address a broader range of constrained sampling problems. More recently, Ref. [23] tackled the constrained sampling problem via the mirror-Langevin algorithm. In spite of the widespread adoption of these MCMC methods, most of them have primarily focused on the algorithm design and lack the rigorous theoretical analysis of convergence rates.
Among all the MCMC algorithms, a class of algorithms based on the Langevin dynamics has garnered significant attention in both practical applications and theoretical analyses [24][25][26][27]. It has recently witnessed a notable increase in non-asymptotic analyses of these algorithms, initiated by the seminal work of [28]. In the setting of unconstrained sampling, Ref. [29] extended the theoretical analysis of convergence rates by studying with decreasing step size, and Refs. [30,31] derived corresponding convergence results based on alternative distances. These theoretical analyses focus on the Langevin algorithm without the Metropolis-Hastings step. More recently, Refs. [32,33] have shown that incorporating the Metropolis-Hastings step can significantly improve the convergence rate of the associated Langevin algorithm. In the setting of constrained sampling, Ref. [34] suggested a Euclidean projection step in the Langevin algorithms for the constrained case (PLMC) and derived the convergence rate of the associated Markov chain. Ref. [35] presented a detailed theoretical analysis for a proximal version of the Langevin algorithm that incorporates the Moreau-Yosida envelope of the indicator function (MYULA) to handle the distributions that are restricted to a convex body. Ref. [36] constructed the mirrored Langevin algorithm (MLD) using a mirror map to constrain the domain, which achieves the same convergence rate as its unconstrained counterpart [28]. However, these constrained sampling algorithms are all developed based on the Langevin algorithm without incorporating the Metropolis-Hastings steps, thus not leveraging the fast mixing advantages of them.
In this paper, we considered the constrained Langevin Monte Carlo with the Metropolis-Hastings step for sampling from the distributions restricted to some convex support. Firstly, for certain constraints, we re-examine the simple and intuitive rejection-type methods for sampling from constrained distributions, and reach a surprising discovery that the corresponding algorithm still retained the advantage of rapid convergence by carefully selecting the step size parameter. Subsequently, for the more generally constrained domain, we build upon the framework proposed in [35], incorporating the Metropolis-Hastings step for further refinement, and analyze the convergence rate of the corresponding Markov chain. We present detailed non-asymptotic analysis for these constrained algorithms and achieve notably enhanced convergence rates in the total variation distance. Compared with the best rate in [36], our results show that adopting the Metropolis-Hastings step in some constrained MCMC algorithms can also lead to an exponentially improved dependence on the error tolerance.
The rest of the paper is organized as follows. In Section 2, we introduce the preliminaries and the problem set-up of our study. Then, we propose the constrained sampling algorithms tailored to different types of constraint regions in Section 3. Section 4 provides the non-asymptotic theoretical results of the proposed algorithms. The numerical experiments and comparisons are presented in Section 5. Some Markov chain basics are provided in Appendix A and all the technical proofs are deferred to Appendix B.
Notation: Let a represent the smallest integer not less than a ∈ R. For a vector x ∈ R d , we use |x| 2 to denote its Euclidean norm. For a q × q symmetric matrix A, denote by λ min (A) and λ max (A) the smallest and largest eigenvalues of A, respectively, and let A T be its transpose. For two square matrices A and B, we write A B if (B − A) is a positive semi-definite matrix. Denote by I(·) the indicator function. For r > 0, let B(x, r) = {y ∈ R d : |y − x| 2 ≤ r} denote a closed Euclidean ball with center x and radius r. For two real-valued sequences a n and b n , we say a n = O(b n ) if there exists a universal constant c such that a n ≤ cb n , and a n =Õ(b n ) if a n ≤ c n b n where the sequence c n grows at most poly-logarithmically with n. For any two probability measures µ and ν, denote by µ − ν TV the total variation distance between µ and ν.

Preliminaries and Problem Set-Up
In this section, we introduce the MCMC sampling methods with its mixing analysis, the traditional unconstrained Metropolis-Adjusted Langevin Algorithm (MALA), and our problem set-up for this paper.

Markov Chain Monte Carlo and Mixing
Consider a distribution Π equipped with a density π : R d → R + such that for some potential function U : R d → R. In certain scenarios, it is necessary to perform sampling from this distribution. For example, many statistical applications involve estimating the expectation of a function g(X) for X ∼ π, where analytical and numerical computation is infeasible. Monte Carlo approximation provides a solution by generating samples from Π and using sample mean to estimate the population expectation. Hence, the key point is to access samples from Π. MCMC represents a class of popular sampling algorithms, which construct an appropriate Markov chain whose stationary distribution is Π or close to Π in certain metrics. The class of the Metropolis-Hastings algorithms refers to a type of MCMC method that ensures the corresponding Markov chain converges to the target distribution by incorporating the Metropolis-Hastings step. The Metropolis-Hastings algorithms usually take two steps to generate a Markov chain: a proposal step and a reject-accept step. At each iteration, a sample is generated from the proposal distribution in the proposal step, and it is updated as a new state of the Markov chain with probability determined by the Metropolis-Hastings correction in the reject-accept step.
Given an error tolerance ε ∈ (0, 1), in order to obtain an ε-accurate sample with respect to some metric, one simulates the Markov chain for a certain number of steps k, as determined by a mixing time analysis. Specifically, we are concerned about how many steps the chain needs to take such that the current distribution of the chain is ε-close to the target distribution Π. Based on this, we define the ε-mixing time with respect to the target distribution Π as for the error tolerance ε ∈ (0, 1), where T is the transition operator of the Markov chain and T k (P 0 ) is the distribution of the Markov chain at k-th step from an initial distribution P 0 .

Metropolis-Adjusted Langevin Algorithm
Consider the problem of sampling from the distribution with density defined as (1). MALA [26,27] adopts the Gaussian distribution N {x k − h∇U(x k ), 2hI p } as the proposal distribution for the k-th step, where x k is the current state and h > 0 is a proper step size, and performs a Metropolis-Hastings accept-reject step. MALA is the standard Metropolis-Hastings algorithm applied to the Langevin dynamics, and the associated Langevin-type algorithms belong to a family of gradient-based MCMC sampling algo-rithms [37]. The Langevin-type algorithms can be understood as the Euler discretization of the Langevin dynamics: where W t (t ≥ 0) is the standard Brownian motion on R d . Algorithm 1 provides the unconstrained MALA for sampling from the distribution supported on R d , where φ h (· | x) denotes the probability density function of N {x − h∇U(x), 2hI d }.

Algorithm 1 Metropolis-adjusted Langevin algorithm
Input: a sample x 0 ∈ R d from an initial distribution P 0 , the step size h for k = 0, 1, 2, . . . , K − 1 do Proposal step:

Problem Set-Up
In this part, we consider the problem of sampling from a target distribution or posterior Π * supported on a compact set X ⊂ R d equipped with a density π * . It can be written in the form for some potential function U : R d → R. Assume that the function U(·) and the set X satisfy the following assumptions: Assumption 1. U(·) is a twice continuously differentiable, L-smooth and m-strongly convex function on R d . That is, there exist universal constants L ≥ m > 0 such that for any x, y ∈ R d .
Assumption 2. X ⊂ R d is a compact and convex set satisfying for some universal constants 0 < r ≤ R and x * ∈ X .
Hereafter, we assume that the above two assumptions hold, which is frequently used in the literature for the analysis of constrained sampling algorithms [34][35][36]. We will modify the MALA in Algorithm 1 to adapt to sampling from the above constrained distribution, and analyse its non-asymptotic theoretical properties and derive the mixing time bound in terms of the problem dimension d and the error tolerance ε.

The Constrained Langevin Algorithms
In this section, we present three sampling algorithms based on MALA to handle the distribution constrained within some convex body X . As discussed in [34], the inherent challenges in constrained sampling problems arise from the complex properties on the boundary of the constraint region, and the lack of the curvature in the potential function. To tackle these challenges, Ref. [34] initially studied constrained sampling from the uniform distribution on X , and then extended the exploration to more general distributions. Similarly, we begin our investigation by examining some simple constrained regions and progressively extend our analysis to more complex constraint scenarios.

Constrained Langevin Algorithm via Rejection
We initially discuss the case where the constraint region X is an Euclidean ball on R d , where the boundary can be characterized by a curve equation. If X = B(x * , R) for some universal constant R > 0 and x * ∈ R d , we consider the simple and intuitive rejection-type methods via the Metropolis-Hastings accept-reject step for sampling from the distribution with density defined as (3). The constrained MALA for X = B(x * , R) outlined in Algorithm 2 as follows, where φ h (· | x) denotes the probability density function of the Gaussian distribution N {x − h∇U(x), 2hI d }.

Algorithm 2 The MALA for Euclidean ball constrained domain
Input: a sample x 0 ∈ X from an initial distribution P 0 , the step size h for k = 0, 1, 2, . . . , K − 1 do Proposal step: Compared with Algorithm 1, this modified algorithm forces the Markov chain to stay at the current state when it jumps out of the limited state space X = B(x * , R), which is a quite natural extension of the unconstrained MALA. This idea is not completely novel. Ref. [34] suggested a projection step in unadjusted Langevin algorithm for sampling from a log-concave distribution with compact support. Ref. [10] proposed an MALA for constrained optimization, where they used a similar step to constrain the Markov chain to stay at a given state space. Due to the favorable properties on the boundary of constrained domain X = B(x * , R), we can establish the theoretical results of Algorithm 2; see Lemma A1 in Appendix B for details.

Norm-Constrained Domain
Regularization is a technique commonly used in machine-learning and statistical modeling. As discussed in [38], some models with regularization can be reformulated as the distributions with norm-constraint on the parameters. Notice that the L p -norm for the vector For the norm-constrained domain X = {x ∈ R d : |x| p ≤ C} with some universal constant C > 0, we can transform it into the Euclidean ball B(0, 1) via a vector-valued function f : X → B(0, 1). Specifically, for any such that y ∈ B(0, 1). Due to the bijective nature of the function f : X → B(0, 1), its inverse function f −1 =: g : B(0, 1) → X can be defined accordingly. Similarly, for any y = (y 1 , y 2 , . . . , such that x ∈ X . By utilizing the vector-valued functions f (·) and g(·) defined above, we can employ the Euclidean ball constrained sampling algorithm, as described in Section 3.1,

Algorithm 3 The MALA for norm-constrained domain
Input: a sample x 0 ∈ X from an initial distribution P 0 , the step size h for k = 0, 1, 2, . . . , K − 1 do Transformation step: Compared with Algorithm 2, the Algorithm 3 achieves the X → B(0, 1) → X transformation by incorporating two transformation steps, thereby addressing the normconstrained sampling problems. The main purpose of this approach is to facilitate theoretical analysis by leveraging the well-understood properties of the boundary of the Euclidean ball compared to the boundary of the norm-constrained domain; see Appendix B.7 for details.

Constrained Langevin Algorithm via an Approximation of the Indicator Function
We proceed to discuss the constrained sampling for more general constraint regions.
for any x ∈ R d . Then, the target distribution Π * with density defined as (3) can be reformulated as where ι X (·) is defined in (4). Notice that ι X (·) is a convex function on R d . Under Assumption 1, we then know that the potential function V X (·) is smooth and strongly convex on R d . By this transformation, the problem of constrained sampling is apparently converted into an unconstrained counterpart. However, the non-differentiability of the function V X (·) on the boundary of X poses a challenge when applying the gradient-based unconstrained sampling algorithms. To address this issue, we can approximate the function ι X (·) by a differentiable function such as the Moreau-Yosida (MY) envelope [35]. The MY envelope of ι X (·) is defined as for any x ∈ R d , where λ > 0 is a regularization parameter and Pro X (·) is the projection function onto X . By [35], the function ι λ X (·) is convex and continuously differentiable with the gradient ∇ι λ for any x ∈ R d , and it holds that for any x, y ∈ R d . Then the approximation of V X (·) defined as (6) can be given by which is continuously differentiable, smooth and strongly convex on R d if U(·) satisfying Assumption 1. Define the distribution Π * ,λ with density Recall that the target distribution Π * with the reformulated density defined as (5). As discussed in [35], under some mild conditions including Assumptions 1 and 2, the approximation error between Π * and Π * ,λ in total variation distance can be made arbitrarily small by adjusting the regularization parameter λ. Therefore, we can utilize the gradient-based unconstrained sampling algorithms, such as the MALA presented in Algorithm 1, for constructing an appropriate Markov chain whose stationary distribution is close to Π * ; see Algorithm 4 for details, where φ λ h (· | x) denotes probability density function of the Gaussian (8).

Algorithm 4 The MALA for convex constrained domain
Input: a sample x 0 ∈ R d from an initial distribution P 0 , the step size h for k = 0, 1, 2, . . . , K − 1 do Proposal step:

Theoretical Results
In this section, we first analyze the properties of the Markov chains determined by the three constrained sampling algorithms presented in Section 3, and then establish the mixing time bounds of these Markov chains.

Properties of the Markov Chains
The outcomes {x 1 , . . . , x K } from each algorithm presented in Section 3 form a Markov chain, whose properties are established in Propositions 1, 2, and 3, respectively, as below. Proposition 1. For X = B(x * , R) with some universal constant R > 0 and x * ∈ R d , the Markov chain determined by Algorithm 2 is Π * -irreducible, smooth, and reversible with respect to the stationary distribution Π * with density π * defined as (3) (The definition of the Π * -irreducible, reversible, and smooth Markov chain is deferred to Appendix A).
Remark 1. Proposition 1 shows that the Markov chain determined by Algorithm 2 enjoys a series of nice properties as the unconstrained MALA, which form the basis for the study of the mixing time bounds of such Markov chain.
The similar properties hold for the Markov chains determined by Algorithms 3 and 4 as well.
Proposition 2. For X = {x ∈ R d : |x| p ≤ C} with some universal constant C > 0, the Markov chain determined by Algorithm 3 is Π * -irreducible, smooth, and reversible with respect to the stationary distribution Π * with density π * defined as (3).

Mixing Time Bounds of the Markov Chains
For a distribution Π supported on X ⊂ R d with the density π, recall that the ε-mixing time with respect to Π is defined as (2). A β-warm initial distribution P 0 with density p 0 with respect to the distribution Π is commonly used for the mixing time analysis, which satisfies for some finite constant β > 0. We say that the Markov chain is ς-lazy if at each iteration the chain is forced to stay at the previous state with probability at least ς. It is a convenient assumption for theoretical analysis of the convergence rate, but not likely to be used in practice since the lazy steps slow down the mixing rate of Markov chain. Given the definitions above and some Markov chain basics in Appendix A, we can obtain the following results for some well-behaved Markov chains defined on {X , B(X )}. Lemma 1. Consider a reversible, Π-irreducible, ς-lazy, and smooth Markov chain defined on {X , B(X )} with stationary distribution Π supported on X . For any error tolerance ε ∈ (0, 1) and β-warm initial distribution P 0 , the ε-mixing time with respect to Π satisfying where τ(ε; P 0 , Π) andΩ(·) are defined, respectively, in (2) and (A4).

Remark 2.
Lemma 1 provides a control on the mixing time of a Markov chain on X in terms of Ω(·). This result can be seen as an extension of Lemma 3 in [33] to the case where a Markov chain defined on {X , B(X )}. We then can readily derive the mixing time bound if a lower bound for Ω(·) is known.
The following lemma gives a lower bound for Ω(·).

Lemma 2.
Assume that the distribution Π supported on X with the density π satisfy the logisoperimetry inequality defined as (A1) for some constantĉ > 0. If a reversible Markov chain with where T x is the one-step transition distribution of this Markov chain at x ∈ X and Ω(·) is the conductance profile of this Markov chain defined in (A3).

Remark 3.
Lemma 2 states a lower bound for the conductance profile of a Markov chain on X . Similar results can be found in the [33,39,40]. Lemma 2, together with Lemma 1, provides a general framework for obtaining mixing time bound of a well-behaved Markov chain on X .
Based on Lemmas 1 and 2, we can drive the upper bounds for each ε-mixing time of the Markov chains determined by the three constrained sampling algorithms presented in Section 3.
Given a β-warm initial distribution P 0 and an error tolerance ε ∈ (0, 1), the Markov chain determined by Algorithm 2 satisfies for any step size h satisfying  This result improves upon the previously known mixing time bounds for constrained sampling algorithms in [34][35][36]; see Table 1 for details.
MLD in [36] mI d ∇ 2 U(x) LI dÕ {d log(1/ε)} Algorithms 2 and 3 in this paper Algorithm 4 in this paper For sampling from the norm-constrained domain X = {x ∈ R d : |x| p ≤ C} with some universal constant C > 0, we transform it into the sampling from Euclidean ball B(0, 1) as shown in Algorithm 3; then, the similar result holds for the Markov chain determined by Algorithm 3 as well. For the Markov chain determined by Algorithm 4, we can also derive a sharp mixing time bound by the mixing time analysis for sampling from log-concave distribution without constraints in [33] and the approximation error between Π * and Π * ,λ in [35]. Theorem 2. Let Assumptions 1 and 2 hold, and assume that there exists a universal constant C > 0 such that exp{inf x∈X c U(x) − sup x∈X U(x)} ≥C. Given the initial distribution and an error tolerance ε ∈ (0, 1), the Markov chain determined by Algorithm 4 satisfies with some universal constant c > 0, where V λ X (·) is defined as in (10) with λ := 8π −1 ε 2 r 2 d −2C2 , and Π * with density π * defined as (3).

Remark 5.
Theorem 2 presents a mixing time bound for Algorithm 4 with a feasible initial distribution as O{d 3 ε −2 log(d/ε)} up to L, m, r which are specified in Assumptions 1 and 2 if we choose the regularization parameter λ = λ . This result improves upon the mixing time bound for constrained sampling algorithm without incorporating the Metropolis-Hastings step in [35]; see Table 1 for details.

Numerical Experiments
In this section, we conduct numerical experiments to validate the theoretical properties derived in Section 4 and compare the constrained sampling algorithms presented in Section 3 with three competing MCMC algorithms for sampling from constrained logconcave distributions listed in Table 1 under various simulation settings. The implementation of these algorithms involves the selection of a step size. For Algorithms 2 and 3, we follow Theorem 1 and Corollary 3, respectively, to select the step size. For Algorithm 4, we choose the step size as that in [32] for the MALA for sampling from log-concave distribution without constraints. The step size choice of the other three MCMC algorithms follows the recommendation in the associated papers; see Table 2 for details. Step sizes for sampling from log-concave distributions with bounded support.

Algorithms
Step Size

Sampling from the Euclidean Ball Constrained Domain
We consider the problem of sampling from a truncated multivariate Gaussian distribution on X , which admits the density where the mean µ = 0 and covariance matrix Σ ∈ R d×d is a diagonal matrix with λ max (Σ) = 10 and λ min (Σ) = 1. For this target distribution, the potential function U(·) and its derivatives are given as U(x) = 2 −1 x T Σ −1 x, ∇U(x) = Σ −1 x, and ∇ 2 U(x) = Σ −1 . Therefore, the function U(·) is smooth with parameter L = λ −1 min (Σ) and strongly convex with parameter m = λ −1 max (Σ) on R d . We select X = B(0, R) with R = 5, the initial distribution P 0 = N X {0, (2L) −1 I d }, and use the inverse transformation algorithm [14] to generate an initial point from P 0 . We compare Algorithm 2 with the three sampling algorithms in literature given in Table 2, and follow the recommendation in the associated papers to choose the initial points of the three sampling algorithms.

The Trace Graphs of Sampling Algorithms
To initiate a preliminary assessment of the convergence properties of these algorithms, we commence with simple sample trace plots. Write x = (x 1 , . . . , x d ) T ∈ R d and µ = (µ 1 , . . . , µ d ) T ∈ R d . Figure 1    Similarly, it is evident that Algorithm 2 achieves sample means closer to µ 1 = 0, along with the least variance. Conversely, the sample means obtained from the other three sampling algorithms exhibit a certain degree of deviation from µ 1 = 0, accompanied by heavier tails.

Dimension and Error Dependence of Algorithm 2
The goal of this simulation is to demonstrate that the dimension and error tolerance dependence of the mixing time bound for Algorithm 2 both conform to the theoretical results shown in Theorem 1.
Since the total variation distance between continuous measures is hard to estimate, we use the error in quantiles along some direction for convergence diagnostics in the experiments. In the spirit of [33], we measure the error in the 95% quantile of the sample distribution and the true distribution in the direction along the eigenvector of Σ corresponding to λ min (Σ). The approximate mixing timek mix (ε) is then defined as the smallest iteration k when such error between the distribution of the Markov chain at iteration k and the target distribution falls below the error tolerance ε. We simulate 20 independent runs of the Markov chain of the algorithms with N = 20,000 samples at each run to determine the approximate mixing timek mix (ε). Then the finalk mix (ε) is the average of these 20 independent runs. Figure 3a shows the dependence of the approximate mixing timek mix (0.2) as a function of dimension d for Algorithm 2. By the linear regression fork mix (0.2) with respect to d, we conclude that the mixing time of Algorithm 2 is linear in d with slope 4.137 and R-squared 0.991. Figure 3b presents the dependence of the approximate mixing timek mix (ε) on the inverse of the error tolerance ε −1 for Algorithm 2 under d = 4. The linear regression for the approximate mixing timek mix (ε) with respect to ε −1 suggests that the mixing time of Algorithm 2 is linear in log(ε −1 ) with slope 15.854 and R-squared 0.994, which is consistent with the theoretical results given in Theorem 1.   Figure 4a shows the dependence of the approximate mixing timek mix (0.2) on the problem dimension d for the four sampling algorithms. Compared with the other three algorithms, the approximate mixing time of Algorithm 2 seems more robust to dimension. When d is small, the approximate mixing time of the four algorithms is comparatively close. However, as the dimension d increases, the approximate mixing time of PLMC and MYULA increases rapidly, showing a polynomial order with respect to d. Moreover, the dimension dependence of MLD and Algorithm 2 both indicate a linear growth trend, and MLD needs a few more steps than Algorithm 2 to reach the same error tolerance. Figure 4b presents the dependence of the approximate mixing timek mix (ε) on the inverse of the error tolerance ε −1 for the four sampling algorithms under d = 4. The regression analysis shows that the approximate mixing timek mix (ε) of PLMC and MYULA increases in polynomial order of ε −1 . When ε −1 is relatively small, MLD and Algorithm 2 have similar approximate mixing time. With the increase in ε −1 , the strength of Algorithm 2 gets more significant. For MLD, the linear regression for the approximate mixing timê k mix (ε) with respect to ε −2 yields a slope of 1.934 and R-squared 0.984, suggesting the error tolerance dependence of order ε −2 .

Comparison with Competitive Algorithms
It is noteworthy that the above analysis not only suggests significantly better dimension and error tolerance dependence of the constrained MALA but also partly verifies the theoretical convergence rates of the three methods for comparison.

Bayesian Regularized Regression
The regularized regression involves adding a penalty term on the objective function of the regression model, which helps to control the complexity of the model and prevent it from fitting the noise in the data. In this section, we validate the effectiveness of Algorithm 3 for constrained sampling involving the Bayesian regularized regression.
Given the independent and identically observations y = (y 1 , y 2 , . . . , y n ) T ∈ R n which follow from the Gaussian distribution with mean Xβ and covariance matrix σ 2 I n , we consider the regression models where the parameter are obtain by minimizing the square of Euclidean norm of the residual subject to a norm-constraint on the regression parameter as follows: min β∈R d |y − Xβ| 2 2 subject to |β| p ≤ C for some universal constant C > 0, where X ∈ R n×d is the design matrix, β ∈ R d is the regression parameter, and |β| p is the L p -norm of β. In Bayesian setting, many regularization techniques correspond to imposing certain prior distributions on model parameters. We then consider sampling from the distribution with density and obtaining the parameter estimatesβ via the maximum a posteriori probability (MAP) estimate, where X = {x ∈ R d : |x| p ≤ C}. We use the diabetes data studied in [41], and set the burn-in period to be 10 3 iterations and σ 2 = 1. Figure 5 presents the paths of the parameter estimates under different norm constraints, which demonstrate that Algorithm 3 can effectively handle the norm-constrained sampling problems.

Truncated Multivariate Gaussian Distribution
The final comparison was made by examining the sampling performance of MYULA in [35] and Algorithm 4 in the setting of a more general truncated multivariate Gaussian distribution. We consider the same setup as in [35]. Specifically, the density of the target distribution is defined as follows: where X is a convex set and the origin 0 is on its boundary. Let µ = 0, the covariance matrix Σ ∈ R d×d with (i, j)-th element given by (Σ) i,j = 1/(1 + |i − j|), and X = [0, 5] × [0, 1]. We generate 10 6 samples for Algorithm 4, and set the burn-in period to be the initial 10% iterations. Table 3 presents the mean and covariance estimation results of the target distribution based on the samples generated by MYULA and Algorithm 4. For comparison purposes, the results of MYULA align with those reported in [35]. With the same number of iterations, Algorithm 4 outperforms MYULA in terms of the estimation results. This indicates that incorporating the Metropolis-Hastings step in Algorithm 4 leads to improvements in the mixing time.

Discussion and Conclusions
In this article, we propose three sampling algorithms based on Langevin Monte Carlo with the Metropolis-Hastings steps to handle the distribution constrained within some convex body, and establish the mixing time bounds of these algorithms for sampling from strongly log-concave distributions. Under certain conditions, these bounds are sharper than existing algorithms in the literature. Furthermore, in comparison to existing algorithms, the suggested constrained sampling algorithms are simpler, more intuitive, and easier to operate in some cases.
Our results demonstrate that the sampling algorithm, enhanced with the Metropolis-Hastings step, offers an effective solution for tackling some constrained sampling problems. Numerical experiments fully illustrate the advantages of the proposed algorithms. Although we focus on the strongly log-concave distributions in the theoretical analysis, the proposed algorithm can be readily applied to weakly log-concave distributions or nonconvex potential functions. Simultaneously, we recognize that there are various aspects of the sampling algorithms that can be further improved. For instance, potential enhancements could involve the multiple importance sampling methods or adaptive techniques. We leave the investigation of its theoretical properties under such scenarios for future work.

Data Availability Statement:
The data used to support the findings of this study are included within the article.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Some Markov Chain Basics
Consider the time-homogeneous (We say that a Markov chain is time-homogeneous in which the probability of any state transition is independent of time.) Markov chains defined on a measurable state space {X , B(X )} with a transition probability Ψ : X × B(X ) → [0, 1]. The transition probability satisfies Ψ(x, dy) ≥ 0 ∀ x ∈ X , and y∈X Ψ(x, dy) = 1 .
The k-th step transition probability defined recursively as For a distribution Π on X , a Markov chain defined on {X , B(X )} is called Π-irreducible if for each A ∈ B(X ) with Π(A) > 0 and each x ∈ X , there exists k ∈ N such that Ψ k (x, A) > 0. A Markov chain defined on {X , B(X )} with transition probability Ψ: X × B(X ) → [0, 1] and stationary distribution Π is called reversible if it satisfies the detailed balance condition Π(dx)Ψ(x, dy) = Π(dy)Ψ(y, dx) for any x, y ∈ X .
Smooth chain assumption. We say that the Markov chain satisfies the smooth chain condition if its transition probability Ψ : X × B(X ) → [0, 1] can be expressed in the form Ψ(x, dy) = ψ(x, y) dy + ι x δ x (dy) for any x, y ∈ X , where ψ(· , ·) is the transition kernel satisfying ψ(x, y) ≥ 0 for any x, y ∈ X , ι x denotes the one-step probability of the chain to stay at its current state x, and δ x (·) is the Dirac-delta function at x.
Log-isoperimetric inequality. A distribution Π supported on X with density π is said to satisfy the log-isoperimetry inequality with some constantĉ > 0 if for any partition (S 1 , S 2 , S 3 ) of X , where Π(S i ) = S i π(x) dx and d(S 1 , S 2 ) = inf x∈S 1 ,y∈S 2 |x − y| 2 .
Conductance profile. Given a Markov chain with transition probability Ψ : X × B(X ) → [0, 1] and stationary distribution Π with density π, its stationary flow ω(·) : B(X ) → R is defined as for any S ∈ B(X ). For any v ∈ (0, 1/2], the conductance profile is given by Furthermore, the extended conductance profile is defined as

Appendix B. Proofs
Appendix B.1. Proof of Proposition 1 Proof of Proposition 1. Denote by Ψ(x, ·) the transition probability of the Markov chain at x ∈ X determined by Algorithm 2. For any x ∈ X , let P x,h = N {x − h∇U(x), 2hI d } with the step size h. Write the density of P x,h as φ h (· | x). For any x ∈ X , denote by α x (y) = min{1, R x (y)} the acceptance probability for any y ∈ R d , where Then, the transition probability of the associated Markov chain at x ∈ X has a probability mass ψ where δ x (·) is the Dirac-delta function at x. By the smooth chain condition given in Appendix A, we know the Markov chain with the transition probability Ψ(· , ·) is smooth.
Recall that Π * is the distribution on X with the density π * defined as (3). Since for any x, y ∈ X , then π * (x)ψ(x, y) = π * (y)ψ(y, x) for any x, y ∈ X . Together with (A5), for any A, B ∈ B(X ), it holds that for any A ∈ B(X ). Thus, Π * is the stationary distribution of the Markov chain with the transition probability Ψ(· , ·). Hence, such Markov chain is reversible. Furthermore, by (A5), we have for any x ∈ X and A ∈ B(X ). For any A ∈ B(X ) with Π * (A) > 0, due to Π * (A) = A π * (x) dx, we know the Lebesgue measure of A is nonzero. Since α x (y) ≤ 1 and X = B(x * , R) for some universal constant R > 0 and x * ∈ R d , we know ψ Lebesgue measure of A \ {x} is also nonzero, which implies Ψ(x, A) ≥ A\{x} ψ(x, y) dy > 0. Thus, the Markov chain with the transition probability Ψ(· , ·) is Π * -irreducible. We complete the proof of Proposition 1.
For any g ∈ L 2 (π), let E π (g) = X g(x)π(x) dx and Var π (g) = For a measurable non-empty subset S ⊂ X , the spectral gap is defined as where c + 0 (S) = {g ∈ L 2 (π) : supp(g) ⊂ S, g ≥ 0, Var π (g) > 0}. Define the spectral profile Λ(·) as for any v ∈ (0, ∞). If the current state of a Markov chain admits the distribution P with density p, we write T (p) as the distribution of its next state. The proof of Lemma 1 includes two steps. The first step is to show The second step is to show that the spectral profile and the conductance profile defined in (A3) are related as , Notice that Π(X ) = 1. Replacing the restricted conductance profile and restricted spectral gap in the proof of Lemma 1 in [33] by the conductance profile and spectral gap, respectively, and using the similar arguments in the proof of Lemma 1 in [33], we can obtain the results of the two steps. Then, Lemma 1 can be constructed immediately.
Appendix B.5. Proof of Lemma 2 Proof of Lemma 2. Denote by π the density function of the distribution Π. For any measurable non-empty subset A 1 ⊂ X such that 0 < Π( Given δ > 0, we define the following sets 1] is the transition probability of the considered Markov chain.
Lemma A2. Let X = B(x * , R) for some universal constant R > 0 and x * ∈ R d , and Assumption 1 hold. The target distribution Π * with density π * defined as (3) satisfies the log-isoperimetry inequality given in (A1) with constantĉ = m −1/2 , where m is specified in Assumption 1.