Next Article in Journal
“The Creative Force of Mathematical Formulations”: Werner Heisenberg and the Past, Present, and Future of Quantum Theory
Previous Article in Journal
SIRI-YOLO: A Foreign Object Detection Method for Belt Conveyors in High-Entropy Underground Scenes
Previous Article in Special Issue
A Framework for Characterization of Optimal Decision Rules in Hypothesis-Testing Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models

1
Luddy School of Informatics, Computing, and Engineering, Indiana University Indianapolis, Indianapolis, IN 46202, USA
2
Department of Electrical and Computer Engineering, UC Davis, Davis, CA 95616, USA
3
Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(6), 675; https://doi.org/10.3390/e28060675 (registering DOI)
Submission received: 16 May 2026 / Revised: 3 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

Abstract

Discrete diffusion models have become an important class of generative models for categorical data, yet their theoretical understanding remains largely limited to time-homogeneous noise schedules. In this work, we study uniform-rate discrete diffusion models with time-inhomogeneous continuous-time Markov chain forward processes. We establish convergence guarantees for practical reverse-time samplers by directly controlling the total variation distance, avoiding the indirect route of first bounding KL divergence and then applying Pinsker’s inequality. Our analysis decomposes the sampling error into initialization, score-estimation, discretization, and early-stopping errors, and explicitly characterizes how each term depends on the accumulated noise, the local noise rate, and the smoothness of the noise schedule. Under suitable regularity conditions on the noise schedule, we further derive step-complexity guarantees that match the order of existing results for homogeneous samplers.

1. Introduction

Generative modeling aims to produce samples that approximate the training data distribution. Diffusion models [1,2,3] have become a leading approach, with strong performance in image, video, audio, and text generation [4,5,6,7,8]. They generate data by learning to reverse a forward noising process; for discrete data, this process is naturally modeled as a CTMC on a finite state space.
A growing line of work studies the theoretical convergence of discrete diffusion samplers. And a growing body of work suggests that for discrete data such as natural language and graphs, discrete diffusion models offer greater advantages and more flexibility than their continuous counterparts [9,10]. For uniform-rate discrete diffusion models, existing guarantees have primarily been developed for time-homogeneous CTMCs. Early work analyzed τ -leaping under total variation distance [11], while subsequent studies established convergence guarantees through score-entropy control for uniformization, exact-step, and τ -leaping-type samplers [12,13,14,15,16]. More recently, sharper analyses have been obtained for practical samplers such as the Euler method and Tweedie τ -leaping [17,18]. However, these results largely focus on homogeneous noise schedules, leaving open the question of how non-constant noise schedules affect the sampling error.
This limitation is important because time-inhomogeneous schedules are widely used in practice and can substantially affect generative performance. For continuous diffusion models, it is empirically showed in [19] that noise scheduling is a crucial design choice: the optimal schedule can depend on the task, and higher-resolution generation may benefit from schedules that place more mass on noisier regimes. These observations suggest that the noise schedule is not merely a technical detail but rather a central component controlling the trade-off between data corruption, score estimation, and reverse-time discretization. Motivated by this, we develop a convergence analysis for non-homogeneous uniform-rate discrete diffusion models. Our results quantify how the accumulated noise, the local rate β t , and the smoothness of the schedule jointly determine initialization, discretization, and early-stopping errors.

2. Related Work

Theoretical studies of uniform-rate discrete diffusion models have so far largely considered time-homogeneous CTMCs. We have provided a summary in Table 1. An early result by [11] analyzed τ -leaping in total variation distance but required strong assumptions on the estimated reverse rates and led to suboptimal parameter dependence. Subsequent works focused on score-entropy-based guarantees for random-step samplers: ref. [12] studied a uniformization sampler on the hypercube, and ref. [16] extended the analysis to general product spaces [ S ] d .
Other recent works analyze deterministic-step samplers. Some assume an exact per-step solver [13] or a specially designed sampler [14,20], while others study existing τ -leaping-type samplers used in practice [15,16,17]. Beyond τ -leaping, ref. [17] developed a sharp analysis for the Euler method and Tweedie τ -leaping [8], avoiding a Girsanov change-of-measure argument. Besides regular samplers, refs. [21,22] investigated possible accelerations to these vanilla samplers. Most of these results are stated in KL divergence and then converted to total variation distance, which can introduce looseness. Also, these works are focused on time-homogeneous noise schedules, which are almost never used in practice. In contrast, our work directly analyzes total variation error for time-inhomogeneous uniform-rate processes, thereby capturing the effect of non-constant schedules such as the geometric schedule used in empirical works (e.g., [8,23]).
There is a concurrent work in [24] that derives an S-independent upper bound on the convergence error also for the time-inhomogeneous process. Differently, their work assumes exact simulation of the continuous-time sampling process, and the step complexity when there is discretization remains unclear.
A parallel line of work studies masked, or absorbing-rate, discrete diffusion models. Convergence guarantees have been established in [15,18,20,25,26].

Our Contributions

Our main contributions are summarized as follows:
  • We establish convergence guarantees for discrete diffusion models with time-inhomogeneous uniform-rate generators. This extends prior analyses, which primarily focus on homogeneous noise schedules.
  • We identify regularity conditions on the noise schedule under which explicit convergence rates can be obtained. Under these conditions, the resulting rates match state-of-the-art guarantees for homogeneous discrete diffusion samplers.

3. Preliminaries of Discrete Diffusion Samplers

In this section, we provide the background of the continuous-time discrete-space diffusion sampler.

3.1. Continuous-Time Forward Dynamics on Discrete State Spaces

We consider discrete data represented as a length-d sequence
x 0 = ( x 0 1 , , x 0 d ) [ S ] d ,
where each coordinate x 0 i takes values in a finite alphabet [ S ] of size S. Let q 0 i denote the marginal probability mass function of the i-th token, and let q 0 denote the joint distribution of the complete data vector x 0 .
The forward noising process is modeled as a time-inhomogeneous continuous-time Markov chain (CTMC) on the finite state space [ S ] d . This process is specified by the initial law q 0 and a time-dependent transition-rate matrix R t R S d × S d . Under the convention that R t ( x , y ) denotes the instantaneous rate of jumping from state x to state y, the infinitesimal transition probability satisfies, for sufficiently small Δ t ,
q t + Δ t t ( y x ) = 1 y = x + R t ( x , y ) Δ t + o ( Δ t ) , x , y [ S ] d .
Equivalently, the marginal law q t evolves according to the Kolmogorov forward equation
d d t q t ( y ) = x [ S ] d q t ( x ) R t ( x , y ) , y [ S ] d .
For R t to define a valid CTMC generator, it must satisfy the standard conditions
R t ( x , y ) 0 for x y , R t ( x , x ) 0 , y [ S ] d R t ( x , y ) = 0 .
Directly specifying a generator over the full space [ S ] d is generally infeasible when either S or d is large. A common simplification is therefore to impose coordinate-wise independent corruption so that only one token changes at any infinitesimal transition [8,11]. In this case, for x y , the generator takes the form [11] (Prop. 3)
R t ( x , y ) = R t tok ( x i , y i ) , if Ham x , y = 1 and x i y i , 0 , if Ham x , y > 1 ,
where R t tok R S × S is the token-level transition-rate matrix and Ham x , y denotes the Hamming distance between x and y. The diagonal entries of R t are then determined by the zero-row-sum constraint.
Following [11], we parameterize the token-level generator as
R t tok = β t R base ,
where β t > 0 is a noise schedule and R base is a fixed base generator. Prior work often studies the homogeneous case β t 1 ; see, for example, refs. [13,16]. We define the accumulated noise level by
β ¯ t : = 0 t β s d s < .
Since β t > 0 , the map t β ¯ t is increasing. We denote the inverse of β ¯ t as β ¯ 1 , so that
β ¯ 1 ( β ¯ t ) = t .
Throughout, we assume that β t is smooth on [ 0 , T ] .
The coordinate-wise construction yields a tractable closed-form transition kernel. In particular, the token-level transition matrix from time 0 to time t is
Q t 0 tok = exp β ¯ t R base ,
and the full transition kernel factorizes across coordinates as
q t 0 ( x t x 0 ) = i = 1 d Q t 0 tok ( x 0 i , x t i ) .
This closed form is especially useful for training discrete diffusion models.
In this work, we focus on the commonly used uniform-rate discrete diffusion model, with
R base : = 1 S 1 S 1 S I S ,
which appears in several prior works [3,8,11]. For this choice, each token is driven toward the uniform distribution over [ S ] . Moreover, the diagonal entry of the full-state generator is
R t ( x , x ) = y : y x R t ( x , y ) = β t S 1 S d , x [ S ] d .
With such an R t , as T becomes large, the terminal distribution induced by this forward process approaches the uniform distribution on [ S ] d .

3.2. Reverse Dynamics and Discrete-Time Sampling

The CTMC forward process induces a time-reversed CTMC whose marginals coincide with those of the forward chain run backward in time [27]. More precisely, by [11] (Prop. 1), the reverse process starts from
q 0 : = q T
and evolves with transition-rate matrix R t given, for x y , by
R t ( x , y ) = R T t ( y , x ) q T t ( y ) q T t ( x ) , x , y [ S ] d ,
with diagonal entries determined by
R t ( x , x ) = y : y x R t ( x , y ) .
With this setup, the marginal distribution of this reverse chain satisfies
q t = q T t .
Under the coordinate-wise forward generator in (2), the reverse generator inherits the same sparsity pattern: if Ham x , y 2 , then
R t ( x , y ) = 0 .
Thus, both the forward and reverse processes only allow infinitesimal transitions that modify a single token. In practice, the reverse process is stopped at time T δ for a small δ > 0 . This early-stopping convention avoids the numerical instability that may arise as the forward time approaches 0 + , where the corresponding score ratios can become irregular (as shown in [17] (Lemma 2)).
The exact reverse CTMC is not directly implementable, and the sampling procedure therefore introduces several approximations. Let p t denote the marginal distribution of the practical sampler at reverse time t [ 0 , T δ ] . First, the exact initial law q T is typically unavailable. Since the forward process with the uniformizing base generator converges to the uniform distribution, the sampler is initialized as
p 0 : = Uniform ( [ S ] d ) .
Second, the likelihood ratio appearing in (3) is unknown. For reverse time t, the required ratio q T t ( y ) q T t ( x ) is approximated by a learned concrete score function
s T t ( y , x ) q T t ( y ) q T t ( x ) .
Consequently, on a discrete sampling grid, the reverse generator is replaced by an estimated generator.
Beyond estimation error, we also account for the error introduced by approximate sampling procedures. Let
0 = t 0 < t 1 < < t N = T δ
be the reverse-time discretization grid. At grid point t k , the off-diagonal entries of the estimated reverse generator are
R ^ t k ( x , y ) : = R T t k ( y , x ) s T t k ( y , x ) , x y ,
with
R ^ t k ( x , x ) = y : y x R ^ t k ( x , y ) .
This approximation replaces the exact reverse rate in (3) by a score-based estimate.
Following [8,23], one example is to use the Euler method that freezes the estimated generator over each interval [ t k , t k + 1 ) . Since the generator only permits single-token transitions, the update can be written in a token-wise form. Given the current state x t k , define, for each coordinate i [ d ] and candidate token a [ S ] with a x t k i ,
R ^ k i ( x t k i , a ) : = R ^ t k x t k , x t k i i a ,
where x t k i i a denotes the vector obtained from x t k by replacing its i-th token with a. The corresponding diagonal token rate is
R ^ k i ( x t k i , x t k i ) : = a : a x t k i R ^ k i ( x t k i , a ) .
The Euler transition for the i-th token is then
P x t k + 1 i = a x t k = ( t k + 1 t k ) R ^ k i ( x t k i , a ) , a x t k i , 1 + ( t k + 1 t k ) R ^ k i ( x t k i , x t k i ) , a = x t k i .
Equivalently, each token either remains unchanged or jumps to another symbol according to the local estimated reverse rates. The step size is typically small, so that the probabilities in (6) remain nonnegative. The final output of the sampler is the state at reverse time T δ , whose law is intended to approximate the early-stopped data distribution q δ .

3.3. Notations

Let x i ( 1 i d ) denote the i-th element of a vector x [ S ] d and x i [ S ] d 1 denote the i-th element removed. Define Ham x , y as the Hamming distance between two vectors x and y. For a positive integer n, [ n ] : = { 1 , , n } . Write 1 S as a vector of length S that contains all 1’s, and I S as an identity matrix of size S × S .

4. Convergence Under General Non-Homogeneous Noise Schedule

In this section, we establish improved upper error bounds and convergence rates for the uniform-rate matrix. Following the approach of [18], our results are stated directly in terms of total variation distance rather than first deriving a KL-divergence bound and then applying Pinsker’s inequality. Our analysis relies on the following estimation-error condition.
Assumption 1.
The score estimation error satisfies that L T V ( s ) ε TV , where
L T V ( s ) : = k = 0 N 1 ( t k + 1 t k ) E x k q T t k y : y x k R T t k ( y , x k ) | s T t k ( y , x k ) q T t k ( y ) q T t k ( x k ) | .
Assumption 1 requires the learned score to be accurate at L 1 distance, measured as the expected sum of absolute differences between the exact and estimated rate matrices. This condition differs from the commonly used score-entropy loss [16,17] but is particularly convenient for our direct total-variation-based analysis. For masked diffusion models, ref. [18] shows that L T V can be upper-bounded by the score-entropy loss. We have provided a similar justification in Appendix A.
With this condition in place, the following theorem provides an upper bound on the sampling error in total variation distance.
Theorem 1.
Suppose that Assumption 1 holds. As long as t k + 1 t k vanishes at small ε and that each Euler step is valid, we have
TV ( q 0 , p T δ ) min { d , d S } e β ¯ T + ε TV + d S k = 0 N 1 max { 1 , β ¯ T t k + 1 2 } ( | β T t k | + d β T t k 2 ) ( t k + 1 t k ) 2 + d β ¯ δ .
Theorem 1 decomposes the total sampling error into four distinct contributions. The first term, min { d , d S } e β ¯ T , corresponds to the initialization error, which quantifies the discrepancy between the true terminal distribution q T and the uniform distribution used to initialize the sampler. This error decays exponentially fast with the accumulated noise level β ¯ T . The second term, ε TV , captures the error due to imperfect score estimation under Assumption 1. The third term represents the discretization error incurred by replacing the continuous-time reverse CTMC with a discrete-time approximation on the grid { t k } k = 0 N . Finally, the last term, d β ¯ δ , is the early-stopping error, which arises because the reverse process is terminated at time T δ rather than being run all the way to the data distribution.
Remark 1
(On the time steps t k + 1 t k ). In order to guarantee the validity of the Euler steps, it is necessary to have that
| R ^ k i ( x i , x i ) | = a : a x i R ^ k i ( x i , a ) < 1 .
Assuming that the score functions are accurately estimated, from Lemma A2 we have | R ^ k i ( x i , x i ) | S , necessarily yielding t k + 1 t k = o ( 1 / S ) . Note that this condition is satisfied for the step size in Theorem 2.
Remark 2
(On the early-stopping parameter δ ). For time-homogeneous models, to achieve O ( ε ) error, δ can be easily set to be ε / d (see Corollary 1). For non-homogeneous schedules, however, such a dependence on d becomes implicit, and one should set δ such that β ¯ 1 ( ε / d ) , which depends on the choice of the actual schedule.
For comparison, we provide the following corollary for the special case of a time-homogeneous CTMC, which is immediate upon plugging in β t 1 .
Corollary 1.
Under the setting of Theorem 1, for time-homogeneous CTMC, i.e., when β t 1 , we have
TV ( q 0 , p T δ ) min d , d S e T + ε TV + d 2 S k = 0 N 1 max { 1 , T t k + 1 2 } ( t k + 1 t k ) 2 + d δ .
Compared with the time-homogeneous CTMC setting, our error bound in Theorem 1 exhibits two key differences. First, the initialization and early-stopping errors are no longer determined solely by the diffusion times T and δ ; they also depend on the prescribed noise schedule through the accumulated noise levels β ¯ T and β ¯ δ . Second, the discretization error is not controlled only by the selected time steps ( t k + 1 t k ) . Instead, it additionally depends on the smoothness of the noise schedule through | β T t k | and the magnitude of the transition rates through β T t k 2 , which is, effectively speaking, the size of each discretization interval as measured by the amount of noise injected over that interval. We have provided a proof sketch in Section 5.
To further determine the number of steps to reach O ( ε ) error, we consider those β t ’s that satisfy the following regularity condition.
Assumption 2
(Slow-varying noise). Suppose that noise schedule  β t  satisfies the following regularity conditions:
1. 
The schedule is uniformly bounded away from zero on the relevant low-noise interval, namely,
β : = min z [ δ , β ¯ 1 ( 1 ) ] β z 1 .
2. 
The schedule is asymptotically constant (ignoring logarithms):
β t , | β t | = O ˜ ( 1 ) , t [ δ , T ] .
3. 
The inverse accumulated-noise map grows at most polynomially:
β ¯ 1 ( x ) = poly ( x ) .
On a high level, Assumption 2 assumes that the noise schedule is slow-varying. This assumption is used only to determine a suitable choice of the number of sampling steps. In the time-homogeneous CTMC setting, where β t 1 , we have β * = 1 and β t 0 . Therefore, the homogeneous case automatically satisfies Assumption 2. Another example of noise schedule that satisfies Assumption 2 is the polynomial schedule where β ¯ t = t p , where p > 0 . With this choice, β t = p t p 1 and β * = p δ p 1 if p 1 and p otherwise. β ¯ 1 ( x ) = x 1 / p = poly ( x ) . Further, for t [ δ , T ] , β t = p t p 1 p T p 1 and | β t | | p ( p 1 ) T p 2 | , for which we note that T = poly ( log ( d / ε ) ) is asymptotically a constant. In general, both the initial β 0 and the growth of β t w.r.t. t need to be mild to satisfy Assumption 2.
Beyond polynomial schedule, the exponential (also known as geometric) noise schedule does not satisfy this assumption. This will be our main focus in Section 6.
The first condition in Assumption 2 imposes a constant lower bound on β t over the low-noise regime. Intuitively, this prevents the noise schedule from becoming too small near the data distribution, ensuring that the score function does not blow up over the entire reverse process. The second condition controls the cumulative magnitude and variation of the schedule along the discretization grid. This requirement ensures that the transition rates do not depend on the system parameters, which is a common practice, so that the discrete-time approximation remains stable and reliable. Finally, the third condition requires the inverse accumulated-noise map β ¯ 1 to grow at most polynomially. Equivalently, this guarantees that one can reach a sufficiently large terminal accumulated noise level β ¯ T within a polynomial diffusion time, thereby ensuring that enough noise is injected by the terminal time.
With Assumption 2, the following theorem characterizes the number of sampling steps required to achieve O ( ε ) error in the time-inhomogeneous setting.
Theorem 2.
Under the setting of Theorem 1, suppose further that Assumption 2 holds. Let t k + 1 t k = κ min 1 , β ¯ T t k . Then, in order to achieve O ( ε ) TV-error, one can choose T = β ¯ 1 ( log ( d / ε ) ) , κ ε / ( d 2 S ) , and thus N = O ˜ d 2 S / ε .
The proof of Theorem 2 is in Appendix C. This theorem shows that under the stated regularity conditions on the noise schedule, the time-inhomogeneous sampler requires the same order of sampling steps as in the time-homogeneous setting; see, e.g., refs. [17,18]. The main difference is that in the inhomogeneous case, both the discretization step sizes and the terminal diffusion time must be chosen according to the prescribed noise schedule.

5. Proof Sketch of Theorem 1

We now provide a proof sketch of Theorem 1. The full proof is in Appendix B.
Proof sketch of Theorem 1.
We analyze the Euler sampler through the truncated τ -leaping sampler, which is asymptotically equivalent to Euler by [17] (Lemma 8). The key property is that the estimated rate is piecewise constant on each interval [ t k , t k + 1 ) :
R ^ t ( x t k , · ) = R ^ t k ( x t k , · ) , t [ t k , t k + 1 ) .
Let q t = q T t denote the exact reverse marginal. By the TV perturbation bound of [18] (Theorem 1),
TV ( q T δ , p T δ ) TV ( q 0 , p 0 ) + k = 0 N 1 t k t k + 1 E x t q t y : y x t | R ^ t ( x t , y ) R t ( x t , y ) | d t .
Writing
h t ( x ) : = y : y x | R ^ t ( x , y ) R t ( x , y ) | ,
and adding and subtracting h t k ( x t k ) , the error decomposes into initialization, score-estimation, and discretization terms. By Assumption 1, the score-estimation term is bounded by
k = 0 N 1 ( t k + 1 t k ) E x t k q t k h t k ( x t k ) ε TV .
The initialization error is controlled by the convergence of the forward chain to the uniform distribution. Since q 0 = q T and p 0 = Uniform ( [ S ] d ) , we have
TV ( q 0 , p 0 ) = TV ( q T , p 0 ) min { d , d S } e β ¯ T .
It remains to bound the discretization error. The part caused by the change from x t k to x t is lower-order for sufficiently small step size, since the total outgoing reverse rate is of order d β T t . The dominant contribution comes from the time variation of the reverse rate. Using the piecewise-constant property of R ^ t ,
E x t k q t k h t ( x t k ) h t k ( x t k ) E x t k q t k y : y x t k | R t ( x t k , y ) R t k ( x t k , y ) | .
By the reverse-rate formula in (3) and its coordinate-wise structure, only states satisfying Ham x , y = 1 contribute. The difference is then split into the change in the likelihood ratio and the change in the rate matrix. The likelihood-ratio term is bounded using the non-homogeneous score-ratio estimate
q s ( y ) q s ( x ) S max { 1 , β ¯ s 1 } , ( x , y ) : Ham x , y = 1 ,
together with the corresponding likelihood-ratio difference bound. The rate-matrix term follows from smoothness of β t :
| R T t k ( y , x ) R T t ( y , x ) | 1 S | β T t k | ( t t k ) .
Combining these bounds yields
D disc d S k = 0 N 1 max { 1 , β ¯ T t k + 1 2 } | β T t k | + d β T t k 2 ( t k + 1 t k ) 2 ,
where D disc represents the discretization error.
Finally, because the reverse process is stopped at T δ , its target marginal is q δ rather than q 0 . The early-stopping error satisfies
TV ( q 0 , q δ ) d β ¯ δ .
Using the triangle inequality,
TV ( q 0 , p T δ ) TV ( q 0 , q δ ) + TV ( q T δ , p T δ ) ,
and combining the four bounds gives the desired result. □

6. Geometric Noise Schedule

While the geometric schedule [8] does not directly satisfy Assumption 2, the preceding time-inhomogeneous framework enables us to analyze such a noise schedule. Different from the common scenario where T diverges, in practice one typically sets T = 1 without diverging with vanishing ε . This way, the geometric noise schedule is defined as follows.
β ¯ t : = β min 1 t β max t , β t = β min 1 t β max t log ( β max / β min ) = β ¯ t log ( β max / β min ) , t [ 0 , 1 ] ,
where 0 < β min < 1 < β max are prescribed constants. The default choice in [8] is β min = 10 4 and β max = 20 . Here, both β ¯ t and β t are increasing functions on [ 0 , 1 ] . Moreover, differentiating β t gives
β t = β min 1 t β max t log 2 ( β max / β min ) = β t log ( β max / β min ) .
We first follow ref. [8] and use a constant step size κ , so that the total number of discretization steps is N = κ 1 . The following theorem specializes our general time-inhomogeneous result to the geometric noise schedule.
Theorem 3.
Suppose that Assumption 1 holds. With the geometric noise schedule defined in (7), choosing t k + 1 t k = κ where t 0 = 0 and t N = 1 δ , we have
lim δ 0 + TV ( q 0 , p 1 δ ) min d , d S e β max + ε TV + κ d 2 S log ( β max / β min ) ( β min 1 1 ) 2 + ( β max 1 ) 2 + d β min .
Thus, in order to achieve O ( ε ) TV error, we can choose
β min ε d , β max log min d , d S ε , and κ ε 3 / 2 d 4 S log ( d / ε ) ,
we have N = O ˜ ( d 4 S / ε 3 / 2 ) .
Theorem 3 highlights an interesting regime in which a finite error upper bound can be obtained even as δ 0 + . This contrasts with prior bounds whose dependence on log δ 1 diverges in the vanishing early-stopping limit [15,16,17,18]. While the initialization, score-estimation, and early-stopping terms retain the same interpretation as in the general time-inhomogeneous result, the discretization error admits an explicit expression under the geometric noise schedule. In particular, up to logarithmic factors, this term scales quadratically with the prescribed schedule parameters β min 1 and β max .
Our result illustrates the trade-off governed by the schedule parameters β min and β max , when the step sizes are constant, as is typical in many empirical studies [8,23]. A smaller β min reduces the early-stopping error but increases the accumulated discretization error. Similarly, a larger β max decreases the initialization error while again amplifying the discretization error.
Note that the choice of the sampling steps is not optimized here (i.e., constant) and therefore may lead to a suboptimal overall step complexity. As in Theorem 2, we can similarly employ a constant-then-decreasing step size towards the end to obtain an improved convergence result, given as follows.
Theorem 4.
With the same setting as in Theorem 3, but choosing t k + 1 t k = κ min 1 , β min t k β max 1 t k , we have
lim δ 0 + TV ( q 0 , p 1 δ ) min d , d S e β max + ε TV + κ d 2 S log ( β max / β min ) ( β max 1 ) 2 + 1 + d β min .
Meanwhile, the number of steps satisfies that
N 1 κ 1 + β min 1 1 log ( β max / β min ) .
Thus, choosing
β min ε d , β max log min d , d S ε , and κ ε d 2 S log ( d / ε ) ,
we have N = O ˜ ( d 3 S / ε ) .
Compared to Theorem 3, Theorem 4 improves the step complexity by a factor of order O ( d / ε ) . The key observation is that under the geometric noise schedule, the effective noise level decays exponentially along the reverse-time trajectory. The exponential-then-constant step size adapts to this behavior: it uses relatively large constant steps in the high-noise regime, where the reverse dynamics are more stable, and automatically takes smaller steps near the low-noise regime, where the discretization error is most sensitive. This refinement removes the unfavorable dependence on β min 2 that appears under a constant step size while increasing the number of steps only in the region where finer discretization is necessary. As a result, the adaptive scheme achieves a sharper overall sampling complexity.
As a future direction, it might be interesting to investigate why using such a geometric noise schedule would yield worsened results than using polynomial ones.

7. Conclusions

In conclusion, this work provides a theoretical foundation for discrete diffusion models driven by time-inhomogeneous uniform-rate generators. By moving beyond the homogeneous noise schedules considered in much of the prior literature, our analysis broadens the class of discrete diffusion processes for which convergence can be rigorously understood. We further identify regularity conditions on the noise schedule that enable explicit convergence-rate estimates. Under these conditions, the resulting guarantees match state-of-the-art rates known for homogeneous discrete diffusion samplers, demonstrating that carefully controlled time-inhomogeneous schedules can retain the same theoretical efficiency while offering greater modeling flexibility.

Author Contributions

Conceptualization, Y.L. (Yuchen Liang); Methodology, Y.L. (Yuchen Liang); Formal analysis, Y.L. (Yuchen Liang); Investigation, Y.L. (Yuchen Liang); Writing—original draft, Y.L. (Yuchen Liang); Writing—review and editing, Y.L. (Yuchen Liang); Supervision, L.L., N.S. and Y.L. (Yingbin Liang); Funding acquisition, N.S. and Y.L. (Yingbin Liang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by U.S. National Science Foundation grant number NSF AI Institute (AI-EDGE) 2112471, ExpandAI-2324052, ECCS-2413528, CNS-2312836, CNS-2223452, CNS-2225561, CCF-2232907, ECCS-2448268; DEVCOM Army Research Laboratory grant number W911NF-23-2-0225.

Data Availability Statement

Data sharing is not applicable. No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Justification of Assumption 1

The following proposition links the TV score-estimation error in Assumption 1 with the commonly used score-entropy loss [8].
Proposition A1.
Define the score entropy loss at time t as
L S E ( s ; t ) = E x t q t y : y x t R t ( y , x t ) · s t ( y , x ) q t ( y ) q t ( x ) q t ( y ) q t ( x ) log s t ( y , x ) q t ( y ) / q t ( x ) .
If R ^ t ( x , y ) R t ( x , y ) = o ( 1 ) , t , x , y , then we have
L T V ( s ; T t ) d ( S 1 ) max { 1 , β ¯ T t 1 } · L S E ( s ; T t ) .
Proof Proposition A1.
Recall that R ^ t k ( x k , y ) = R T t k ( y , x k ) s T t k ( y , x k ) . From the definition, the TV loss at time t satisfies that
L T V 2 ( s ; T t k ) = E x k q t k y : y x k | R ^ t k ( x k , y ) R t k ( x k , y ) | 2 E x k q t k y : y x k | R ^ t k ( x k , y ) R t k ( x k , y ) | 2 = d 2 ( S 1 ) 2 E x k q t k 1 d ( S 1 ) y : y x k | R ^ t k ( x k , y ) R t k ( x k , y ) | 2 E x k q t k d 2 ( S 1 ) 2 1 d ( S 1 ) y : y x k R ^ t k ( x k , y ) R t k ( x k , y ) 2 d ( S 1 ) E x k q t k y : y x k R ^ t k ( x k , y ) R t k ( x k , y ) 2
where both inequalities are due to Jensen’s inequality.
On the other hand, the score-entropy loss at time t satisfies that
L S E ( s ; T t k ) = E x t q t k y : y x k R ^ t k ( x k , y ) R t k ( x k , y ) R t k ( x k , y ) log R ^ t k ( x k , y ) R t k ( x k , y ) = ( i ) E x t q t k y : y x k ( R ^ t k ( x k , y ) R t k ( x k , y ) ) 2 2 R t k ( x k , y ) + o ( 1 ) ( i i ) min { 1 , β ¯ T t k 1 } · E x t q t k y : y x k ( R ^ t k ( x k , y ) R t k ( x k , y ) ) 2
where ( i ) follows by assuming that R ^ t ( x , y ) R t ( x , y ) = o ( 1 ) , t , x , y , and ( i i ) follows because q t ( y ) q t ( x k ) S max 1 , β ¯ t 1 (see Lemma A2) and thus R t k ( x k , y ) max 1 , ( T t k ) 1 .
Therefore, we have that, for all t k ’s,
L T V ( s ; T t k ) d ( S 1 ) max { 1 , β ¯ T t k 1 } · L S E ( s ; T t k ) .
Remark A1.
Here we have shown a point-wise upper bound over t. Indeed, a time-averaged upper bound can also be achieved with the step size in Theorem 2: t k + 1 t k = κ min 1 , β ¯ T t k , which yields
k = 0 N 1 ( t k + 1 t k ) L T V ( s ; T t k ) T d S sup k [ N ] L S E ( s ; T t k ) .

Appendix B. Proof of Theorem 1

We follow the idea of [17,18] to analyze the Euler method by constructing a truncated version of the vanilla τ -leaping sampler. In [17] (Lemma 8), it is shown that the truncated τ -leaping sampler is asymptotically equivalent to the Euler method. Also, from Lemma 7 of [17], one important property of this sampler is that its sampling rate R ^ t is piecewise constant and, given x t k [ S ] d , we have
R ^ t ( x t k , · ) = R ^ t k ( x t k , · ) .
  • Step 1: Decompose total error.
To begin, by [18] (Theorem 1), we have
TV ( q T δ , p T δ ) TV ( q 0 , p 0 ) + k = 0 N 1 t k t k + 1 E x t q t y : y x t | R ^ t ( x t , y ) R t ( x t , y ) | d t .
Thus, if we write h t ( x t ) : = y : y x t | R ^ t ( x t , y ) R t ( x t , y ) | , we have
TV ( q T δ , p T δ ) TV ( q 0 , p 0 ) + k = 0 N 1 t k t k + 1 E x t q t h t ( x t ) d t = TV ( q 0 , p 0 ) + k = 0 N 1 t k t k + 1 E x t k q t k h t k ( x t k ) d t + k = 0 N 1 t k t k + 1 E x t q t x t k q t k h t ( x t ) h t k ( x t k ) = TV ( q 0 , p 0 ) initialization error + k = 0 N 1 ( t k + 1 t k ) E x t k q t k h t k ( x t k ) estimation error + k = 0 N 1 t k t k + 1 E x t q t x t k q t k h t ( x t ) h t ( x t k ) + E x t k q t k h t ( x t k ) h t k ( x t k ) d t discretization error .
By Assumption 1, the estimation error satisfies that
k = 0 N 1 ( t k + 1 t k ) E x t k q t k h t k ( x t k ) ε TV .
  • Step 2: Bound initialization error.
The following lemma bounds the initialization error for a non-homogeneous noise schedule.
Lemma A1.
Given the forward process in (1) with the rate given in (2), we have
TV ( q T , p 0 ) min d , d S e β ¯ T .
Proof. 
See Appendix F.1. □
  • Step 3: Bound discretization error.
It now remains to upper-bound the discretization error. As shown in (A2), the discretization error can be decomposed into two terms, one for the time-difference in the argument of h t (in expected value) and the other in the space-difference for h t itself. Using [18] (Lemma 4), the space-difference term can be upper-bounded as
E x t q t x t k q t k h t ( x t ) h t ( x t k ) ( t t k ) E x t q t ( R T t ( x t , x t ) ) h t ( x t ) + o ( t t k ) ( t t k ) d · β T t E x t q t h t ( x t ) .
Thus, since t k + 1 t k κ , we have
k = 0 N 1 t k t k + 1 E x t q t x t k q t k h t ( x t ) h t ( x t k ) d t = κ · d k = 0 N 1 t k t k + 1 β T t E x t q t h t ( x t ) d t = ( i ) κ · d k = 0 N 1 t k t k + 1 β T t 1 ( t t k ) d β T t E x t k q t k h t ( x t k ) d t = κ · d k = 0 N 1 t k t k + 1 β T t 1 ( t t k ) d β T t · E x t k q t k h t k ( x t k ) + E x t k q t k h t ( x t k ) h t k ( x t k ) d t ( i i ) κ · d sup t β T t 1 ( t t k ) d β T t ε + 0 T δ E x t k q t k h t ( x t k ) h t k ( x t k ) d t
where ( i ) follows again by (A3), and ( i i ) follows by Assumption 1. Thus, as long as κ = o ( 1 ) , the time-difference in the discretization error dominates.
We now turn to the time-difference term of the discretization error, which equals to
E x t k q t k h t ( x t k ) h t k ( x t k ) = E x t k q t k y : y x t k | R ^ t ( x t k , y ) R t ( x t k , y ) | | R ^ t k ( x t k , y ) R t k ( x t k , y ) | E x t k q t k y : y x t k | R ^ t ( x t k , y ) R t ( x t k , y ) R ^ t k ( x t k , y ) + R t k ( x t k , y ) | = ( i ) E x t k q t k y : y x t k | R ^ t k ( x t k , y ) R t ( x t k , y ) R ^ t k ( x t k , y ) + R t k ( x t k , y ) | = E x t k q t k y : y x t k | R t ( x t k , y ) R t k ( x t k , y ) | ,
where ( i ) follows from the property of the sampler given in (A1).
  • Step 4: Bound Expected Absolute Difference in the Rate Function.
Continuing (A5), the rate-determining term is the expected absolute difference in the rate function. To this end, we need a characterization of the score function under a non-homogeneous noise schedule, as follows.
Lemma A2
(Score bound under non-homogeneous β t ). Fix t > 0 and x y such that Ham x , y = p . Given the forward process in (1) with a rate given in (2), we have
q t ( y ) q t ( x ) S · max { 1 , ( β ¯ t ) 1 } p .
Proof. 
See Appendix F.2. □
Now, for any x t k [ S ] d , the sum difference in the reverse rate can be further calculated using (3) as
E x t k q t k y x t k | R t k ( x t k , y ) R t ( x t k , y ) | = E x t k q t k y x t k | q T t k ( y ) q T t k ( x t k ) R T t k ( y , x t k ) q T t ( y ) q T t ( x t k ) R T t ( y , x t k ) | E x t k q t k y x t k | q T t k ( y ) q T t k ( x t k ) q T t ( y ) q T t ( x t k ) | R T t k ( y , x t k ) + E x t k q t k y x t k q T t ( y ) q T t ( x t k ) | R T t k ( y , x t k ) R T t ( y , x t k ) | = E x t k q t k y x t k Ham y , x t k = 1 | q T t k ( y ) q T t k ( x t k ) q T t ( y ) q T t ( x t k ) | R T t k ( y , x t k ) + E x t k q t k y x t k Ham y , x t k = 1 q T t ( y ) q T t ( x t k ) | R T t k ( y , x t k ) R T t ( y , x t k ) | ,
where the last line follows because R t ( y , x ) = 0 whenever Ham y , x 2 . Here the second term captures the effect of non-homogeneity of β t , which becomes zero in the case of constant β t .
We first deal with the second term in (A6). Since β t is smooth, we have
| R T t k ( y , x t k ) R T t ( y , x t k ) | = 1 S | β T t k β T t | 1 S | β T t k | ( t t k ) .
Thus, combining the result from Lemma A2, the second term in (A6) satisfies that
E x t k q t k y x t k Ham y , x t k = 1 q T t ( y ) q T t ( x t k ) | R T t k ( y , x t k ) R T t ( y , x t k ) | S d | β T t k | max { 1 , β ¯ T t k + 1 1 } ( t t k ) .
In order to provide an upper bound for the first term in (A6), it is essential to deal with the time difference of the likelihood ratios. The following lemma is a direct extension of [17] (Lemma 5).
Lemma A3.
Fix s < t such that t s is small. Given the forward process in (1) with a rate given in (2), we have
E x t q t y x t Ham y , x t = 1 | q t ( y ) q t ( x t ) q s ( y ) q s ( x t ) | R t ( y , x t ) d S β t 2 max { 1 , β ¯ s 2 } ( t s ) + d 2 S β t 2 max { 1 , β ¯ s 1 } ( t s ) .
Proof. 
See Appendix F.3. □
Thus, considering the two terms in (A6), we have the following bound.
E x t k q t k y x t k | R t k ( x t k , y ) R t ( x t k , y ) | d S max { 1 , β ¯ T t k + 1 1 } ( | β T t k | + β t 2 max { 1 , β ¯ T t k + 1 1 } + d β t 2 ) ( t t k ) .
Therefore, combining all the ingredients above, we have
TV ( q T δ , p T δ ) min d , d S e β ¯ T + ε TV + d S k = 0 N 1 max { 1 , β ¯ T t k + 1 2 } ( | β T t k | + d β T t k 2 ) ( t t k ) 2 .
  • Step 5: Bound initial perturbation.
As we are finishing up the proof, it remains to investigate how much perturbation would be introduced due to early-stopping when β t is non-homogeneous. This is characterized in the following lemma, whose proof is similar to the last part of [12] (Theorem 6).
Lemma A4.
Under the forward process in (1) with the rate given in (2), we have
TV ( q 0 , q δ ) d β ¯ δ , as δ 0 .
Proof. 
See Appendix F.4. □
The proof of Theorem 1 is now complete.

Appendix C. Proof of Theorem 2

From the result of Theorem 1, it remains to determine the number of steps using the special step sizes. The goal is to directly characterize the summation term, which we perform as follows.
Let fn ( β T t k ) { β T t k 2 , | β T t k | } . Recall that β ¯ t is increasing. Define
k * : = sup k : β ¯ T t k > 1 .
We thus have the following cases.
1.
Case 1: k < k * . This implies that β ¯ T t k > β ¯ T t k + 1 > 1 , and t k + 1 t k κ . Thus, we have
k = 0 k * 1 ( t k + 1 t k ) 2 fn ( β T t k ) min 1 , β ¯ T t k + 1 2 κ 2 k = 0 k * 1 fn ( β T t k ) .
2.
Case 2: k > k * . This implies that β ¯ T t k + 1 < β ¯ T t k 1 , and
k = k * + 1 N 1 ( t k + 1 t k ) 2 fn ( β T t k ) min 1 , β ¯ T t k + 1 2 κ 2 k = k * + 1 N 1 fn ( β T t k ) β ¯ T t k 2 β ¯ T t k + 1 2 κ 2 k = k * + 1 N 1 fn ( β T t k )
where the last line follows because by the Taylor expansion of β ¯ (noting that β ¯ is continuous since by definition it is the integral of β t ),
β ¯ T t k = β ¯ T t k + 1 + O ( t k + 1 t k ) = β ¯ T t k + 1 + O ( κ ) , k = 0 , , N 1 .
Since β ¯ T t k + 1 β ¯ δ > 0 , we have β ¯ T t k 2 / β ¯ T t k + 1 2 = 1 + O ( κ ) = 1 + o ( 1 ) for all k = 0 , , N 1 .
3.
Case 3: k = k * . This implies that β ¯ T t k + 1 1 , β ¯ T t k > 1 , and t k + 1 t k κ . Also, from (A8), β ¯ T t k 1 + O ( κ ) . Then,
( t k + 1 t k ) 2 fn ( β T t k ) min 1 , β ¯ T t k + 1 2 κ 2 fn ( β T t k ) β ¯ T t k 2 β ¯ T t k + 1 2 κ 2 fn ( β T t k * ) .
Summing up all three cases, we have
k = 0 N 1 ( t k + 1 t k ) 2 fn ( β T t k ) min 1 , β ¯ T t k + 1 2 κ 2 k = 0 N 1 fn ( β T t k ) .
Now, under Assumption 2, since β t and | β t | have upper bounds independent of the system parameters, we have
k = 0 N 1 ( t k + 1 t k ) 2 fn ( β T t k ) min 1 , β ¯ T t k + 1 2 κ 2 N .
Next, we choose t k + 1 t k κ min 1 , β ¯ T t k and we need to determine an upper bound for the number of steps. Define t * = t k * . When k k * , the number of steps is upper- bounded as
k * = : N 1 = T t * κ T κ .
When k > k * , note that t k + 1 is chosen according to the following fixed-point iteration: T t k + 1 = h ( T t k ) , where h ( z ) : = z κ β ¯ z . Obviously z = 0 is a fixed point. By Banach’s fixed-point theorem,
T t N 2 ( T t * ) · max z [ δ , T t * ] | h ( z ) | N 2 .
Here note that h ( z ) = 1 κ β z 1 κ β * , where we recall that β * = min z [ δ , β ¯ 1 ( 1 ) ] β z . Thus, in order to reach that T t N 2 δ ,
N 2 log 1 κ β * δ T t * log ( T t * ) + log δ 1 κ β * .
Therefore, the total number of steps to take satisfies that
N = N 1 + N 2 T + ( log T + log δ 1 ) / β * κ .
Finally, since under Assumption 2 we have β * 1 , this yields
N T + log δ 1 κ .
Then, we have
k = 0 N 1 ( t k + 1 t k ) 2 fn ( β T t k ) min 1 , β ¯ T t k + 1 2 κ ( T + log δ 1 ) .
Note that T poly ( log ( d / ε ) ) . Choosing the corresponding T and κ yields the desired result.

Appendix D. Proof of Theorem 3

We begin from Theorem 1, which also applies to the constant step size:
TV ( q 0 , p T δ ) min { d , d S } e β ¯ T + ε TV + d S k = 0 N 1 max { 1 , β ¯ T t k + 1 2 } ( | β T t k | + d β T t k 2 ) ( t k + 1 t k ) 2 + d β ¯ δ .
We recall from (7) that
β ¯ t : = β min 1 t β max t , β t = β min 1 t β max t log ( β max / β min ) , β t = β min 1 t β max t log 2 ( β max / β min ) .
Thus, we have β ¯ T = β max . Also, note that
β ¯ δ = β min 1 δ β max δ δ 0 β min > 0 ,
which implies that there will be asymptotic mismatch for vanishing δ . What remains is to work out the summation term, and we proceed as follows.
Note that
k = 0 N 1 κ max { 1 , β ¯ T t k + 1 2 } ( | β T t k | + d β T t k 2 ) = log 2 ( β max / β min ) k = 0 N 1 κ max { 1 , β min 2 t k + 1 β max 2 ( 1 t k + 1 ) } ( β min t k β max 1 t k + d β min 2 t k β max 2 ( 1 t k ) ) log 2 ( β max / β min ) δ 1 max { 1 , β min 2 ( 1 t ) β max 2 t } ( β min 1 t β max t + d β min 2 ( 1 t ) β max 2 t ) d t .
Since β min < 1 < β max , and β ¯ t is smooth, there exists t * such that β ¯ t * = 1 . Then, the integral becomes
log ( β max / β min ) δ 1 max { 1 , β min 2 ( 1 t ) β max 2 t } ( β min 1 t β max t + d β min 2 ( 1 t ) β max 2 t ) d t = log ( β max / β min ) δ t * β min 2 ( 1 t ) β max 2 t ( β min 1 t β max t + d ) d t + t * 1 β min 1 t β max t + d β min 2 ( 1 t ) β max 2 t d t = 1 β ¯ t δ t * d 2 β ¯ t 2 δ t * + β ¯ t t * 1 + d 2 β ¯ t 2 t * 1 β min 1 1 + d 2 ( β min 1 1 ) 2 + ( β max 1 ) + d 2 ( β max 1 ) 2 d ( β min 1 1 ) 2 + ( β max 1 ) 2 .
Combining all of the above, we finally have, for each δ > 0 ,
TV ( q 0 , p T δ ) min { d , d S } e β max + ε TV + κ d 2 S log ( β max / β min ) ( β min 1 1 ) 2 + ( β max 1 ) 2 + d β min .
Since the right-hand side does not depend on δ , taking its limit to 0 + yields the desired result.

Appendix E. Proof of Theorem 4

Continuing the proof of Theorem 3, we again have
k = 0 N 1 ( t k + 1 t k ) 2 max { 1 , β ¯ T t k + 1 2 } ( | β T t k | + d β T t k 2 ) = log 2 ( β max / β min ) k = 0 N 1 ( t k + 1 t k ) 2 max { 1 , β ¯ T t k + 1 2 } ( β min t k β max 1 t k + d β min 2 t k β max 2 ( 1 t k ) )
Now, suppose we choose step sizes as in Theorem 2:
t k + 1 t k = κ min { 1 , β ¯ T t k } = κ min { 1 , β min t k β max 1 t k } .
Again define
k * : = sup k : β ¯ T t k = β min t k β max 1 t k > 1 ,
with t * such that β min t * β max 1 t * = 1 . Then, from the proof of Theorem 2, using this step size, we have
log 2 ( β max / β min ) k = 0 N 1 ( t k + 1 t k ) 2 max { 1 , β ¯ T t k + 1 2 } ( β min t k β max 1 t k + d β min 2 t k β max 2 ( 1 t k ) ) log 2 ( β max / β min ) κ 2 k = 0 N 1 ( β min t k β max 1 t k + d β min 2 t k β max 2 ( 1 t k ) )
For k < k * , t k + 1 t k = κ . This is the regime similar to Theorem 3, where we have
k = 0 k * 1 ( β min t k β max 1 t k + d β min 2 t k β max 2 ( 1 t k ) ) 1 κ t * 1 β min 1 t β max t + d β min 2 ( 1 t ) β max 2 t d t = 1 κ log ( β max / β min ) ( β max 1 ) + d 2 ( β max 1 ) 2 .
Meanwhile, the number of steps in this regime, N 1 , satisfies that
N 1 1 t * κ 1 κ .
As follows, the goal is thus to provide an upper bound on the two summation terms as well as N 2 in the regime where k k * . The key idea is to use the property that β ¯ t , β t , and β t have the same exponential form in t.
  • Part 1: On the number of steps N 2.
Let us define a series of auxiliary variable y k as follows. Let
y k : = β ¯ 1 t k 1 = β min t k β max 1 t k 1 .
Here, by definition of k * , we have y k * 1 . Also notice that y k is increasing in k, with
y k + 1 y k = β min ( t k + 1 t k ) β max t k + 1 t k = e λ ( t k + 1 t k ) ,
where λ : = log ( β max / β min ) . With the step size in (A9) and since k k * , we have
y k + 1 y k = e λ κ β min t k β max 1 t k = e λ κ y k 1 + λ κ y k ,
which further implies that
y k + 1 y k + λ κ .
For the terminal condition, we have y N β min 1 δ β max δ 1 . This implies that
N 2 = N k * β min 1 δ β max δ 1 1 λ κ δ 0 β min 1 1 λ κ .
  • Part 2: On the first-order sum.
With (A9) and when k k * , the first-order sum is indeed a telescoping sum:
k = k * N 1 β min t k β max 1 t k = k = k * N 1 t k + 1 t k κ = 1 κ ( t N t k * ) 1 δ κ .
  • Part 3: On the second-order sum.
The second-order sum needs some extra work. The idea is to upper-bound it with the first-order sum using trajectory smoothness. Write
f k : = y k 1 = β min t k β max 1 t k .
Note that f k 1 for k > k * (for k = k * , we get f k = 1 + o ( 1 ) ), and we have a similar recursion as the above one:
f k + 1 = f k e λ κ f k .
The following lemma characterizes an important property of f k , which will be useful for the analysis.
Lemma A5.
With f k defined in (A10) and when k > k * , we have
f k 2 f k f k + 1 1 e λ κ ,
where λ : = log ( β max / β min ) .
Proof. 
See Appendix G. □
With Lemma A5, the second-order sum becomes
d k = k * N 1 β min 2 t k β max 2 ( 1 t k ) = d k = k * N 1 f k 2 d k = k * N 1 f k f k + 1 1 e λ κ = d 1 e λ κ ( t N t k * ) d λ κ ( 1 δ ) .
  • Part 4: Combine all previous parts.
The overall discretization error becomes
log 2 ( β max / β min ) κ 2 k = 0 N 1 ( β min t k β max 1 t k + d β min 2 t k β max 2 ( 1 t k ) ) log 2 ( β max / β min ) κ 2 1 κ log ( β max / β min ) ( β max 1 ) + d 2 ( β max 1 ) 2 + log 2 ( β max / β min ) κ 2 1 δ κ + d log ( β max / β min ) κ ( 1 δ ) d log ( β max / β min ) κ ( β max 1 ) 2 + 1 δ δ 0 d log ( β max / β min ) κ ( β max 1 ) 2 + 1 .
Also note that the total number of steps satisfies
N = N 1 + N 2 1 κ 1 + β min 1 1 log ( β max / β min ) .
The proof is now complete.

Appendix F. Auxiliary Proofs

In this section, we provide proofs of all the auxiliary lemmas in this paper.

Appendix F.1. Proof of Lemma A1

Write ϵ T : = e β ¯ T . Let u S denote the uniform distribution on [ S ] , and let u d = u S d . By assumption, p 0 = u d .
First, to obtain the analytical solution for the conditional probability, we can solve the Kolmogorov forward equation for the i-th dimension ( i [ d ] ):
d d t q t | 0 i ( z | a ) = z ˜ [ S ] q t | 0 i ( z ˜ | a ) R t tok ( z ˜ , z ) ,
whose solution is
q t | 0 i ( z | a ) = exp 0 t R s tok d s ( a , z ) = exp β ¯ t R base ( a , z ) = P exp β ¯ t Λ P 1 ( a , z ) = S 1 ( 1 e β ¯ t ) if z a S 1 ( 1 + ( S 1 ) e β ¯ t ) if z = a
where we recall that R t tok = β t R base and we denote the eigendecomposition of R base = S 1 1 S 1 S I S as R base = P Λ P 1 . Equivalently, we can write
q T | 0 i ( z | a ) = u S ( z ) + ϵ T 1 { z = a } u S ( z ) .
Thus, for fixed x 0 , x T [ S ] d ,
q T | 0 ( x T | x 0 ) = i = 1 d q T | 0 i ( x T i | x 0 i ) .
To upper-bound the initialization error, we first reduce to the case of a fixed initialization. Since
q T ( x T ) = x 0 [ S ] d q 0 ( x 0 ) q T | 0 ( x T | x 0 ) ,
by Jensen’s inequality we have
TV ( q T , p 0 ) x 0 [ S ] d q 0 ( x 0 ) TV ( q T | 0 ( · | x 0 ) , p 0 ) .
Therefore it suffices to upper-bound TV ( q T | 0 ( · | x 0 ) , u d ) uniformly over x 0 .
For one coordinate, we have
TV ( q T | 0 i ( · | a ) , u S ) = 1 2 z [ S ] q T | 0 i ( z | a ) S 1 = 1 2 · S 1 S ϵ T + 1 2 · ( S 1 ) 1 S ϵ T = S 1 S ϵ T .
Using the standard tensorization bound for total variation of product measures,
TV ( i = 1 d μ i , i = 1 d ν i ) 1 i = 1 d 1 TV ( μ i , ν i ) ,
we obtain
TV ( q T | 0 ( · | x 0 ) , u d ) 1 1 S 1 S ϵ T d .
Consequently,
TV ( q T , p 0 ) 1 1 S 1 S ϵ T d d S 1 S ϵ T d e β ¯ T .
We can also derive a complementary bound using the likelihood ratio. For fixed x 0 , define
L x 0 ( x T ) : = q T | 0 ( x T | x 0 ) u d ( x T ) .
For one coordinate,
q T | 0 i ( z | a ) u S ( z ) = 1 + ϵ T S 1 { z = a } 1 .
Hence,
L x 0 ( x T ) = i = 1 d 1 + ϵ T S 1 { x T i = x 0 i } 1 .
By Cauchy–Schwarz,
TV ( q T | 0 ( · | x 0 ) , u d ) = 1 2 E x T u d | L x 0 ( x T ) 1 | 1 2 E x T u d ( L x 0 ( x T ) 1 ) 2 .
Since x T 1 , , x T d are independent under u d ,
E x T u d L x 0 ( x T ) 2 = i = 1 d E z u S 1 + ϵ T ( S 1 { z = x 0 i } 1 ) 2 = 1 S ( 1 + ( S 1 ) ϵ T ) 2 + S 1 S ( 1 ϵ T ) 2 d = 1 + ( S 1 ) ϵ T 2 d .
Also, E u d [ L x 0 ] = 1 . Therefore,
E u d ( L x 0 1 ) 2 = 1 + ( S 1 ) ϵ T 2 d 1 ,
and thus
TV ( q T | 0 ( · | x 0 ) , u d ) 1 2 1 + ( S 1 ) ϵ T 2 d 1 .
Again using convexity over x 0 q 0 , this gives
TV ( q T , p 0 ) 1 2 1 + ( S 1 ) ϵ T 2 d 1 .
Combining the two bounds yields
TV ( q T , p 0 ) 1 1 S 1 S ϵ T d 1 2 1 + ( S 1 ) ϵ T 2 d 1 .
Finally, since 1 ( 1 r ) d d r and ( 1 + a ) d e a d , we obtain
TV ( q T , p 0 ) min d ϵ T , d S ϵ T = min d , d S ϵ T .

Appendix F.2. Proof of Lemma A2

Write P : = i [ d ] : y i x i . First, we note that
q t ( y ) q t ( x ) = 1 q t ( x ) x 0 [ S ] d q 0 ( x 0 ) q t | 0 ( y | x 0 ) = ( i ) 1 q t ( x ) x 0 [ S ] d q 0 ( x 0 ) i [ d ] q t | 0 i ( y i | x 0 i ) = 1 q t ( x ) x 0 [ S ] d q 0 ( x 0 ) i [ d ] q t | 0 i ( x i | x 0 i ) i P q t | 0 i ( x i | x 0 i ) q t | 0 i ( y i | x 0 i ) = ( i i ) x 0 [ S ] d q 0 ( x 0 ) q t | 0 ( x | x 0 ) q t ( x ) i P q t | 0 i ( x i | x 0 i ) q t | 0 i ( y i | x 0 i ) = E x 0 q 0 | t ( · | x ) i P q t | 0 i ( y i | x 0 i ) q t | 0 i ( x i | x 0 i )
where both ( i ) and ( i i ) follow because with the chosen R t in (2), each dimension propagates independently in the forward process (cf. [11] (Prop. 3)). Recall the forward conditional probability under non-homogeneity in (A11), which yields
q t | 0 i ( y i | x 0 i ) q t | 0 i ( x i | x 0 i ) = 1 if x i x 0 i and y i x 0 i 1 e β ¯ t 1 + ( S 1 ) e β ¯ t if x i = x 0 i but y i x 0 i 1 + ( S 1 ) e β ¯ t 1 e β ¯ t if x i x 0 i but y i = x 0 i .
Now, since e β ¯ t 0 , the second case satisfies that 1 e β ¯ t 1 + ( S 1 ) e β ¯ t 1 . For the third case, we have
1 + ( S 1 ) e β ¯ t 1 e β ¯ t = 1 + S · 1 e β ¯ t 1 S + 1 if e β ¯ t 2 S β ¯ t otherwise S · max { 1 , ( β ¯ t ) 1 } .
Since these bounds do not depend on i, the proof is now complete.

Appendix F.3. Proof of Lemma A3

First, fix x t and y and let i be the index such that x t i y i . From (A12), we have,
| q t ( y ) q t ( x t ) q s ( y ) q s ( x t ) | = | E x 0 q 0 | t ( · | x t ) q t | 0 i ( y i | x 0 i ) q t | 0 i ( x t i | x 0 i ) E x ˜ 0 q 0 | s ( · | x t ) q s | 0 i ( y i | x ˜ 0 i ) q s | 0 i ( x t i | x ˜ 0 i ) | E x 0 q 0 | t ( · | x t ) | q t | 0 i ( y i | x 0 i ) q t | 0 i ( x t i | x 0 i ) q s | 0 i ( y i | x 0 i ) q s | 0 i ( x t i | x 0 i ) | + | E x 0 q 0 | t ( · | x t ) x ˜ 0 q 0 | s ( · | x t ) q s | 0 i ( y i | x 0 i ) q s | 0 i ( x t i | x 0 i ) q s | 0 i ( y i | x ˜ 0 i ) q s | 0 i ( x t i | x ˜ 0 i ) | .
For the first term in (A14), we note the expression of likelihood ratio in (A13), and thus for any fixed x 0 , x t , and y,
| q t | 0 i ( y i | x 0 i ) q t | 0 i ( x t i | x 0 i ) q s | 0 i ( y i | x 0 i ) q s | 0 i ( x t i | x 0 i ) | = 0 if x i x 0 i and y i x 0 i 1 e β ¯ t 1 + ( S 1 ) e β ¯ t 1 e β ¯ s 1 + ( S 1 ) e β ¯ s if x i = x 0 i but y i x 0 i 1 + ( S 1 ) e β ¯ t 1 e β ¯ t 1 + ( S 1 ) e β ¯ s 1 e β ¯ s if x i x 0 i but y i = x 0 i .
Now, since
| t 1 e β ¯ t 1 + ( S 1 ) e β ¯ t | = S e β ¯ t β t ( S + e β ¯ t 1 ) 2 β t | t 1 + ( S 1 ) e β ¯ t 1 e β ¯ t | = S e β ¯ t β t ( e β ¯ t 1 ) 2 S β t min 1 , β ¯ t 2 ,
we have
| q t | 0 i ( y i | x 0 i ) q t | 0 i ( x t i | x 0 i ) q s | 0 i ( y i | x 0 i ) q s | 0 i ( x t i | x 0 i ) | S β t min 1 , β ¯ t 2 ( t s ) .
Note that this term does not depend on d. Thus,
E x t q t y x t Ham y , x t = 1 E x 0 q 0 | t ( · | x t ) | q t | 0 i ( y i | x 0 i ) q t | 0 i ( x t i | x 0 i ) q s | 0 i ( y i | x 0 i ) q s | 0 i ( x t i | x 0 i ) | R t ( y , x t ) d S β t 2 max 1 , β ¯ t 2 ( t s ) .
Now we turn to the second term in (A14). Write f ( z ) : = q s | 0 i ( y i | z ) q s | 0 i ( x t i | z ) for brevity (recall that x t and y are fixed and thus omitted in this expression). Note that from (A13), an upper bound on f ( z ) is
sup y i , x t i , z [ S ] f ( z ) = sup y i , x t i , z [ S ] q s | 0 i ( y i | z ) q s | 0 i ( x t i | z ) S · max { 1 , β ¯ s 1 } .
Thus, the second term in (A14) can be upper-bounded (for each y i ) as
| E x 0 q 0 | t ( · | x t ) x ˜ 0 q 0 | s ( · | x t ) f ( x 0 i ) f ( x ˜ 0 i ) | = | x 0 [ S ] d f ( x 0 i ) ( q 0 | t ( x 0 | x t ) q 0 | s ( x 0 | x t ) ) | S max { 1 , β ¯ s 1 } x 0 [ S ] d | q 0 | t ( x 0 | x t ) q 0 | s ( x 0 | x t ) | .
Using Bayes’s rule, we have
x 0 [ S ] d | q 0 | t ( x 0 | x t ) q 0 | s ( x 0 | x t ) | = x 0 [ S ] d q 0 ( x 0 ) | q t | 0 ( x t | x 0 ) q t ( x t ) q s | 0 ( x t | x 0 ) q s ( x t ) | 1 q t ( x t ) · q s ( x t ) x 0 , y 0 [ S ] d q 0 ( x 0 ) q 0 ( y 0 ) · | q t | 0 ( x t | x 0 ) q s | 0 ( x t | y 0 ) q s | 0 ( x t | x 0 ) q t | 0 ( x t | y 0 ) | 1 q t ( x t ) · q s ( x t ) E x 0 , y 0 q 0 [ | q t | 0 ( x t | x 0 ) q s | 0 ( x t | x 0 ) | q s | 0 ( x t | y 0 ) + | q s | 0 ( x t | y 0 ) q t | 0 ( x t | y 0 ) | q s | 0 ( x t | x 0 ) ] = 1 q t ( x t ) E x 0 q 0 | q t | 0 ( x t | x 0 ) q s | 0 ( x t | x 0 ) | + 1 q t ( x t ) E y 0 q 0 | q s | 0 ( x t | y 0 ) q t | 0 ( x t | y 0 ) | = 2 q t ( x t ) E x 0 q 0 | q t | 0 ( x t | x 0 ) q s | 0 ( x t | x 0 ) |
Now, this term (without the constant factor 2) can be upper-bounded as
1 q t ( x t ) E x 0 q 0 | q t | 0 ( x t | x 0 ) q s | 0 ( x t | x 0 ) | 1 q t ( x t ) ( t s ) E x 0 q 0 | t q t | 0 ( x t | x 0 ) | = ( i ) 1 q t ( x t ) ( t s ) E x 0 q 0 | x ˜ t [ S ] d q t | 0 ( x ˜ t | x 0 ) R t ( x ˜ t , x t ) | 1 q t ( x t ) ( t s ) E x 0 q 0 x ˜ t [ S ] d q t | 0 ( x ˜ t | x 0 ) | R t ( x ˜ t , x t ) | = ( t s ) x ˜ t [ S ] d q t ( x ˜ t ) q t ( x t ) | R t ( x ˜ t , x t ) | ,
where ( i ) follows from Kolmogorov forward equation. Thus, combining these intermediate results, we have
E x t q t y x t Ham y , x t = 1 | E x 0 q 0 | t ( · | x t ) x ˜ 0 q 0 | s ( · | x t ) f ( x 0 i ) f ( x ˜ 0 i ) | R t ( y , x t ) ( i i ) d S β t max { 1 , β ¯ s 1 } · E x t q t 1 q t ( x t ) E x 0 q 0 | q t | 0 ( x t | x 0 ) q s | 0 ( x t | x 0 ) | ( i i i ) ( t s ) d S β t max { 1 , β ¯ s 1 } E x t q t x ˜ t [ S ] d q t ( x ˜ t ) q t ( x t ) | R t ( x ˜ t , x t ) | = ( t s ) d S β t max { 1 , β ¯ s 1 } E x ˜ t q t x t [ S ] d | R t ( x ˜ t , x t ) | ( i v ) ( t s ) d 2 S β t 2 max { 1 , β ¯ s 1 } ,
where ( i i ) follows from (A15) and (A16), ( i i i ) follows from (A17), and ( i v ) follows because R t ( x , x ) = y x R t ( x , y ) = β t S 1 S d for all x , y [ S ] d . The proof is now complete.

Appendix F.4. Proof of Lemma A4

Write Π ( q 0 , q δ ) is the set of all joint probability measures with marginal distributions q 0 and q δ . Then,
TV ( q 0 , q δ ) = min π Π ( q 0 , q δ ) E x 0 , x δ π 1 x δ x 0 E x 0 q 0 Q δ | 0 x δ x 0 ( i ) i [ d ] a x 0 i E x 0 q 0 Q δ | 0 i a x 0 i = ( i i ) d S 1 S 1 e β ¯ δ d β ¯ δ ,
where ( i ) follows from the union bound, and ( i i ) follows from (A11). The proof is now complete.

Appendix G. Proof of Lemma A5

Recall the recursion of f k as
f k + 1 = f k e λ κ f k .
Equivalently, we have
f k f k + 1 1 e λ κ = f k ( 1 e λ κ f k ) 1 e λ κ .
With this, since f k > 0 , it is equivalent to prove that
f k 1 e λ κ f k 1 e λ κ 1 e λ κ 1 e λ κ f k f k .
Now, since the function g ( z ) = 1 e a z z is decreasing in z > 0 for a > 0 , we get the latter inequality by noting that f k 1 for k > k * .

References

  1. Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 2256–2265. [Google Scholar]
  2. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  3. Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; van den Berg, R. Structured Denoising Diffusion Models in Discrete State-Spaces. In Proceedings of the Advances in Neural Information Processing Systems; Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
  4. Dhariwal, P.; Nichol, A.Q. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
  5. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
  6. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 10684–10695. [Google Scholar]
  7. Huang, R.; Huang, J.; Yang, D.; Ren, Y.; Liu, L.; Li, M.; Ye, Z.; Liu, J.; Yin, X.; Zhao, Z. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  8. Lou, A.; Meng, C.; Ermon, S. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Volume 235, pp. 32819–32848. [Google Scholar]
  9. Liu, C.; Fan, W.; Liu, Y.; Li, J.; Li, H.; Liu, H.; Tang, J.; Li, Q. Generative diffusion models on graphs: Methods and applications. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023. [Google Scholar]
  10. Alakhdar, A.; Poczos, B.; Washburn, N. Diffusion Models in De Novo Drug Design. J. Chem. Inf. Model. 2024, 64, 7238–7256. [Google Scholar] [CrossRef] [PubMed]
  11. Campbell, A.; Benton, J.; Bortoli, V.D.; Rainforth, T.; Deligiannidis, G.; Doucet, A. A Continuous Time Framework for Discrete Denoising Models. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
  12. Chen, H.; Ying, L. Convergence Analysis of Discrete Diffusion Model: Exact Implementation through Uniformization. arXiv 2024, arXiv:2402.08095. [Google Scholar] [CrossRef]
  13. Zhang, Z.; Chen, Z.; Gu, Q. Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  14. Pham, L.T.N.; Shariatian, D.; Ocello, A.; Conforti, G.; Durmus, A.O. Discrete Markov Probabilistic Models: An Improved Discrete Score-Based Framework with sharp convergence bounds under minimal assumptions. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
  15. Dmitriev, D.; Huang, Z.; Wei, Y. Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees. arXiv 2026, arXiv:2602.15008. [Google Scholar] [CrossRef]
  16. Ren, Y.; Chen, H.; Rotskoff, G.M.; Ying, L. How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  17. Liang, Y.; Liang, Y.; Lai, L.; Shroff, N. Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
  18. Liang, Y.; Tan, Z.; Shroff, N.; Liang, Y. Sharp Convergence Rates for Masked Diffusion Models. arXiv 2026, arXiv:2602.22505. [Google Scholar] [CrossRef]
  19. Chen, T. On the Importance of Noise Scheduling for Diffusion Models. arXiv 2023, arXiv:2301.10972. [Google Scholar] [CrossRef]
  20. Conforti, G.; Durmus, A.; Pham, L.T.N.; Raoul, G. Non-Asymptotic Convergence of Discrete Diffusion Models: Masked and Random Walk dynamics. arXiv 2025, arXiv:2512.00580. [Google Scholar]
  21. Ren, Y.; Chen, H.; Zhu, Y.; Guo, W.; Chen, Y.; Rotskoff, G.M.; Tao, M.; Ying, L. Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms. In Proceedings of the Advances in Neural Information Processing Systems 38 (NeurIPS 2025); Curran Associates, Inc.: Red Hook, NY, USA, 2025. [Google Scholar]
  22. Liang, Y.; Shroff, N.; Liang, Y. From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models. arXiv 2026, arXiv:2605.27352. [Google Scholar] [CrossRef]
  23. Nisonoff, H.; Xiong, J.; Allenspach, S.; Listgarten, J. Unlocking Guidance for Discrete State-Space Diffusion and Flow Models. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  24. Kan, K.; Li, X.; Zhang, B.J.; Sahai, T.; Osher, S.; Katsoulakis, M.A. Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space. arXiv 2026, arXiv:2605.17232. [Google Scholar] [CrossRef]
  25. Liang, Y.; Huang, R.; Lai, L.; Shroff, N.; Liang, Y. Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2025. [Google Scholar]
  26. Huang, X.; Lin, Y.; Jain, N.; Wang, K.; Zou, D.; Ma, Y.; Zhang, T. On the Complexity Theory of Masked Discrete Diffusion: From poly(1/ϵ) to Nearly ϵ-Free. arXiv 2025, arXiv:2509.21835. [Google Scholar]
  27. Kelly, F.P. Reversibility and Stochastic Networks; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Table 1. Summary of results for regular uniform-rate discrete diffusion samplers in terms of the number of steps needed to achieve ε -accuracy in TV ( q δ , p T δ ) . Only the best result for each sampler is shown. All log-dependencies are omitted. In the table, d is the dimension, S is the vocabulary size, and M is an upper bound of the score estimates. In this work, we provide the first analysis under non-homogeneous noise schedule and L 1 estimation error (i.e., in total-variation metric), having a convergence rate comparable to the state-of-the-art results.
Table 1. Summary of results for regular uniform-rate discrete diffusion samplers in terms of the number of steps needed to achieve ε -accuracy in TV ( q δ , p T δ ) . Only the best result for each sampler is shown. All log-dependencies are omitted. In the table, d is the dimension, S is the vocabulary size, and M is an upper bound of the score estimates. In this work, we provide the first analysis under non-homogeneous noise schedule and L 1 estimation error (i.e., in total-variation metric), having a convergence rate comparable to the state-of-the-art results.
SamplerAssumptionTime-Homo-Geneous?Results: Num of StepsReference
KolmogorovScore entropy, bounded scoreYes O d M S ε [13]
DMPMScore entropyYes O d S 2 ε 2 [14,20]
τ -leaping L errorYes O d 4 S 4 ε [11]
τ -leapingScore entropy, bounded scoreYes O d 2 S 2 ε [16,17]
τ -leapingScore entropyYes O d ε [15]
Euler method, Tweedie τ -leapingScore entropy, bounded scoreYes O d 2 S ε [17]
Euler method, Tweedie τ -leaping L 1 error, slow-varying noiseNo O d 2 S ε Theorem 2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, Y.; Lai, L.; Shroff, N.; Liang, Y. Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models. Entropy 2026, 28, 675. https://doi.org/10.3390/e28060675

AMA Style

Liang Y, Lai L, Shroff N, Liang Y. Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models. Entropy. 2026; 28(6):675. https://doi.org/10.3390/e28060675

Chicago/Turabian Style

Liang, Yuchen, Lifeng Lai, Ness Shroff, and Yingbin Liang. 2026. "Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models" Entropy 28, no. 6: 675. https://doi.org/10.3390/e28060675

APA Style

Liang, Y., Lai, L., Shroff, N., & Liang, Y. (2026). Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models. Entropy, 28(6), 675. https://doi.org/10.3390/e28060675

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop