Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models

Liang, Yuchen; Lai, Lifeng; Shroff, Ness; Liang, Yingbin

doi:10.3390/e28060675

Open AccessArticle

Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models

by

Yuchen Liang

^1,*

,

Lifeng Lai

²,

Ness Shroff

³ and

Yingbin Liang

³

¹

Luddy School of Informatics, Computing, and Engineering, Indiana University Indianapolis, Indianapolis, IN 46202, USA

²

Department of Electrical and Computer Engineering, UC Davis, Davis, CA 95616, USA

³

Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(6), 675; https://doi.org/10.3390/e28060675 (registering DOI)

Submission received: 16 May 2026 / Revised: 3 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Foundations and Frontiers of Information Theory—Dedicated to Professor H. Vincent Poor on the Occasion of His 75th Birthday)

Download Versions Notes

Abstract

Discrete diffusion models have become an important class of generative models for categorical data, yet their theoretical understanding remains largely limited to time-homogeneous noise schedules. In this work, we study uniform-rate discrete diffusion models with time-inhomogeneous continuous-time Markov chain forward processes. We establish convergence guarantees for practical reverse-time samplers by directly controlling the total variation distance, avoiding the indirect route of first bounding KL divergence and then applying Pinsker’s inequality. Our analysis decomposes the sampling error into initialization, score-estimation, discretization, and early-stopping errors, and explicitly characterizes how each term depends on the accumulated noise, the local noise rate, and the smoothness of the noise schedule. Under suitable regularity conditions on the noise schedule, we further derive step-complexity guarantees that match the order of existing results for homogeneous samplers.

Keywords:

discrete diffusion models; continuous-time Markov chains; time-inhomogeneity; convergence analysis

1. Introduction

Generative modeling aims to produce samples that approximate the training data distribution. Diffusion models [1,2,3] have become a leading approach, with strong performance in image, video, audio, and text generation [4,5,6,7,8]. They generate data by learning to reverse a forward noising process; for discrete data, this process is naturally modeled as a CTMC on a finite state space.

A growing line of work studies the theoretical convergence of discrete diffusion samplers. And a growing body of work suggests that for discrete data such as natural language and graphs, discrete diffusion models offer greater advantages and more flexibility than their continuous counterparts [9,10]. For uniform-rate discrete diffusion models, existing guarantees have primarily been developed for time-homogeneous CTMCs. Early work analyzed

τ

-leaping under total variation distance [11], while subsequent studies established convergence guarantees through score-entropy control for uniformization, exact-step, and

τ

-leaping-type samplers [12,13,14,15,16]. More recently, sharper analyses have been obtained for practical samplers such as the Euler method and Tweedie

τ

-leaping [17,18]. However, these results largely focus on homogeneous noise schedules, leaving open the question of how non-constant noise schedules affect the sampling error.

This limitation is important because time-inhomogeneous schedules are widely used in practice and can substantially affect generative performance. For continuous diffusion models, it is empirically showed in [19] that noise scheduling is a crucial design choice: the optimal schedule can depend on the task, and higher-resolution generation may benefit from schedules that place more mass on noisier regimes. These observations suggest that the noise schedule is not merely a technical detail but rather a central component controlling the trade-off between data corruption, score estimation, and reverse-time discretization. Motivated by this, we develop a convergence analysis for non-homogeneous uniform-rate discrete diffusion models. Our results quantify how the accumulated noise, the local rate

β_{t}

, and the smoothness of the schedule jointly determine initialization, discretization, and early-stopping errors.

2. Related Work

Theoretical studies of uniform-rate discrete diffusion models have so far largely considered time-homogeneous CTMCs. We have provided a summary in Table 1. An early result by [11] analyzed

τ

-leaping in total variation distance but required strong assumptions on the estimated reverse rates and led to suboptimal parameter dependence. Subsequent works focused on score-entropy-based guarantees for random-step samplers: ref. [12] studied a uniformization sampler on the hypercube, and ref. [16] extended the analysis to general product spaces

{[S]}^{d}

.

Other recent works analyze deterministic-step samplers. Some assume an exact per-step solver [13] or a specially designed sampler [14,20], while others study existing

τ

-leaping-type samplers used in practice [15,16,17]. Beyond

τ

-leaping, ref. [17] developed a sharp analysis for the Euler method and Tweedie

τ

-leaping [8], avoiding a Girsanov change-of-measure argument. Besides regular samplers, refs. [21,22] investigated possible accelerations to these vanilla samplers. Most of these results are stated in KL divergence and then converted to total variation distance, which can introduce looseness. Also, these works are focused on time-homogeneous noise schedules, which are almost never used in practice. In contrast, our work directly analyzes total variation error for time-inhomogeneous uniform-rate processes, thereby capturing the effect of non-constant schedules such as the geometric schedule used in empirical works (e.g., [8,23]).

There is a concurrent work in [24] that derives an S-independent upper bound on the convergence error also for the time-inhomogeneous process. Differently, their work assumes exact simulation of the continuous-time sampling process, and the step complexity when there is discretization remains unclear.

A parallel line of work studies masked, or absorbing-rate, discrete diffusion models. Convergence guarantees have been established in [15,18,20,25,26].

Our Contributions

Our main contributions are summarized as follows:

We establish convergence guarantees for discrete diffusion models with time-inhomogeneous uniform-rate generators. This extends prior analyses, which primarily focus on homogeneous noise schedules.
We identify regularity conditions on the noise schedule under which explicit convergence rates can be obtained. Under these conditions, the resulting rates match state-of-the-art guarantees for homogeneous discrete diffusion samplers.

3. Preliminaries of Discrete Diffusion Samplers

In this section, we provide the background of the continuous-time discrete-space diffusion sampler.

3.1. Continuous-Time Forward Dynamics on Discrete State Spaces

We consider discrete data represented as a length-d sequence

x_{0} = (x_{0}^{1}, \dots, x_{0}^{d}) \in {[S]}^{d},

where each coordinate

x_{0}^{i}

takes values in a finite alphabet

[S]

of size S. Let

q_{0}^{i}

denote the marginal probability mass function of the i-th token, and let

q_{0}

denote the joint distribution of the complete data vector

x_{0}

.

The forward noising process is modeled as a time-inhomogeneous continuous-time Markov chain (CTMC) on the finite state space

{[S]}^{d}

. This process is specified by the initial law

q_{0}

and a time-dependent transition-rate matrix

R_{t} \in R^{S^{d} \times S^{d}}

. Under the convention that

R_{t} (x, y)

denotes the instantaneous rate of jumping from state x to state y, the infinitesimal transition probability satisfies, for sufficiently small

Δ t

,

q_{t + Δ t ∣ t} (y ∣ x) = 1 \{y = x\} + R_{t} (x, y) Δ t + o (Δ t), x, y \in {[S]}^{d} .

(1)

Equivalently, the marginal law

q_{t}

evolves according to the Kolmogorov forward equation

\frac{d}{d t} q_{t} (y) = \sum_{x \in {[S]}^{d}} q_{t} (x) R_{t} (x, y), y \in {[S]}^{d} .

For

R_{t}

to define a valid CTMC generator, it must satisfy the standard conditions

R_{t} (x, y) \geq 0 for x \neq y, R_{t} (x, x) \leq 0, \sum_{y \in {[S]}^{d}} R_{t} (x, y) = 0 .

Directly specifying a generator over the full space

{[S]}^{d}

is generally infeasible when either S or d is large. A common simplification is therefore to impose coordinate-wise independent corruption so that only one token changes at any infinitesimal transition [8,11]. In this case, for

x \neq y

, the generator takes the form [11] (Prop. 3)

R_{t} (x, y) = \{\begin{matrix} R_{t}^{tok} (x^{i}, y^{i}), & if Ham (x, y) = 1 and x^{i} \neq y^{i}, \\ 0, & if Ham (x, y) > 1, \end{matrix}

(2)

where

R_{t}^{tok} \in R^{S \times S}

is the token-level transition-rate matrix and

Ham (x, y)

denotes the Hamming distance between x and y. The diagonal entries of

R_{t}

are then determined by the zero-row-sum constraint.

Following [11], we parameterize the token-level generator as

R_{t}^{tok} = β_{t} R_{base},

where

β_{t} > 0

is a noise schedule and

R_{base}

is a fixed base generator. Prior work often studies the homogeneous case

β_{t} \equiv 1

; see, for example, refs. [13,16]. We define the accumulated noise level by

{\bar{β}}_{t} : = \int_{0}^{t} β_{s} d s < \infty .

Since

β_{t} > 0

, the map

t \mapsto {\bar{β}}_{t}

is increasing. We denote the inverse of

{\bar{β}}_{t}

as

{\bar{β}}^{- 1}

, so that

{\bar{β}}^{- 1} ({\bar{β}}_{t}) = t .

Throughout, we assume that

β_{t}

is smooth on

[0, T]

.

The coordinate-wise construction yields a tractable closed-form transition kernel. In particular, the token-level transition matrix from time 0 to time t is

Q_{t ∣ 0}^{tok} = exp ({\bar{β}}_{t} R_{base}),

and the full transition kernel factorizes across coordinates as

q_{t ∣ 0} (x_{t} ∣ x_{0}) = \prod_{i = 1}^{d} Q_{t ∣ 0}^{tok} (x_{0}^{i}, x_{t}^{i}) .

This closed form is especially useful for training discrete diffusion models.

In this work, we focus on the commonly used uniform-rate discrete diffusion model, with

R_{base} : = \frac{1}{S} 1_{S} 1_{S}^{⊺} - I_{S},

which appears in several prior works [3,8,11]. For this choice, each token is driven toward the uniform distribution over

[S]

. Moreover, the diagonal entry of the full-state generator is

R_{t} (x, x) = - \sum_{y : y \neq x} R_{t} (x, y) = - β_{t} \frac{S - 1}{S} d, x \in {[S]}^{d} .

With such an

R_{t}

, as T becomes large, the terminal distribution induced by this forward process approaches the uniform distribution on

{[S]}^{d}

.

3.2. Reverse Dynamics and Discrete-Time Sampling

The CTMC forward process induces a time-reversed CTMC whose marginals coincide with those of the forward chain run backward in time [27]. More precisely, by [11] (Prop. 1), the reverse process starts from

{\overset{\leftarrow}{q}}_{0} : = q_{T}

and evolves with transition-rate matrix

{\overset{\leftarrow}{R}}_{t}

given, for

x \neq y

, by

{\overset{\leftarrow}{R}}_{t} (x, y) = R_{T - t} (y, x) \frac{q_{T - t} (y)}{q_{T - t} (x)}, x, y \in {[S]}^{d},

(3)

with diagonal entries determined by

{\overset{\leftarrow}{R}}_{t} (x, x) = - \sum_{y : y \neq x} {\overset{\leftarrow}{R}}_{t} (x, y) .

With this setup, the marginal distribution of this reverse chain satisfies

{\overset{\leftarrow}{q}}_{t} = q_{T - t} .

Under the coordinate-wise forward generator in (2), the reverse generator inherits the same sparsity pattern: if

Ham (x, y) \geq 2

, then

{\overset{\leftarrow}{R}}_{t} (x, y) = 0 .

Thus, both the forward and reverse processes only allow infinitesimal transitions that modify a single token. In practice, the reverse process is stopped at time

T - δ

for a small

δ > 0

. This early-stopping convention avoids the numerical instability that may arise as the forward time approaches

0^{+}

, where the corresponding score ratios can become irregular (as shown in [17] (Lemma 2)).

The exact reverse CTMC is not directly implementable, and the sampling procedure therefore introduces several approximations. Let

p_{t}

denote the marginal distribution of the practical sampler at reverse time

t \in [0, T - δ]

. First, the exact initial law

q_{T}

is typically unavailable. Since the forward process with the uniformizing base generator converges to the uniform distribution, the sampler is initialized as

p_{0} : = Uniform ({[S]}^{d}) .

Second, the likelihood ratio appearing in (3) is unknown. For reverse time t, the required ratio

\frac{q_{T - t} (y)}{q_{T - t} (x)}

is approximated by a learned concrete score function

s_{T - t} (y, x) \approx \frac{q_{T - t} (y)}{q_{T - t} (x)} .

Consequently, on a discrete sampling grid, the reverse generator is replaced by an estimated generator.

Beyond estimation error, we also account for the error introduced by approximate sampling procedures. Let

0 = t_{0} < t_{1} < \dots < t_{N} = T - δ

be the reverse-time discretization grid. At grid point

t_{k}

, the off-diagonal entries of the estimated reverse generator are

{\hat{R}}_{t_{k}} (x, y) : = R_{T - t_{k}} (y, x) s_{T - t_{k}} (y, x), x \neq y,

(4)

with

{\hat{R}}_{t_{k}} (x, x) = - \sum_{y : y \neq x} {\hat{R}}_{t_{k}} (x, y) .

This approximation replaces the exact reverse rate in (3) by a score-based estimate.

Following [8,23], one example is to use the Euler method that freezes the estimated generator over each interval

[t_{k}, t_{k + 1})

. Since the generator only permits single-token transitions, the update can be written in a token-wise form. Given the current state

x_{t_{k}}

, define, for each coordinate

i \in [d]

and candidate token

a \in [S]

with

a \neq x_{t_{k}}^{i}

,

{\hat{R}}_{k}^{i} (x_{t_{k}}^{i}, a) : = {\hat{R}}_{t_{k}} (x_{t_{k}}, x_{t_{k}}^{- i} \oplus_{i} a),

(5)

where

x_{t_{k}}^{- i} \oplus_{i} a

denotes the vector obtained from

x_{t_{k}}

by replacing its i-th token with a. The corresponding diagonal token rate is

{\hat{R}}_{k}^{i} (x_{t_{k}}^{i}, x_{t_{k}}^{i}) : = - \sum_{a : a \neq x_{t_{k}}^{i}} {\hat{R}}_{k}^{i} (x_{t_{k}}^{i}, a) .

The Euler transition for the i-th token is then

P (x_{t_{k + 1}}^{i} = a ∣ x_{t_{k}}) = \{\begin{matrix} (t_{k + 1} - t_{k}) {\hat{R}}_{k}^{i} (x_{t_{k}}^{i}, a), & a \neq x_{t_{k}}^{i}, \\ 1 + (t_{k + 1} - t_{k}) {\hat{R}}_{k}^{i} (x_{t_{k}}^{i}, x_{t_{k}}^{i}), & a = x_{t_{k}}^{i} . \end{matrix}

(6)

Equivalently, each token either remains unchanged or jumps to another symbol according to the local estimated reverse rates. The step size is typically small, so that the probabilities in (6) remain nonnegative. The final output of the sampler is the state at reverse time

T - δ

, whose law is intended to approximate the early-stopped data distribution

q_{δ}

.

3.3. Notations

Let

x^{i} (1 \leq i \leq d)

denote the i-th element of a vector

x \in {[S]}^{d}

and

x^{- i} \in {[S]}^{d - 1}

denote the i-th element removed. Define

Ham (x, y)

as the Hamming distance between two vectors x and y. For a positive integer n,

[n] : = {1, \dots, n}

. Write

1_{S}

as a vector of length S that contains all 1’s, and

I_{S}

as an identity matrix of size

S \times S

.

4. Convergence Under General Non-Homogeneous Noise Schedule

In this section, we establish improved upper error bounds and convergence rates for the uniform-rate matrix. Following the approach of [18], our results are stated directly in terms of total variation distance rather than first deriving a KL-divergence bound and then applying Pinsker’s inequality. Our analysis relies on the following estimation-error condition.

Assumption 1.

The score estimation error satisfies that

L_{T V} (s) \leq \sqrt{ε_{TV}}

, where

L_{T V} (s) : = \sum_{k = 0}^{N - 1} (t_{k + 1} - t_{k}) E_{x_{k} \sim q_{T - t_{k}}} [\sum_{y : y \neq x_{k}} R_{T - t_{k}} (y, x_{k}) | s_{T - t_{k}} (y, x_{k}) - \frac{q_{T - t_{k}} (y)}{q_{T - t_{k}} (x_{k})} |] .

Assumption 1 requires the learned score to be accurate at

L_{1}

distance, measured as the expected sum of absolute differences between the exact and estimated rate matrices. This condition differs from the commonly used score-entropy loss [16,17] but is particularly convenient for our direct total-variation-based analysis. For masked diffusion models, ref. [18] shows that

L_{T V}

can be upper-bounded by the score-entropy loss. We have provided a similar justification in Appendix A.

With this condition in place, the following theorem provides an upper bound on the sampling error in total variation distance.

Theorem 1.

Suppose that Assumption 1 holds. As long as

t_{k + 1} - t_{k}

vanishes at small ε and that each Euler step is valid, we have

\begin{matrix} TV (q_{0}, p_{T - δ}) ≲ min {d, & \sqrt{d S}} e^{- {\bar{β}}_{T}} + \sqrt{ε_{TV}} \\ + d S \sum_{k = 0}^{N - 1} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (| β_{T - t_{k}}^{'} | + d β_{T - t_{k}}^{2}) {(t_{k + 1} - t_{k})}^{2} + d {\bar{β}}_{δ} . \end{matrix}

Theorem 1 decomposes the total sampling error into four distinct contributions. The first term,

min {d, \sqrt{d S}} e^{- {\bar{β}}_{T}}

, corresponds to the initialization error, which quantifies the discrepancy between the true terminal distribution

q_{T}

and the uniform distribution used to initialize the sampler. This error decays exponentially fast with the accumulated noise level

{\bar{β}}_{T}

. The second term,

\sqrt{ε_{TV}}

, captures the error due to imperfect score estimation under Assumption 1. The third term represents the discretization error incurred by replacing the continuous-time reverse CTMC with a discrete-time approximation on the grid

{t_{k}}_{k = 0}^{N}

. Finally, the last term,

d {\bar{β}}_{δ}

, is the early-stopping error, which arises because the reverse process is terminated at time

T - δ

rather than being run all the way to the data distribution.

Remark 1

(On the time steps

t_{k + 1} - t_{k}

). In order to guarantee the validity of the Euler steps, it is necessary to have that

| {\hat{R}}_{k}^{i} (x^{i}, x^{i}) | = \sum_{a : a \neq x^{i}} {\hat{R}}_{k}^{i} (x^{i}, a) < 1 .

Assuming that the score functions are accurately estimated, from Lemma A2 we have

| {\hat{R}}_{k}^{i} (x^{i}, x^{i}) | ≲ S

, necessarily yielding

t_{k + 1} - t_{k} = o (1 / S)

. Note that this condition is satisfied for the step size in Theorem 2.

Remark 2

(On the early-stopping parameter

δ

). For time-homogeneous models, to achieve

O (\sqrt{ε})

error, δ can be easily set to be

\sqrt{ε} / d

(see Corollary 1). For non-homogeneous schedules, however, such a dependence on d becomes implicit, and one should set δ such that

{\bar{β}}^{- 1} (\sqrt{ε} / d)

, which depends on the choice of the actual schedule.

For comparison, we provide the following corollary for the special case of a time-homogeneous CTMC, which is immediate upon plugging in

β_{t} \equiv 1

.

Corollary 1.

Under the setting of Theorem 1, for time-homogeneous CTMC, i.e., when

β_{t} \equiv 1

, we have

\begin{matrix} TV (q_{0}, p_{T - δ}) ≲ min \{d, \sqrt{d S}\} e^{- T} + & \sqrt{ε_{TV}} \\ + d^{2} S \sum_{k = 0}^{N - 1} max {1, {(T - t_{k + 1})}^{- 2}} {(t_{k + 1} - t_{k})}^{2} + d δ . \end{matrix}

Compared with the time-homogeneous CTMC setting, our error bound in Theorem 1 exhibits two key differences. First, the initialization and early-stopping errors are no longer determined solely by the diffusion times T and

δ

; they also depend on the prescribed noise schedule through the accumulated noise levels

{\bar{β}}_{T}

and

{\bar{β}}_{δ}

. Second, the discretization error is not controlled only by the selected time steps

(t_{k + 1} - t_{k})

. Instead, it additionally depends on the smoothness of the noise schedule through

| β_{T - t_{k}}^{'} |

and the magnitude of the transition rates through

β_{T - t_{k}}^{2}

, which is, effectively speaking, the size of each discretization interval as measured by the amount of noise injected over that interval. We have provided a proof sketch in Section 5.

To further determine the number of steps to reach

O (\sqrt{ε})

error, we consider those

β_{t}

’s that satisfy the following regularity condition.

Assumption 2

(Slow-varying noise). Suppose that noise schedule

β_{t}

satisfies the following regularity conditions:

1.: The schedule is uniformly bounded away from zero on the relevant low-noise interval, namely,

$β^{*} : = min_{z \in [δ, {\bar{β}}^{- 1} (1)]} β_{z} ≳ 1 .$
2.: The schedule is asymptotically constant (ignoring logarithms):

$β_{t}, | β_{t}^{'} | = \tilde{O} (1), \forall t \in [δ, T] .$
3.: The inverse accumulated-noise map grows at most polynomially:

${\bar{β}}^{- 1} (x) = poly (x) .$

On a high level, Assumption 2 assumes that the noise schedule is slow-varying. This assumption is used only to determine a suitable choice of the number of sampling steps. In the time-homogeneous CTMC setting, where

β_{t} \equiv 1

, we have

β^{*} = 1

and

β_{t}^{'} \equiv 0

. Therefore, the homogeneous case automatically satisfies Assumption 2. Another example of noise schedule that satisfies Assumption 2 is the polynomial schedule where

{\bar{β}}_{t} = t^{p}

, where

p > 0

. With this choice,

β_{t} = p t^{p - 1}

and

β^{*} = p δ^{p - 1}

if

p \geq 1

and p otherwise.

{\bar{β}}^{- 1} (x) = x^{1 / p} = poly (x)

. Further, for

t \in [δ, T]

,

β_{t} = p t^{p - 1} \leq p T^{p - 1}

and

| β_{t}^{'} | \leq | p (p - 1) T^{p - 2} |

, for which we note that

T = poly (log (d / \sqrt{ε}))

is asymptotically a constant. In general, both the initial

β_{0}

and the growth of

β_{t}

w.r.t. t need to be mild to satisfy Assumption 2.

Beyond polynomial schedule, the exponential (also known as geometric) noise schedule does not satisfy this assumption. This will be our main focus in Section 6.

The first condition in Assumption 2 imposes a constant lower bound on

β_{t}

over the low-noise regime. Intuitively, this prevents the noise schedule from becoming too small near the data distribution, ensuring that the score function does not blow up over the entire reverse process. The second condition controls the cumulative magnitude and variation of the schedule along the discretization grid. This requirement ensures that the transition rates do not depend on the system parameters, which is a common practice, so that the discrete-time approximation remains stable and reliable. Finally, the third condition requires the inverse accumulated-noise map

{\bar{β}}^{- 1}

to grow at most polynomially. Equivalently, this guarantees that one can reach a sufficiently large terminal accumulated noise level

{\bar{β}}_{T}

within a polynomial diffusion time, thereby ensuring that enough noise is injected by the terminal time.

With Assumption 2, the following theorem characterizes the number of sampling steps required to achieve

O (\sqrt{ε})

error in the time-inhomogeneous setting.

Theorem 2.

Under the setting of Theorem 1, suppose further that Assumption 2 holds. Let

t_{k + 1} - t_{k} = κ min \{1, {\bar{β}}_{T - t_{k}}\}

. Then, in order to achieve

O (\sqrt{ε})

TV-error, one can choose

T = {\bar{β}}^{- 1} (log (d / \sqrt{ε}))

,

κ ≍ \sqrt{ε} / (d^{2} S)

, and thus

N = \tilde{O} (d^{2} S / \sqrt{ε})

.

The proof of Theorem 2 is in Appendix C. This theorem shows that under the stated regularity conditions on the noise schedule, the time-inhomogeneous sampler requires the same order of sampling steps as in the time-homogeneous setting; see, e.g., refs. [17,18]. The main difference is that in the inhomogeneous case, both the discretization step sizes and the terminal diffusion time must be chosen according to the prescribed noise schedule.

5. Proof Sketch of Theorem 1

We now provide a proof sketch of Theorem 1. The full proof is in Appendix B.

Proof sketch of Theorem 1.

We analyze the Euler sampler through the truncated

τ

-leaping sampler, which is asymptotically equivalent to Euler by [17] (Lemma 8). The key property is that the estimated rate is piecewise constant on each interval

[t_{k}, t_{k + 1})

:

{\hat{R}}_{t} (x_{t_{k}}, \cdot) = {\hat{R}}_{t_{k}} (x_{t_{k}}, \cdot), t \in [t_{k}, t_{k + 1}) .

Let

{\overset{\leftarrow}{q}}_{t} = q_{T - t}

denote the exact reverse marginal. By the TV perturbation bound of [18] (Theorem 1),

TV ({\overset{\leftarrow}{q}}_{T - δ}, p_{T - δ}) \leq TV ({\overset{\leftarrow}{q}}_{0}, p_{0}) + \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t}} \sum_{y : y \neq x_{t}} | {\hat{R}}_{t} (x_{t}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t}, y) | d t .

Writing

h_{t} (x) : = \sum_{y : y \neq x} | {\hat{R}}_{t} (x, y) - {\overset{\leftarrow}{R}}_{t} (x, y) |,

and adding and subtracting

h_{t_{k}} (x_{t_{k}})

, the error decomposes into initialization, score-estimation, and discretization terms. By Assumption 1, the score-estimation term is bounded by

\sum_{k = 0}^{N - 1} (t_{k + 1} - t_{k}) E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} h_{t_{k}} (x_{t_{k}}) \leq \sqrt{ε_{TV}} .

The initialization error is controlled by the convergence of the forward chain to the uniform distribution. Since

{\overset{\leftarrow}{q}}_{0} = q_{T}

and

p_{0} = Uniform ({[S]}^{d})

, we have

TV ({\overset{\leftarrow}{q}}_{0}, p_{0}) = TV (q_{T}, p_{0}) ≲ min {d, \sqrt{d S}} e^{- {\bar{β}}_{T}} .

It remains to bound the discretization error. The part caused by the change from

x_{t_{k}}

to

x_{t}

is lower-order for sufficiently small step size, since the total outgoing reverse rate is of order

d β_{T - t}

. The dominant contribution comes from the time variation of the reverse rate. Using the piecewise-constant property of

{\hat{R}}_{t}

,

E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t} (x_{t_{k}}) - h_{t_{k}} (x_{t_{k}})] ≲ E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{t_{k}}} | {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) | .

By the reverse-rate formula in (3) and its coordinate-wise structure, only states satisfying

Ham (x, y) = 1

contribute. The difference is then split into the change in the likelihood ratio and the change in the rate matrix. The likelihood-ratio term is bounded using the non-homogeneous score-ratio estimate

\frac{q_{s} (y)}{q_{s} (x)} ≲ S max {1, {\bar{β}}_{s}^{- 1}}, \forall (x, y) : Ham (x, y) = 1,

together with the corresponding likelihood-ratio difference bound. The rate-matrix term follows from smoothness of

β_{t}

:

| R_{T - t_{k}} (y, x) - R_{T - t} (y, x) | ≲ \frac{1}{S} | β_{T - t_{k}}^{'} | (t - t_{k}) .

Combining these bounds yields

D_{disc} ≲ d S \sum_{k = 0}^{N - 1} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (| β_{T - t_{k}}^{'} | + d β_{T - t_{k}}^{2}) {(t_{k + 1} - t_{k})}^{2},

where

D_{disc}

represents the discretization error.

Finally, because the reverse process is stopped at

T - δ

, its target marginal is

q_{δ}

rather than

q_{0}

. The early-stopping error satisfies

TV (q_{0}, q_{δ}) ≲ d {\bar{β}}_{δ} .

Using the triangle inequality,

TV (q_{0}, p_{T - δ}) \leq TV (q_{0}, q_{δ}) + TV ({\overset{\leftarrow}{q}}_{T - δ}, p_{T - δ}),

and combining the four bounds gives the desired result. □

6. Geometric Noise Schedule

While the geometric schedule [8] does not directly satisfy Assumption 2, the preceding time-inhomogeneous framework enables us to analyze such a noise schedule. Different from the common scenario where T diverges, in practice one typically sets

T = 1

without diverging with vanishing

ε

. This way, the geometric noise schedule is defined as follows.

{\bar{β}}_{t} : = β_{\min}^{1 - t} β_{\max}^{t}, β_{t} = β_{\min}^{1 - t} β_{\max}^{t} log (β_{\max} / β_{\min}) = {\bar{β}}_{t} log (β_{\max} / β_{\min}), t \in [0, 1],

(7)

where

0 < β_{\min} < 1 < β_{\max}

are prescribed constants. The default choice in [8] is

β_{\min} = 10^{- 4}

and

β_{\max} = 20

. Here, both

{\bar{β}}_{t}

and

β_{t}

are increasing functions on

[0, 1]

. Moreover, differentiating

β_{t}

gives

β_{t}^{'} = β_{\min}^{1 - t} β_{\max}^{t} {log}^{2} (β_{\max} / β_{\min}) = β_{t} log (β_{\max} / β_{\min}) .

We first follow ref. [8] and use a constant step size

κ

, so that the total number of discretization steps is

N = κ^{- 1}

. The following theorem specializes our general time-inhomogeneous result to the geometric noise schedule.

Theorem 3.

Suppose that Assumption 1 holds. With the geometric noise schedule defined in (7), choosing

t_{k + 1} - t_{k} = κ

where

t_{0} = 0

and

t_{N} = 1 - δ

, we have

\begin{matrix} lim_{δ \to 0^{+}} TV (q_{0}, p_{1 - δ}) ≲ min & \{d, \sqrt{d S}\} e^{- β_{\max}} + \sqrt{ε_{TV}} \\ + κ d^{2} S log (β_{\max} / β_{\min}) ({(β_{\min}^{- 1} - 1)}^{2} + {(β_{\max} - 1)}^{2}) + d β_{\min} . \end{matrix}

Thus, in order to achieve

O (\sqrt{ε})

TV error, we can choose

β_{\min} ≍ \frac{\sqrt{ε}}{d}, β_{\max} ≍ log \frac{min \{d, \sqrt{d S}\}}{\sqrt{ε}}, and κ ≍ \frac{ε^{3 / 2}}{d^{4} S log (d / ε)},

we have

N = \tilde{O} (d^{4} S / ε^{3 / 2})

.

Theorem 3 highlights an interesting regime in which a finite error upper bound can be obtained even as

δ \to 0^{+}

. This contrasts with prior bounds whose dependence on

log δ^{- 1}

diverges in the vanishing early-stopping limit [15,16,17,18]. While the initialization, score-estimation, and early-stopping terms retain the same interpretation as in the general time-inhomogeneous result, the discretization error admits an explicit expression under the geometric noise schedule. In particular, up to logarithmic factors, this term scales quadratically with the prescribed schedule parameters

β_{\min}^{- 1}

and

β_{\max}

.

Our result illustrates the trade-off governed by the schedule parameters

β_{\min}

and

β_{\max}

, when the step sizes are constant, as is typical in many empirical studies [8,23]. A smaller

β_{\min}

reduces the early-stopping error but increases the accumulated discretization error. Similarly, a larger

β_{\max}

decreases the initialization error while again amplifying the discretization error.

Note that the choice of the sampling steps is not optimized here (i.e., constant) and therefore may lead to a suboptimal overall step complexity. As in Theorem 2, we can similarly employ a constant-then-decreasing step size towards the end to obtain an improved convergence result, given as follows.

Theorem 4.

With the same setting as in Theorem 3, but choosing

t_{k + 1} - t_{k} = κ min \{1, β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}}\}

, we have

\begin{matrix} lim_{δ \to 0^{+}} TV (q_{0}, p_{1 - δ}) ≲ min \{d, \sqrt{d S}\} & e^{- β_{\max}} + \sqrt{ε_{TV}} \\ + κ d^{2} S log (β_{\max} / β_{\min}) ({(β_{\max} - 1)}^{2} + 1) + d β_{\min} . \end{matrix}

Meanwhile, the number of steps satisfies that

N ≲ \frac{1}{κ} (1 + \frac{β_{\min}^{- 1} - 1}{log (β_{\max} / β_{\min})}) .

Thus, choosing

β_{\min} ≍ \frac{\sqrt{ε}}{d}, β_{\max} ≍ log \frac{min \{d, \sqrt{d S}\}}{\sqrt{ε}}, and κ ≍ \frac{\sqrt{ε}}{d^{2} S log (d / ε)},

we have

N = \tilde{O} (d^{3} S / ε)

.

Compared to Theorem 3, Theorem 4 improves the step complexity by a factor of order

O (d / \sqrt{ε})

. The key observation is that under the geometric noise schedule, the effective noise level decays exponentially along the reverse-time trajectory. The exponential-then-constant step size adapts to this behavior: it uses relatively large constant steps in the high-noise regime, where the reverse dynamics are more stable, and automatically takes smaller steps near the low-noise regime, where the discretization error is most sensitive. This refinement removes the unfavorable dependence on

β_{\min}^{- 2}

that appears under a constant step size while increasing the number of steps only in the region where finer discretization is necessary. As a result, the adaptive scheme achieves a sharper overall sampling complexity.

As a future direction, it might be interesting to investigate why using such a geometric noise schedule would yield worsened results than using polynomial ones.

7. Conclusions

In conclusion, this work provides a theoretical foundation for discrete diffusion models driven by time-inhomogeneous uniform-rate generators. By moving beyond the homogeneous noise schedules considered in much of the prior literature, our analysis broadens the class of discrete diffusion processes for which convergence can be rigorously understood. We further identify regularity conditions on the noise schedule that enable explicit convergence-rate estimates. Under these conditions, the resulting guarantees match state-of-the-art rates known for homogeneous discrete diffusion samplers, demonstrating that carefully controlled time-inhomogeneous schedules can retain the same theoretical efficiency while offering greater modeling flexibility.

Author Contributions

Conceptualization, Y.L. (Yuchen Liang); Methodology, Y.L. (Yuchen Liang); Formal analysis, Y.L. (Yuchen Liang); Investigation, Y.L. (Yuchen Liang); Writing—original draft, Y.L. (Yuchen Liang); Writing—review and editing, Y.L. (Yuchen Liang); Supervision, L.L., N.S. and Y.L. (Yingbin Liang); Funding acquisition, N.S. and Y.L. (Yingbin Liang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by U.S. National Science Foundation grant number NSF AI Institute (AI-EDGE) 2112471, ExpandAI-2324052, ECCS-2413528, CNS-2312836, CNS-2223452, CNS-2225561, CCF-2232907, ECCS-2448268; DEVCOM Army Research Laboratory grant number W911NF-23-2-0225.

Data Availability Statement

Data sharing is not applicable. No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Justification of Assumption 1

The following proposition links the TV score-estimation error in Assumption 1 with the commonly used score-entropy loss [8].

Proposition A1.

Define the score entropy loss at time t as

L_{S E} (s; t) = E_{x_{t} \sim q_{t}} \sum_{y : y \neq x_{t}} R_{t} (y, x_{t}) \cdot (s_{t} (y, x) - \frac{q_{t} (y)}{q_{t} (x)} - \frac{q_{t} (y)}{q_{t} (x)} log \frac{s_{t} (y, x)}{q_{t} (y) / q_{t} (x)}) .

If

{\hat{R}}_{t} (x, y) - {\overset{\leftarrow}{R}}_{t} (x, y) = o (1), \forall t, x, y

, then we have

L_{T V} (s; T - t) ≲ \sqrt{d (S - 1)} max {1, {\bar{β}}_{T - t}^{- 1}} \cdot \sqrt{L_{S E} (s; T - t)} .

Proof Proposition A1.

Recall that

{\hat{R}}_{t_{k}} (x_{k}, y) = R_{T - t_{k}} (y, x_{k}) s_{T - t_{k}} (y, x_{k})

. From the definition, the TV loss at time t satisfies that

\begin{matrix} L_{T V}^{2} (s; T - t_{k}) & = {(E_{x_{k} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [\sum_{y : y \neq x_{k}} | {\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y) |])}^{2} \\ \leq E_{x_{k} \sim {\overset{\leftarrow}{q}}_{t_{k}}} {(\sum_{y : y \neq x_{k}} | {\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y) |)}^{2} \\ = d^{2} {(S - 1)}^{2} E_{x_{k} \sim {\overset{\leftarrow}{q}}_{t_{k}}} {(\frac{1}{d (S - 1)} \sum_{y : y \neq x_{k}} | {\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y) |)}^{2} \\ \leq E_{x_{k} \sim {\overset{\leftarrow}{q}}_{t_{k}}} d^{2} {(S - 1)}^{2} \frac{1}{d (S - 1)} \sum_{y : y \neq x_{k}} {({\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y))}^{2} \\ \leq d (S - 1) E_{x_{k} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{k}} {({\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y))}^{2} \end{matrix}

where both inequalities are due to Jensen’s inequality.

On the other hand, the score-entropy loss at time t satisfies that

\begin{matrix} L_{S E} (s; T - t_{k}) & = E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{k}} {\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y) log \frac{{\hat{R}}_{t_{k}} (x_{k}, y)}{{\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y)} \\ \overset{(i)}{=} E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{k}} \frac{{({\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y))}^{2}}{2 {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y)} + o (1) \\ \overset{(i i)}{≳} min {1, {\bar{β}}_{T - t_{k}}^{- 1}} \cdot E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{k}} {({\hat{R}}_{t_{k}} (x_{k}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y))}^{2} \end{matrix}

where

(i)

follows by assuming that

{\hat{R}}_{t} (x, y) - {\overset{\leftarrow}{R}}_{t} (x, y) = o (1), \forall t, x, y

, and

(i i)

follows because

\frac{q_{t} (y)}{q_{t} (x_{k})} \leq S max \{1, {\bar{β}}_{t}^{- 1}\}

(see Lemma A2) and thus

{\overset{\leftarrow}{R}}_{t_{k}} (x_{k}, y) \leq max \{1, {(T - t_{k})}^{- 1}\}

.

Therefore, we have that, for all

t_{k}

’s,

L_{T V} (s; T - t_{k}) ≲ \sqrt{d (S - 1)} max {1, {\bar{β}}_{T - t_{k}}^{- 1}} \cdot \sqrt{L_{S E} (s; T - t_{k})} . □

Remark A1.

Here we have shown a point-wise upper bound over t. Indeed, a time-averaged upper bound can also be achieved with the step size in Theorem 2:

t_{k + 1} - t_{k} = κ min \{1, {\bar{β}}_{T - t_{k}}\}

, which yields

\sum_{k = 0}^{N - 1} (t_{k + 1} - t_{k}) L_{T V} (s; T - t_{k}) ≲ T \sqrt{d S} sup_{k \in [N]} \sqrt{L_{S E} (s; T - t_{k})} .

Appendix B. Proof of Theorem 1

We follow the idea of [17,18] to analyze the Euler method by constructing a truncated version of the vanilla

τ

-leaping sampler. In [17] (Lemma 8), it is shown that the truncated

τ

-leaping sampler is asymptotically equivalent to the Euler method. Also, from Lemma 7 of [17], one important property of this sampler is that its sampling rate

{\hat{R}}_{t}

is piecewise constant and, given

x_{t_{k}} \in {[S]}^{d}

, we have

{\hat{R}}_{t} (x_{t_{k}}, \cdot) = {\hat{R}}_{t_{k}} (x_{t_{k}}, \cdot) .

(A1)

Step 1: Decompose total error.

To begin, by [18] (Theorem 1), we have

TV ({\overset{\leftarrow}{q}}_{T - δ}, p_{T - δ}) \leq TV ({\overset{\leftarrow}{q}}_{0}, p_{0}) + \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t}} \sum_{y : y \neq x_{t}} | {\hat{R}}_{t} (x_{t}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t}, y) | d t .

Thus, if we write

h_{t} (x_{t}) : = \sum_{y : y \neq x_{t}} | {\hat{R}}_{t} (x_{t}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t}, y) |

, we have

\begin{matrix} TV ({\overset{\leftarrow}{q}}_{T - δ}, p_{T - δ}) \\ \leq TV ({\overset{\leftarrow}{q}}_{0}, p_{0}) + \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t}} [h_{t} (x_{t})] d t \\ = TV ({\overset{\leftarrow}{q}}_{0}, p_{0}) + \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t_{k}} (x_{t_{k}})] d t \\ + \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{\begin{matrix} x_{t} \sim {\overset{\leftarrow}{q}}_{t} \\ x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}} \end{matrix}} [h_{t} (x_{t}) - h_{t_{k}} (x_{t_{k}})] \\ = \underset{initialization error}{\underset{⏟}{TV ({\overset{\leftarrow}{q}}_{0}, p_{0})}} + \underset{estimation error}{\underset{⏟}{\sum_{k = 0}^{N - 1} (t_{k + 1} - t_{k}) E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t_{k}} (x_{t_{k}})]}} \\ + \underset{discretization error}{\underset{⏟}{\sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{\begin{matrix} x_{t} \sim {\overset{\leftarrow}{q}}_{t} \\ x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}} \end{matrix}} [h_{t} (x_{t}) - h_{t} (x_{t_{k}})] + E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t} (x_{t_{k}}) - h_{t_{k}} (x_{t_{k}})] d t}} . \end{matrix}

(A2)

By Assumption 1, the estimation error satisfies that

\sum_{k = 0}^{N - 1} (t_{k + 1} - t_{k}) E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t_{k}} (x_{t_{k}})] \leq \sqrt{ε_{TV}} .

Step 2: Bound initialization error.

The following lemma bounds the initialization error for a non-homogeneous noise schedule.

Lemma A1.

Given the forward process in (1) with the rate given in (2), we have

TV (q_{T}, p_{0}) ≲ min \{d, \sqrt{d S}\} e^{- {\bar{β}}_{T}} .

Proof.

See Appendix F.1. □

Step 3: Bound discretization error.

It now remains to upper-bound the discretization error. As shown in (A2), the discretization error can be decomposed into two terms, one for the time-difference in the argument of

h_{t}

(in expected value) and the other in the space-difference for

h_{t}

itself. Using [18] (Lemma 4), the space-difference term can be upper-bounded as

\begin{matrix} E_{\begin{matrix} x_{t} \sim {\overset{\leftarrow}{q}}_{t} \\ x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}} \end{matrix}} [h_{t} (x_{t}) - h_{t} (x_{t_{k}})] & \leq (t - t_{k}) E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t}} [(- R_{T - t} (x_{t}, x_{t})) h_{t} (x_{t})] + o (t - t_{k}) \\ ≲ (t - t_{k}) d \cdot β_{T - t} E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t}} [h_{t} (x_{t})] . \end{matrix}

(A3)

Thus, since

t_{k + 1} - t_{k} \leq κ

, we have

\begin{matrix} \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} E_{\begin{matrix} x_{t} \sim {\overset{\leftarrow}{q}}_{t} \\ x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}} \end{matrix}} [h_{t} (x_{t}) - h_{t} (x_{t_{k}})] d t \\ = κ \cdot d \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} β_{T - t} E_{x_{t} \sim {\overset{\leftarrow}{q}}_{t}} [h_{t} (x_{t})] d t \\ \overset{(i)}{=} κ \cdot d \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} \frac{β_{T - t}}{1 - (t - t_{k}) d β_{T - t}} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t} (x_{t_{k}})] d t \\ = κ \cdot d \sum_{k = 0}^{N - 1} \int_{t_{k}}^{t_{k + 1}} \frac{β_{T - t}}{1 - (t - t_{k}) d β_{T - t}} \\ \cdot (E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t_{k}} (x_{t_{k}})] + E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t} (x_{t_{k}}) - h_{t_{k}} (x_{t_{k}})]) d t \\ \overset{(i i)}{≲} κ \cdot d sup_{t} \frac{β_{T - t}}{1 - (t - t_{k}) d β_{T - t}} (\sqrt{ε} + \int_{0}^{T - δ} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t} (x_{t_{k}}) - h_{t_{k}} (x_{t_{k}})] d t) \end{matrix}

(A4)

where

(i)

follows again by (A3), and

(i i)

follows by Assumption 1. Thus, as long as

κ = o (1)

, the time-difference in the discretization error dominates.

We now turn to the time-difference term of the discretization error, which equals to

\begin{matrix} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} [h_{t} (x_{t_{k}}) - h_{t_{k}} (x_{t_{k}})] \\ = E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{t_{k}}} (| {\hat{R}}_{t} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) | - | {\hat{R}}_{t_{k}} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) |) \\ \leq E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{t_{k}}} | {\hat{R}}_{t} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) - {\hat{R}}_{t_{k}} (x_{t_{k}}, y) + {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) | \\ \overset{(i)}{=} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{t_{k}}} | {\hat{R}}_{t_{k}} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) - {\hat{R}}_{t_{k}} (x_{t_{k}}, y) + {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) | \\ = E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y : y \neq x_{t_{k}}} | {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) |, \end{matrix}

(A5)

where

(i)

follows from the property of the sampler given in (A1).

Step 4: Bound Expected Absolute Difference in the Rate Function.

Continuing (A5), the rate-determining term is the expected absolute difference in the rate function. To this end, we need a characterization of the score function under a non-homogeneous noise schedule, as follows.

Lemma A2

(Score bound under non-homogeneous

β_{t}

). Fix

t > 0

and

x \neq y

such that

Ham (x, y) = p

. Given the forward process in (1) with a rate given in (2), we have

\frac{q_{t} (y)}{q_{t} (x)} ≲ {(S \cdot max {1, {({\bar{β}}_{t})}^{- 1}})}^{p} .

Proof.

See Appendix F.2. □

Now, for any

x_{t_{k}} \in {[S]}^{d}

, the sum difference in the reverse rate can be further calculated using (3) as

\begin{matrix} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y \neq x_{t_{k}}} | {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) | \\ = E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y \neq x_{t_{k}}} | \frac{q_{T - t_{k}} (y)}{q_{T - t_{k}} (x_{t_{k}})} R_{T - t_{k}} (y, x_{t_{k}}) - \frac{q_{T - t} (y)}{q_{T - t} (x_{t_{k}})} R_{T - t} (y, x_{t_{k}}) | \\ \leq E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y \neq x_{t_{k}}} | \frac{q_{T - t_{k}} (y)}{q_{T - t_{k}} (x_{t_{k}})} - \frac{q_{T - t} (y)}{q_{T - t} (x_{t_{k}})} | R_{T - t_{k}} (y, x_{t_{k}}) \\ + E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y \neq x_{t_{k}}} \frac{q_{T - t} (y)}{q_{T - t} (x_{t_{k}})} | R_{T - t_{k}} (y, x_{t_{k}}) - R_{T - t} (y, x_{t_{k}}) | \\ = E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{\begin{matrix} y \neq x_{t_{k}} \\ Ham (y, x_{t_{k}}) = 1 \end{matrix}} | \frac{q_{T - t_{k}} (y)}{q_{T - t_{k}} (x_{t_{k}})} - \frac{q_{T - t} (y)}{q_{T - t} (x_{t_{k}})} | R_{T - t_{k}} (y, x_{t_{k}}) \\ + E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{\begin{matrix} y \neq x_{t_{k}} \\ Ham (y, x_{t_{k}}) = 1 \end{matrix}} \frac{q_{T - t} (y)}{q_{T - t} (x_{t_{k}})} | R_{T - t_{k}} (y, x_{t_{k}}) - R_{T - t} (y, x_{t_{k}}) |, \end{matrix}

(A6)

where the last line follows because

R_{t} (y, x) = 0

whenever

Ham (y, x) \geq 2

. Here the second term captures the effect of non-homogeneity of

β_{t}

, which becomes zero in the case of constant

β_{t}

.

We first deal with the second term in (A6). Since

β_{t}

is smooth, we have

| R_{T - t_{k}} (y, x_{t_{k}}) - R_{T - t} (y, x_{t_{k}}) | = \frac{1}{S} | β_{T - t_{k}} - β_{T - t} | ≲ \frac{1}{S} | β_{T - t_{k}}^{'} | (t - t_{k}) .

Thus, combining the result from Lemma A2, the second term in (A6) satisfies that

\begin{matrix} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{\begin{matrix} y \neq x_{t_{k}} \\ Ham (y, x_{t_{k}}) = 1 \end{matrix}} \frac{q_{T - t} (y)}{q_{T - t} (x_{t_{k}})} | R_{T - t_{k}} (y, x_{t_{k}}) - R_{T - t} (y, x_{t_{k}}) | \\ ≲ S d | β_{T - t_{k}}^{'} | max {1, {\bar{β}}_{T - t_{k + 1}}^{- 1}} (t - t_{k}) . \end{matrix}

In order to provide an upper bound for the first term in (A6), it is essential to deal with the time difference of the likelihood ratios. The following lemma is a direct extension of [17] (Lemma 5).

Lemma A3.

Fix

s < t

such that

t - s

is small. Given the forward process in (1) with a rate given in (2), we have

\begin{matrix} E_{x_{t} \sim q_{t}} \sum_{\begin{matrix} y \neq x_{t} \\ Ham (y, x_{t}) = 1 \end{matrix}} | \frac{q_{t} (y)}{q_{t} (x_{t})} - & \frac{q_{s} (y)}{q_{s} (x_{t})} | R_{t} (y, x_{t}) \\ ≲ d S β_{t}^{2} max {1, {\bar{β}}_{s}^{- 2}} (t - s) + d^{2} S β_{t}^{2} max {1, {\bar{β}}_{s}^{- 1}} (t - s) . \end{matrix}

Proof.

See Appendix F.3. □

Thus, considering the two terms in (A6), we have the following bound.

\begin{matrix} E_{x_{t_{k}} \sim {\overset{\leftarrow}{q}}_{t_{k}}} \sum_{y \neq x_{t_{k}}} | {\overset{\leftarrow}{R}}_{t_{k}} (x_{t_{k}}, y) - {\overset{\leftarrow}{R}}_{t} (x_{t_{k}}, y) | \\ ≲ d S max {1, {\bar{β}}_{T - t_{k + 1}}^{- 1}} (| β_{T - t_{k}}^{'} | + β_{t}^{2} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 1}} + d β_{t}^{2}) (t - t_{k}) . \end{matrix}

(A7)

Therefore, combining all the ingredients above, we have

\begin{matrix} TV ({\overset{\leftarrow}{q}}_{T - δ}, p_{T - δ}) \\ ≲ min \{d, \sqrt{d S}\} e^{- {\bar{β}}_{T}} + \sqrt{ε_{TV}} + d S \sum_{k = 0}^{N - 1} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (| β_{T - t_{k}}^{'} | + d β_{T - t_{k}}^{2}) {(t - t_{k})}^{2} . \end{matrix}

Step 5: Bound initial perturbation.

As we are finishing up the proof, it remains to investigate how much perturbation would be introduced due to early-stopping when

β_{t}

is non-homogeneous. This is characterized in the following lemma, whose proof is similar to the last part of [12] (Theorem 6).

Lemma A4.

Under the forward process in (1) with the rate given in (2), we have

TV (q_{0}, q_{δ}) ≲ d {\bar{β}}_{δ}, as δ \to 0 .

Proof.

See Appendix F.4. □

The proof of Theorem 1 is now complete.

Appendix C. Proof of Theorem 2

From the result of Theorem 1, it remains to determine the number of steps using the special step sizes. The goal is to directly characterize the summation term, which we perform as follows.

Let

fn (β_{T - t_{k}}) \in {β_{T - t_{k}}^{2}, | β_{T - t_{k}}^{'} |}

. Recall that

{\bar{β}}_{t}

is increasing. Define

k^{*} : = sup \{k : {\bar{β}}_{T - t_{k}} > 1\} .

We thus have the following cases.

1.: Case 1: $k < k^{*}$ . This implies that ${\bar{β}}_{T - t_{k}} > {\bar{β}}_{T - t_{k + 1}} > 1$ , and $t_{k + 1} - t_{k} \leq κ$ . Thus, we have

$\sum_{k = 0}^{k^{*} - 1} {(t_{k + 1} - t_{k})}^{2} \frac{fn (β_{T - t_{k}})}{min \{1, {\bar{β}}_{T - t_{k + 1}}^{2}\}} \leq κ^{2} \sum_{k = 0}^{k^{*} - 1} fn (β_{T - t_{k}}) .$
2.: Case 2: $k > k^{*}$ . This implies that ${\bar{β}}_{T - t_{k + 1}} < {\bar{β}}_{T - t_{k}} \leq 1$ , and

$\begin{matrix} \sum_{k = k^{*} + 1}^{N - 1} {(t_{k + 1} - t_{k})}^{2} \frac{fn (β_{T - t_{k}})}{min \{1, {\bar{β}}_{T - t_{k + 1}}^{2}\}} & \leq κ^{2} \sum_{k = k^{*} + 1}^{N - 1} fn (β_{T - t_{k}}) \frac{{\bar{β}}_{T - t_{k}}^{2}}{{\bar{β}}_{T - t_{k + 1}}^{2}} \\ ≍ κ^{2} \sum_{k = k^{*} + 1}^{N - 1} fn (β_{T - t_{k}}) \end{matrix}$

where the last line follows because by the Taylor expansion of $\bar{β}$ (noting that $\bar{β}$ is continuous since by definition it is the integral of $β_{t}$ ),

${\bar{β}}_{T - t_{k}} = {\bar{β}}_{T - t_{k + 1}} + O (t_{k + 1} - t_{k}) = {\bar{β}}_{T - t_{k + 1}} + O (κ), \forall k = 0, \dots, N - 1 .$

(A8)

Since ${\bar{β}}_{T - t_{k + 1}} \geq {\bar{β}}_{δ} > 0$ , we have ${\bar{β}}_{T - t_{k}}^{2} / {\bar{β}}_{T - t_{k + 1}}^{2} = 1 + O (κ) = 1 + o (1)$ for all $k = 0, \dots, N - 1$ .
3.: Case 3: $k = k^{*}$ . This implies that ${\bar{β}}_{T - t_{k + 1}} \leq 1$ , ${\bar{β}}_{T - t_{k}} > 1$ , and $t_{k + 1} - t_{k} \leq κ$ . Also, from (A8), ${\bar{β}}_{T - t_{k}} \leq 1 + O (κ)$ . Then,

${(t_{k + 1} - t_{k})}^{2} \frac{fn (β_{T - t_{k}})}{min \{1, {\bar{β}}_{T - t_{k + 1}}^{2}\}} ≍ κ^{2} fn (β_{T - t_{k}}) \frac{{\bar{β}}_{T - t_{k}}^{2}}{{\bar{β}}_{T - t_{k + 1}}^{2}} ≍ κ^{2} fn (β_{T - t_{k^{*}}}) .$

Summing up all three cases, we have

\sum_{k = 0}^{N - 1} {(t_{k + 1} - t_{k})}^{2} \frac{fn (β_{T - t_{k}})}{min \{1, {\bar{β}}_{T - t_{k + 1}}^{2}\}} ≲ κ^{2} \sum_{k = 0}^{N - 1} fn (β_{T - t_{k}}) .

Now, under Assumption 2, since

β_{t}

and

| β_{t}^{'} |

have upper bounds independent of the system parameters, we have

\sum_{k = 0}^{N - 1} {(t_{k + 1} - t_{k})}^{2} \frac{fn (β_{T - t_{k}})}{min \{1, {\bar{β}}_{T - t_{k + 1}}^{2}\}} ≲ κ^{2} N .

Next, we choose

t_{k + 1} - t_{k} \leq κ min \{1, {\bar{β}}_{T - t_{k}}\}

and we need to determine an upper bound for the number of steps. Define

t^{*} = t_{k^{*}}

. When

k \leq k^{*}

, the number of steps is upper- bounded as

k^{*} = : N_{1} = ⌊\frac{T - t^{*}}{κ}⌋ \leq \frac{T}{κ} .

When

k > k^{*}

, note that

t_{k + 1}

is chosen according to the following fixed-point iteration:

T - t_{k + 1} = h (T - t_{k})

, where

h (z) : = z - κ {\bar{β}}_{z}

. Obviously

z = 0

is a fixed point. By Banach’s fixed-point theorem,

T - t_{N_{2}} \leq (T - t^{*}) \cdot {(max_{z \in [δ, T - t^{*}]} | h^{'} (z) |)}^{N_{2}} .

Here note that

h^{'} (z) = 1 - κ β_{z} \leq 1 - κ β^{*}

, where we recall that

β^{*} = {min}_{z \in [δ, {\bar{β}}^{- 1} (1)]} β_{z}

. Thus, in order to reach that

T - t_{N_{2}} ≍ δ

,

N_{2} ≲ {log}_{1 - κ β^{*}} \frac{δ}{T - t^{*}} ≍ \frac{log (T - t^{*}) + log δ^{- 1}}{κ β^{*}} .

Therefore, the total number of steps to take satisfies that

N = N_{1} + N_{2} ≲ \frac{T + (log T + log δ^{- 1}) / β^{*}}{κ} .

Finally, since under Assumption 2 we have

β^{*} ≳ 1

, this yields

N ≲ \frac{T + log δ^{- 1}}{κ} .

Then, we have

\sum_{k = 0}^{N - 1} {(t_{k + 1} - t_{k})}^{2} \frac{fn (β_{T - t_{k}})}{min \{1, {\bar{β}}_{T - t_{k + 1}}^{2}\}} ≲ κ (T + log δ^{- 1}) .

Note that

T ≲ poly (log (d / \sqrt{ε}))

. Choosing the corresponding T and

κ

yields the desired result.

Appendix D. Proof of Theorem 3

We begin from Theorem 1, which also applies to the constant step size:

\begin{matrix} TV (q_{0}, p_{T - δ}) ≲ min {d & , \sqrt{d S}} e^{- {\bar{β}}_{T}} + \sqrt{ε_{TV}} \\ + d S \sum_{k = 0}^{N - 1} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (| β_{T - t_{k}}^{'} | + d β_{T - t_{k}}^{2}) {(t_{k + 1} - t_{k})}^{2} + d {\bar{β}}_{δ} . \end{matrix}

We recall from (7) that

{\bar{β}}_{t} : = β_{\min}^{1 - t} β_{\max}^{t}, β_{t} = β_{\min}^{1 - t} β_{\max}^{t} log (β_{\max} / β_{\min}), β_{t}^{'} = β_{\min}^{1 - t} β_{\max}^{t} {log}^{2} (β_{\max} / β_{\min}) .

Thus, we have

{\bar{β}}_{T} = β_{\max}

. Also, note that

{\bar{β}}_{δ} = β_{\min}^{1 - δ} β_{\max}^{δ} \overset{δ \to 0}{\to} β_{\min} > 0,

which implies that there will be asymptotic mismatch for vanishing

δ

. What remains is to work out the summation term, and we proceed as follows.

Note that

\begin{matrix} \sum_{k = 0}^{N - 1} κ max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (| β_{T - t_{k}}^{'} | + d β_{T - t_{k}}^{2}) \\ = {log}^{2} (β_{\max} / β_{\min}) \sum_{k = 0}^{N - 1} κ max {1, β_{\min}^{- 2 t_{k + 1}} β_{\max}^{- 2 (1 - t_{k + 1})}} (β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} + d β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})}) \\ ≍ {log}^{2} (β_{\max} / β_{\min}) \int_{δ}^{1} max {1, β_{\min}^{- 2 (1 - t)} β_{\max}^{- 2 t}} (β_{\min}^{1 - t} β_{\max}^{t} + d β_{\min}^{2 (1 - t)} β_{\max}^{2 t}) d t . \end{matrix}

Since

β_{\min} < 1 < β_{\max}

, and

{\bar{β}}_{t}

is smooth, there exists

t^{*}

such that

{\bar{β}}_{t^{*}} = 1

. Then, the integral becomes

\begin{matrix} log (β_{\max} / β_{\min}) \int_{δ}^{1} max {1, β_{\min}^{- 2 (1 - t)} β_{\max}^{- 2 t}} (β_{\min}^{1 - t} β_{\max}^{t} + d β_{\min}^{2 (1 - t)} β_{\max}^{2 t}) d t \\ = log (β_{\max} / β_{\min}) \int_{δ}^{t^{*}} β_{\min}^{- 2 (1 - t)} β_{\max}^{- 2 t} (β_{\min}^{1 - t} β_{\max}^{t} + d) d t + \int_{t^{*}}^{1} β_{\min}^{1 - t} β_{\max}^{t} + d β_{\min}^{2 (1 - t)} β_{\max}^{2 t} d t \\ = - {\frac{1}{{\bar{β}}_{t}}|}_{δ}^{t^{*}} - {\frac{d}{2 {\bar{β}}_{t}^{2}}|}_{δ}^{t^{*}} + {{\bar{β}}_{t}|}_{t^{*}}^{1} + \frac{d}{2} {{\bar{β}}_{t}^{2}|}_{t^{*}}^{1} \\ ≍ β_{\min}^{- 1} - 1 + \frac{d}{2} {(β_{\min}^{- 1} - 1)}^{2} + (β_{\max} - 1) + \frac{d}{2} {(β_{\max} - 1)}^{2} \\ ≲ d ({(β_{\min}^{- 1} - 1)}^{2} + {(β_{\max} - 1)}^{2}) . \end{matrix}

Combining all of the above, we finally have, for each

δ > 0

,

\begin{matrix} TV (q_{0}, p_{T - δ}) ≲ min {d, & \sqrt{d S}} e^{- β_{\max}} + \sqrt{ε_{TV}} \\ + κ d^{2} S log (β_{\max} / β_{\min}) ({(β_{\min}^{- 1} - 1)}^{2} + {(β_{\max} - 1)}^{2}) + d β_{\min} . \end{matrix}

Since the right-hand side does not depend on

δ

, taking its limit to

0^{+}

yields the desired result.

Appendix E. Proof of Theorem 4

Continuing the proof of Theorem 3, we again have

\begin{matrix} \sum_{k = 0}^{N - 1} {(t_{k + 1} - t_{k})}^{2} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (| β_{T - t_{k}}^{'} | + d β_{T - t_{k}}^{2}) \\ = {log}^{2} (β_{\max} / β_{\min}) \sum_{k = 0}^{N - 1} {(t_{k + 1} - t_{k})}^{2} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} + d β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})}) \end{matrix}

Now, suppose we choose step sizes as in Theorem 2:

t_{k + 1} - t_{k} = κ min {1, {\bar{β}}_{T - t_{k}}} = κ min {1, β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}}} .

(A9)

Again define

k^{*} : = sup \{k : {\bar{β}}_{T - t_{k}} = β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} > 1\},

with

t^{*}

such that

β_{\min}^{t^{*}} β_{\max}^{1 - t^{*}} = 1

. Then, from the proof of Theorem 2, using this step size, we have

\begin{matrix} {log}^{2} (β_{\max} / β_{\min}) \sum_{k = 0}^{N - 1} {(t_{k + 1} - t_{k})}^{2} max {1, {\bar{β}}_{T - t_{k + 1}}^{- 2}} (β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} + d β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})}) \\ ≲ {log}^{2} (β_{\max} / β_{\min}) κ^{2} \sum_{k = 0}^{N - 1} (β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} + d β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})}) \end{matrix}

For

k < k^{*}

,

t_{k + 1} - t_{k} = κ

. This is the regime similar to Theorem 3, where we have

\begin{matrix} \sum_{k = 0}^{k^{*} - 1} (β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} + d β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})}) \\ ≍ \frac{1}{κ} \int_{t^{*}}^{1} β_{\min}^{1 - t} β_{\max}^{t} + d β_{\min}^{2 (1 - t)} β_{\max}^{2 t} d t \\ = \frac{1}{κ log (β_{\max} / β_{\min})} ((β_{\max} - 1) + \frac{d}{2} {(β_{\max} - 1)}^{2}) . \end{matrix}

Meanwhile, the number of steps in this regime,

N_{1}

, satisfies that

N_{1} ≍ \frac{1 - t^{*}}{κ} \leq \frac{1}{κ} .

As follows, the goal is thus to provide an upper bound on the two summation terms as well as

N_{2}

in the regime where

k \geq k^{*}

. The key idea is to use the property that

{\bar{β}}_{t}

,

β_{t}

, and

β_{t}^{'}

have the same exponential form in t.

Part 1: On the number of steps N 2.

Let us define a series of auxiliary variable

y_{k}

as follows. Let

y_{k} : = {\bar{β}}_{1 - t_{k}}^{- 1} = {(β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}})}^{- 1} .

Here, by definition of

k^{*}

, we have

y_{k^{*}} ≍ 1

. Also notice that

y_{k}

is increasing in k, with

\frac{y_{k + 1}}{y_{k}} = β_{\min}^{- (t_{k + 1} - t_{k})} β_{\max}^{t_{k + 1} - t_{k}} = e^{λ (t_{k + 1} - t_{k})},

where

λ : = log (β_{\max} / β_{\min})

. With the step size in (A9) and since

k \geq k^{*}

, we have

\frac{y_{k + 1}}{y_{k}} = e^{λ κ β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}}} = e^{\frac{λ κ}{y_{k}}} \geq 1 + \frac{λ κ}{y_{k}},

which further implies that

y_{k + 1} \geq y_{k} + λ κ .

For the terminal condition, we have

y_{N} ≍ {(β_{\min}^{1 - δ} β_{\max}^{δ})}^{- 1}

. This implies that

N_{2} = N - k^{*} ≲ \frac{{(β_{\min}^{1 - δ} β_{\max}^{δ})}^{- 1} - 1}{λ κ} \overset{δ \to 0}{\to} \frac{β_{\min}^{- 1} - 1}{λ κ} .

Part 2: On the first-order sum.

With (A9) and when

k \geq k^{*}

, the first-order sum is indeed a telescoping sum:

\sum_{k = k^{*}}^{N - 1} β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} = \sum_{k = k^{*}}^{N - 1} \frac{t_{k + 1} - t_{k}}{κ} = \frac{1}{κ} (t_{N} - t_{k^{*}}) \leq \frac{1 - δ}{κ} .

Part 3: On the second-order sum.

The second-order sum needs some extra work. The idea is to upper-bound it with the first-order sum using trajectory smoothness. Write

f_{k} : = y_{k}^{- 1} = β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} .

(A10)

Note that

f_{k} \leq 1

for

k > k^{*}

(for

k = k^{*}

, we get

f_{k} = 1 + o (1)

), and we have a similar recursion as the above one:

f_{k + 1} = f_{k} e^{- λ κ f_{k}} .

The following lemma characterizes an important property of

f_{k}

, which will be useful for the analysis.

Lemma A5.

With

f_{k}

defined in (A10) and when

k > k^{*}

, we have

f_{k}^{2} \leq \frac{f_{k} - f_{k + 1}}{1 - e^{- λ κ}},

where

λ : = log (β_{\max} / β_{\min})

.

Proof.

See Appendix G. □

With Lemma A5, the second-order sum becomes

\begin{matrix} d \sum_{k = k^{*}}^{N - 1} β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})} & = d \sum_{k = k^{*}}^{N - 1} f_{k}^{2} \\ ≲ d \sum_{k = k^{*}}^{N - 1} \frac{f_{k} - f_{k + 1}}{1 - e^{- λ κ}} \\ = \frac{d}{1 - e^{- λ κ}} (t_{N} - t_{k^{*}}) \\ ≲ \frac{d}{λ κ} (1 - δ) . \end{matrix}

Part 4: Combine all previous parts.

The overall discretization error becomes

\begin{matrix} {log}^{2} (β_{\max} / β_{\min}) κ^{2} \sum_{k = 0}^{N - 1} (β_{\min}^{t_{k}} β_{\max}^{1 - t_{k}} + d β_{\min}^{2 t_{k}} β_{\max}^{2 (1 - t_{k})}) \\ ≲ {log}^{2} (β_{\max} / β_{\min}) κ^{2} \frac{1}{κ log (β_{\max} / β_{\min})} ((β_{\max} - 1) + \frac{d}{2} {(β_{\max} - 1)}^{2}) \\ + {log}^{2} (β_{\max} / β_{\min}) κ^{2} (\frac{1 - δ}{κ} + \frac{d}{log (β_{\max} / β_{\min}) κ} (1 - δ)) \\ ≲ d log (β_{\max} / β_{\min}) κ ({(β_{\max} - 1)}^{2} + 1 - δ) \\ \overset{δ \to 0}{\to} d log (β_{\max} / β_{\min}) κ ({(β_{\max} - 1)}^{2} + 1) . \end{matrix}

Also note that the total number of steps satisfies

N = N_{1} + N_{2} ≲ \frac{1}{κ} (1 + \frac{β_{\min}^{- 1} - 1}{log (β_{\max} / β_{\min})}) .

The proof is now complete.

Appendix F. Auxiliary Proofs

In this section, we provide proofs of all the auxiliary lemmas in this paper.

Appendix F.1. Proof of Lemma A1

Write

ϵ_{T} : = e^{- {\bar{β}}_{T}}

. Let

u_{S}

denote the uniform distribution on

[S]

, and let

u_{d} = u_{S}^{\otimes d}

. By assumption,

p_{0} = u_{d}

.

First, to obtain the analytical solution for the conditional probability, we can solve the Kolmogorov forward equation for the i-th dimension (

\forall i \in [d]

):

\frac{d}{d t} q_{t | 0}^{i} (z | a) = \sum_{\tilde{z} \in [S]} q_{t | 0}^{i} (\tilde{z} | a) R_{t}^{tok} (\tilde{z}, z),

whose solution is

\begin{matrix} q_{t | 0}^{i} (z | a) & = [exp (\int_{0}^{t} R_{s}^{tok} d s)] (a, z) \\ = [exp ({\bar{β}}_{t} R_{base})] (a, z) \\ = [P exp ({\bar{β}}_{t} Λ) P^{- 1}] (a, z) \\ = \{\begin{matrix} S^{- 1} (1 - e^{- {\bar{β}}_{t}}) & if z \neq a \\ S^{- 1} (1 + (S - 1) e^{- {\bar{β}}_{t}}) & if z = a \end{matrix} \end{matrix}

(A11)

where we recall that

R_{t}^{tok} = β_{t} R_{base}

and we denote the eigendecomposition of

R_{base} = S^{- 1} 1_{S} 1_{S}^{⊺} - I_{S}

as

R_{base} = P Λ P^{- 1}

. Equivalently, we can write

q_{T | 0}^{i} (z | a) = u_{S} (z) + ϵ_{T} (1 {z = a} - u_{S} (z)) .

Thus, for fixed

x_{0}, x_{T} \in {[S]}^{d}

,

q_{T | 0} (x_{T} | x_{0}) = ⨂_{i = 1}^{d} q_{T | 0}^{i} (x_{T}^{i} | x_{0}^{i}) .

To upper-bound the initialization error, we first reduce to the case of a fixed initialization. Since

q_{T} (x_{T}) = \sum_{x_{0} \in {[S]}^{d}} q_{0} (x_{0}) q_{T | 0} (x_{T} | x_{0}),

by Jensen’s inequality we have

TV (q_{T}, p_{0}) \leq \sum_{x_{0} \in {[S]}^{d}} q_{0} (x_{0}) TV (q_{T | 0} (\cdot | x_{0}), p_{0}) .

Therefore it suffices to upper-bound

TV (q_{T | 0} (\cdot | x_{0}), u_{d})

uniformly over

x_{0}

.

For one coordinate, we have

\begin{matrix} TV (q_{T | 0}^{i} (\cdot | a), u_{S}) & = \frac{1}{2} \sum_{z \in [S]} |q_{T | 0}^{i} (z | a) - S^{- 1}| \\ = \frac{1}{2} \cdot \frac{S - 1}{S} ϵ_{T} + \frac{1}{2} \cdot (S - 1) \frac{1}{S} ϵ_{T} \\ = \frac{S - 1}{S} ϵ_{T} . \end{matrix}

Using the standard tensorization bound for total variation of product measures,

TV (⨂_{i = 1}^{d} μ_{i}, ⨂_{i = 1}^{d} ν_{i}) \leq 1 - \prod_{i = 1}^{d} (1 - TV (μ_{i}, ν_{i})),

we obtain

TV (q_{T | 0} (\cdot | x_{0}), u_{d}) \leq 1 - {(1 - \frac{S - 1}{S} ϵ_{T})}^{d} .

Consequently,

TV (q_{T}, p_{0}) \leq 1 - {(1 - \frac{S - 1}{S} ϵ_{T})}^{d} \leq d \frac{S - 1}{S} ϵ_{T} \leq d e^{- {\bar{β}}_{T}} .

We can also derive a complementary bound using the likelihood ratio. For fixed

x_{0}

, define

L_{x_{0}} (x_{T}) : = \frac{q_{T | 0} (x_{T} | x_{0})}{u_{d} (x_{T})} .

For one coordinate,

\frac{q_{T | 0}^{i} (z | a)}{u_{S} (z)} = 1 + ϵ_{T} (S 1 {z = a} - 1) .

Hence,

L_{x_{0}} (x_{T}) = \prod_{i = 1}^{d} [1 + ϵ_{T} (S 1 {x_{T}^{i} = x_{0}^{i}} - 1)] .

By Cauchy–Schwarz,

TV (q_{T | 0} (\cdot | x_{0}), u_{d}) = \frac{1}{2} E_{x_{T} \sim u_{d}} [| L_{x_{0}} (x_{T}) - 1 |] \leq \frac{1}{2} \sqrt{E_{x_{T} \sim u_{d}} [{(L_{x_{0}} (x_{T}) - 1)}^{2}]} .

Since

x_{T}^{1}, \dots, x_{T}^{d}

are independent under

u_{d}

,

\begin{matrix} E_{x_{T} \sim u_{d}} [L_{x_{0}} {(x_{T})}^{2}] & = \prod_{i = 1}^{d} E_{z \sim u_{S}} [{(1 + ϵ_{T} (S 1 {z = x_{0}^{i}} - 1))}^{2}] \\ = {[\frac{1}{S} {(1 + (S - 1) ϵ_{T})}^{2} + \frac{S - 1}{S} {(1 - ϵ_{T})}^{2}]}^{d} \\ = {(1 + (S - 1) ϵ_{T}^{2})}^{d} . \end{matrix}

Also,

E_{u_{d}} [L_{x_{0}}] = 1

. Therefore,

E_{u_{d}} [{(L_{x_{0}} - 1)}^{2}] = {(1 + (S - 1) ϵ_{T}^{2})}^{d} - 1,

and thus

TV (q_{T | 0} (\cdot | x_{0}), u_{d}) \leq \frac{1}{2} \sqrt{{(1 + (S - 1) ϵ_{T}^{2})}^{d} - 1} .

Again using convexity over

x_{0} \sim q_{0}

, this gives

TV (q_{T}, p_{0}) \leq \frac{1}{2} \sqrt{{(1 + (S - 1) ϵ_{T}^{2})}^{d} - 1} .

Combining the two bounds yields

TV (q_{T}, p_{0}) \leq [1 - {(1 - \frac{S - 1}{S} ϵ_{T})}^{d}] \land \frac{1}{2} \sqrt{{(1 + (S - 1) ϵ_{T}^{2})}^{d} - 1} .

Finally, since

1 - {(1 - r)}^{d} \leq d r

and

{(1 + a)}^{d} \leq e^{a d}

, we obtain

TV (q_{T}, p_{0}) ≲ min \{d ϵ_{T}, \sqrt{d S} ϵ_{T}\} = min \{d, \sqrt{d S}\} ϵ_{T} .

Appendix F.2. Proof of Lemma A2

Write

P : = \{i \in [d] : y^{i} \neq x^{i}\}

. First, we note that

\begin{matrix} \frac{q_{t} (y)}{q_{t} (x)} & = \frac{1}{q_{t} (x)} \sum_{x_{0} \in {[S]}^{d}} q_{0} (x_{0}) q_{t | 0} (y | x_{0}) \\ \overset{(i)}{=} \frac{1}{q_{t} (x)} \sum_{x_{0} \in {[S]}^{d}} q_{0} (x_{0}) \prod_{i \in [d]} q_{t | 0}^{i} (y^{i} | x_{0}^{i}) \\ = \frac{1}{q_{t} (x)} \sum_{x_{0} \in {[S]}^{d}} q_{0} (x_{0}) (\prod_{i \in [d]} q_{t | 0}^{i} (x^{i} | x_{0}^{i})) (\prod_{i \in P} \frac{q_{t | 0}^{i} (x^{i} | x_{0}^{i})}{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}) \\ \overset{(i i)}{=} \sum_{x_{0} \in {[S]}^{d}} \frac{q_{0} (x_{0}) q_{t | 0} (x | x_{0})}{q_{t} (x)} (\prod_{i \in P} \frac{q_{t | 0}^{i} (x^{i} | x_{0}^{i})}{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}) \\ = E_{x_{0} \sim q_{0 | t} (\cdot | x)} \prod_{i \in P} \frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x^{i} | x_{0}^{i})} \end{matrix}

(A12)

where both

(i)

and

(i i)

follow because with the chosen

R_{t}

in (2), each dimension propagates independently in the forward process (cf. [11] (Prop. 3)). Recall the forward conditional probability under non-homogeneity in (A11), which yields

\frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x^{i} | x_{0}^{i})} = \{\begin{matrix} 1 & if x^{i} \neq x_{0}^{i} and y^{i} \neq x_{0}^{i} \\ \frac{1 - e^{- {\bar{β}}_{t}}}{1 + (S - 1) e^{- {\bar{β}}_{t}}} & if x^{i} = x_{0}^{i} but y^{i} \neq x_{0}^{i} \\ \frac{1 + (S - 1) e^{- {\bar{β}}_{t}}}{1 - e^{- {\bar{β}}_{t}}} & if x^{i} \neq x_{0}^{i} but y^{i} = x_{0}^{i} \end{matrix} .

(A13)

Now, since

e^{- {\bar{β}}_{t}} \geq 0

, the second case satisfies that

\frac{1 - e^{- {\bar{β}}_{t}}}{1 + (S - 1) e^{- {\bar{β}}_{t}}} \leq 1

. For the third case, we have

\begin{matrix} \frac{1 + (S - 1) e^{- {\bar{β}}_{t}}}{1 - e^{- {\bar{β}}_{t}}} & = 1 + S \cdot \frac{1}{e^{{\bar{β}}_{t}} - 1} \leq \{\begin{matrix} S + 1 & if e^{{\bar{β}}_{t}} \geq 2 \\ \frac{S}{{\bar{β}}_{t}} & otherwise \end{matrix} \\ ≲ S \cdot max {1, {({\bar{β}}_{t})}^{- 1}} . \end{matrix}

Since these bounds do not depend on i, the proof is now complete.

Appendix F.3. Proof of Lemma A3

First, fix

x_{t}

and y and let i be the index such that

x_{t}^{i} \neq y^{i}

. From (A12), we have,

\begin{matrix} | \frac{q_{t} (y)}{q_{t} (x_{t})} - \frac{q_{s} (y)}{q_{s} (x_{t})} | & = | E_{x_{0} \sim q_{0 | t} (\cdot | x_{t})} [\frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x_{t}^{i} | x_{0}^{i})}] - E_{{\tilde{x}}_{0} \sim q_{0 | s} (\cdot | x_{t})} [\frac{q_{s | 0}^{i} (y^{i} | {\tilde{x}}_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | {\tilde{x}}_{0}^{i})}] | \\ \leq E_{x_{0} \sim q_{0 | t} (\cdot | x_{t})} | \frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x_{t}^{i} | x_{0}^{i})} - \frac{q_{s | 0}^{i} (y^{i} | x_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | x_{0}^{i})} | \\ + | E_{\begin{matrix} x_{0} \sim q_{0 | t} (\cdot | x_{t}) \\ {\tilde{x}}_{0} \sim q_{0 | s} (\cdot | x_{t}) \end{matrix}} [\frac{q_{s | 0}^{i} (y^{i} | x_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | x_{0}^{i})} - \frac{q_{s | 0}^{i} (y^{i} | {\tilde{x}}_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | {\tilde{x}}_{0}^{i})}] | . \end{matrix}

(A14)

For the first term in (A14), we note the expression of likelihood ratio in (A13), and thus for any fixed

x_{0}

,

x_{t}

, and y,

| \frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x_{t}^{i} | x_{0}^{i})} - \frac{q_{s | 0}^{i} (y^{i} | x_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | x_{0}^{i})} | = \{\begin{matrix} 0 & if x^{i} \neq x_{0}^{i} and y^{i} \neq x_{0}^{i} \\ \frac{1 - e^{- {\bar{β}}_{t}}}{1 + (S - 1) e^{- {\bar{β}}_{t}}} - \frac{1 - e^{- {\bar{β}}_{s}}}{1 + (S - 1) e^{- {\bar{β}}_{s}}} & if x^{i} = x_{0}^{i} but y^{i} \neq x_{0}^{i} \\ \frac{1 + (S - 1) e^{- {\bar{β}}_{t}}}{1 - e^{- {\bar{β}}_{t}}} - \frac{1 + (S - 1) e^{- {\bar{β}}_{s}}}{1 - e^{- {\bar{β}}_{s}}} & if x^{i} \neq x_{0}^{i} but y^{i} = x_{0}^{i} \end{matrix} .

Now, since

\begin{matrix} | \frac{\partial}{\partial t} (\frac{1 - e^{- {\bar{β}}_{t}}}{1 + (S - 1) e^{- {\bar{β}}_{t}}}) | & = \frac{S e^{{\bar{β}}_{t}} β_{t}}{{(S + e^{{\bar{β}}_{t}} - 1)}^{2}} ≲ β_{t} \\ | \frac{\partial}{\partial t} (\frac{1 + (S - 1) e^{- {\bar{β}}_{t}}}{1 - e^{- {\bar{β}}_{t}}}) | & = \frac{S e^{{\bar{β}}_{t}} β_{t}}{{(e^{{\bar{β}}_{t}} - 1)}^{2}} ≲ \frac{S β_{t}}{min {\{1, {\bar{β}}_{t}\}}^{2}}, \end{matrix}

we have

| \frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x_{t}^{i} | x_{0}^{i})} - \frac{q_{s | 0}^{i} (y^{i} | x_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | x_{0}^{i})} | ≲ \frac{S β_{t}}{min {\{1, {\bar{β}}_{t}\}}^{2}} (t - s) .

Note that this term does not depend on d. Thus,

\begin{matrix} E_{x_{t} \sim q_{t}} \sum_{\begin{matrix} y \neq x_{t} \\ Ham (y, x_{t}) = 1 \end{matrix}} E_{x_{0} \sim q_{0 | t} (\cdot | x_{t})} | \frac{q_{t | 0}^{i} (y^{i} | x_{0}^{i})}{q_{t | 0}^{i} (x_{t}^{i} | x_{0}^{i})} - \frac{q_{s | 0}^{i} (y^{i} | x_{0}^{i})}{q_{s | 0}^{i} (x_{t}^{i} | x_{0}^{i})} | R_{t} & (y, x_{t}) \\ ≲ d S β_{t}^{2} max \{1, {\bar{β}}_{t}^{- 2}\} (t - s) . \end{matrix}

Now we turn to the second term in (A14). Write

f (z) : = \frac{q_{s | 0}^{i} (y^{i} | z)}{q_{s | 0}^{i} (x_{t}^{i} | z)}

for brevity (recall that

x_{t}

and y are fixed and thus omitted in this expression). Note that from (A13), an upper bound on

f (z)

is

sup_{y^{i}, x_{t}^{i}, z \in [S]} f (z) = sup_{y^{i}, x_{t}^{i}, z \in [S]} \frac{q_{s | 0}^{i} (y^{i} | z)}{q_{s | 0}^{i} (x_{t}^{i} | z)} ≲ S \cdot max {1, {\bar{β}}_{s}^{- 1}} .

Thus, the second term in (A14) can be upper-bounded (for each

y^{i}

) as

\begin{matrix} | E_{\begin{matrix} x_{0} \sim q_{0 | t} (\cdot | x_{t}) \\ {\tilde{x}}_{0} \sim q_{0 | s} (\cdot | x_{t}) \end{matrix}} [f (x_{0}^{i}) - f ({\tilde{x}}_{0}^{i})] | & = | \sum_{x_{0} \in {[S]}^{d}} f (x_{0}^{i}) (q_{0 | t} (x_{0} | x_{t}) - q_{0 | s} (x_{0} | x_{t})) | \\ ≲ S max {1, {\bar{β}}_{s}^{- 1}} \sum_{x_{0} \in {[S]}^{d}} | q_{0 | t} (x_{0} | x_{t}) - q_{0 | s} (x_{0} | x_{t}) | . \end{matrix}

(A15)

Using Bayes’s rule, we have

\begin{matrix} \sum_{x_{0} \in {[S]}^{d}} | q_{0 | t} (x_{0} | x_{t}) - q_{0 | s} (x_{0} | x_{t}) | \\ = \sum_{x_{0} \in {[S]}^{d}} q_{0} (x_{0}) | \frac{q_{t | 0} (x_{t} | x_{0})}{q_{t} (x_{t})} - \frac{q_{s | 0} (x_{t} | x_{0})}{q_{s} (x_{t})} | \\ \leq \frac{1}{q_{t} (x_{t}) \cdot q_{s} (x_{t})} \sum_{x_{0}, y_{0} \in {[S]}^{d}} q_{0} (x_{0}) q_{0} (y_{0}) \\ \cdot | q_{t | 0} (x_{t} | x_{0}) q_{s | 0} (x_{t} | y_{0}) - q_{s | 0} (x_{t} | x_{0}) q_{t | 0} (x_{t} | y_{0}) | \\ \leq \frac{1}{q_{t} (x_{t}) \cdot q_{s} (x_{t})} E_{x_{0}, y_{0} \sim q_{0}} [| q_{t | 0} (x_{t} | x_{0}) - q_{s | 0} (x_{t} | x_{0}) | q_{s | 0} (x_{t} | y_{0}) \\ + | q_{s | 0} (x_{t} | y_{0}) - q_{t | 0} (x_{t} | y_{0}) | q_{s | 0} (x_{t} | x_{0})] \\ = \frac{1}{q_{t} (x_{t})} E_{x_{0} \sim q_{0}} | q_{t | 0} (x_{t} | x_{0}) - q_{s | 0} (x_{t} | x_{0}) | \\ + \frac{1}{q_{t} (x_{t})} E_{y_{0} \sim q_{0}} | q_{s | 0} (x_{t} | y_{0}) - q_{t | 0} (x_{t} | y_{0}) | \\ = \frac{2}{q_{t} (x_{t})} E_{x_{0} \sim q_{0}} | q_{t | 0} (x_{t} | x_{0}) - q_{s | 0} (x_{t} | x_{0}) | \end{matrix}

(A16)

Now, this term (without the constant factor 2) can be upper-bounded as

\begin{matrix} \frac{1}{q_{t} (x_{t})} E_{x_{0} \sim q_{0}} | q_{t | 0} (x_{t} | x_{0}) - q_{s | 0} (x_{t} | x_{0}) | \\ ≲ \frac{1}{q_{t} (x_{t})} (t - s) E_{x_{0} \sim q_{0}} | \frac{\partial}{\partial t} q_{t | 0} (x_{t} | x_{0}) | \\ \overset{(i)}{=} \frac{1}{q_{t} (x_{t})} (t - s) E_{x_{0} \sim q_{0}} | \sum_{{\tilde{x}}_{t} \in {[S]}^{d}} q_{t | 0} ({\tilde{x}}_{t} | x_{0}) R_{t} ({\tilde{x}}_{t}, x_{t}) | \\ \leq \frac{1}{q_{t} (x_{t})} (t - s) E_{x_{0} \sim q_{0}} \sum_{{\tilde{x}}_{t} \in {[S]}^{d}} q_{t | 0} ({\tilde{x}}_{t} | x_{0}) | R_{t} ({\tilde{x}}_{t}, x_{t}) | \\ = (t - s) \sum_{{\tilde{x}}_{t} \in {[S]}^{d}} \frac{q_{t} ({\tilde{x}}_{t})}{q_{t} (x_{t})} | R_{t} ({\tilde{x}}_{t}, x_{t}) |, \end{matrix}

(A17)

where

(i)

follows from Kolmogorov forward equation. Thus, combining these intermediate results, we have

\begin{matrix} E_{x_{t} \sim q_{t}} \sum_{\begin{matrix} y \neq x_{t} \\ Ham (y, x_{t}) = 1 \end{matrix}} | E_{\begin{matrix} x_{0} \sim q_{0 | t} (\cdot | x_{t}) \\ {\tilde{x}}_{0} \sim q_{0 | s} (\cdot | x_{t}) \end{matrix}} [f (x_{0}^{i}) - f ({\tilde{x}}_{0}^{i})] | R_{t} (y, x_{t}) \\ \overset{(i i)}{≲} d S β_{t} max {1, {\bar{β}}_{s}^{- 1}} \cdot E_{x_{t} \sim q_{t}} [\frac{1}{q_{t} (x_{t})} E_{x_{0} \sim q_{0}} | q_{t | 0} (x_{t} | x_{0}) - q_{s | 0} (x_{t} | x_{0}) |] \\ \overset{(i i i)}{≲} (t - s) d S β_{t} max {1, {\bar{β}}_{s}^{- 1}} E_{x_{t} \sim q_{t}} [\sum_{{\tilde{x}}_{t} \in {[S]}^{d}} \frac{q_{t} ({\tilde{x}}_{t})}{q_{t} (x_{t})} | R_{t} ({\tilde{x}}_{t}, x_{t}) |] \\ = (t - s) d S β_{t} max {1, {\bar{β}}_{s}^{- 1}} E_{{\tilde{x}}_{t} \sim q_{t}} \sum_{x_{t} \in {[S]}^{d}} | R_{t} ({\tilde{x}}_{t}, x_{t}) | \\ \overset{(i v)}{≍} (t - s) d^{2} S β_{t}^{2} max {1, {\bar{β}}_{s}^{- 1}}, \end{matrix}

where

(i i)

follows from (A15) and (A16),

(i i i)

follows from (A17), and

(i v)

follows because

- R_{t} (x, x) = \sum_{y \neq x} R_{t} (x, y) = β_{t} \frac{S - 1}{S} d

for all

x, y \in {[S]}^{d}

. The proof is now complete.

Appendix F.4. Proof of Lemma A4

Write

Π (q_{0}, q_{δ})

is the set of all joint probability measures with marginal distributions

q_{0}

and

q_{δ}

. Then,

\begin{matrix} TV (q_{0}, q_{δ}) & = min_{π \in Π (q_{0}, q_{δ})} E_{x_{0}, x_{δ} \sim π} 1 \{x_{δ} \neq x_{0}\} \\ \leq E_{x_{0} \sim q_{0}} [Q_{δ | 0} \{x_{δ} \neq x_{0}\}] \\ \overset{(i)}{\leq} \sum_{i \in [d]} \sum_{a \neq x_{0}^{i}} E_{x_{0} \sim q_{0}} (Q_{δ | 0}^{i} \{a \neq x_{0}^{i}\}) \\ \overset{(i i)}{=} d \frac{S - 1}{S} (1 - e^{- {\bar{β}}_{δ}}) \\ ≍ d {\bar{β}}_{δ}, \end{matrix}

where

(i)

follows from the union bound, and

(i i)

follows from (A11). The proof is now complete.

Appendix G. Proof of Lemma A5

Recall the recursion of

f_{k}

as

f_{k + 1} = f_{k} e^{- λ κ f_{k}} .

Equivalently, we have

\frac{f_{k} - f_{k + 1}}{1 - e^{- λ κ}} = \frac{f_{k} (1 - e^{- λ κ f_{k}})}{1 - e^{- λ κ}} .

With this, since

f_{k} > 0

, it is equivalent to prove that

f_{k} \leq \frac{1 - e^{- λ κ f_{k}}}{1 - e^{- λ κ}} \Leftrightarrow 1 - e^{- λ κ} \leq \frac{1 - e^{- λ κ f_{k}}}{f_{k}} .

Now, since the function

g (z) = \frac{1 - e^{- a z}}{z}

is decreasing in

z > 0

for

a > 0

, we get the latter inequality by noting that

f_{k} \leq 1

for

k > k^{*}

.

References

Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 2256–2265. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; van den Berg, R. Structured Denoising Diffusion Models in Discrete State-Spaces. In Proceedings of the Advances in Neural Information Processing Systems; Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
Dhariwal, P.; Nichol, A.Q. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 10684–10695. [Google Scholar]
Huang, R.; Huang, J.; Yang, D.; Ren, Y.; Liu, L.; Li, M.; Ye, Z.; Liu, J.; Yin, X.; Zhao, Z. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Lou, A.; Meng, C.; Ermon, S. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Volume 235, pp. 32819–32848. [Google Scholar]
Liu, C.; Fan, W.; Liu, Y.; Li, J.; Li, H.; Liu, H.; Tang, J.; Li, Q. Generative diffusion models on graphs: Methods and applications. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023. [Google Scholar]
Alakhdar, A.; Poczos, B.; Washburn, N. Diffusion Models in De Novo Drug Design. J. Chem. Inf. Model. 2024, 64, 7238–7256. [Google Scholar] [CrossRef] [PubMed]
Campbell, A.; Benton, J.; Bortoli, V.D.; Rainforth, T.; Deligiannidis, G.; Doucet, A. A Continuous Time Framework for Discrete Denoising Models. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Chen, H.; Ying, L. Convergence Analysis of Discrete Diffusion Model: Exact Implementation through Uniformization. arXiv 2024, arXiv:2402.08095. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, Z.; Gu, Q. Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Pham, L.T.N.; Shariatian, D.; Ocello, A.; Conforti, G.; Durmus, A.O. Discrete Markov Probabilistic Models: An Improved Discrete Score-Based Framework with sharp convergence bounds under minimal assumptions. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Dmitriev, D.; Huang, Z.; Wei, Y. Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees. arXiv 2026, arXiv:2602.15008. [Google Scholar] [CrossRef]
Ren, Y.; Chen, H.; Rotskoff, G.M.; Ying, L. How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Liang, Y.; Liang, Y.; Lai, L.; Shroff, N. Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Liang, Y.; Tan, Z.; Shroff, N.; Liang, Y. Sharp Convergence Rates for Masked Diffusion Models. arXiv 2026, arXiv:2602.22505. [Google Scholar] [CrossRef]
Chen, T. On the Importance of Noise Scheduling for Diffusion Models. arXiv 2023, arXiv:2301.10972. [Google Scholar] [CrossRef]
Conforti, G.; Durmus, A.; Pham, L.T.N.; Raoul, G. Non-Asymptotic Convergence of Discrete Diffusion Models: Masked and Random Walk dynamics. arXiv 2025, arXiv:2512.00580. [Google Scholar]
Ren, Y.; Chen, H.; Zhu, Y.; Guo, W.; Chen, Y.; Rotskoff, G.M.; Tao, M.; Ying, L. Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms. In Proceedings of the Advances in Neural Information Processing Systems 38 (NeurIPS 2025); Curran Associates, Inc.: Red Hook, NY, USA, 2025. [Google Scholar]
Liang, Y.; Shroff, N.; Liang, Y. From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models. arXiv 2026, arXiv:2605.27352. [Google Scholar] [CrossRef]
Nisonoff, H.; Xiong, J.; Allenspach, S.; Listgarten, J. Unlocking Guidance for Discrete State-Space Diffusion and Flow Models. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Kan, K.; Li, X.; Zhang, B.J.; Sahai, T.; Osher, S.; Katsoulakis, M.A. Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space. arXiv 2026, arXiv:2605.17232. [Google Scholar] [CrossRef]
Liang, Y.; Huang, R.; Lai, L.; Shroff, N.; Liang, Y. Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2025. [Google Scholar]
Huang, X.; Lin, Y.; Jain, N.; Wang, K.; Zou, D.; Ma, Y.; Zhang, T. On the Complexity Theory of Masked Discrete Diffusion: From poly(1/ϵ) to Nearly ϵ-Free. arXiv 2025, arXiv:2509.21835. [Google Scholar]
Kelly, F.P. Reversibility and Stochastic Networks; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]

Table 1. Summary of results for regular uniform-rate discrete diffusion samplers in terms of the number of steps needed to achieve

\sqrt{ε}

-accuracy in

TV (q_{δ}, p_{T - δ})

. Only the best result for each sampler is shown. All log-dependencies are omitted. In the table, d is the dimension, S is the vocabulary size, and M is an upper bound of the score estimates. In this work, we provide the first analysis under non-homogeneous noise schedule and

L_{1}

estimation error (i.e., in total-variation metric), having a convergence rate comparable to the state-of-the-art results.

Table 1. Summary of results for regular uniform-rate discrete diffusion samplers in terms of the number of steps needed to achieve

\sqrt{ε}

-accuracy in

TV (q_{δ}, p_{T - δ})

. Only the best result for each sampler is shown. All log-dependencies are omitted. In the table, d is the dimension, S is the vocabulary size, and M is an upper bound of the score estimates. In this work, we provide the first analysis under non-homogeneous noise schedule and

L_{1}

estimation error (i.e., in total-variation metric), having a convergence rate comparable to the state-of-the-art results.

Sampler	Assumption	Time-Homo-Geneous?	Results: Num of Steps	Reference
Kolmogorov	Score entropy, bounded score	Yes	$O (\frac{\sqrt{d M} S}{\sqrt{ε}})$	[13]
DMPM	Score entropy	Yes	$O (\frac{d S^{2}}{ε^{2}})$	[14,20]
$τ$ -leaping	$L_{\infty}$ error	Yes	$O (\frac{d^{4} S^{4}}{\sqrt{ε}})$	[11]
$τ$ -leaping	Score entropy, bounded score	Yes	$O (\frac{d^{2} S^{2}}{ε})$	[16,17]
$τ$ -leaping	Score entropy	Yes	$O (\frac{d}{ε})$	[15]
Euler method, Tweedie $τ$ -leaping	Score entropy, bounded score	Yes	$O (\frac{d^{2} S}{ε})$	[17]
Euler method, Tweedie $τ$ -leaping	$L_{1}$ error, slow-varying noise	No	$O (\frac{d^{2} S}{\sqrt{ε}})$	Theorem 2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, Y.; Lai, L.; Shroff, N.; Liang, Y. Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models. Entropy 2026, 28, 675. https://doi.org/10.3390/e28060675

AMA Style

Liang Y, Lai L, Shroff N, Liang Y. Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models. Entropy. 2026; 28(6):675. https://doi.org/10.3390/e28060675

Chicago/Turabian Style

Liang, Yuchen, Lifeng Lai, Ness Shroff, and Yingbin Liang. 2026. "Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models" Entropy 28, no. 6: 675. https://doi.org/10.3390/e28060675

APA Style

Liang, Y., Lai, L., Shroff, N., & Liang, Y. (2026). Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models. Entropy, 28(6), 675. https://doi.org/10.3390/e28060675

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convergence Guarantees for Time-Inhomogeneous Uniform-Rate Discrete Diffusion Models

Abstract

1. Introduction

2. Related Work

Our Contributions

3. Preliminaries of Discrete Diffusion Samplers

3.1. Continuous-Time Forward Dynamics on Discrete State Spaces

3.2. Reverse Dynamics and Discrete-Time Sampling

3.3. Notations

4. Convergence Under General Non-Homogeneous Noise Schedule

5. Proof Sketch of Theorem 1

6. Geometric Noise Schedule

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Justification of Assumption 1

Appendix B. Proof of Theorem 1

Appendix C. Proof of Theorem 2

Appendix D. Proof of Theorem 3

Appendix E. Proof of Theorem 4

Appendix F. Auxiliary Proofs

Appendix F.1. Proof of Lemma A1

Appendix F.2. Proof of Lemma A2

Appendix F.3. Proof of Lemma A3

Appendix F.4. Proof of Lemma A4

Appendix G. Proof of Lemma A5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI