How Much Is Enough? A Study on Diffusion Times in Score-Based Generative Models

Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, a detailed understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive with regard to the state of the art, according to standard sample quality metrics and log-likelihood.

Diffusion models learn to generate samples from an unknown density p data by reversing a diffusion process which transforms the distribution of interest into noise. The forward dynamics injects noise into the data following a diffusion process that can be described by a Stochastic Differential Equation (SDE) of the form where x t is a random variable at time t, f (·, t) is the drift term, g(·) is the diffusion term and w t is a Wiener process (or Brownian motion). We also consider a special class of linear SDEs, for which the drift term is decomposed as f (x t , t) = α(t)x t , where the function α(t) ≤ 0 for all t, and the diffusion term is independent of x t . This class of parameterizations of SDEs is known as affine and it admits analytic solutions. We denote the time-varying probability density by p(x, t), where, by definition p(x, 0) = p data (x), and the conditional on the initial condition x 0 by p(x, t | x 0 ). The forward SDE is usually considered for a "sufficiently long" diffusion time T, leading to the density p(x, T). In principle, when T → ∞, p(x, T) converges to Gaussian noise, regardless of initial conditions.
For generative modeling purposes, we are interested in the inverse dynamics of such process, i.e., transforming samples of the noisy distribution p(x, T) into p data (x). Such dynamics can be obtained by considering the solutions of the inverse diffusion process [17], where t def = T − t, with the inverse dynamics involving a new Wiener process. Given p(x, T) as the initial condition, the solution of Equation (2) after a reverse diffusion time T, will be distributed as p data (x). We refer to the density associated to the backward process as q(x, t ). The simulation of the backward process is referred to as sampling and, differently from the forward process, this process is not affine and a closed form solution is out of reach.
Practical considerations on diffusion times. In practice, diffusion models are challenging to work with [3]. Indeed, a direct access to the true score function ∇ log p(x t , t) required in the dynamics of the reverse diffusion is unavailable. This can be solved by approximating it with a parametric function s θ (x t , t), e.g., a neural network, which is trained using the following loss function: where λ(t) is a positive weighting factor and the notation E ∼(1) means that the expectation is taken with respect to the random process x t in Equation (1): for a generic function h, The loss in Equation (3), usually referred to as score matching loss, is the cost function considered in [18] (Equation (4)). The condition λ(t) = g(t) 2 , which we use in this work, is referred to a likelihood reweighting. Due to the affine property of the drift, the term p(x t , t | x 0 ) is analytically known and normally distributed for all t (expression available in Table 1, and in Särkkä and Solin [19]). Intuitively, the estimation of the score is akin to a denoising objective, which operates in a challenging regime. Later, we will quantify the difficulty of learning the score, as a function of T.
While the forward and reverse diffusion processes are valid for all T, the noise distribution p(x, T) is analytically known only when the diffusion time is T → ∞. Then, the common solution is to replace p(x, T) with a simple (i.e., easy to sample) distribution p noise (x), which, for the classes of SDEs that we consider in this work, is a Gaussian distribution.
In the literature, the discrepancy between p(x, T) and p noise (x) has been neglected, under the informal assumption of a sufficiently large diffusion time. Unfortunately, while this approximation seems a valid approach to simulate and generate samples, the reverse diffusion process starts from an initial condition q(x, 0) which is different from p(x, T) and, as a consequence, it will converge to a solution q(x, T) that is different from the true p data (x). Later, we will expand on the error introduced by this approximation, but for illustration purposes, Figure 1 shows quantitatively this behavior for a simple 1D toy example where we set the data distribution equal to a mixture of normal (N ) distributions as p data (x) = πN (1, 0.1 2 ) + (1 − π)N (3, 0.5 2 ), with π = 0.3. When T is small, the distribution p noise (x) is very different from p(x, T) and samples from q(x, T) exhibit very low likelihood of being generated from p data (x).
Crucially, Figure 1 (zoomed region) illustrates an unknown behavior of diffusion models, which we unveil in our analysis. The right balance between efficient score estimation and sampling quality can be achieved by diffusion times that are smaller than common best practices. Moreover, even excessively large diffusion times can be detrimental. This is a key observation that we explore in our work. Contributions. An appropriate choice of the diffusion time T is a key factor that impacts training convergence, sampling time and quality. On the one hand, the approximation error introduced by considering initial conditions for the reverse diffusion process drawn from a simple distribution p noise (x) = p(x, T) increases when T is small. This is why the current best practice is to choose a sufficiently long diffusion time. On the other hand, training convergence of the score model s θ (x t , t) becomes more challenging to achieve with a large T, which also imposes extremely high computational costs both for training and for sampling. This would suggest to choose a smaller diffusion time. Given the importance of this problem, in this work, we set off to study the existence of suitable operating regimes to strike the right balance between computational efficiency and model quality. The main contributions of this work are the following: Contribution 1: We use an evidence lower bound (ELBO) decomposition which allows us to study the impact of the diffusion time T. This ELBO decomposition emphasizes the roles of (i) the discrepancy between the "ending" distribution of the diffusion and the "starting" distribution of the reverse diffusion processes, and (ii) of the score matching objective. Crucially, our analysis does not rely on assumptions on the quality of the score models. We explicitly study the existence of a trade-off and explore experimentally, for the first time, current approaches for selecting the diffusion time T. Contribution 2: In Section 3, we propose a novel method to improve both the training and sampling efficiency of diffusion-based models, while maintaining high sample quality. Our method introduces an auxiliary distribution, allowing us to transform the simple "starting" distribution of the reverse process used in the literature so as to minimize the discrepancy to the "ending" distribution of the forward process. Then, a standard reverse diffusion can be used to closely match the data distribution. Intuitively, our method allows to build "bridges" across multiple distributions, and to set T toward the advantageous regime of small diffusion times.
In addition to our methodological contributions, in Section 4, we provide experimental evidence of the benefits of our method, in terms of sample quality and log likelihood. Finally, we conclude this work in Section 5.
Related Work. A concurrent work by Zheng et al. [20] presents an empirical study of a truncated diffusion process but lacks a rigorous analysis and a clear justification for the proposed approach. Recent attempts by Lee et al. [9] to optimize p noise , or the proposal to do so [21], have been studied in different contexts. Related work focus primarily on improving sampling efficiency (but not training efficiency), using a wide array of techniques. Sample generation times can be drastically reduced considering adaptive step-size integrators [22]. Such methods are complementary to our approach, and can be used in combination with the techniques we propose in this work. Other popular choices are based on merging multiple steps of a pretrained model through distillation techniques [23] or by taking larger sampling steps with GANs [24]. Approaches closer to ours modify the SDE, or the discrete time processes, to obtain inference efficiency gains. In particular, Song et al. [7] considers implicit non-Markovian diffusion processes, while Watson et al. [25] changes the diffusion processes by optimal scheduling selection, and Dockhorn et al. [26] considers overdamped SDEs. Finally, hybrid techniques combining VAEs and diffusion models [4] or simple auto encoders and diffusion models [27] have positive effects on training and sampling times.
Moreover, we remark that a simple modification of the noise schedule to steer the diffusion process toward a small diffusion time [5,28] is not a viable solution. As we discuss in Section 2.4, the optimal value of the ELBO, in the case of affine SDEs, is invariant to the choice of the noise schedule. Naively selecting a faster noise schedule does not provide any practical benefit in terms of computational complexity, as it requires smaller step sizes to keep the same accuracy of the original noise schedule simulation. However, the optimization of the noise schedule can have important practical effects on the stability of training and variance of estimations [5]. Finally, few other works in the literature attempt to study the convergence properties of diffusion models. For instance, De Bortoli et al. [29] obtain a total variation bound between the generated and data distribution under maximum error assumptions between true and approximated score. De Bortoli [30] relaxes this requirement obtaining a bound in terms of Wasserstein distance. Lee et al. [31] show how the total variation bound can be expressed as a function of the maximum score error and find that the bound is optimized for a diffusion time that depends on this error. Our work, on the other hand, does not make any assumption and aims at selecting the smallest possible diffusion time to maximize training and sampling efficiency.

A Tradeoff on Diffusion Time
The dynamics of a diffusion model can be studied through the lens of variational inference, which allows us to bound the (log-)likelihood using an evidence lower bound (ELBO) [32]. The interpretation we consider in this work (see also [18], Theorem 1) emphasizes the two main factors affecting the quality of sample generation: an imperfect score and a mismatch, measured in terms of KL[log p(x, T) p noise (x)], the Kullback-Leibler (KL) divergence between the noise distribution p(x, T) of the forward process and the distribution p noise used to initialize the backward process.

Preliminaries: The ELBO Decomposition
Our goal is to study the quality of the generated data distribution as a function of the diffusion time T. Instead of focusing on the log-likelihood bounds for single datapoints log q(x, T), we consider the average over the data distribution, i.e., the cross-entropy E p data (x) log q(x, T). By rewriting the L ELBO derived in Huang et al. [32] [Equation (25)] (details of the steps in Appendix B), we have that where Note that R(T) depends neither on s θ nor on p noise , while I(s θ , T), or an equivalent reparameterization [18,32] [Equation (1)], is used to learn the approximated score, by optimization of the parameters θ. It is then possible to show that Note that the term K(T) = I(∇ log p, T) does not depend on θ. Consequently, we can define G(s θ , T) = I(s θ , T) − K(T) (see Appendix C for details), where G(s θ , T) is a positive term that we call the gap term, accounting for the practical case of an imperfect score, i.e., s θ (x t , t) = ∇ log p(x t , t). It also holds that Therefore, we can substitute the cross-entropy term E ∼(1) log p noise (x T ) of the ELBO in Equation (4) to obtain Before concluding our derivation, we show how to combine different terms of Equation (7) into the negative entropy term E p data (x) log p data (x). Given the stochastic dynamics defined in Equation (1), it holds that (see derivation and details in Appendix D) Finally, we can now bound the value of E p data (x) log q(x, T) as Equation (9) clearly emphasizes the roles of an approximate score function, through the gap term G(·), and the discrepancy between the noise distribution of the forward process and the initial distribution of the reverse process through the KL term. The (negative) entropy term E p data (x) log p data (x), which is constant with regard to T and θ, is the best value achievable by the ELBO. Indeed, by rearranging Equation (9), KL[p data (x) q(x, T)] ≤ G(s θ , T) + KL[p(x, T) p noise (x)]. Optimality is achieved when (i) we have perfect score matching and (ii) the initial conditions for the reverse process are ideal, i.e., q(x, 0) = p(x, T). Next, we show the existence of a tradeoff: the KL decreases with T, while the gap increases with T.

The Tradeoff on Diffusion Time
We begin by showing that the KL term in Equation (9) decreases with the diffusion time T, which induces to select large T to maximize the ELBO. We consider the two main classes of SDEs for the forward diffusion process defined in Equation (1): SDEs whose steady state distribution is the standard multivariate Gaussian, referred to as Variance Preserving (VP), and SDEs without a stationary distribution, referred to as Variance Exploding (VE), which we summarize in Table 1. The standard approach to generate new samples relies on the backward process defined in Equation (2), and consists in setting p noise in agreement with the form of the forward process SDE. The following result bounds the discrepancy between the noise distribution p(x, T) and p noise .
Lemma 1. For the classes of SDEs considered (Table 1), the discrepancy between p(x, T) and the p noise (x) can be bounded as follows.
For Variance Preserving SDEs, it holds that: Our proof uses results from Villani [33], the logarithmic Sobolev Inequality and Gronwall inequality (see Appendix E for details). The consequence of Lemma 1 is that to maximize the ELBO, the diffusion time T should be as large as possible (ideally, T → ∞), such that the KL term vanishes. This result is in line with current practices for training score-based diffusion processes, which argue for sufficiently long diffusion times [29]. Our analysis, on the other hand, highlights how this term is only one of the two contributions to the ELBO. Now, we focus our attention on studying the behavior of the second component, G(·). Before that, we define a few quantities that allow us to write the next important result.
Definition 1. We define the optimal score s θ for any diffusion time T, as the score obtained using parameters that minimize I(s θ , T). Similarly, we define the optimal score gap G( s θ , T) for any diffusion time T, as the gap attained when using the optimal score.
The optimal score gap term G( s θ , T) is a non-decreasing function in T. That is, given T 2 > T 1 , and θ 1 = arg min θ I(s θ , T 1 ), θ 2 = arg min θ I(s θ , T 2 ), then G(s θ 2 , T 2 ) ≥ G(s θ 1 , T 1 ). The proof (see Appendix F) is a direct consequence of the definition of G and the optimality of the score.
Note that Section 2.2 does not imply that G(s θ a , T 2 ) ≥ G(s θ b , T 1 ) holds for generic parameters θ a , θ b .

Is There an Optimal Diffusion Time?
While diffusion processes are generally studied for T → ∞, diffusion times in scorebased models have been arbitrarily set to be "sufficiently large" in the literature. Here we formally argue about the existence of an optimal diffusion time, which strikes the right balance between the gap G(·) and the KL terms of the ELBO in Equation (9).
Before proceeding any further, we clarify that our final objective in this work is not to find and use an optimal diffusion time. Instead, our result on the existence of optimal diffusion times (which can be smaller than the ones set by than popular heuristics) serves the purpose of motivating the choice of small diffusion times, which, however, calls for a method to overcome approximation errors. For completeness, in Appendix H, we show that optimizing the ELBO to obtain an optimal diffusion time T is technically feasible, without resorting to exhaustive grid search.
Consider the ELBO decomposition in Equation (9). We study it as a function of time T, and seek its optimal argument T = arg max T L ELBO ( s θ , T). Then, the optimal diffusion time T ∈ R + , and thus not necessarily T = ∞. Additional assumptions on the gap term G(·) can be used to guarantee strict finiteness of T .
It is trivial to verify that, since the optimal gap term G( s θ , T) is a non decreasing function in T (Section 2.2), we have ∂G ∂T ≥ 0.Then, we study the sign of the KL derivative, which is always negative as shown in Appendix G. Moreover, we know that that lim T→∞ ∂KL ∂T = 0. Consequently, the function ∂L ELBO ∂T = ∂G ∂T + ∂KL ∂T has at least one zero in its domain R + . To guarantee a stricter bounding of T , we could study asymptotically the growth rates of G and the KL terms for large T. The investigation is technically involved and outside the scope of this paper. Nevertheless, as discussed hereafter, the numerical investigation carried out in this work suggests finiteness of T . Empirically, we use Figure 2 to illustrate the tradeoff and the optimality arguments through the lens of the same toy example we use in Section 1. On the first and third columns, we show the ELBO decomposition. We can verify that G(s θ , T) is an increasing function of T, whereas the KL term is a decreasing function of T. Even in the simple case of a toy example, the tension between small and large values of T is clear. On the second and fourth columns, we show the values of the ELBO and of the likelihood as a function of T. We then verify the validity of our claims: the ELBO is neither maximized by an infinite diffusion time, nor by a "sufficiently large" value. Instead, there exists an optimal diffusion time which, for this example, is smaller than T = 1.0, which is typically used in practice. In Section 3, we present a new method that admits much smaller diffusion times and we show that the ELBO of our approach is at least as good as the one of a standard diffusion model, configured to use its optimal diffusion time T .

Relation with Diffusion Process Noise Schedule
We remark that a simple modification of the noise schedule to steer the the diffusion process toward a small diffusion time [5,28] is not a viable solution. In Appendix J, we discuss how the optimal value of the ELBO, in the case of affine SDEs, is invariant to the choice of the noise schedule. Indeed, its value depends uniquely on the relative level of corruption of the initial data at the considered final diffusion time T, that is, the Signalto-Noise Ratio. Naively, we could think that, by selecting a twice as fast noise schedule, we would be able to obtain the same ELBO of the original schedule by diffusing only for half the time. While true, this does not provide any practical benefit in terms of computational complexity. If the noise schedule is faster, the drift terms involved in the reverse process changes more rapidly. Consequently, to simulate the reverse SDE with a numerical integration scheme, smaller step sizes are required to keep the same accuracy of the original noise schedule simulation. The effect is that, while the diffusion time for the continuous time dynamics is smaller, the number of integration steps is larger, inducing no computational gains. The optimization of the noise schedule can, however, have important practical effects in terms of stability of the training and variance of the estimations, which we do not tackle in this work [5].

Relation with Literature on Bounds and Goodness of Score Assumptions
Few other works in the literature attempt to study the convergence properties of Diffusion models. In the work of De Bortoli et al. [29] (Theorem 1), a total variation (TV) bound between the generated and data distribution is obtained in the form C 1 exp(a 1 T) + C 2 exp(−a 2 T), where the constant C 1 depends on the maximum error over [0, T] between the true and approximated score, i.e., max t∈[0,T] s θ (x, t) − ∇ log p(x, t) . In the work of De Bortoli [30], the requirement is relaxed by setting max t∈[0,T] where the 1-Wasserstein distance between generated and true data is bounded as C 1 + C 2 exp(−a 2 T) + C 3 (Theorem 1). Other works consider the more realistic average square norm instead of the infinity norm, which is consistent with standard training of diffusion models. Moreover, Lee et al. [31] show how the TV bound can be expressed as a function t) 2 (Theorems 2.2, 3.1 and 3.2). Related to our work, Lee et al. [31] find that the TV bound is optimized for a diffusion time that depends, among others, on the maximum score error. Finally, the work by Chen et al. [34] (Theo- is bounded, then the TV distance between true and generated data can be bound as C 1 exp(−a 1 T) + √ T, plus a discretization error. All prior approaches require assumptions on the maximum score error, which implicitly depends on: (i) the maximum diffusion time T and (ii) the class of parametric score networks considered. Hence, such methods allow for the study of convergence properties, but with the following limitations. It is not clear how the score error behaves as the fitting domain ([0, T]) is increased, for generic class of parametric functions and generic p data . Moreover, it is difficult to link the error assumptions with the actual training loss of diffusion models. In this work, instead, we follow a more agnostic path, as we make no assumptions about the error behavior. We notice that the optimal gap term is always a non decreasing function of T. First, we question whether the current best practice for setting diffusion times is adequate: we find that, in realistic implementations, diffusion times are larger than necessary. Second, we introduce a new approach, with provably the same performance of standard diffusion models but lower computational complexity, as highlighted in Section 3.

A New, Practical Method for Decreasing Diffusion Times
The ELBO decomposition in Equation (9) and the bounds in Lemma 1 and Section 2.2 highlight a dilemma. We thus propose a simple method that allows us to achieve both a small gap G(s θ , T) and a small discrepancy KL[p(x, T) p noise (x)]. Before that, let us use Figure 3 to summarize all densities involved and the effects of the various approximations, which will be useful to visualize our proposal.
The data distribution p data (x) is transformed into the noise distribution p(x, T) through the forward diffusion process. Ideally , starting from p(x, T), we can recover the data distribution by simulating using the exact score ∇ log p. Using the approximated score s θ and the same initial conditions, the backward process ends up in q (1) (x, T), whose discrepancy 1 to p data (x) is G(s θ , T). However, the distribution p(x, T) is unknown and replaced with an easy distribution p noise (x), accounting for an error a measured as KL[p(x, T) p noise (x)]. With the score and initial distribution approximated, the backward process ends up in q (3) (x, T), where the discrepancy 3 from p data is the sum of the terms G(s θ , T) + KL[p(x, T) p noise ].
Multiple bridges across densities. In a nutshell, our method allows us to reduce the gap term by selecting smaller diffusion times and by using a learned auxiliary model to transform the initial density p noise (x) into a density ν φ (x), which is as close as possible to p(x, T), thus avoiding the penalty of a large KL term. To implement this, we first transform the simple distribution p noise into the distribution ν φ (x), whose discrepancy b KL p(x, T) ν φ (x) is smaller than a . Then, starting from from the auxiliary model ν φ (x), we use the approximate score s θ to simulate the backward process reaching q (2) (x, T). This solution has a discrepancy 2 from the data distribution of G(s θ , T) + KL p(x, T) ν φ (x) , which we will quantify later in the section. Intuitively, we introduce two bridges. The first bridge connects the noise distribution p noise to an auxiliary distribution ν φ (x) that is as close as possible to that obtained by the forward diffusion process. The second bridge-a standard reverse diffusion process-connects the smooth distribution ν φ (x) to the data distribution. Notably, our approach has important guarantees, which we discuss next. Figure 3. Intuitive illustration of the forward and backward diffusion processes. Discrepancies between distributions are illustrated as distances. Color coding is discussed in the text.

Auxiliary Model Fitting and Guarantees
We begin by stating the requirements we consider for the density ν φ (x). First, as it is the case for p noise , it should be easy to generate samples from ν φ (x) in order to initialize the reverse diffusion process. Second, the auxiliary model should allow us to compute the likelihood of the samples generated through the overall generative process, which begins in p noise , passes through ν φ (x), and arrives in q(x, T).
The fitting procedure of the auxiliary model is straightforward. First, we recognize that minimizing KL p(x, T) ν φ (x) with respect to φ also minimizes E p(x,T) log ν φ (x) , which we can use as loss function. To obtain the set of optimal parameters φ , we require samples from p(x, T), which can be easily obtained even if the density p(x, T) is not available. Indeed, by sampling from p data , and p(x, T | x 0 ), we obtain an unbiased Monte Carlo estimate of E p(x,T) log ν φ (x) , and optimization of the loss can be performed. Note that, due to the affine nature of the drift, the conditional distribution p(x, T | x 0 ) is easy to sample from, as shown in Table 1. From a practical point of view, it is important to notice that the fitting of ν φ is independent from the training of the score-matching objective, i.e., the result of I(s θ ) does not depend on the shape of the auxiliary distribution ν φ . This implies that the two training procedures can be run in parallel, thus enabling considerable time savings.
Next, we show that the first bridge in our model reduces the KL term, even for small diffusion times. Proposition 1. Let us assume that p noise (x) is in the family spanned by ν φ , i.e., there exists φ such that ν φ = p noise . Then we have that Since we introduce the auxiliary distribution ν, we shall define a new ELBO for our method: Recalling that s θ is the optimal score for a generic time T, Proposition 1 allows us to claim that L φ ELBO ( s θ , T) ≥ L ELBO ( s θ , T). Then, we can state the following important result: Proposition 2. Given the existence of T , defined as the diffusion time such that the ELBO is maximized (Section 2.3), there exists at least one diffusion time τ ≤ T , such that L φ ELBO ( s θ , τ) ≥ L ELBO ( s θ , T * ).
Proposition 2, which we prove in Appendix I, has two interpretations. On the one hand, given two score models optimally trained for their respective diffusion times, our approach guarantees an ELBO that is at least as good as that of a standard diffusion model configured with its optimal time T . Our method achieves this with a smaller diffusion time τ, which offers sampling efficiency and generation quality. On the other hand, if we settle for an equivalent ELBO for the standard diffusion model and our approach, with our method we can afford a sub-optimal score model, which requires a smaller computational budget to be trained, while guaranteeing shorter sampling times. We elaborate on this interpretation in Section 4, where our approach obtains substantial savings in terms of training iterations.
A final note is in order. The choice of the auxiliary model depends on the selected diffusion time. The larger the T, the "simpler" the auxiliary model can be. Indeed, the noise distribution p(x, T) approaches p noise , so that a simple auxiliary model is sufficient to transform p noise into a distribution ν φ . Instead, for a small T, the distribution p(x, T) is closer to the data distribution. Then, the auxiliary model requires high flexibility and capacity. In Section 4, we substantiate this discussion empirically on synthetic and real data.

Comparison with Schrödinger Bridges
In this section, we briefly compare our method with the Schrödinger bridges approach [29,35,36], which allows one to move from an arbitrary p noise to p data in any finite amount of time T. This is achieved by simulating the SDE whereψ, ψ solve the Partial Differential Equation (PDE) system with boundary conditions ψ(x, 0)ψ(x, 0) = p data (x), ψ(x, T)ψ(x, T) = p noise (x). In the above ∂x i , being N the dimension of the vectors x, f and the notation f i , x i indicating their i th component. This approach presents drawbacks compared to classical Diffusion models. First, the functions ψ,ψ are not known, and their parametric approximation is costly and complex. Second, it is much harder to obtain quantitative bounds between true and generated data as a function of the quality of such approximations.
Theψ, ψ estimation procedure simplifies considerably in the particular case where p noise (x) = p(x, T), for arbitrary T. The solution of Equation (13) is indeed ψ(x, t) = 1, ψ(x, t) = p(x, t). The first PDE of the system is satisfied when ψ is a constant. The second PDE is the Fokker-Planck equation, satisfied byψ(x, t) = p(x, t). Boundary conditions are also satisfied. In this scenario, a sensible objective is the score-matching, as getting ∇ logψ equal to the true score ∇ log p allows perfect generation.
Unfortunately, it is difficult to generate samples from p(x, T), the starting conditions of Equation (12). A trivial solution is to select T → ∞ in order to have p noise as the simple and analytically known steady state distribution of Equation (1). This corresponds to the classical diffusion models approach, which we discussed in the previous sections. An alternative solution is to keep T finite and cover the first part of the bridge from p noise to p(x, T) with an auxiliary model. This provides a different interpretation of our method, which allows for smaller diffusion times while keeping good generative quality.

An Extension for Density Estimation
Diffusion models can be also used for density estimation by transforming the diffusion SDE into an equivalent Ordinary Differential Equation (ODE) whose marginal distribution p(x, t) at each time instant coincide to that of the corresponding SDE [3]. The exact equivalent ODE requires the score ∇ log p(x t , t), which in practice is replaced by the score model s θ , leading to the following ODE whose time varying probability density is indicated with p(x, t). Note that the density p(x, t), is in general not equal to the density p(x, t) associated to Equation (1), with the exception of perfect score matching [18]. The reverse time process is modeled as a Continuous Normalizing Flow (CNF) [37,38] initialized with distribution p noise (x); then, the likelihood of a given value x 0 is To use our proposed model for density estimation, we also need to take into account the ODE dynamics. We focus again on the term log p noise (x T ) to improve the expected log likelihood. For consistency, our auxiliary density ν φ should now maximize E ∼ (14) log ν φ (x T ) instead of E ∼(1) log ν φ (x T ). However, the simulation of Equation (14) requires access to s θ which, in the endeavor of density estimation, is available only once the score model has been trained. Consequently, optimization with respect to φ can only be performed sequentially, whereas, for generative purposes, it could be done concurrently. While the sequential version is expected to perform better, experimental evidence indicates that improvements are marginal, justifying the adoption of the more efficient concurrent version.

Experiments
We now present numerical results on the MNIST and CIFAR10 datasets, to support our claims in Sections 2 and 3. We follow a standard experimental setup [5,7,18,32]: we use a standard U-Net architecture with time embeddings [6] and we report the log-likelihood in terms of bit per dimension (BPD) and the Fréchet Inception Distance (FID) scores (uniquely for CIFAR10). Although the FID score is a standard metric for ranking generative models, caution should be used against over-interpreting FID improvements [39]. Similarly, while the theoretical properties of the models we consider are obtained through the lens of ELBO maximization, the log-likelihood measured in terms of BPD should be considered with care [40]. Finally, we also report the number of neural function evaluations (NFE) for computing the relevant metrics. We compare our method to the standard score-based model [3]. The full description on the experimental setup is presented in Appendix K.
On the existence of T . We look for further empirical evidence of the existence of a T < ∞, as stated in Section 2.3. For the moment, we shall focus on the baseline model [3], where no auxiliary models are introduced. Results are reported in Table 2. For MNIST, we observe how times T = 0.6 and T = 1.0 have comparable performance in terms of BPD, implying that any T ≥ 1.0 is at best unnecessary and generally detrimental. Similarly, for CIFAR10, it is possible to notice that the best value of BPD is achieved for T = 0.6, outperforming all other values. Our auxiliary models. In Section 3, we introduced an auxiliary model to minimize the mismatch between initial distributions of the backward process. We now specify the family of parametric distributions we have considered. Clearly, the choice of an auxiliary model also depends on the data distribution, in addition to the choice of diffusion time T.
For our experiments, we consider two auxiliary models: (i) a Dirichlet process Gaussian mixture model (DPGMM) [41,42] for MNIST and (ii) Glow [43], a flexible normalizing flow for CIFAR10. Both of them satisfy our requirements: they allow exact likelihood computation and they are equipped with a simple sampling procedure. As discussed in Section 3, auxiliary model complexity should be adjusted as a function of T.
This is confirmed experimentally in Figure 4, where we use the number of mixture components of the DPGMM as a proxy to measure the complexity of the auxiliary model.  Reducing T with auxiliary models. We now show how it is possible to obtain a comparable (or better) performance than the baseline model for a wide range of diffusion times T. For MNIST, setting τ = 0.4 produces good performance both in terms of BPD (Table 3) and visual sample quality ( Figure 5). We also consider the sequential extension (S) to compute the likelihood, but remark marginal improvements compared to a concurrent implementation. Similarly for the CIFAR10 dataset, in Table 4 we observe how our method achieves better BPD than the baseline diffusion for T = 1. Moreover, our approach outperforms the baselines for the corresponding diffusion time in terms of FID score (Figure 6 and additional non-curated samples in the Appendix K). In Figure A3 we provide a non curated subset of qualitative results, showing that our method for a diffusion time equal to 0.4 still produces appealing images, while the vanilla approach fails. We finally notice how the proposed method has comparable performance with regard to several other competitors, while stressing that many orthogonal to our solutions (like diffusion in latent space [4], or the selection of higher order schemes [22]) can actually be combined with our methodology.  Training and sampling efficiency. In Figure 7, the horizontal line corresponds to the best performance of a fully trained baseline model for T = 1.0 [3]. To achieve the same performance of the baseline, variants of our method require fewer iterations, which translate in training efficiency. For the sake of fairness, the total training cost of our method should account for the auxiliary model training, which, however, can be done concurrently to the diffusion process. As an illustration for CIFAR10, using four GPUs, the baseline model requires ∼6.4 days of training. With our method we trained the auxiliary and diffusion models for ∼2.3 and 2 days, respectively, leading to a total training time of max{2.3, 2} = 2.3 days. Similar training curves can be obtained for the MNIST dataset, where the training time for DPGMMs is negligible.
Sampling speed benefits are evident from Tables 3 and 4. When considering the SDE version of the methods the number of sampling steps can decrease linearly with T, in accordance with theory [45], while retaining good BPD and FID scores. Similarly, although not in a linear fashion, the number of steps of the ODE samplers can be reduced by using a smaller diffusion time T.
Finally, we test the proposed methodology on the more challenging CELEBA 64x64 dataset. In this case, we use a variance exploding diffusion and we consider again Glow as the auxiliary model. The results, presented in Table 5, report the log-likelihood performance of different methods (qualitative results are reported in Appendix K). On the two extremes of the complexity we have the original diffusion (VE, T = 1.0) with the best BPD and the highest complexity, and Glow which provides a much simpler scheme with worse performance. In the table we report the BPD and the NFE metrics for smaller diffusion times, in three different configurations: naively neglecting the mismatch (ScoreSDE) or using the auxiliary model (Our). Interestingly, we found that the best results are obtained by using a combination of diffusion models pretrained for T = 1.0. The summary of the content of this table is the following: by accepting a small degradation in terms of BPD, we can reduce the computational cost by almost one order of magnitude. We think it would be interesting to study more performing auxiliary models to improve performance of our method on challenging datasets.

Conclusions
Diffusion-based generative models emerged as an extremely competitive approach for a wide range of application domains. In practice, however, these models are resourcehungry, both for their training and for sampling new data points. In this work, we have introduced the key idea of considering diffusion times T as a free variable which should be chosen appropriately. We have shown that the choice of T introduces a trade-off, for which a "sweet spot" exists. In standard diffusion-based models, smaller values of T are preferable for efficiency reasons, but sufficiently large T are required to reduce approximation errors of the forward dynamics. Thus, we devised a novel method that allows for an arbitrary selection of diffusion times, where even small values are allowed. Our method closes the gap between practical and ideal diffusion dynamics, using an auxiliary model. Our empirical validation indicated that the performance of our approach was comparable and often superior to standard diffusion models, while being efficient both in training and in sampling.

Appendix A. Generic Definitions and Assumptions
Our work builds upon the work in [18], which should be considered as a basis for the developments hereafter. In this supplementary material, we use the following shortened notation for a generic ω > 0: It is useful to notice that ∇ log(N ω (x)) = − 1 ω x. For an arbitrary probability density p(x), we define the convolution ( * operator) with N ω using notation Equivalently, p ω (x) = exp ω 2 ∆ p(x), and consequently dp ω (x) dω = 1 2 ∆p(x), where ∆ = ∇ ∇. Notice that, by considering the Dirac delta function δ(x), we have the equality δ ω (x) = N ω (x).
In the following derivations, we make use of the Stam-Gross logarithmic Sobolev inequality result in [33]

Appendix B. Deriving Equation (4) from [32]
We start with Equation (25) of [32], which, in our notation, reads The first step is to take the expected value with respect to x 0 ∼ p data on both sides of the above inequality We focus on rewriting the term Consequently, we can rewrite the r.h.s of Equation (A4) as which is exactly Equation (4).

Appendix C. Proof of Equation (5)
We prove the following result Proof. We prove that for generic positive λ(·), and T 2 > T 1 the following holds: First, we compute the functional derivative (with regard to s) δ δs where we used Consequently, we can obtain the optimal s through δ δs Substitution of this result into Equation (A5) directly proves the desired inequality.

Appendix D. Proof of Equation (8)
Proof. We consider the pair of equations where t = T − t, q is the density of the backward process and p is the density of the forward process. These equations can be interpreted as a particular case of the following pair of SDEs (corresponding to [32] Equations (4) and (17) (Notice that our notation for the roles of p, q is swapped with respect to [32])).
where Equation (A7) is recovered considering a(x, t) = σ(t )∇ log q(x, t ) = g(t)∇ log q(x, t ). Equation (A8) is associated to an ELBO ( [32], Theorem 3) that is attained with equality if and only if a(x, t) = σ(t )∇ log q(x, t ). Consequently, we can write the following equality associated to the backward process of Equation (A7) where the expected value is taken with respect to the dynamics of the associated forward process. By careful inspection of the couple of equations, we notice that, in the process x t , the drift includes the ∇ log q(x t , t) term, while in our main (1) we have ∇ log p(x t , t ). In general the two vector fields do not agree. However, if we select as starting distribution of the generating process p(x, T), i.e., q(x, 0) = p(x, T), then ∀t, q(x, t) = p(x, t ).
Given the initial conditions, the time evolution of the density p is fully described by the Fokker-Planck equation Similarly, for the density q, with q(x, 0) = p(x, T). By Taylor expansion we have . This holds for arbitrarily small δt. By induction, with similar reasoning, we claim that q(x, t) = p(x, t ). This last result allows us to rewrite Equation (A7) as the pair of SDEs Moreover, since q(x, T) = p(x, 0) = p data (x), together with the result Equation (A9), we have the following equality Consequently, Remembering the definitions we finally conclude the proof that

Appendix E. Proof of Lemma 1
In this section, we prove the validity of Lemma 1 for the case of Variance Preserving (VP) and Variance Exploding (VE) SDEs. Remember that, as reported also in Table 1, the above mentioned classes correspond to and consequently dp(x,t) dt Simple calculations show that lim T→∞ p(x, T) = N 1 (x).
Theorem A2. Suppose that for any φ of the auxiliary model ν φ (x) it exists one φ such that ν φ (x) = k −D ν φ ( x k ), for any k > 0. Notice that this condition is trivially satisfied if the considered parametric model has the expressiveness to multiply its output by the scalar k. Then the minimum Kullback-Leibler divergence betweeen p(x, T) associated to a generic diffusion Equation (A27) and the density of an auxiliary model ν φ (x) depends only onσ(T) and not on σ(T) alone.
Proof. We start with the equality ,σ(T)) log Then the minimimum only depends onσ(T), as it is always possible to achieve the same value independently on the SDE by rescaling the auxiliary model output.

Appendix K. Experimental Details
We here give some additional details concerning the experimental (Section 4) settings.

Appendix K.2. Section 4 Details
We considered Variance Preserving SDE with default β 0 , β 1 parameter settings. When experimenting on CIFAR10 we considered the NCSN++ architecture as implemented in [3]. Training of the score matching network has been carried out with the default set of optimizers and schedulers of [3], independently of the selected T.
For the MNIST dataset we reduced the architecture by considering 64 features, ch_mult = (1, 2) and attention resolutions equal to 8. The optimizer has been selected as the one in the CIFAR10 experiment but the warmup has been reduced to 1000 and the total number of iterations to 65,000.

Appendix K.3. Varying T
We clarify about the T truncation procedure during both training and testing. The SDE parameters are kept unchanged irrespective of T. During training, as evident from Equation (3), it is sufficient to sample randomly the diffusion time from distribution U (0, T), where T can take any positive value. For testing (sampling), we simply modified the algorithmic routines to begin the reverse diffusion processes from a generic T instead of the default 1.0.