Latent Abstractions in Generative Diffusion Models

Franzese, Giulio; Martini, Mattia; Corallo, Giulio; Papotti, Paolo; Michiardi, Pietro

doi:10.3390/e27040371

Open AccessArticle

Latent Abstractions in Generative Diffusion Models

by

Giulio Franzese

^1,*

,

Mattia Martini

²

,

Giulio Corallo

^1,3

,

Paolo Papotti

¹

and

Pietro Michiardi

¹

Department of Data Science, EURECOM, 06410 Biot, France

²

Laboratoire J. A. Dieudonné, CNRS, Université Côte d’Azur, 06108 Nice, France

³

SAP Labs, 805 Avenue du Dr Donat-Font de l’Orme, 06259 Mougins, France

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(4), 371; https://doi.org/10.3390/e27040371

Submission received: 21 February 2025 / Revised: 14 March 2025 / Accepted: 24 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue The Statistical Physics of Generative Diffusion Models)

Download

Browse Figures

Versions Notes

Abstract

:

In this work, we study how diffusion-based generative models produce high-dimensional data, such as images, by relying on latent abstractions that guide the generative process. We introduce a novel theoretical framework extending Nonlinear Filtering (NLF), offering a new perspective on SDE-based generative models. Our theory is based on a new formulation of joint (state and measurement) dynamics and an information-theoretic measure of state influence on the measurement process. We show that diffusion models can be interpreted as a system of SDE, describing a non-linear filter where unobservable latent abstractions steer the dynamics of an observable measurement process. Additionally, we present an empirical study validating our theory and supporting previous findings on the emergence of latent abstractions at different generative stages.

Keywords:

diffusion models; world modeling; information theory; nonlinear filtering

1. Introduction

Generative models have become a cornerstone of modern machine learning, offering powerful methods for synthesizing high-quality data across various domains such as image and video synthesis [1,2,3], natural language processing [4,5,6,7], audio generation [8,9], and molecular structures and general 3D shapes [10,11,12,13], to name a few. These models transform an initial distribution, which is simple to sample from, into one that approximates the data distribution. Among these, diffusion-based models designed through the lenses of Stochastic Differential Equations (SDEs) [14,15,16] have gained popularity due to their ability to generate realistic and diverse data samples through a series of stochastic transformations.

In such models, the data generation process, as described by a substantial body of empirical research [17,18,19], appears to develop according to distinct stages: high-level semantics emerge first, followed by the incorporation of low-level details, culminating in a refinement (denoising) phase. Despite ample evidence, a comprehensive theoretical framework for modeling these dynamics remains underexplored.

Indeed, despite recent work on SDE-based generative models, refs. [20,21,22,23] shedding new light on such models, they fall short of explicitly investigating the emergence of abstract representations in the generative process. We address this gap by establishing a new framework for elucidating how generative models construct and leverage latent abstractions, approached through the paradigm of NLF [24,25,26].

NLF is used across diverse engineering domains [24], as it provides robust methodologies for the estimation and prediction of a system’s state amidst uncertainty and noise. NLF enables the inference of dynamic latent variables that define the system state based on observed data, offering a Bayesian interpretation of state evolution and the ability to incorporate stochastic system dynamics. The problem we consider is the following: an unobservable random variable X is measured through a noisy continuous-time process

Y_{t}

, wherein the influence of X on the noisy process is described by an observation function H, with the noise component modeled as a Brownian motion term. The goal is to estimate the a posteriori measure

π_{t}

of the variable X given the entire historical trajectory of the measurement process

Y_{t}

.

In this work, we establish a connection between SDE-based generative models and NLF by observing that they can be interpreted as simulations of NLF dynamics. In our framework, the latent abstraction, which corresponds to certain real-world properties within the scope of classical nonlinear filtering and remains unaffected in a causal manner by the posterior process

π_{t}

, is implicitly simulated and iteratively refined. We explore the connection between latent abstractions and the a posteriori process, through the concept of filtrations—broadly defined as collections of progressively increasing information sets—and offer a rigorous theory to study the emergence and influence of latent abstractions throughout the data generation process. To ground the reader’s intuition in a concrete example, our experimental validation considers a scenario where latent abstractions correspond to scene descriptions—such as color, shape, and object size—that are subsequently rendered using a computer program.

Our theoretical contributions unfold as follows. In Section 2 we show how to reformulate classical NLF results such that the measurement process is the only available information, and derive the corresponding dynamics of both the latent abstraction and the measurement process. These results are summarized in Theorems 2 and 3.

Given the new dynamics, in Theorem 4, we show how to estimate the a posteriori measure of the NLF model and present a novel derivation to compute the mutual information between the measurement process and random variables derived from a transformation of the latent abstractions in Theorem 5. Finally, we show in Theorem 6 that the a posteriori measure is a sufficient statistic for any random variable derived from the latent abstractions when only having access to the measurement process.

Building on these general results, in Section 3 we present a novel perspective on continuous-time score-based diffusion models, which is summarised in Equation (10). We propose to view such generative models as NLF simulators that progress in two stages: first, our model updates the a posteriori measure representing sufficient statistics of the latent abstractions; second, it uses a projection of the a posteriori measure to update the measurement process. Such intuitive understanding is the result of several fundamental steps. In Theorems 7 and 8, we show that the common view of score-based diffusion models by which they evolve according to forward (noising) and backward (generative) dynamics is compatible with the NLF formulation, in which there is no need to distinguish between such phases. In other words, the NLF perspective of Equation (10) is a valid generative model. In Appendix H, we provide additional results (see Lemma A1), focusing on the specific case of linear diffusion models, which are the most popular instance of score-based generative models in use today. In Section 4, we summarize the main intuitions behind our NLF framework.

Our results explain, by means of a theoretically sound framework, the emergence of latent abstractions that has been observed by a large body of empirical work [17,18,19,27,28,29,30,31,32,33]. The closest research to our findings are discussed in [34,35,36], albeit from a different mathematical perspective. To root our theoretical results in additional empirical evidence, we conclude our work in Section 5 with a series of experiments on score-based generative models [14], where we (1) validate existing probing techniques to measure the emergence of latent abstractions, (2) compute the mutual information as derived in our framework and show that it is a suitable approach to measure the relation between the generative process and latent abstractions, and (3) introduce a new measurement protocol to further confirm the connections between our theory and how practical diffusion-based generative models operate.

2. Nonlinear Filtering

Consider two random variables

Y_{t}

and X, corresponding to a stochastic measurement process (

Y_{t}

) of some underlying latent abstraction (X). We construct our universe sample space

Ω

as the combination of the space of continuous functions in the interval

[0, T]

(i.e.,

C ([0, T], R^{N})

with

T \in R^{+}

), and of a complete separable metric space

S

, i.e.,

Ω = C ([0, T], R^{N}) \times S

. On this space, we consider the joint canonical process

Z_{t} (ω) = [Y_{t}, X] = [ω_{t}^{y}, ω^{x}]

for all

ω \in Ω

, with

ω = [ω^{y}, ω^{x}]

. In this work, we indicate with

σ (\cdot)

sigma-algebras. Consider the growing filtration naturally induced by the canonical process

F_{t}^{Y, X} = σ (Y_{0 \leq s \leq t}, X)

(a short-hand for

σ (σ (Y_{0 \leq s \leq t}) \cup σ (X))

), and define

F = F_{T}^{Y, X}

. We build the probability triplet

(Ω, F, P)

, where the probability measure

P

is selected such that the process

{Z_{0 \leq t \leq T}, F_{0 \leq t \leq T}^{Y, X}}

has the following SDE representation

Y_{t} = Y_{0} + \int_{0}^{t} H (Y_{s}, X, s) d s + W_{t},

(1)

where

{W_{0 \leq t \leq T}, F_{0 \leq t \leq T}^{Y, X}}

is a Brownian motion with initial value 0 and

H : Ω \times [0, T] \to R^{N}

is an observation process. All standard technical assumptions are available in Appendix A.

Next, we provide the necessary background on NLF, to pave the way for understanding its connection with the generative models of interest. The most important building block of the NLF literature is represented by the conditional probability measure

P [X \in A | F_{t}^{Y}]

(notice the reduced filtration

F_{t}^{Y} \subset F_{t}^{Y, X}

), which summarizes, a posteriori, the distribution of X given observations of the measurement process until time t, that is,

Y_{0 \leq s \leq t}

.

Theorem 1

(Thm 2.1 [24]). Consider the probability triplet

(Ω, F, P)

, the metric space

S

and its Borel sigma-algebra

B (S)

. There exists a (probability measure valued

P (S)

) process

{π_{0 \leq t \leq T}, F_{0 \leq t \leq T}^{Y}}

, with a progressively measurable modification, such that for all

A \in B (S)

, the conditional probability measure

P [X \in A | F_{t}^{Y}]

is well defined and is equal to

π_{t} (A)

.

The conditional probability measure is extremely important, as the fundamental goal of nonlinear filtering is the solution to the following problem. Here, we introduce the quantity

ϕ

, which is a random variable derived from the latent abstractions X.

Problem 1.

For any fixed

ϕ : S \to R

bounded and measurable, given knowledge of the measurement process

Y_{0 \leq s \leq t}

, compute

E_{P} [ϕ (X) | F_{t}^{Y}]

. This amounts to computing

〈 π_{t}, ϕ 〉 = \int_{S} ϕ (x) d π_{t} (x) .

(2)

In simple terms, Problem 1 involves studying the existence of the a posteriori measure and the implementation of efficient algorithms for its update, using the flowing stream of incoming information

Y_{t}

. We first focus our attention on the existence of an analytic expression for the value of the a posteriori expected measure

π_{t}

. Then, we quantify the interaction dynamics between observable measurements and

ϕ

, through the lenses of mutual information

I (Y_{0 \leq s \leq t}; ϕ)

, which is an extension of the problems considered in [37,38,39,40].

2.1. Technical Preliminaries

We set the stage of our work by revisiting the measurement process

Y_{t}

, and express it in a way that does not require access to unobservable information. Indeed, while

Y_{t}

is naturally adapted with reference to its own filtration

F_{t}^{Y}

, and consequently to any other growing filtration

R_{t}

such

F_{t}^{Y, X} \supseteq R_{t} \supseteq F_{t}^{Y}

, the representation in Equation (1) is in general not adapted, letting aside degenerate cases.

Let us consider the family of growing filtrations

R_{t} = σ (R_{0} \cup σ (Y_{0 \leq s \leq t} - Y_{0}))

, where

σ (Y_{0}) \subseteq R_{0} \subseteq σ (X, Y_{0})

. Intuitively,

R_{0}

allows to modulate between the two extreme cases of knowing only the initial conditions of the SDE, that is

Y_{0}

, to the case of complete knowledge of the whole latent abstraction X, and anything in between. As shown hereafter, the original process

Y_{t}

associated with the space

(Ω, F, P)

which solves Equation (1), also solves Equation (4), which is adapted on the reduced filtration

R_{t}

. This allows us to reason about the partial observation of the latent abstraction (

R_{0}

vs.

σ (X, Y_{0})

), without incurring in the problem of the measurement process

Y_{t}

being statistically dependent of the whole latent abstraction X.

Armed with such representation, we study under which change of measure the process

Y_{t} - Y_{0}

behaves as a Brownian motion (Theorem 3). This serves the purpose of simplifying the calculation of the expected value of

ϕ

given

Y_{t}

, as described in Problem 1. Indeed, if

Y_{t} - Y_{0}

is a Brownian motion independent of

ϕ

, its knowledge does not influence our best guess for

ϕ

, i.e., the conditional expected value. Moreover, our alternative representation is instrumental for the efficient and simple computation of the mutual information

I (Y_{0 \leq s \leq t}; ϕ)

, where the different measures involved in the Radon–Nikodym derivatives will be compared against the same reference Brownian measures.

The first step to define our representation is provided by the following

Theorem 2.

[Appendix B] Consider the the probability triplet

(Ω, F, P)

, the process in Equation (1) defined on it, and the growing filtration

R_{t} = σ (R_{0} \cup σ (Y_{0 \leq s \leq t} - Y_{0}))

. Define a new stochastic process

W_{t}^{R} \overset{def}{=} Y_{t} - Y_{0} - \int_{0}^{t} E_{P} (H (Y_{s}, X, s) | R_{s}) d s .

(3)

Then,

{W_{0 \leq t \leq T}^{R}, R_{0 \leq t \leq T}}

is a Brownian motion. Notice that if

R_{t} = F_{t}^{Y, X}

, then

W_{t}^{R} = W_{t}

.

Following Theorem 2, the process

{Y_{0 \leq t \leq T}, R_{0 \leq t \leq T}}

has SDE representation

Y_{t} = Y_{0} + \int_{0}^{t} E_{P} (H (Y_{s}, X, s) | R_{s}) d s + W_{t}^{R} .

(4)

Next, we derive the change of measure necessary for the process

{\tilde{W}}_{t} \overset{def}{=} Y_{t} - Y_{0}

to be a Brownian motion with reference to to the filtration

R_{t}

. To carry this out, we apply the Girsanov theorem [41] to

{\tilde{W}}_{t}

, which, in general, admits a

R

-adapted representation

\int_{0}^{t} E_{P} (H (Y_{s}, X, s) | R_{s}) d s + W_{t}^{R}

.

Theorem 3.

[Appendix C] Define the new probability space

(Ω, R_{T}, Q^{R})

via the measure

Q^{R} (A) = E_{P} [1 (A) {(ψ_{T}^{R})}^{- 1}]

, for

A \in R_{T}

, where

ψ_{t}^{R} \overset{def}{=} exp (\int_{0}^{t} E_{P} [H (Y_{s}, X, s) | R_{s}] d Y_{s} - \frac{1}{2} \int_{0}^{t} {| | E_{P} [H (Y_{s}, X, s) | R_{s}] | |}^{2} d s),

(5)

and

Q^{R} |_{R_{t}} = E_{P} [1 (A) E_{P} [{(ψ_{T}^{R})}^{- 1} | R_{t}]] = E_{P} [1 (A) {(ψ_{t}^{R})}^{- 1}] .

Then, the stochastic process

{{\tilde{W}}_{0 \leq t \leq T}, R_{0 \leq t \leq T}}

is a Brownian motion on the space

(Ω, R_{T}, Q^{R})

.

A direct consequence of Theorem 3 is that the process

{\tilde{W}}_{t}

is independent of any

R_{0}

measurable random variable under the measure

Q^{R}

. Moreover, it holds that for all

R_{t}^{'} \subseteq R_{t}

,

Q^{R} |_{R_{t}^{'}} = Q^{R^{'}} |_{R_{t}^{'}}

.

2.2. A Posteriori Measure and Mutual Information

As in Section 2 for the process

π_{t}

, here, we introduce a new process

π_{t}^{R}

, which represents the conditional law of X given the filtration

R_{t} = σ (R_{0} \cup σ (Y_{0 \leq s \leq t} - Y_{0}))

. More precisely, for all

A \in B (S)

, the conditional probability measure

P [X \in A | R_{t}]

is well defined and is equal to

π_{t}^{R} (A)

. Moreover, for any

ϕ : S \to R

bounded and measurable,

E_{P} [ϕ (X) | R_{t}] = 〈 π_{t}^{R}, ϕ 〉

. Notice that if

R = F^{Y}

then

π^{R}

reduces to

π

.

Armed with Theorem 3, we are ready to derive the expression for the a posteriori measure

π_{t}^{R}

and the mutual information between observable measurements and the unavailable information about the latent abstractions, that materialize in the random variable

ϕ

.

Theorem 4.

[Appendix D] The measure-valued process

π_{t}^{R}

solves in weak sense (see Appendix D for a precise definition) the following SDE

π_{t}^{R} = π_{0}^{R} + \int_{0}^{t} π_{s}^{R} (H (Y_{s}, \cdot, s) - 〈 π_{s}^{R}, H (Y_{s}, \cdot, s) 〉) (d Y_{s} - 〈 π_{s}^{R}, H (Y_{s}, \cdot, s) 〉 d s),

(6)

where the initial condition

π_{0}

satisfies

π_{0}^{R} (A) = P [X \in A | R_{0}]

for all

A \in B (S)

.

When

R = F^{Y}

, Equation (6) is the well-known Kushner–Stratonovitch (or Fujisaki—Kallianpur–Kunita) equation (see, e.g., [24]). A proof for uniqueness of the solution of Equation (6) can be approached by considering the strategies in [42], but is outside the scope of this work. The (recursive) expression in Equation (6) is particularly useful for engineering purposes since, in general, it is usually not known in which variables

ϕ (X)

, representing latent abstractions, we could be interested in. Keeping track of the whole distribution

π_{t}^{R}

at time t is the most cost-effective solution, as we will show later.

Our next goal is to quantify the interaction dynamics between observable measurements and latent abstractions that materialize through the variable

ϕ (X)

(from now on, we write only

ϕ

for the sake of brevity); in Theorem 5, we derive the mutual information

I (Y_{0 \leq s \leq t}; ϕ)

.

Theorem 5.

[Appendix E] The mutual information between observable measurements

Y_{0 \leq s \leq t}

and ϕ is defined as:

I (Y_{0 \leq s \leq t}; ϕ) \overset{def}{=} \int log \frac{{d P}_{# Y_{0 \leq s \leq t}, ϕ}}{{d P}_{# Y_{0 \leq s \leq t}} {d P}_{# ϕ}} {d P}_{# Y_{0 \leq s \leq t}, ϕ} .

(7)

It holds that such quantity is equal to

E_{P} [log \frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}}]

, with

R_{t} = σ (Y_{0 \leq s \leq t}, ϕ)

, which can be simplified as follows:

I (Y_{0}; ϕ) + \frac{1}{2} E_{P} [\int_{0}^{t} {| | E_{P} [H (X, Y_{s}, s) | F_{s}^{Y}] - E_{P} [H (X, Y_{s}, s) | R_{s}] | |}^{2} d s] .

(8)

The mutual information computed by Equation (8) is composed of two elements: first, the mutual information between the initial measurements

Y_{0}

and

ϕ

, which is typically zero by construction. The second term quantifies how much the best prediction of the observation function H is influenced by the extra knowledge of

ϕ

, in addition to the measurement history

Y_{0 \leq s \leq t}

. By adhering to the premise that the conditional expectation of a stochastic variable constitutes the optimal estimator given the conditioning information, the integral on the right-hand side quantifies the expected squared difference between predictions, having access to measurements only (

E_{P} [\cdot | F_{t}^{Y}]

) and those incorporating additional information (

E_{P} [\cdot | R_{t}]

).

Even though a precise characterization for general observation functions and and variables

ϕ

is typically out of reach, a qualitative analysis is possible. First, the mutual information between

ϕ

and the measurements depends on (i) how much the amplitude of H is impacted by knowledge of

ϕ

and (ii) the number of elements of H that are impacted (informally, how much localized vs. global is the impact of

ϕ

). Second, it is possible to define a hierarchical interpretation about the emergence of the various latent factors: a variable with a local impact can “appear”, in an information theoretic sense, only if the impact of other global variables is resolved; otherwise, the remaining uncertainty of the global variables makes knowledge of the local variable irrelevant. In classical diffusion models, this is empirically known [17,18,19], and corresponds to the phenomenon where semantics emerges before details (global vs. local details in our language). For instance, as shown in Section 5, during the generative dynamics, latent abstractions which correspond to high level properties such as color and geometric aspect ratio emerge in very early stages of the process.

Now, consider any

F_{t}^{Y}

measurable random variable

{\tilde{Y}}_{t}

, defined as a mapping to a generic measurable space

(Ψ, B (Ψ))

, which means it can also be seen as a process. The data processing inequality states that the mutual information between such

\tilde{Y}

and

ϕ

will be smaller than the mutual information between the original measurement process and

ϕ

. However, it can be shown that all the relevant information about the random variable

ϕ

contained in

F_{t}^{Y}

is equivalently contained in the filtering process at time instant t, that is

π_{t}

. This is not trivial, since

π_{t}

is a

F_{t}^{Y}

-measurable quantity, i.e.,

σ (π_{t}) \subset F_{t}^{Y}

. In other words, we show that

π_{t}

is a sufficient statistic for any

σ (X)

measurable random variable when starting from the measurement process.

Theorem 6.

[Appendix F] For any

F_{t}^{Y}

measurable random variable

{\tilde{Y}}_{t} : Ω \to Ψ

, the following inequality holds:

I (\tilde{Y}; ϕ) \leq I (Y_{0 \leq s \leq t}; ϕ) .

(9)

For a given

t \geq 0

, the measurement process

Y_{0 \leq s \leq t}

and X are conditionally-independent given

π_{t}

. This implies that

P (A | σ (π_{t})) = P (A | F_{t}^{Y}), \forall A \in σ (X)

. Then,

I (Y_{0 \leq s \leq t}; ϕ) = I (π_{t}; ϕ)

(i.e., Equation (9) is attained with equality).

While

π_{t}

contains all the relevant information about

ϕ

, the same cannot be said about the conditional expectation, i.e., the particular case

\tilde{Y} = 〈 π_{t}, ϕ 〉

. Indeed, from Equation (2),

〈 π_{t}, ϕ 〉

is obtained as a transformation of

π_{t}

and thus can be interpreted as a

F_{t}^{Y}

measurable quantity subject to the constraint of Equation (9). As a particular case, the quantity

〈 π_{t}, H 〉

, of central importance in the construction of generative models Section 3, carries, in general, less information about

ϕ

than the un-projected

π_{t}

.

3. Generative Modeling

We are interested in generative models for a given

σ (X)

-measurable random variable V. An intuitive illustration of how data generation works according to our framework is as follows. Consider, for example, the image domain, and the availability of a rendering engine that takes as an input a computer program describing a scene (coordinates of objects, textures, light sources, auxiliary labels, etc...) and that produces an output image of the scene. In a similar vein, a generative model learns how to use latent variables (which are not explicitly provided in input, but rather implicitly learned through training) to generate an image. For such a model to work, one valid strategy is to consider an SDE in the form of Equation (1) where the following holds (from a strictly technical point of view, Assumption 1 might be incompatible with other assumptions in Appendix A, or proving compatibility could require particular effort. Such details are discussed in Appendix G).

Assumption 1.

The stochastic process

Y_{t}

satisfies

Y_{T} = V, P - a . s .

Then, we could numerically simulate the dynamics of Equation (1) until time T. Indeed, starting from initial conditions

Y_{0}

, we could obtain

Y_{T}

that, under Assumption 1, is precisely V. Unfortunately, such a simple idea requires explicit access to X, as it is evident from Equation (1). In mathematical terms, Equation (1) is adapted to the filtration

F_{t}^{Y, X}

. However, we have shown how to reduce the available information to account only for historical values of

Y_{t}

. Then, we can combine the result in Theorem 4 with Theorem 2 and re-interpret Equation (4), which is a valid generative model, as

\{\begin{matrix} π_{t} = π_{0} + \int_{0}^{t} π_{s} (H - 〈 π_{s}, H 〉) (d Y_{s} - 〈 π_{s}, H 〉 d s), \\ Y_{t} = Y_{0} + \int_{0}^{t} 〈 π_{s}, H 〉 d s + W_{t}^{F^{Y}}, \end{matrix}

(10)

where H denotes

H (Y_{s}, \cdot, s)

. Explicit simulation of Equation (10) only requires knowledge of the whole history of the measurement process: provided Assumption 1 holds, it allows generation of a sample of the random variable V.

Although the discussion in this work includes a large class of observation functions, we focus on the particular case of generative diffusion models [14]. Typically, such models are presented through the lenses of a forward noising process and backward (in time) SDEs, following the intuition of Anderson [43]. Next, according to the framework we introduce in this work, we reinterpret such models from the perspective of enlargement of filtrations.

Consider the reversed process

{\hat{Y}}_{t} \overset{def}{=} Y_{T - t}

defined on

(Ω, F, P)

and the corresponding filtration

F_{t}^{\hat{Y}} \overset{def}{=} σ ({\hat{Y}}_{0 \leq s \leq t})

. The measure

P

is selected such that the process

{\hat{Y}}_{t}

has a

F_{t}^{\hat{Y}}

-adapted expression

{\hat{Y}}_{t} = V + \int_{0}^{t} F ({\hat{Y}}_{s}, s) d s + {\hat{W}}_{t},

(11)

where

{{\hat{W}}_{t}, F_{t}^{\hat{Y}}}

is a Brownian motion. Then, Assumption 1 is valid since

Y_{T} = {\hat{Y}}_{0} = V

. Note that Equation (11), albeit with a different notation, is reminiscent of the forward SDE that is typically used as the starting point to illustrate score-based generative models [14]. In particular,

F (\cdot)

corresponds to the drift term of such a diffusion SDE.

Equation (11) is equivalent to

Y_{t} = V + \int_{t}^{T} F (Y_{s}, T - s) d s + {\hat{W}}_{T - t}

, which is an expression for the process

Y_{t}

, which is adapted to

F^{\hat{Y}}

. This constitutes the first step to derive an equivalent backward (generative) process according to the traditional framework of score-based diffusion models. Note that such an equivalent representation is not useful for simulation purposes: the goal of the next step is to transform it such that it is adapted to

F^{Y}

. Indeed, using simple algebra, it holds that

Y_{t} = Y_{0} - \int_{0}^{t} F (Y_{s}, T - s) d s + (- Y_{0} + V + \int_{0}^{T} F (Y_{s}, T - s) d s + {\hat{W}}_{T - t}),

where the last term in the parentheses is equal to

- {\hat{W}}_{T} + {\hat{W}}_{T - t}

.

Note that

F_{t}^{Y} = σ ({\hat{Y}}_{T - t \leq s \leq T})

. Since

σ ({\hat{Y}}_{T - t \leq s \leq T}) = σ ({\hat{W}}_{T - t \leq s \leq T}) \cup σ ({\hat{Y}}_{T - t})

, we can apply the result in [44] (Thm 2.2) to claim the following:

- {\hat{W}}_{T} + {\hat{W}}_{T - t} - \int_{0}^{t} \nabla log \hat{p} (Y_{s}, T - s) d s

is a Brownian motion adapted to

F_{t}^{Y}

, where this time

P ({\hat{Y}}_{t} \in d y) = \hat{p} (y, t) d y

. Then, [44].

Theorem 7.

Consider the stochastic process

Y_{t}

which solves Equation (11). The same stochastic process also admits a

F_{t}^{Y}

-adapted representation

Y_{t} = Y_{0} + \int_{0}^{t} \underset{In Theorem 8, we call this F^{'} (Y_{s}, s)}{\underset{︸}{- F (Y_{s}, T - s) + \nabla log \hat{p} (Y_{s}, T - s)}} d s + W_{t} .

(12)

Equation (12) corresponds to the backward diffusion process from [14] and, because it is adapted to the filtration

F^{Y}

, it represents a valid, and easy-to-simulate, measurement process.

By now, it is clear how to go from an

F^{Y, X}

-adapted filtration to a

F^{Y}

-adapted one. We also showed that a

F^{Y}

-adapted filtration can be linked to the reverse,

F^{\hat{Y}}

-adapted process induced by a forward diffusion SDE. What remains to be discussed is the connection that exists between the

F^{Y}

-adapted filtration, and its enlarged version

F^{Y, X}

. In other words, we have shown that a forward, diffusion SDE admits a backward process which is compatible with our generative model that simulates a NLF process having access only to measurements, but we need to make sure that such process admits a formulation that is compatible with the standard NLF framework in which latent abstractions are available.

To carry this out, we can leverage existing results about Markovian bridges [22,45] (and further work [46,47,48,49] on filtration enlargement). This requires assumptions about the existence and well-behaved nature of densities

p (y, t)

of the SDE process, defined by the logarithm of the Radon–Nikodym derivative of the instantaneous measure

P (Y_{t} \in d y)

with reference to the Lebesgue measure in

R^{N}

,

P (Y_{t} \in d y) = p (y, t) d y

(the analysis of the existence of the process adapted to

F_{t}^{Y}

is considered in the time interval

[0, T)

[50]; see also Appendix G).

Theorem 8.

Suppose that on

(Ω, F, P)

, the Markov stochastic process

Y_{t}

satisfies

Y_{t} = Y_{0} + \int_{0}^{t} F^{'} (Y_{s}, s) d s + W_{t},

where

{W_{0 \leq t \leq T}, F_{0 \leq t \leq T}^{Y}}

is a Brownian motion and F satisfies the requirements for existence and well definition of the stochastic integral [51]. Moreover, let Assumption 1 hold. Then, the same process admits

R_{t} = σ (Y_{0 \leq s \leq t}, Y_{T})

-adapted representation

Y_{t} = Y_{0} + \int_{0}^{t} F^{'} (Y_{s}, s) + \nabla_{Y_{s}} log p (Y_{T} | Y_{s}) d s + β_{t},

(13)

where

p (Y_{T} | Y_{s})

is the density with reference to the Lebesgue measure of the probability

P (Y_{T} | σ (Y_{s}))

, and

{β_{0 \leq t \leq T}, R_{0 \leq t \leq T}}

is a Brownian motion.

The connection between time reversal of diffusion processes and enlarged filtrations is finalized with the result of Al-Hussaini and Elliott [52], Thm. 3.3, where it is proved how the

β_{t}

term of Equation (13) is a Brownian motion, using the techniques of time reversals of SDEs.

Since

\hat{p} (y, T - t) = p (y, t)

, the enlarged filtration version of Equation (12) reads

Y_{t} = Y_{0} + \int_{0}^{t} \underset{Equivalent to H (Y_{t}, X, t) = - F (Y_{s}, T - s) + \nabla_{Y_{s}} log p (Y_{s} | V}{\underset{︸}{- F (Y_{s}, T - s) + \nabla_{Y_{s}} log p (Y_{s} | Y_{T}) d s}} + W_{t} .

(14)

Note that the dependence of

Y_{t}

on the latent abstractions X is implicitly defined by conditioning the score term

\nabla_{Y_{s}} log p (Y_{s} | Y_{T})

by

Y_{T}

, which is the “rendering” of X into the observable data domain.

Clearly, Equation (14) can be reverted to the starting generative Equation (12) by mimicking the results which allowed us to go from Equation (1) to Equation (4), by noticing that

E_{P} [\nabla_{Y_{s}} log p (Y_{T} | Y_{s}) | F_{t}^{Y}] = 0

(informally, this is obtained since

\int \nabla_{y_{s}} log p (y_{t} | y_{s}) p (y_{t} | y_{s}) d y_{t} = \int \nabla_{y_{s}} p (y_{t} | y_{s}) d y_{t} = 0

).

It is also important to notice that we can derive the expression for the mutual information between the measurement process and a sample from the data distribution, as follows

I (Y_{0 \leq s \leq t}; V) = I (Y_{0}; V) + \frac{1}{2} E_{P} [\int_{0}^{t} | | {\nabla_{Y_{s}} log p (Y_{s}) - \nabla_{Y_{s}} log p (Y_{s} | Y_{T}) | |}^{2} d s] .

Mutual information is tightly related to the classical loss function of generative diffusion models.

Furthermore, by casting the result of Equation (8) according to the forms of Equations (12) and (14), we obtain the simple and elegant expression

I (Y_{0 \leq s \leq t}; V) = I (Y_{0}; V) + \frac{1}{2} E_{P} [\int_{0}^{t} {| | \nabla_{Y_{s}} log p (Y_{T} | Y_{s}) | |}^{2} d s] .

In Appendix H, we present a specialization of our framework for the particular case of linear diffusion models, recovering the expressions for the variance-preserving and variance-exploding SDEs that are the foundations of score-based generative models [14].

4. An Informal Summary of the Results

We shall now take a step back from the rigor of this work, and provide an intuitive summary of our results, using Figure 1 as a reference.

We begin with an illustration of NLF, shown on the left of the figure. We consider an observable latent abstraction X and the measurement process

Y_{t}

, which, for ease of illustration, we consider evolving in discrete time, i.e.,

Y_{0}, Y_{1}, \dots

, and whose joint evolution is described by Equation (1). Such an interaction is shown in blue:

Y_{3}

depends on its immediate past

Y_{2}

and the latent abstraction X.

The a posteriori measure process

π_{t}

is updated in an iterative fashion by integrating the flux of information. We show this in green:

π_{1}

is obtained by updating

π_{0}

with

Y_{1} - Y_{0}

(the equivalent of

d Y_{t}

). This evolution is described by Kushner’s equation, which has been derived informally from the result of Equation (6). The a posteriori process is a sufficient statistic for the latent abstraction X: for example,

π_{3}

contains the same information about

ϕ

as the whole

Y_{0}, \dots, Y_{3}

(red boxes). Instead, in general, a projected statistic

〈 π_{t}, ϕ 〉

contains less information than the whole measurement process (this is shown in orange, for time instant 2). The mutual information between all these variables is proven in Theorem 6, whereas the actual value of

I (Y_{0 \leq s \leq t}; ϕ)

is shown in Theorem 5.

Next, we focus on generative modeling. As per our definition, any stochastic process satisfying Assumption 1 (

Y_{3} = V

, in the figure) can be used for generative purposes. Since the latent abstraction is by definition not available, it is not possible to simulate directly the dynamics using Equation (1) (dashed lines from X to

Y_{t}

). Instead, we derive a version of the process adapted to the history of

Y_{t}

alone, together with the update of the projection

〈 π_{t}, H 〉

, which amounts to simulating Equation (10). In [36], diffusion models are shown to solve a “self-consistency” equation akin to a mean-field fixed point. Our framework aligns with this view by revealing how SDE-based generative processes implicitly enforce self-consistency between latent abstractions and measurements.

The update of the upper part of Equation (10), which is a particular case of Equation (6), can be interpreted as the composition of two steps: (1) (green) the update of the a posteriori measure given new available measurements, and, (2) (orange) the projection of the whole

π_{t}

into the statistic of interest. The update of the measurement process, i.e., the lower part of Equation (10), is color-coded in blue. This is in stark contrast to the NLF case, as the update of, e.g.,

Y_{3} = V

does not depend directly on X. The system in Equation (10) and its simulation describes the emergence of latent world representations in SDE-based generative models:

The theory developed in this work guarantees that the mutual information between measurements and any statistics

ϕ

grows as described by Theorem 5. Our framework offers a new perspective, according to which, the dynamics of SDE-based generative models [14] implicitly mimic the two steps procedure described in the box above. We claim that this is the reason why it is possible to dissect the parametric drift of such generative models and find a representation of the abstract state distribution

π_{t}

, encoded into their activations. Next, we set to root our theoretical findings in experimental evidence.

Generality of Our Framework

While previous works studied latent abstraction emergence specifically within diffusion-based generative models, our current theoretical framework deliberately transcends this scope. Indeed, the results presented in this paper can be interpreted as a generalization of the results contained in [53] (see also Equation (A10)) and apply broadly to any generative model satisfying Assumption 1. Such generality includes a wide variety of generative modeling techniques, such as Neural Stochastic Differential Equations (neural SDEs) [54], Schrödinger Bridges [55], and Stochastic Normalizing Flows [56]. In all these cases, our results on latent abstractions still hold; thus, the insights provided by our framework pave the way for deeper theoretical understanding and wider applicability across generative modeling paradigms.

5. Empirical Evidence

We complement existing empirical studies [17,18,19,30,31,32,33,34] that first measured the interactions between the generative process of diffusion models and latent abstractions, by focusing on a particular dataset that allows for a fine-grained assessment of the influence of latent factors.

Dataset. We use the Shapes3D [57] dataset, which is a collection of

64 \times 64

ray-tracing generated images, depicting simple 3D scenes, with an object (a sphere, cube,...) placed in a space, described by several attributes (color, size, orientation). Attributes have been derived from the computer program that the ray-tracing software executed to generate the scene: these are transformed into labels associated with each image. In our experiments, such labels are the materialization of the latent abstractions X we consider in this work (see Appendix J.1 for details).

Measurement Protocols. For our experiments, we use the base NCSPP model described by [14]: specifically, our denoising score network corresponds to a U-NET [58]. We train the unconditional version of this model from scratch using a score-matching objective. Detailed hyper-parameters and training settings are provided in Appendix J.2. Next, we summarize three techniques to measure the emergence of latent abstractions through the lenses of the labels associated with each image in our dataset. For all such techniques, we use a specific “measurement” subset of our dataset, which we partition in 246 training, 150 validation, and 371 test examples. We use a multi-label stratification algorithm [59,60] to guarantee a balanced distribution of labels across all dataset splits.

Linear probing. Each image in the measurement subset is perturbed with noise, using a variance-exploding schedule [14], with noise levels decreasing from

τ = 0

to

τ = 1.0

in steps of 0.1, as shown in Figure 2. Intuitively, each time value

τ

can be linked to a different signal-to-noise ratio (

S N R

), ranging from

S N R (τ = 1) = \infty

to

S N R (τ = 0) ≃ 0

. We extract several feature maps from all the linear and convolutional layers of the denoising score network, for each perturbed image, resulting in a total of 162 feature map sets for each noise level. This process yields 11 different datasets per layer, which we use to train a linear classifier (our probe) for each of these datasets, using the training subset. In these experiments, we use a batch size of 64 and adjust the learning rate based on the noise level (see Appendix J.3). Classifier performance is optimized by selecting models based on their log-probability accuracy observed on the validation subset. The final evaluation of each classifier is conducted on the test subset. Classification accuracy, measured by the model log likelihood, is a proxy of latent abstraction emergence [17].

Mutual information estimation. We estimate mutual information between the labels and the outputs of the diffusion model across varying diffusion times, using Equation (A10) (which is a specialized version of our theory for linear diffusion models; see Appendix H) and adopt the same methodology discussed by Franzese et al. [53] to learn conditional and unconditional score functions and to approximate the mutual information. The training process uses a randomized conditioning scheme: 33% of training instances are conditioned on all labels, 33% on a single label, and the remaining 33% are trained unconditionally. See Appendix J.4 for additional details.

Forking. We propose a new technique to measure at which stage of the generative process, image features described by our labels emerge. Given an initial noise sample, we proceed with numerical integration of the backward SDE [14] up to time

τ

. At this point, we fork k replicas of the backward process and continue the k generative pathways independently until numerical integration concludes. We use a simple classifier (a pre-trained ResNet50 [61] with an additional linear layer trained from scratch) to verify that labels are coherent across the k forks. Coherency is measured using the entropy of the label distribution output by our simple classifier on each latent factor for all the k branches of the fork. Intuitively, if we fork the process at time

τ = 0.6

, and the k forks all end up displaying a cube in the image (entropy equals 0), this implies that the object shape is a latent abstraction that has already emerged by time

τ

. Conversely, lack of coherence implies that such a latent factor has not yet influenced the generative process. Details of the classifier training and sampling procedure are provided in Appendix J.5.

Results. We present our results in Figure 3. We note that some attributes like floor hue, wall hue and shape emerge earlier than others, which corroborates the hierarchical nature of latent abstractions, a phenomenon that is related to the spatial extent of each attribute in pixel space. This is evident from the results of linear probing, where we evaluate the performance of linear probes trained on features maps extracted from the denoiser network, and from the mutual information measurement strategy and the measured entropy of the predicted labels across forked generative pathways. Entropy decreases with

τ

, which marks the moment in which the generative process proceeds along k forks. When generative pathways converge to a unique scene with identical predicted labels (entropy reaches zero), this means that the model has committed to a specific set of latent factors (breaking some of the symmetries in the language of [36]). This coincides with the same noise level corresponding to high accuracy for the linear probe, and high-values of mutual information. Further ablation experiments are presented in Appendix J.6.

6. Potential Applications and Practical Implications

Our theoretical analysis and experimental results provide a novel information-theoretic perspective on diffusion-based generative models. Besides the primary theoretical contribution, our framework naturally suggests several promising practical applications and implications. First, the explicit characterization of latent abstractions through mutual information measures may facilitate novel strategies for conditional generation, where generative processes could be guided or steered in a principled manner by explicitly controlling the latent abstraction process. Second, the insights derived from our nonlinear filtering viewpoint offer opportunities for enhanced interpretability and robustness in downstream applications that utilize learned latent representations, potentially leading to more reliable model behaviors. These practical implications represent compelling directions for future research, extending the reach and usability of diffusion-based generative modeling frameworks.

7. Conclusions

Despite their tremendous success in many practical applications, a deep understanding of how SDE-based generative models operate remained elusive. A particularly intriguing aspect of several empirical investigations was to uncover the capacity of generative models to create entirely new data by combining latent factors learned from examples. To the best of our knowledge, there exists no theoretical framework that attempts to describe such a phenomenon.

In this work, we closed this gap, and presented a novel theory—which builds on the framework of NLF—to describe the implicit dynamics allowing SDE-based generative models to tap into latent abstractions and guide the generative process. Our theory, which required advancing the standard NLF formulation, culminates in a new system of joint SDEs that fully describes the iterative process of data generation. Furthermore, we derived an information-theoretic measure to study the influence of latent abstractions, which provides a concrete understanding of the joint dynamics.

To root our theory into concrete examples, we collected experimental evidence by means of novel (and established) measurement strategies that corroborate our understanding of diffusion models. Latent abstractions emerge according to an implicitly learned hierarchy and can appear early on in the data generation process, much earlier than what is visible in the data domain. Our theory is especially useful as it allows analyses and measurements of generative pathways, opening up opportunities for a variety of applications, including image editing, and improved conditional generation.

Author Contributions

Conceptualization, G.F., M.M. and P.M.; Methodology, G.F., M.M. and P.M.; Software, G.F., G.C., P.P. and P.M.; Validation, G.F., G.C., P.P. and P.M.; Formal analysis, G.F., M.M. and P.M.; Investigation, G.F., G.C., P.P. and P.M.; Resources, G.F.; Data curation, G.F., G.C., P.P. and P.M.; Writing—original draft, G.F., M.M. and P.M.; Writing—review & editing, G.F., M.M., G.C., P.P. and P.M.; Supervision, P.P.; Funding acquisition, G.F. and P.M. All authors have read and agreed to the published version of the manuscript.

Funding

G.F. and P.M were partially funded by project MUSECOM2—AI-enabled MUltimodal SEmantic COMmunications and COMputing, in the Machine Learning-based Communication Systems, towards Wireless AI (WAI), Call 2022, ChistERA. M.M. acknowledges the financial support of the European Research Council (ERC) under the European Union’s Horizon Europe research and innovation programme (AdG ELISA project, Grant Agreement No. 101054746). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. All considered datasets are publicly available.

Conflicts of Interest

G.C. was employed by the company SAP Labs Mougins. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Assumptions

Assumption A1.

Whenever we mention a filtration, we assume as usual that it is augmented with the

P -

null sets, i.e., if the set N is such that

P (N) = 0

, then all

A \subseteq N

should be in the filtration.

Assumption A1 is standard construction in measure theoretic formulations and ensures that “impossible events” are measurable.

We now list Assumptions A2 and A3 which correspond to Equations (2.3) and (2.4) of [24]. These assumptions are necessary to ensure that (i) the stochastic integral in Equation (3) is well defined and (ii) that Theorem 1 (which corresponds to Thm 2.1 in [24]) holds

Assumption A2.

E_{P} [\int_{0}^{t} | | H (Y_{s}, X, s) | | d s] < \infty .

Assumption A3.

P (\int_{0}^{t} {| | E_{P} [H (Y_{s}, X, s) | F_{s}^{Y}] | |}^{2} d s < \infty) = 1 .

Assumptions A2 and A3, while sufficient, are often difficult to check. In practice it is often easier to check Assumption A4, which implies the two.

Assumption A4.

E_{P} [\int_{0}^{t} {| | H (Y_{s}, X, s) | |}^{2} d s] < \infty .

Finally, it is necessary to ensure that the stochastic integrals used in the Girsanov transformations of Theorem 3 are martingales. For this reason, we consider the adaptation of Equation (3.19) in [24] and consider:

Assumption A5.

E_{P} [exp {\frac{1}{2} \int_{0}^{t} {| | H (Y_{s}, X, s) | |}^{2} d s}] < \infty,

and

E_{P} [exp {\frac{1}{2} \int_{0}^{t} {| | E_{P} [H (Y_{s}, X, s) | R_{s}] | |}^{2} d s}] < \infty,

Note: Assumption A5 and Assumption A4 are trivially verified when H is bounded, which is consequently a condition which allow to claim validity of all the results discussed in this work.

Appendix B. Proof of Theorem 2

We start by combining Equation (3) and Equation (1)

\begin{matrix} W_{t}^{R} & = Y_{0} + \int_{0}^{t} H (Y_{s}, X, s) d s + W_{t} - Y_{0} - \int_{0}^{t} E_{P} (H (Y_{s}, X, s) | R_{s}) d s \\ = \int_{0}^{t} H (Y_{s}, X, s) d s + W_{t} - \int_{0}^{t} E_{P} (H (Y_{s}, X, s) | R_{s}) d s . \end{matrix}

We begin by showing that it is a martingale. For any

0 \leq τ \leq t

, it holds

\begin{matrix} E_{P} [W_{t}^{R} | R_{τ}] & = E_{P} [\int_{0}^{t} H (Y_{s}, X, s) d s | R_{τ}] + E_{P} [W_{t} | R_{τ}] \\ - E_{P} [\int_{0}^{t} E_{P} (H (s, Y_{s}, X) | R_{s}) d s | R_{τ}] \\ = \int_{0}^{t} E_{P} [H (Y_{s}, X, s) | R_{τ}] d s + E_{P} [E_{P} [W_{t} | F_{τ}^{Y, X}] | R_{τ}] \\ - \int_{0}^{τ} E_{P} [H (Y_{s}, X, s) | R_{s}] d s - \int_{τ}^{t} E_{P} [H (Y_{s}, X, s) | R_{τ}] d s \\ = \int_{0}^{τ} E_{P} [H (Y_{s}, X, s) | R_{τ}] d s + E_{P} [W_{τ} | R_{τ}] + W_{τ}^{R} + Y_{0} - Y_{τ} \\ = E_{P} [\int_{0}^{τ} H (Y_{s}, X, s) d s + W_{τ} + Y_{0} - Y_{τ} | R_{τ}] + W_{τ}^{R} = W_{τ}^{R} . \end{matrix}

Moreover, it is easy to check that the cross-variation of

W_{t}^{R}

is the same as the one of

W_{t}

. Then, we can conclude the proof by Levy’s characterization of Brownian motion (

W_{0}^{R} = 0

). ☐

Appendix C. Proof of Theorem 3

First, by combining the definition of

ψ_{t}^{R}

and the fact that

d Y_{t} = E_{P} [H (Y_{t}, X, t) | R_{t}] + d W_{t}^{R}

we obtain

{(ψ_{t}^{R})}^{- 1} = exp (- \int_{0}^{t} E_{P} [H (Y_{s}, X, s) | R_{s}] d W_{s}^{R} - \frac{1}{2} \int_{0}^{t} {| | E_{P} [H (Y_{s}, X, s) | R_{s}] | |}^{2} d s) .

Notice that by Assumption A5 (which is actually the usual Novikov’s condition), the local martingale

{(ψ_{t}^{R})}^{- 1}

is a real-valued martingale starting from

{(ψ_{0}^{R})}^{- 1} = 1

. Then, we can apply Girsanov theorem and conclude that

d Q^{R} = ψ_{T}^{R} d P

is a probability measure under which the process

{{\tilde{W}}_{0 \leq t \leq T}, R_{0 \leq t \leq T}}

, with

\tilde{W_{t}} = W_{t}^{R} + \int_{0}^{t} E_{P} [H (Y_{t}, X, s) | R_{t}] d s,

is a Brownian motion on the space

(Ω, R_{T}, Q^{R})

. ☐

Appendix D. Proof of Theorem 4

First, let us give a precise meaning to being a weak solution of Equation (6). We say that

π_{t}^{R}

solves (6) in a weak sense in, for any for any

ϕ : S \to R

bounded and measurable, it holds

\begin{matrix} 〈 π_{t}^{R}, ϕ 〉 & = 〈 π_{0}^{R}, ϕ 〉 \\ + \int_{0}^{t} (〈 π_{s}^{R}, H (Y_{s}, \cdot, s) ϕ 〉 - 〈 π_{s}^{R}, ϕ 〉 〈 π_{s}^{R}, H (Y_{s}, \cdot, s) 〉) (d Y_{s} - 〈 π_{s}^{R}, H (Y_{s}, \cdot, s) 〉 d s) . \end{matrix}

(A1)

Let us recall that, on

(Ω, F, P)

, the process

Y_{t}

has the SDE representation (1), where

{W_{0 \leq t \leq T}, F_{0 \leq t \leq T}^{Y, X}}

is a Brownian motion. Moreover, by Theorem 3 with

R_{t} = F_{t}^{Y, X}

, it holds that

{{(Y - Y_{0})}_{0 \leq t \leq T}, F_{0 \leq t \leq T}^{Y, X}}

is a Brownian motion on the space

(Ω, F, Q^{F^{Y, X}})

, where

{d Q}^{F^{Y, X}} = {(ψ_{T}^{F^{Y, X}})}^{- 1} d P

and

ψ_{t}^{F^{Y, X}} = exp (\int_{0}^{t} H (Y_{s}, X, s) d Y_{s} - \frac{1}{2} \int_{0}^{t} {| | H (Y_{s}, X, s) | |}^{2} d s) .

(A2)

For notation simplicity, in this subsection

ψ_{t}^{F^{Y, X}}

and

Q^{F^{Y, X}}

are simply indicated as

π_{t}

,

ψ_{t}

and

Q

, respectively.

Since we aim at showing that (A1) holds, let us fix

ϕ

and let us start from

E_{P} [ϕ (X) | R_{t}] = 〈 π_{t}^{R}, ϕ 〉

. Bayes Theorem provides us with the following

〈 π_{t}^{R}, ϕ 〉 = E_{P} [ϕ (X) | R_{t}] = \frac{E_{Q} [\frac{d P}{d Q} ϕ (X) | R_{t}]}{E_{Q} [\frac{d P}{d Q} | R_{t}]} = \frac{E_{Q} [ψ_{T} ϕ (X) | R_{t}]}{E_{Q} [ψ_{T} | R_{t}]} \overset{def}{=} \frac{〈 {\hat{π}}_{t}^{R}, ϕ 〉}{〈 {\hat{π}}_{t}^{R}, 1 〉} .

(A3)

Starting from the numerator

〈 {\hat{π}}_{t}^{R}, ϕ 〉

, we involve the tower property of conditional expectation and the fact that

ψ_{t}

is

F_{t}^{Y, X}

measurable to write

\begin{matrix} 〈 {\hat{π}}_{t}^{R}, ϕ 〉 & = E_{Q} [ψ_{T} ϕ (X) | R_{t}] = E_{Q} [E_{Q} [ψ_{T} ϕ (X) | F_{t}^{Y, X}] | R_{t}] \\ = E_{Q} [E_{Q} [ψ_{T} | F_{t}^{Y, X}] ϕ (X) | R_{t}] = E_{Q} [ψ_{t} ϕ (X) | R_{t}] . \end{matrix}

(A4)

Recalling the definition of

ψ_{t}

(see Equation (A2)), we have

\begin{matrix} d ψ_{t} = ψ_{t} H (Y_{t}, X, t) d Y_{t}, \end{matrix}

from which it follows

ψ_{t} = 1 + \int_{0}^{t} ψ_{s} H (Y_{s}, X, s) d Y_{s} .

(A5)

We continue processing Equation (A4), using Equation (A5), as

\begin{matrix} E_{Q} [ψ_{t} ϕ (X) | R_{s}] & = E_{Q} [(1 + \int_{0}^{t} ψ_{s} H (Y_{s}, X, s) d Y_{s}) ϕ (X) | R_{t}] \\ = E_{Q} [ϕ (X) | R_{t}] + E_{Q} [\int_{0}^{t} ψ_{s} H (Y_{s}, X, s) ϕ (X) d Y_{s} | R_{t}] \\ = E_{Q} [ϕ (X) | R_{t}] + \int_{0}^{t} E_{Q} [ψ_{s} H (Y_{s}, X, s) ϕ (X) | R_{s}] d Y_{s}, \end{matrix}

where to obtain the last equality we used Lemma 5.4 in [62]. We also recall that, under

Q

, the process

(Y_{t} - Y_{0})

is independent of X. Thus, since

R_{t} = σ (R_{0} \cup σ (Y_{0 \leq s \leq t} - Y_{0}))

and

\frac{d P}{d Q} |_{F_{0}^{Y, X}} = 1

, we obtain

E_{Q} [ϕ (X) | R_{t}] = E_{P} [ϕ (X) | R_{0}]

. Concluding and rearranging:

〈 {\hat{π}}_{t}^{R}, ϕ 〉 = 〈 {\hat{π}}_{0}^{R}, ϕ 〉 + \int_{0}^{t} 〈 {\hat{π}}_{s}^{R}, ϕ H (Y_{s}, \cdot, s) 〉 d Y_{s} .

Obviously by the same arguments

〈 {\hat{π}}_{t}^{R}, 1 〉 = E_{Q} [\frac{d P}{d Q} | R_{t}] = E_{Q} [ψ_{t} | R_{t}]

, and

〈 {\hat{π}}_{t}^{R}, 1 〉 = 1 + \int_{0}^{t} 〈 {\hat{π}}_{s}^{R}, H (Y_{s}, \cdot, s) 〉 d Y_{s} .

(A6)

From now on, for simplicity we assume that all the processes involved in our computations are 1-dimensional. The extension to the multidimensional case is trivial. First, let us notice that, by (A6) and Itô’s lemma, it holds

d ({〈 {\hat{π}}_{t}^{R}, 1 〉}^{- 1}) = - \frac{〈 {\hat{π}}_{t}^{R}, H (Y_{t}, \cdot, t) 〉}{{〈 {\hat{π}}_{t}^{R}, 1 〉}^{2}} d Y_{s} + \frac{{〈 {\hat{π}}_{t}^{R}, H (Y_{t}, \cdot, t) 〉}^{2}}{{〈 {\hat{π}}_{t}^{R}, 1 〉}^{3}} d t .

Then, by the stochastic product rule,

\begin{matrix} d 〈 π_{t}^{R}, ψ 〉 & = d (〈 {\hat{π}}_{t}^{R}, ϕ 〉 {〈 {\hat{π}}_{t}^{R}, 1 〉}^{- 1}) \\ = 〈 {\hat{π}}_{t}^{R}, ϕ 〉 d ({〈 {\hat{π}}_{t}^{R}, 1 〉}^{- 1}) + {〈 {\hat{π}}_{t}^{R}, 1 〉}^{- 1} d 〈 {\hat{π}}_{t}^{R}, ϕ 〉 - 〈 {\hat{π}}_{t}^{R}, ϕ H (Y_{t}, \cdot, t) 〉 \frac{〈 {\hat{π}}_{t}^{R}, H (Y_{t}, \cdot, t) 〉}{{〈 {\hat{π}}_{t}^{R}, 1 〉}^{2}} d t \\ = - 〈 {\hat{π}}_{t}^{R}, ϕ 〉 \frac{〈 {\hat{π}}_{t}^{R}, H (Y_{t}, \cdot, t) 〉}{{〈 {\hat{π}}_{t}^{R}, 1 〉}^{2}} d Y_{t} + 〈 {\hat{π}}_{t}^{R}, ϕ 〉 \frac{{〈 {\hat{π}}_{t}^{R}, H (Y_{t}, \cdot, t) 〉}^{2}}{{〈 {\hat{π}}_{t}^{R}, 1 〉}^{3}} d t \\ + \frac{〈 {\hat{π}}_{t}^{R}, ϕ H (Y_{t}, \cdot, t) 〉}{〈 {\hat{π}}_{t}^{R}, 1 〉} d Y_{t} - 〈 {\hat{π}}_{t}^{R}, ϕ H (Y_{t}, \cdot, t) 〉 \frac{〈 {\hat{π}}_{t}^{R}, H (Y_{t}, \cdot, t) 〉}{{〈 {\hat{π}}_{t}^{R}, 1 〉}^{2}} d t . \end{matrix}

Recalling (A3) and rearranging the terms lead us to

\begin{matrix} d 〈 π_{t}^{R}, ψ 〉 & = - 〈 π_{t}^{R}, ϕ 〉 〈 π_{t}^{R}, H (Y_{t}, \cdot, t) 〉 d Y_{t} + 〈 π_{t}^{R}, ϕ 〉 {〈 π_{t}^{R}, H (Y_{t}, \cdot, t) 〉}^{2} d t \\ + 〈 π_{t}^{R}, ϕ H (Y_{t}, \cdot, t) 〉 d Y_{t} - 〈 π_{t}^{R}, ϕ H (Y_{t}, \cdot, t) 〉 〈 π_{t}^{R}, H (Y_{t}, \cdot, t) 〉 d t \\ = (〈 π_{t}^{R}, ϕ H (Y_{t}, \cdot, t) 〉 - 〈 π_{t}^{R}, ϕ 〉 〈 π_{t}^{R}, H (Y_{t}, \cdot, t) 〉) (d Y_{t} - 〈 π_{t}^{R}, H (Y_{t}, \cdot, t) 〉 d t) . \end{matrix}

☐

Appendix E. Proof of Theorem 5

The proof of this Theorem involves two separate parts. First, we should show the second equality in Equation (7), i.e.,

\int log \frac{{d P}_{# Y_{0 \leq s \leq t}, ϕ}}{{d P}_{# Y_{0 \leq s \leq t}} {d P}_{# ϕ}} {d P}_{# Y_{0 \leq s \leq t}, ϕ} = E_{P} [log \frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}}]

. Then, we should prove that the right-hand side of Equation (7) is equal to Equation (8).

Appendix E.1. Part 1

We overload in this Section the notation adopted in the rest of the paper for sake of simplicity in exposition. A random variable X on a probability space

(Ω, F, P)

is defined as a measurable mapping

X : Ω \to Ψ

, where the measure space

(Ψ, G)

satisfies the usual assumptions. To be precise, X is measurable with reference to

F

if

\forall E \in G, X^{- 1} (E) \in F

, where

X^{- 1} (E) = {ω \in Ω : X (ω) \in E}

. Equivalently,

\forall E \in G, \exists S \in F : X (S) = E

. Of all the possible sigma-algebras which allow measurability, the sigma algebra induced by the random variable,

σ (X)

, is the smallest one. It can be shown that

σ (X) = X^{- 1} (G) = {A = X^{- 1} (B) | B \in G}

. We also denote by

P_{#} X : G \to [0, 1]

the push-forward measure associated with X (i.e., the law), which is defined by the relation

P_{# X} (E) = P (X^{- 1} (E))

for any

E \in G

. Moreover, for any

G

-measurable

ϕ

, the following integration rule holds

\int_{Ψ} φ (x) {d P}_{# X} (x) = \int_{Ω} φ (X (ω)) d P (ω) .

Let us focus on

(Ω, σ (X), P)

and let us consider a new measure

Q

absolutely continuous with reference to

P

. Radon–Nikodym theorem guarantees existence of a

σ (X)

-measurable function

Z : Ω \to [0, + \infty)

(the “derivative”

\frac{d Q}{d P} = Z

) such that

Q (A) = \int_{A} Z d P

, for all

A \in σ (X)

. Moreover, by Doob’s measurability criterion (see, e.g., Lemma 1.14 in [63]), there exists a

G

-measurable map

f : Ψ \to [0, + \infty)

such that

Z = f (X)

. Then, for any

E \in G

,

\begin{matrix} Q_{# X} (E) & = Q (X^{- 1} (E)) = \int_{X^{- 1} (E)} f (X) d P (ω) = \int_{Ω} 1_{X^{- 1} (E)} (ω) f (X (ω)) d P (ω) \\ = \int_{Ω} 1_{E} (X (ω)) f (X (ω)) d P (ω) = \int_{Ψ} 1_{E} (x) f (x) {d P}_{# X} (x) = \int_{E} f (x) {d P}_{# X} (x) . \end{matrix}

In summary, we have that

\frac{{d Q}_{# X}}{{d P}_{# X}} = f

, with

f : Ψ \to [0, + \infty)

.

Finally, then,

\int_{Ψ} log (\frac{{d P}_{# X}}{{d Q}_{# X}}) {d P}_{# X} = - \int_{Ψ} log (f) {d P}_{# X} = - \int_{Ω} log (f (X)) d P = \int_{Ω} log \frac{d P}{d Q} d P = E_{P} [log \frac{d P}{d Q}] .

(A7)

What discussed so far, allows to prove that

\int log \frac{{d P}_{# Y_{0 \leq s \leq t}, ϕ}}{{d P}_{# Y_{0 \leq s \leq t}} {d P}_{# ϕ}} {d P}_{# Y_{0 \leq s \leq t}, ϕ} = E_{P} [log \frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}}] .

Indeed,

Consider on the space $(Ω, R_{t}, P |_{R_{t}})$ the random variable $T = (Y_{0 \leq s \leq t}, ϕ)$ . By construction, $σ (T) = R_{t}$ .
Suppose that $P |_{R_{t}}$ is absolutely continuous with reference to $P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}$ .
Then the desired equality follows from Equation (A7).

Appendix E.2. Part 2

Before proceeding, remember that the following holds: for all

R_{t}^{'} \subseteq R_{t}

,

Q^{R} |_{R_{t}^{'}} = Q^{R^{'}} |_{R_{t}^{'}}

.

We restart from the right-hand side of Equation (7). Thanks to the chain rule for Radon–Nykodim derivatives

\begin{matrix} log \frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}} & = log \frac{d P |_{R_{t}}}{d Q^{R} |_{R_{t}}} \frac{d Q^{R} |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}} \\ = log \frac{d P |_{R_{t}}}{d Q^{R} |_{R_{t}}} \frac{d Q^{R} |_{F_{t}^{Y}}}{d P |_{F_{t}^{Y}}} \frac{d Q^{R} |_{R_{t}}}{d Q^{R} |_{F_{t}^{Y}} d P |_{σ (ϕ)}} \\ = log \frac{d P |_{R_{t}}}{d Q^{R} |_{R_{t}}} \frac{d Q^{F^{Y}} |_{F_{t}^{Y}}}{d P |_{F_{t}^{Y}}} \frac{d Q^{R} |_{R_{t}}}{d Q^{R} |_{F_{t}^{Y}} d P |_{σ (ϕ)}} \\ = log ψ_{t}^{R} {(ψ_{t}^{F^{Y}})}^{- 1} \frac{d Q^{R} |_{F_{t}^{Y}}}{d Q^{R} |_{F_{t}^{Y}} d P |_{σ (ϕ)}} \\ = log ψ_{t}^{R} - log ψ_{t}^{F^{Y}} + log \frac{d Q^{R} |_{R_{t}}}{d Q^{R} |_{F_{t}^{Y}} d Q^{R} |_{σ (ϕ)}}, \end{matrix}

where we used Theorem 3 to make

ψ_{t}^{R}

and

ψ_{t}^{F^{Y}}

appear, and the fact that

d Q^{R} |_{σ (ϕ)} = d P |_{σ (ϕ)}

.

Consequently

\begin{matrix} E_{P} [log \frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}}] = E_{P} [log ψ_{t}^{R} - log ψ_{t}^{F^{Y}}] + I (Y_{0}; ϕ) \\ = E_{P} [\int_{0}^{t} E_{P} [h (Y_{s}, X, s) | R_{s}] d W_{s}^{R} + \frac{1}{2} \int_{0}^{t} {| | E_{P} [h (Y_{s}, X, s) | R_{s}] | |}^{2} d s] \\ - E_{P} [\int_{0}^{t} E_{P} [h (Y_{s}, X, s) | F_{s}^{Y}] d W_{s}^{F^{Y}} + \frac{1}{2} \int_{0}^{t} {| | E_{P} [h (Y_{s}, X, s) | F_{s}^{Y}] | |}^{2} d s] + I (Y_{0}; ϕ) \\ = \frac{1}{2} E_{P} [\int_{0}^{t} {| | E_{P} [h (Y_{s}, X, s) | R_{s}] | |}^{2} - {| | E_{P} [h (Y_{s}, X, s) | F_{s}^{Y}] | |}^{2} d s] + I (Y_{0}; ϕ) . \end{matrix}

Actually, the result in the main is in a slightly different form. To show equivalence, it is necessary to prove that

\begin{matrix} E_{P} [{| | E_{P} [h (Y_{s}, X, s) | F_{s}^{Y}] | |}^{2}] & - 2 E_{P} [E_{P} [h (Y_{s}, X, s) | F_{s}^{Y}] E_{P} [h (Y_{s}, X, s) | R_{s}]] \\ = - E_{P} [{| | E_{P} [h (Y_{s}, X, s) | F_{s}^{Y}] | |}^{2}] \end{matrix}

which is trivially true since

E_{P} [\cdot | F_{t}^{Y}] = E_{P} [E_{P} [\cdot | R_{s}] | F_{t}^{Y}]

. ☐

Appendix F. Proof of Theorem 6

Appendix F.1. Proof of Equation (9)

The inequality is proven considering that (i)

I (Y_{0 \leq s \leq t}; ϕ) = E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [η (\frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}})]

and

I ({\tilde{Y}}_{t}; ϕ) = E_{P |_{σ ({\tilde{Y}}_{t})} \times P |_{σ (ϕ)}} [η (\frac{d P |_{σ ({\tilde{Y}}_{t}, ϕ)}}{d P |_{σ ({\tilde{Y}}_{t})} d P |_{σ (ϕ)}})] = E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [η (\frac{d P |_{σ ({\tilde{Y}}_{t}, ϕ)}}{d P |_{σ ({\tilde{Y}}_{t})} d P |_{σ (ϕ)}})],

with

η (x) = x log x

, (ii) that

\frac{d P |_{σ ({\tilde{Y}}_{t}, ϕ)}}{d P |_{σ ({\tilde{Y}}_{t})} d P |_{σ (ϕ)}} = E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [\frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}} | σ ({\tilde{Y}}_{t}, ϕ)]

and (iii) that Jensen’s inequality holds (

η

is convex on its domain)

\begin{matrix} E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [η (\frac{d P |_{σ ({\tilde{Y}}_{t}, ϕ)}}{d P |_{σ ({\tilde{Y}}_{t})} d P |_{σ (ϕ)}})] \\ = E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [η (E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [\frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}} | σ ({\tilde{Y}}_{t}, ϕ)])] \\ \leq E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [η (\frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}}) | σ ({\tilde{Y}}_{t}, ϕ)]] \\ = E_{P |_{F_{t}^{Y}} \times P |_{σ (ϕ)}} [η (\frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}})] . \end{matrix}

Appendix F.2. Proof of Conditional Independence and Mutual Information Equality

Formally the condition of conditional independence given

π

is satisfied if for any

a_{1}, a_{2}

positive random variables which are, respectively,

σ (X)

and

F_{t}^{Y}

measurable, the following holds:

E_{P} [a_{1} a_{2} | σ (π_{t})] = E_{P} [a_{1} | σ (π_{t})] E_{P} [a_{2} | σ (π_{t})]

(see for instance [64]).

The sigma-algebra

σ (π_{t})

is by definition the smallest one that makes

π_{t}

measurable. Since

π_{t}

is

F_{t}^{Y}

measurable, clearly

σ (π_{t}) \subseteq F_{t}^{Y}

. By the very definition of conditional expectation,

E_{P} [a_{1} | F_{t}^{Y}] = 〈 π_{t}, a_{1} 〉

, which is an

σ (π_{t})

measurable quantity. Then,

E_{P} [a_{1} a_{2} | σ (π_{t})] = E_{P} [E_{P} [a_{1} a_{2} | F_{t}^{Y}] | σ (π_{t})] = E_{P} [E_{P} [a_{1} | F_{t}^{Y}] a_{2} | σ (π_{t})] = E_{P} [E_{P} [〈 π_{t}, a_{1} 〉 a_{2} | σ (π_{t})] = 〈 π_{t}, a_{1} 〉 E_{P} [a_{2} | σ (π_{t})]

. Since

〈 π_{t}, a_{1} 〉 = E_{P} [〈 π_{t}, a_{1} 〉 | σ (π_{t})] = E_{P} [E_{P} [a_{1} | F_{t}^{Y}] | σ (π_{t})] = E_{P} [a_{1} | σ (π_{t})]

, the proof of conditional independence is concluded.

In summary,

σ (X)

and

F_{t}^{Y}

are conditionally independent given

σ (π_{t})

(

\subset F_{t}^{Y}

). This implies that

P (A | σ (π_{t})) = P (A | F_{t}^{Y}), \forall A \in σ (X)

, or equivalently

E_{P} [1 (A) | σ (π_{t})] = E_{P} [1 (A) | F_{t}^{Y}]

. To prove this, it is sufficient to show that for any

B \in F_{t}^{Y}

,

E_{P} [E_{P} [1 (A) | σ (π_{t})] 1 (B)] = E_{P} [1 (A) 1 (B)] .

By standard properties of conditional expectation

E_{P} [E_{P} [1 (A) | σ (π_{t})] 1 (B)] = E_{P} [E_{P} [1 (A) | σ (π_{t})] E_{P} [1 (B) | σ (π_{t})]] .

Due to conditional independence

E_{P} [1 (A) | σ (π_{t})] E_{P} [1 (B) | σ (π_{t})] = E_{P} [1 (A) 1 (B) | σ (π_{t})]

. Then,

E_{P} [E_{P} [1 (A) | σ (π_{t})] E_{P} [1 (B) | σ (π_{t})]] = E_{P} [E_{P} [1 (A) 1 (B) | σ (π_{t})]] = E_{P} [1 (A) 1 (B)]

.

The mutual information equality is then proved considering that

\frac{d P |_{R_{t}}}{d P |_{F_{t}^{Y}} d P |_{σ (ϕ)}} = \frac{d P (ω^{x} | F_{t}^{Y})}{d P (ω^{x})}

, since the conditional probabilities exist, and that

P (ω^{x} | F_{t}^{Y}) = P (ω^{x} | σ (π_{t}))

. ☐

Appendix G. A Technical Note

As anticipated in the main, Assumption 1 might be incompatible with the other technical assumptions in Appendix A. The problem might arise for singularities in the drift term at time

t = T

, which are usually present in the construction of dynamics satisfying Assumption 1 like stochastic bridges. This mathematical subtlety can be more clearly interpreted by noticing that when Assumption 1 is satisfied the evolution of the posterior process

π_{t}

at time T can occupy a portion of the space of dimensionality lower than at any

T - ϵ

,

ϵ > 0

. Or, we can notice that if Assumption 1 is satisfied,

I (Y_{0 \leq s \leq T}; V) = I (V; V)

which can be equal to infinity depending on the actual structure of

S

and the mapping V. In many cases, a simple technical solution is to consider in the analysis only dynamics of the process in the time interval

[0, T)

(this is akin to the discussion of arbitrage strategies in finance when the initial filtration is augmented with knowledge of the future value at certain time instants, and the fact that while the new process adapted with reference to the new filtration is also a martingale with reference to a given new measure for all

t \in [0, T)

, it fails to do so for

t = T

thus giving an arbitrage opportunity). In the reduced time interval

[0, T)

, the technical assumptions are generally shown to be satisfied. For the practical purposes explored in this work, this restriction makes no difference, and consequently neglect it for the rest of our discussion.

Appendix H. Linear Diffusion Models

Consider the particular case of linear generative diffusion models [14], which are widely adopted in the literature and by practitioners. We consider the particular case of Equation (11), where the function F has linear expression

{\hat{Y}}_{t} = {\hat{Y}}_{0} - α \int_{0}^{t} {\hat{Y}}_{s} d s + {\hat{W}}_{t},

(A8)

for a given

α \geq 0

. We assume of course again that Assumption 1 holds, which implies that we should select

{\hat{Y}}_{0} = Y_{T} = V

. Now,

α

dictates the behavior of the SDE, which can be cast to the so called variance-preserving and variance-exploding schedules of diffusion models [14]. In diffusion models jargon, Equation (A8) is typically referred to as a noising process. Indeed, by analysing the evolution of Equation (A8),

{\hat{Y}}_{t}

evolves to a noisier and noisier version of V as t grows. In particular, it holds that

{\hat{Y}}_{t} = exp (- α t) V + exp (- α t) \int_{0}^{t} exp (α s) d {\hat{W}}_{s} .

The next result is a particular case of Theorem 7.

Lemma A1.

Consider the stochastic process

Y_{t}

which solves Equation (A8). The same stochastic process also admits a

F_{t}^{Y}

-adapted representation

Y_{t} = Y_{0} + \int_{0}^{t} α Y_{s} + 2 α \frac{exp (- α (T - s)) E_{P} [V | σ (Y_{s})] - Y_{s}}{1 - exp (- 2 α (T - s))} d s + W_{t},

(A9)

where

Y_{0} = exp (- α T) V + \sqrt{\frac{1 - exp (- 2 α T)}{2 α}} ϵ

, with ϵ a standard Gaussian random variable independent of V and

W_{t}

.

As discussed in the main paper, we can now show that the same generative dynamics can be obtained under the NLF framework we present in this work, without the need to explicitly defining a backward and a forward process. In particular, we can directly select a observation function that corresponds to an Orstein–Uhlenbeck bridge [65,66], consequently satisfying Assumption 1, and obtain the generative dynamics of classical diffusion models. In particular we consider the following about H (notice that with H selected as in Assumption A6 the validity of the theory considered is restricted to the time interval

[0, T)

; see also Appendix G):

Assumption A6.

The function H in Equation (1) is selected to be of the linear form

H (Y_{t}, X, t) = m_{t} V - \frac{d log m_{t}}{d t} Y_{t},

with

m_{t} = \frac{α}{sinh (α (T - t))}

, where

α \geq 0

. When

α = 0

,

m_{t} = \frac{d log m_{t}}{d t} = \frac{1}{T - t}

. Furthermore,

Y_{0}

is selected as in Theorem 7. Under this assumption,

Y_{T} = V, P - a . s .

, i.e., Assumption 1 is satisfied Appendix I.

In summary, the particular case of Equation (1) (which is

F^{Y, X}

adapted) under Assumption A6 can be transformed into a generative model leveraging Theorem 2, since Assumption 1 holds. When doing so, we obtain that the process

Y_{t}

has

F^{Y}

adapted representation equal to

Y_{t} = Y_{0} + \int_{0}^{t} m_{s} E_{P} (V | F_{s}^{Y}) d s - \int_{0}^{t} \frac{d log m_{s}}{d s} Y_{s} d s + W_{t}^{F^{Y}},

which is nothing but Equation (A9) after some simple algebraic manipulation. The only relevant detail worth deeper exposition is the clarification about the actual computation of expectation of interest. If

P

is selected such that

{\hat{Y}}_{t}

solves Equation (A8), we have that

E_{P} (V | F_{t}^{Y}) = E_{P} (Y_{T} | σ (Y_{0 \leq s \leq t})) = E_{P} ({\hat{Y}}_{0} | σ ({\hat{Y}}_{T - t \leq s \leq T})) = E_{P} ({\hat{Y}}_{0} | σ ({\hat{Y}}_{T - t})) = E_{P} (V | σ (Y_{t})),

where the second to last equality is due to the Markov nature of

{\hat{Y}}_{t}

.

Moreover, in this particular case we can express the mutual information

I (Y_{0 \leq s \leq t}; ϕ) = I (Y_{t}; ϕ)

(where we removed the past of Y since the following Markov chain holds

ϕ \to {\hat{Y}}_{0} \to {\hat{Y}}_{t > 0}

) can be expressed in the simpler form

I (Y_{t}; ϕ) = I (Y_{0}; ϕ) + \frac{1}{2} E_{P} [\int_{0}^{t} m_{s}^{2} {| | E_{P} [V | σ (Y_{s})] - E_{P} [V | σ (Y_{s}, ϕ)] | |}^{2} d s]

(A10)

matching the result described in [53], obtained with the formalism of time reversal of SDEs.

Appendix I. Discussion About Assumption A6

This is easily checked thanks to the following equality

Y_{t} = Y_{0} \frac{m_{0}}{m_{t}} + V \frac{m_{0}}{m_{T - t}} + \int_{0}^{t} \frac{m_{s}}{m_{t}} d W_{s} .

(A11)

To avoid cluttering the notation, we define

f_{t} = \frac{d log m_{t}}{d t}

. To show that Equation (A11) is true, it is sufficient to observe (i) that initial conditions are met and (ii) that the time differential of the process is the correct one. We proceed to show that indeed the second condition holds (the first one is trivially observed to be true).

\begin{matrix} d Y_{t} = - α Y_{0} \frac{cosh (α (T - t))}{sinh (α T)} + α r (X) \frac{cosh (α t)}{sinh (α T)} - α cosh (α (T - t)) \int_{0}^{t} \frac{1}{sinh (α (T - s))} d W_{s} + d W_{t} \\ = - α \frac{cosh (α (T - t))}{sinh (α (T - t))} (Y_{0} \frac{sinh (α (T - t))}{sinh (α T)} + \int_{0}^{t} \frac{sinh (α (T - t))}{sinh (α (T - s))} d W_{s}) + α r (X) \frac{cosh (α t)}{sinh (α T)} + d W_{t} \\ = - α coth (α (T - t)) (Y_{t} - r (X) \frac{sinh (α t)}{sinh (α T)}) + α r (X) \frac{cosh (α t)}{sinh (α T)} + d W_{t} \\ = - f_{t} Y_{t} + α r (X) (\frac{coth (α (T - t)) sinh (α t)}{sinh (α T)} + \frac{cosh (α t)}{sinh (α T)}) + d W_{t} \\ = - f_{t} Y_{t} + α r (X) (\frac{coth (α (T - t)) sinh (α t) + cosh (α t)}{sinh (α T)}) + d W_{t} \\ = - f_{t} Y_{t} + α r (X) (\frac{coth (α (T - t)) sinh (α t) + cosh (α t)}{sinh (α T)}) + d W_{t} \\ = - f_{t} Y_{t} + m_{t} r (X) + d W_{t} \end{matrix}

where the result is obtained considering that

\begin{matrix} \frac{coth (α (T - t)) sinh (α t) + cosh (α t)}{sinh (α T)} = \frac{\frac{e^{α (T - t)} + e^{- α (T - t)}}{e^{α (T - t)} - e^{- α (T - t)}} (e^{α t} - e^{- α t}) + (e^{α t} + e^{- α t})}{e^{α T} - e^{- α T}} \\ = \frac{\frac{e^{α T} + e^{- α (T - 2 t)} - e^{α (T - 2 t)} - e^{- α T}}{e^{α (T - t)} - e^{- α (T - t)}} + (e^{α t} + e^{- α t})}{e^{α T} - e^{- α T}} \\ = \frac{e^{α T} + e^{- α (T - 2 t)} - e^{α (T - 2 t)} - e^{- α T} + e^{α T} - e^{- α (T - 2 t)} + e^{α (T - 2 t)} - e^{- α T}}{(e^{α (T - t)} - e^{- α (T - t)}) (e^{α T} - e^{- α T})} \\ = \frac{2}{e^{α (T - t)} - e^{- α (T - t)}} . \end{matrix}

Appendix J. Experimental Details

Appendix J.1. Dataset Details

The Shapes3D dataset [57] includes the following attributes and the number of classes for each, as shown in Table A1.

Table A1. Attributes and class counts in the Shapes3D dataset.

Attribute	Number of Classes
Floor hue	10
Object hue	10
Orientation	15
Scale	8
Shape	4
Wall hue	10

Appendix J.2. Unconditional Diffusion Model Training

We train the unconditional denoising score network using the NCSN++ architecture [14], which corresponds to a U-NET [58]. The model is trained from scratch using the score-matching objective. The training hyperparameters are summarized in Table A2.

Table A2. Hyperparameters for unconditional diffusion model training.

Parameter	Value
Epochs	100
Batch size	256
Learning rate	$1 \times 10^{- 4}$
Optimizer	AdamW [67]
$β_{1}$	0.95
$β_{2}$	0.999
Weight decay	$1 \times 10^{- 6}$
Epsilon	$1 \times 10^{- 8}$
Learning rate scheduler	Cosine annealing with warmup
Warmup steps	500
Gradient clipping	1.0
EMA decay	0.9999
Mixed precision	FP16
Scheduler	Variance Exploding [14]
$σ_{\min}$	0.01
$σ_{\max}$	90
Loss function	Denoising score matching [14]

Appendix J.3. Linear Probing Experiment Details

In the linear probing experiments, we train a linear classifier on the feature maps extracted from the denoising score network at various noise levels

τ

. The training details are provided in Table A3.

Table A3. Hyperparameters for linear probing experiments.

Parameter	Value
Batch size	64
Loss function	Cross-Entropy Loss
Optimizer	Adam [68]
Learning rate	$1 \times 10^{- 6}$ for $τ = 0.9$ or $τ = 0.99$ $1 \times 10^{- 4}$ for other $τ$ values
Number of epochs	30
Inputs	Feature maps (used as-is in the linear layer) Noisy images (scaled to $[- 1, + 1]$ )

Appendix J.4. Mutual Information Estimation Experiment Details

For mutual information estimation, we train a conditional diffusion model using the same NCSN++ architecture as before. The conditioning is incorporated by adding a distinct class embedding for each label present in the input image, added to the input embedding along with the timestep embedding. The hyperparameters are the same as those used for the unconditional diffusion model (see Table A2).

To calculate the mutual information, we use Equation (A10), estimating the integral using the midpoint rule with 999 points uniformly spaced in

[0, T]

.

Figure A1. Visualization of the forking experiment with

num_forks = 4

and one initial seed. The image at time

τ = 0.4

is quite noisy. In the final generations after forking, the images exhibit coherence in the labels shape, wall hue, floor hue, and object hue. However, there is variation in orientation and scale.

Figure A1. Visualization of the forking experiment with

num_forks = 4

and one initial seed. The image at time

τ = 0.4

is quite noisy. In the final generations after forking, the images exhibit coherence in the labels shape, wall hue, floor hue, and object hue. However, there is variation in orientation and scale.

Appendix J.5. Forking Experiment Details

In the forking experiments, we use a ResNet50 [61] model with an additional linear layer, trained from scratch, to classify the generated images and assess label coherence across forks. The training details for the classifier are summarized in Table A4.

Table A4. Hyperparameters for the classifier in forking experiments.

Parameter	Value
Image size	224 (resized with bilinear interpolation)
Image scaling	$[- 1, + 1]$
Dataset split	Training set: 72% Validation set: 8% Test set: 20%
Early stopping	Stop when validation accuracy exceeds 99% Evaluated every 1000 steps
Number of epochs	1
Optimizer	Adam [68]
Learning rate	$1 \times 10^{- 4}$

During the sampling process of the forking experiment, we use the settings summarized in Table A5.

Table A5. Sampling settings for the forking experiments.

Parameter	Value
Stochastic predictor	Euler–Maruyama method with 1000 steps
Corrector	Langevin dynamics with 1 step
Signal-to-noise ratio (SNR)	0.06
Number of forks (k)	100
Number of seeds	10 (independent initial noise samples)

Appendix J.6. Linear Probing on Raw Data

In Figure A2, we evaluate the performance of linear probes trained on features maps extracted from the denoiser network, and show compare their log probability accuracy with a linear probe trained on the raw, noisy input and a random guesser. Throughout the generative process, linear probes obtain higher accuracy than the baselines: for large noise levels, a linear probe on raw input data fails, whereas the inner layers of the denoising network extract features that are sufficient to discern latent labels.

Figure A2. Log-probability accuracy of linear classifiers at

τ

. ‘Feature map’ classifiers are trained on network features; ‘Noisy Image’ trained on noisy images; ‘Random Guess’ is the baseline for random guessing.

Figure A2. Log-probability accuracy of linear classifiers at

τ

. ‘Feature map’ classifiers are trained on network features; ‘Noisy Image’ trained on noisy images; ‘Random Guess’ is the baseline for random guessing.

Appendix J.7. Additional Experiments on CelebA Dataset

We present our results conducted on the CelebA dataset [69], consisting of over 200,000 celebrity images with 40 binary attributes. Next, we focus our analysis on the attributes “Male” and “Eyeglasses” as these are (i) among the most reliable and objectively labeled features in the CelebA dataset (this is supported by previous work, which highlights significant labeling issues for many other attributes, making them less suitable for consistent analysis [70]) and (ii) significant examples of attributes which can be mapped to more global and local features, respectively. The unconditional and conditional diffusion models were trained using the identical architectural, optimization, and training hyperparameters as in Song et al. [14]. Both models employed a variance-exploding diffusion process with a U-Net backbone for the denoising score network. Training details, including the learning rate, batch size, and noise schedules, are the same as of Song et al. [14]. We present a comprehensive analysis of the results derived from probing experiments, mutual information (MI) estimation, and the rate of increase in MI across the generative process.

Figure A3. Probing accuracy and mutual information (MI) as a function of the noise intensity parameter

τ

.

Figure A3. Probing accuracy and mutual information (MI) as a function of the noise intensity parameter

τ

.

Probing vs. MI. Our results, as shown in Figure A3, illustrate a coherent growth between classifier accuracy (probing performance) and mutual information as a function of the noise intensity parameter

τ

. For both attributes, probing accuracy increases steadily, mirroring the growth of MI.

Figure A4. Mutual information (MI) growth for “Male” and “Eyeglasses” attributes across the generative process.

Mutual Information Across Labels. Figure A4 compares MI growth across the “Male” and “Eyeglasses” attributes. A key observation is that the MI for “Male” rises earlier than for “Eyeglasses”, beginning at

τ = 0.2

, compared to

τ = 0.3

. This aligns with the intuition that some latent abstractions emerge earlier in the generative process than others, given that the average number of pixels impacted by the global features is larger than the local ones.

Figure A5. Rate of change of mutual information (MI) for “Male” and “Eyeglasses” attributes as a function of

τ

.

Figure A5. Rate of change of mutual information (MI) for “Male” and “Eyeglasses” attributes as a function of

τ

.

Rate of Increase in MI. To further investigate the dynamics, we plot

\frac{Δ (M I)}{Δ τ}

, the rate of change of MI, for the two attributes (Figure A5). This reveals that “Male” exhibits a significantly faster initial growth rate compared to “Eyeglasses”, peaking around

τ = 0.4

. This confirms the earlier emergence of “Male” as a latent abstraction, with a sharp rise in MI during the early stages. In contrast, the MI for “Eyeglasses” grows more gradually, reflecting a slower but steady emergence of this attribute.

References

Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 8780–8794. [Google Scholar]
Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D.P.; Poole, B.; Norouzi, M.; Fleet, D.J.; et al. Imagen Video: High Definition Video Generation with Diffusion Models. 2022. Available online: https://imagen.research.google/video/paper.pdf (accessed on 21 February 2025).
He, Y.; Yang, T.; Zhang, Y.; Shan, Y.; Chen, Q. Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths. arXiv 2022, arXiv:2211.13221. [Google Scholar]
Li, X.L.; Thickstun, J.; Gulrajani, I.; Liang, P.; Hashimoto, T. Diffusion-LM Improves Controllable Text Generation. Oh, A.H., Agarwal, A., Belgrave, D., Cho, K., Eds.. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
He, Z.; Sun, T.; Tang, Q.; Wang, K.; Huang, X.; Qiu, X. DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1. Long Papers. [Google Scholar]
Gulrajani, I.; Hashimoto, T. Likelihood-Based Diffusion Language Models. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Lou, A.; Meng, C.; Ermon, S. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Liu, J.; Li, C.; Ren, Y.; Chen, F.; Zhao, Z. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11020–11028. [Google Scholar]
Trippe, B.L.; Yim, J.; Tischer, D.; Baker, D.; Broderick, T.; Barzilay, R.; Jaakkola, T. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv 2022, arXiv:2206.04119. [Google Scholar]
Hoogeboom, E.; Satorras, V.G.; Vignac, C.; Welling, M. Equivariant Diffusion for Molecule Generation in 3D. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Proceedings of Machine Learning Research. Volume 162, pp. 8867–8887. [Google Scholar]
Luo, S.; Hu, W. Diffusion Probabilistic Models for 3D Point Cloud Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2837–2845. [Google Scholar]
Zeng, X.; Vahdat, A.; Williams, F.; Gojcic, Z.; Litany, O.; Fidler, S.; Kreis, K. LION: Latent Point Diffusion Models for 3D Shape Generation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Albergo, M.S.; Boffi, N.M.; Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv 2023, arXiv:2303.08797. [Google Scholar]
Chen, Y.; Viégas, F.; Wattenberg, M. Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model. arXiv 2023, arXiv:2306.05720. [Google Scholar]
Linhardt, L.; Morik, M.; Bender, S.; Borras, N.E. An Analysis of Human Alignment of Latent Diffusion Models. arXiv 2024, arXiv:2403.08469. [Google Scholar]
Tang, L.; Jia, M.; Wang, Q.; Phoo, C.P.; Hariharan, B. Emergent correspondence from image diffusion. Adv. Neural Inf. Process. Syst. 2023, 36, 1363–1389. [Google Scholar]
Berner, J.; Richter, L.; Ullrich, K. An optimal control perspective on diffusion-based generative modeling. arXiv 2022, arXiv:2211.01364. [Google Scholar]
Richter, L.; Berner, J. Improved sampling via learned diffusions. arXiv 2023, arXiv:2307.01198. [Google Scholar]
Ye, M.; Wu, L.; Liu, Q. First hitting diffusion models for generating manifold, graph and categorical data. Adv. Neural Inf. Process. Syst. 2022, 35, 27280–27292. [Google Scholar]
Raginsky, M. A variational approach to sampling in diffusion processes. arXiv 2024, arXiv:2405.00126. [Google Scholar]
Bain, A.; Crisan, D. Fundamentals of Stochastic Filtering; Springer: Berlin/Heidelberg, Germany, 2009; Volume 3. [Google Scholar]
Van Handel, R. Filtering, Stability, and Robustness. Ph.D. Thesis, California Institute of Technology, Pasadena, CA, USA, 2007. [Google Scholar]
Kutschireiter, A.; Surace, S.C.; Pfister, J.P. The Hitchhiker’s guide to nonlinear filtering. J. Math. Psychol. 2020, 94, 102307. [Google Scholar]
Bisk, Y.; Holtzman, A.; Thomason, J.; Andreas, J.; Bengio, Y.; Chai, J.; Lapata, M.; Lazaridou, A.; May, J.; Nisnevich, A.; et al. Experience grounds language. arXiv 2020, arXiv:2004.10151. [Google Scholar]
Bender, E.M.; Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5185–5198. [Google Scholar]
Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv 2022, arXiv:2210.13382. [Google Scholar]
Park, Y.H.; Kwon, M.; Choi, J.; Jo, J.; Uh, Y. Understanding the latent space of diffusion models through the lens of riemannian geometry. Adv. Neural Inf. Process. Syst. 2023, 36, 24129–24142. [Google Scholar]
Kwon, M.; Jeong, J.; Uh, Y. Diffusion Models already have a Semantic Latent Space. arXiv 2023, arXiv:2210.10960. [Google Scholar]
Xiang, W.; Yang, H.; Huang, D.; Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15802–15812. [Google Scholar]
Haas, R.; Huberman-Spiegelglas, I.; Mulayoff, R.; Graßhof, S.; Brandt, S.S.; Michaeli, T. Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models. arXiv 2024, arXiv:2303.11073. [Google Scholar]
Sclocchi, A.; Favero, A.; Wyart, M. A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data. arXiv 2024, arXiv:2402.16991. [Google Scholar]
Raya, G.; Ambrogioni, L. Spontaneous symmetry breaking in generative diffusion models. J. Stat. Mech. Theory Exp. 2024, 2024, 104025. [Google Scholar]
Ambrogioni, L. The statistical thermodynamics of generative diffusion models. arXiv 2023, arXiv:2310.17467. [Google Scholar]
Newton, N.J. Interactive statistical mechanics and nonlinear filtering. J. Stat. Phys. 2008, 133, 711–737. [Google Scholar]
Duncan, T.E. On the calculation of mutual information. SIAM J. Appl. Math. 1970, 19, 215–220. [Google Scholar]
Duncan, T.E. Mutual information for stochastic differential equations. Inf. Control 1971, 19, 265–271. [Google Scholar] [CrossRef]
Mitter, S.K.; Newton, N.J. A variational approach to nonlinear estimation. SIAM J. Control Optim. 2003, 42, 1813–1833. [Google Scholar]
ksendal, B. Stochastic Differential Equations; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Fotsa-Mbogne, D.J.; Pardoux, E. Nonlinear filtering with degenerate noise. Electron. Commun. Probabability 2017, 22, 1–14. [Google Scholar] [CrossRef]
Anderson, B.D.O. Reverse-time diffusion equation models. Stoch. Process. Their Appl. 1982, 12, 313–326. [Google Scholar]
Pardoux, E. Grossissement d’une filtration et retournement du temps d’une diffusion. In Séminaire de Probabilités XX 1984/85: Proceedings; Springer: Berlin/Heidelberg, Germany, 2006; pp. 48–55. [Google Scholar]
Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes, and Martingales: Itô Calculus; Cambridge University Press: Cambridge, UK, 2000; Volume 2. [Google Scholar]
Aksamit, A.; Jeanblanc, M. Enlargement of Filtration with Finance in View; Springer: Cham, Switherland, 2017. [Google Scholar]
Ouwehand, P. Enlargement of Filtrations—A Primer. arXiv 2022, arXiv:2210.07045. [Google Scholar]
Grigorian, K.; Jarrow, R.A. Enlargement of Filtrations: An Exposition of Core Ideas with Financial Examples. arXiv 2023, arXiv:2303.03573. [Google Scholar]
Çetin, U.; Danilova, A. Markov bridges: SDE representation. Stoch. Process. Their Appl. 2016, 126, 651–679. [Google Scholar] [CrossRef]
Haussmann, U.G.; Pardoux, E. Time Reversal of Diffusions. Ann. Probab. 1986, 14, 1188–1205. [Google Scholar] [CrossRef]
Shreve, S.E. Stochastic Calculus for Finance II: Continuous-Time Models; Springer: Berlin/Heidelberg, Germany, 2004; Volume 11. [Google Scholar]
Al-Hussaini, A.N.; Elliott, R.J. Enlarged filtrations for diffusions. Stoch. Process. Their Appl. 1987, 24, 99–107. [Google Scholar] [CrossRef]
Franzese, G.; Bounoua, M.; Michiardi, P. MINDE: Mutual Information Neural Diffusion Estimation. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024. [Google Scholar]
Kidger, P.; Foster, J.; Li, X.; Lyons, T.J. Neural sdes as infinite-dimensional gans. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 5453–5463. [Google Scholar]
De Bortoli, V.; Thornton, J.; Heng, J.; Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 17695–17709. [Google Scholar]
Hodgkinson, L.; van der Heide, C.; Roosta, F.; Mahoney, M.W. Stochastic continuous normalizing flows: Training SDEs as ODEs. In Proceedings of the Uncertainty in Artificial Intelligence, Online, 27–30 July 2021; pp. 1130–1140. [Google Scholar]
Kim, H.; Mnih, A. Disentangling by factorising. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2649–2658. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switherland, 2011; pp. 145–158. [Google Scholar]
Szymański, P.; Kajdanowicz, T. A Network Perspective on Stratification of Multi-Label Data. In Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications; Skopje, Macedonia, 22 September 2017, Torgo, L., Krawczyk, B., Branco, P., Moniz, N., Eds.; Proceedings of Machine Learning Research; Volume 74, pp. 22–35.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Xiong, J. An Introduction to Stochastic Filtering Theory; Oxford University Press: Oxford, UK, 2008. [Google Scholar]
Kallenberg, O. Foundations of Modern Probability, 2nd ed.; Probability and its Applications (New York); Springer: New York, NY, USA, 2002; p. 18. [Google Scholar] [CrossRef]
Van Putten, C.; van Schuppen, J.H. Invariance properties of the conditional independence relation. Ann. Probab. 1985, 13, 934–945. [Google Scholar] [CrossRef]
Mazzolo, A. Constraint ornstein-uhlenbeck bridges. J. Math. Phys. 2017, 58, 093302. [Google Scholar] [CrossRef]
Corlay, S. Properties of the Ornstein-Uhlenbeck bridge. arXiv 2013, arXiv:1310.5617. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diega, CA, USA, 7–9 May 2015. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the Proceedings of International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Lingenfelter, B.; Davis, S.R.; Hand, E.M. A quantitative analysis of labeling issues in the CelebA dataset. In Proceedings of the Advances in Visual Computing: 17th International Symposium, ISVC 2022, San Diego, CA, USA, 3–5 October 2022; Proceedings, Part I. pp. 129–141. [Google Scholar] [CrossRef]

Figure 1. Graphical intuition for our results: nonlinear filtering (left) and generative modeling (right).

Figure 2. Versions of an image corrupted by different values of noise for different times

τ

.

Figure 2. Versions of an image corrupted by different values of noise for different times

τ

.

Figure 3. Mutual information, entropy across forked generative pathways, and probing results as functions of

τ

.

Figure 3. Mutual information, entropy across forked generative pathways, and probing results as functions of

τ

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Franzese, G.; Martini, M.; Corallo, G.; Papotti, P.; Michiardi, P. Latent Abstractions in Generative Diffusion Models. Entropy 2025, 27, 371. https://doi.org/10.3390/e27040371

AMA Style

Franzese G, Martini M, Corallo G, Papotti P, Michiardi P. Latent Abstractions in Generative Diffusion Models. Entropy. 2025; 27(4):371. https://doi.org/10.3390/e27040371

Chicago/Turabian Style

Franzese, Giulio, Mattia Martini, Giulio Corallo, Paolo Papotti, and Pietro Michiardi. 2025. "Latent Abstractions in Generative Diffusion Models" Entropy 27, no. 4: 371. https://doi.org/10.3390/e27040371

APA Style

Franzese, G., Martini, M., Corallo, G., Papotti, P., & Michiardi, P. (2025). Latent Abstractions in Generative Diffusion Models. Entropy, 27(4), 371. https://doi.org/10.3390/e27040371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Latent Abstractions in Generative Diffusion Models

Abstract

1. Introduction

2. Nonlinear Filtering

2.1. Technical Preliminaries

2.2. A Posteriori Measure and Mutual Information

3. Generative Modeling

4. An Informal Summary of the Results

Generality of Our Framework

5. Empirical Evidence

6. Potential Applications and Practical Implications

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Assumptions

Appendix B. Proof of Theorem 2

Appendix C. Proof of Theorem 3

Appendix D. Proof of Theorem 4

Appendix E. Proof of Theorem 5

Appendix E.1. Part 1

Appendix E.2. Part 2

Appendix F. Proof of Theorem 6

Appendix F.1. Proof of Equation (9)

Appendix F.2. Proof of Conditional Independence and Mutual Information Equality

Appendix G. A Technical Note

Appendix H. Linear Diffusion Models

Appendix I. Discussion About Assumption A6

Appendix J. Experimental Details

Appendix J.1. Dataset Details

Appendix J.2. Unconditional Diffusion Model Training

Appendix J.3. Linear Probing Experiment Details

Appendix J.4. Mutual Information Estimation Experiment Details

Appendix J.5. Forking Experiment Details

Appendix J.6. Linear Probing on Raw Data

Appendix J.7. Additional Experiments on CelebA Dataset

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI