Next Article in Journal
Intrinsic Motivation as Constrained Entropy Maximization
Previous Article in Journal
Successive Refinement for Lossy Compression of Individual Sequences
Previous Article in Special Issue
U-Turn Diffusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Latent Abstractions in Generative Diffusion Models

by
Giulio Franzese
1,*,
Mattia Martini
2,
Giulio Corallo
1,3,
Paolo Papotti
1 and
Pietro Michiardi
1
1
Department of Data Science, EURECOM, 06410 Biot, France
2
Laboratoire J. A. Dieudonné, CNRS, Université Côte d’Azur, 06108 Nice, France
3
SAP Labs, 805 Avenue du Dr Donat-Font de l’Orme, 06259 Mougins, France
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(4), 371; https://doi.org/10.3390/e27040371
Submission received: 21 February 2025 / Revised: 14 March 2025 / Accepted: 24 March 2025 / Published: 31 March 2025
(This article belongs to the Special Issue The Statistical Physics of Generative Diffusion Models)

Abstract

:
In this work, we study how diffusion-based generative models produce high-dimensional data, such as images, by relying on latent abstractions that guide the generative process. We introduce a novel theoretical framework extending Nonlinear Filtering (NLF), offering a new perspective on SDE-based generative models. Our theory is based on a new formulation of joint (state and measurement) dynamics and an information-theoretic measure of state influence on the measurement process. We show that diffusion models can be interpreted as a system of SDE, describing a non-linear filter where unobservable latent abstractions steer the dynamics of an observable measurement process. Additionally, we present an empirical study validating our theory and supporting previous findings on the emergence of latent abstractions at different generative stages.

1. Introduction

Generative models have become a cornerstone of modern machine learning, offering powerful methods for synthesizing high-quality data across various domains such as image and video synthesis [1,2,3], natural language processing [4,5,6,7], audio generation [8,9], and molecular structures and general 3D shapes [10,11,12,13], to name a few. These models transform an initial distribution, which is simple to sample from, into one that approximates the data distribution. Among these, diffusion-based models designed through the lenses of Stochastic Differential Equations (SDEs) [14,15,16] have gained popularity due to their ability to generate realistic and diverse data samples through a series of stochastic transformations.
In such models, the data generation process, as described by a substantial body of empirical research [17,18,19], appears to develop according to distinct stages: high-level semantics emerge first, followed by the incorporation of low-level details, culminating in a refinement (denoising) phase. Despite ample evidence, a comprehensive theoretical framework for modeling these dynamics remains underexplored.
Indeed, despite recent work on SDE-based generative models, refs. [20,21,22,23] shedding new light on such models, they fall short of explicitly investigating the emergence of abstract representations in the generative process. We address this gap by establishing a new framework for elucidating how generative models construct and leverage latent abstractions, approached through the paradigm of NLF [24,25,26].
NLF is used across diverse engineering domains [24], as it provides robust methodologies for the estimation and prediction of a system’s state amidst uncertainty and noise. NLF enables the inference of dynamic latent variables that define the system state based on observed data, offering a Bayesian interpretation of state evolution and the ability to incorporate stochastic system dynamics. The problem we consider is the following: an unobservable random variable X is measured through a noisy continuous-time process  Y t , wherein the influence of X on the noisy process is described by an observation function H, with the noise component modeled as a Brownian motion term. The goal is to estimate the a posteriori measure  π t of the variable X given the entire historical trajectory of the measurement process  Y t .
In this work, we establish a connection between SDE-based generative models and NLF by observing that they can be interpreted as simulations of NLF dynamics. In our framework, the latent abstraction, which corresponds to certain real-world properties within the scope of classical nonlinear filtering and remains unaffected in a causal manner by the posterior process  π t , is implicitly simulated and iteratively refined. We explore the connection between latent abstractions and the a posteriori process, through the concept of filtrations—broadly defined as collections of progressively increasing information sets—and offer a rigorous theory to study the emergence and influence of latent abstractions throughout the data generation process. To ground the reader’s intuition in a concrete example, our experimental validation considers a scenario where latent abstractions correspond to scene descriptions—such as color, shape, and object size—that are subsequently rendered using a computer program.
Our theoretical contributions unfold as follows. In Section 2 we show how to reformulate classical NLF results such that the measurement process is the only available information, and derive the corresponding dynamics of both the latent abstraction and the measurement process. These results are summarized in Theorems 2 and 3.
Given the new dynamics, in Theorem 4, we show how to estimate the a posteriori measure of the NLF model and present a novel derivation to compute the mutual information between the measurement process and random variables derived from a transformation of the latent abstractions in Theorem 5. Finally, we show in Theorem 6 that the a posteriori measure is a sufficient statistic for any random variable derived from the latent abstractions when only having access to the measurement process.
Building on these general results, in Section 3 we present a novel perspective on continuous-time score-based diffusion models, which is summarised in Equation (10). We propose to view such generative models as NLF simulators that progress in two stages: first, our model updates the a posteriori measure representing sufficient statistics of the latent abstractions; second, it uses a projection of the a posteriori measure to update the measurement process. Such intuitive understanding is the result of several fundamental steps. In Theorems 7 and 8, we show that the common view of score-based diffusion models by which they evolve according to forward (noising) and backward (generative) dynamics is compatible with the NLF formulation, in which there is no need to distinguish between such phases. In other words, the NLF perspective of Equation (10) is a valid generative model. In Appendix H, we provide additional results (see Lemma A1), focusing on the specific case of linear diffusion models, which are the most popular instance of score-based generative models in use today. In Section 4, we summarize the main intuitions behind our NLF framework.
Our results explain, by means of a theoretically sound framework, the emergence of latent abstractions that has been observed by a large body of empirical work [17,18,19,27,28,29,30,31,32,33]. The closest research to our findings are discussed in [34,35,36], albeit from a different mathematical perspective. To root our theoretical results in additional empirical evidence, we conclude our work in Section 5 with a series of experiments on score-based generative models [14], where we (1) validate existing probing techniques to measure the emergence of latent abstractions, (2) compute the mutual information as derived in our framework and show that it is a suitable approach to measure the relation between the generative process and latent abstractions, and (3) introduce a new measurement protocol to further confirm the connections between our theory and how practical diffusion-based generative models operate.

2. Nonlinear Filtering

Consider two random variables  Y t  and X, corresponding to a stochastic measurement process ( Y t ) of some underlying latent abstraction (X). We construct our universe sample space  Ω  as the combination of the space of continuous functions in the interval  [ 0 , T ]  (i.e.,  C ( [ 0 , T ] , R N )  with  T R + ), and of a complete separable metric space  S , i.e.,  Ω = C ( [ 0 , T ] , R N ) × S . On this space, we consider the joint canonical process  Z t ( ω ) = [ Y t , X ] = [ ω t y , ω x ]  for all  ω Ω , with  ω = [ ω y , ω x ] . In this work, we indicate with  σ ( · )  sigma-algebras. Consider the growing filtration naturally induced by the canonical process  F t Y , X = σ ( Y 0 s t , X )  (a short-hand for  σ ( σ ( Y 0 s t ) σ ( X ) ) ), and define  F = F T Y , X . We build the probability triplet  ( Ω , F , P ) , where the probability measure  P  is selected such that the process  { Z 0 t T , F 0 t T Y , X }  has the following SDE representation
Y t = Y 0 + 0 t H ( Y s , X , s ) d s + W t ,
where  { W 0 t T , F 0 t T Y , X }  is a Brownian motion with initial value 0 and  H : Ω × [ 0 , T ] R N  is an observation process. All standard technical assumptions are available in Appendix A.
Next, we provide the necessary background on NLF, to pave the way for understanding its connection with the generative models of interest. The most important building block of the NLF literature is represented by the conditional probability measure  P [ X A | F t Y ]  (notice the reduced filtration  F t Y F t Y , X ), which summarizes, a posteriori, the distribution of X given observations of the measurement process until time t, that is,  Y 0 s t .
Theorem 1 
(Thm 2.1 [24]). Consider the probability triplet  ( Ω , F , P ) , the metric space  S  and its Borel sigma-algebra  B ( S ) . There exists a (probability measure valued  P ( S ) ) process  { π 0 t T , F 0 t T Y } , with a progressively measurable modification, such that for all  A B ( S ) , the conditional probability measure  P [ X A | F t Y ]  is well defined and is equal to  π t ( A ) .
The conditional probability measure is extremely important, as the fundamental goal of nonlinear filtering is the solution to the following problem. Here, we introduce the quantity  ϕ , which is a random variable derived from the latent abstractions X.
Problem 1. 
For any fixed  ϕ : S R  bounded and measurable, given knowledge of the measurement process  Y 0 s t , compute  E P [ ϕ ( X ) | F t Y ] . This amounts to computing
π t , ϕ = S ϕ ( x ) d π t ( x ) .
In simple terms, Problem 1 involves studying the existence of the a posteriori measure and the implementation of efficient algorithms for its update, using the flowing stream of incoming information  Y t . We first focus our attention on the existence of an analytic expression for the value of the a posteriori expected measure  π t . Then, we quantify the interaction dynamics between observable measurements and  ϕ , through the lenses of mutual information  I ( Y 0 s t ; ϕ ) , which is an extension of the problems considered in [37,38,39,40].

2.1. Technical Preliminaries

We set the stage of our work by revisiting the measurement process  Y t , and express it in a way that does not require access to unobservable information. Indeed, while  Y t  is naturally adapted with reference to its own filtration  F t Y , and consequently to any other growing filtration  R t  such  F t Y , X R t F t Y , the representation in Equation (1) is in general not adapted, letting aside degenerate cases.
Let us consider the family of growing filtrations  R t = σ ( R 0 σ ( Y 0 s t Y 0 ) ) , where  σ ( Y 0 ) R 0 σ ( X , Y 0 ) . Intuitively,  R 0  allows to modulate between the two extreme cases of knowing only the initial conditions of the SDE, that is  Y 0 , to the case of complete knowledge of the whole latent abstraction X, and anything in between. As shown hereafter, the original process  Y t  associated with the space  ( Ω , F , P )  which solves Equation (1), also solves Equation (4), which is adapted on the reduced filtration  R t . This allows us to reason about the partial observation of the latent abstraction ( R 0  vs.  σ ( X , Y 0 ) ), without incurring in the problem of the measurement process  Y t  being statistically dependent of the whole latent abstraction X.
Armed with such representation, we study under which change of measure the process  Y t Y 0  behaves as a Brownian motion (Theorem 3). This serves the purpose of simplifying the calculation of the expected value of  ϕ  given  Y t , as described in Problem 1. Indeed, if  Y t Y 0  is a Brownian motion independent of  ϕ , its knowledge does not influence our best guess for  ϕ , i.e., the conditional expected value. Moreover, our alternative representation is instrumental for the efficient and simple computation of the mutual information  I ( Y 0 s t ; ϕ ) , where the different measures involved in the Radon–Nikodym derivatives will be compared against the same reference Brownian measures.
The first step to define our representation is provided by the following
Theorem 2. 
[Appendix B] Consider the the probability triplet  ( Ω , F , P ) , the process in Equation (1) defined on it, and the growing filtration  R t = σ ( R 0 σ ( Y 0 s t Y 0 ) ) . Define a new stochastic process
W t R = def Y t Y 0 0 t E P ( H ( Y s , X , s ) | R s ) d s .
Then,  { W 0 t T R , R 0 t T }  is a Brownian motion. Notice that if  R t = F t Y , X , then  W t R = W t .
Following Theorem 2, the process  { Y 0 t T , R 0 t T }  has SDE representation
Y t = Y 0 + 0 t E P ( H ( Y s , X , s ) | R s ) d s + W t R .
Next, we derive the change of measure necessary for the process  W ˜ t = def Y t Y 0  to be a Brownian motion with reference to to the filtration  R t . To carry this out, we apply the Girsanov theorem [41] to  W ˜ t , which, in general, admits a  R -adapted representation  0 t E P ( H ( Y s , X , s ) | R s ) d s + W t R .
Theorem 3. 
[Appendix C] Define the new probability space  ( Ω , R T , Q R )  via the measure  Q R ( A ) = E P 1 ( A ) ( ψ T R ) 1 , for  A R T , where
ψ t R = def exp ( 0 t E P [ H ( Y s , X , s ) | R s ] d Y s 1 2 0 t | | E P [ H ( Y s , X , s ) | R s ] | | 2 d s ) ,
and
Q R | R t = E P 1 ( A ) E P [ ( ψ T R ) 1 | R t ] = E P 1 ( A ) ( ψ t R ) 1 .
Then, the stochastic process  { W ˜ 0 t T , R 0 t T }  is a Brownian motion on the space  ( Ω , R T , Q R ) .
A direct consequence of Theorem 3 is that the process  W ˜ t  is independent of any  R 0  measurable random variable under the measure  Q R . Moreover, it holds that for all  R t R t Q R | R t = Q R | R t .

2.2. A Posteriori Measure and Mutual Information

As in Section 2 for the process  π t , here, we introduce a new process  π t R , which represents the conditional law of X given the filtration  R t = σ ( R 0 σ ( Y 0 s t Y 0 ) ) . More precisely, for all  A B ( S ) , the conditional probability measure  P [ X A | R t ]  is well defined and is equal to  π t R ( A ) . Moreover, for any  ϕ : S R  bounded and measurable,  E P [ ϕ ( X ) | R t ] = π t R , ϕ . Notice that if  R = F Y  then  π R  reduces to  π .
Armed with Theorem 3, we are ready to derive the expression for the a posteriori measure  π t R  and the mutual information between observable measurements and the unavailable information about the latent abstractions, that materialize in the random variable  ϕ .
Theorem 4. 
[Appendix D] The measure-valued process  π t R  solves in weak sense (see Appendix D for a precise definition) the following SDE
π t R = π 0 R + 0 t π s R H ( Y s , · , s ) π s R , H ( Y s , · , s ) d Y s π s R , H ( Y s , · , s ) d s ,
where the initial condition  π 0  satisfies  π 0 R ( A ) = P [ X A | R 0 ]  for all  A B ( S ) .
When  R = F Y , Equation (6) is the well-known Kushner–Stratonovitch (or Fujisaki—Kallianpur–Kunita) equation (see, e.g., [24]). A proof for uniqueness of the solution of Equation (6) can be approached by considering the strategies in [42], but is outside the scope of this work. The (recursive) expression in Equation (6) is particularly useful for engineering purposes since, in general, it is usually not known in which variables  ϕ ( X ) , representing latent abstractions, we could be interested in. Keeping track of the whole distribution  π t R  at time t is the most cost-effective solution, as we will show later.
Our next goal is to quantify the interaction dynamics between observable measurements and latent abstractions that materialize through the variable  ϕ ( X )  (from now on, we write only  ϕ  for the sake of brevity); in Theorem 5, we derive the mutual information  I ( Y 0 s t ; ϕ ) .
Theorem 5. 
[Appendix E] The mutual information between observable measurements  Y 0 s t  and ϕ is defined as:
I ( Y 0 s t ; ϕ ) = def log d P # Y 0 s t , ϕ d P # Y 0 s t d P # ϕ d P # Y 0 s t , ϕ .
It holds that such quantity is equal to  E P log d P | R t d P | F t Y d P | σ ( ϕ ) , with  R t = σ ( Y 0 s t , ϕ ) , which can be simplified as follows:
I ( Y 0 ; ϕ ) + 1 2 E P 0 t | | E P [ H ( X , Y s , s ) | F s Y ] E P [ H ( X , Y s , s ) | R s ] | | 2 d s .
The mutual information computed by Equation (8) is composed of two elements: first, the mutual information between the initial measurements  Y 0  and  ϕ , which is typically zero by construction. The second term quantifies how much the best prediction of the observation function H is influenced by the extra knowledge of  ϕ , in addition to the measurement history  Y 0 s t . By adhering to the premise that the conditional expectation of a stochastic variable constitutes the optimal estimator given the conditioning information, the integral on the right-hand side quantifies the expected squared difference between predictions, having access to measurements only ( E P [ · | F t Y ] ) and those incorporating additional information ( E P [ · | R t ] ).
Even though a precise characterization for general observation functions and and variables  ϕ  is typically out of reach, a qualitative analysis is possible. First, the mutual information between  ϕ  and the measurements depends on (i) how much the amplitude of H is impacted by knowledge of  ϕ  and (ii) the number of elements of H that are impacted (informally, how much localized vs. global is the impact of  ϕ ). Second, it is possible to define a hierarchical interpretation about the emergence of the various latent factors: a variable with a local impact can “appear”, in an information theoretic sense, only if the impact of other global variables is resolved; otherwise, the remaining uncertainty of the global variables makes knowledge of the local variable irrelevant. In classical diffusion models, this is empirically known [17,18,19], and corresponds to the phenomenon where semantics emerges before details (global vs. local details in our language). For instance, as shown in Section 5, during the generative dynamics, latent abstractions which correspond to high level properties such as color and geometric aspect ratio emerge in very early stages of the process.
Now, consider any  F t Y  measurable random variable  Y ˜ t , defined as a mapping to a generic measurable space  ( Ψ , B ( Ψ ) ) , which means it can also be seen as a process. The data processing inequality states that the mutual information between such  Y ˜  and  ϕ  will be smaller than the mutual information between the original measurement process and  ϕ . However, it can be shown that all the relevant information about the random variable  ϕ  contained in  F t Y  is equivalently contained in the filtering process at time instant t, that is  π t . This is not trivial, since  π t  is a  F t Y -measurable quantity, i.e.,  σ ( π t ) F t Y . In other words, we show that  π t  is a sufficient statistic for any  σ ( X )  measurable random variable when starting from the measurement process.
Theorem 6. 
[Appendix F] For any  F t Y  measurable random variable  Y ˜ t : Ω Ψ , the following inequality holds:
I ( Y ˜ ; ϕ ) I ( Y 0 s t ; ϕ ) .
For a given  t 0 , the measurement process  Y 0 s t  and X are conditionally-independent given  π t . This implies that  P ( A | σ ( π t ) ) = P ( A | F t Y ) , A σ ( X ) . Then,  I ( Y 0 s t ; ϕ ) = I ( π t ; ϕ )  (i.e., Equation (9) is attained with equality).
While  π t  contains all the relevant information about  ϕ , the same cannot be said about the conditional expectation, i.e., the particular case  Y ˜ = π t , ϕ . Indeed, from Equation (2),  π t , ϕ  is obtained as a transformation of  π t  and thus can be interpreted as a  F t Y  measurable quantity subject to the constraint of Equation (9). As a particular case, the quantity  π t , H , of central importance in the construction of generative models Section 3, carries, in general, less information about  ϕ  than the un-projected  π t .

3. Generative Modeling

We are interested in generative models for a given  σ ( X ) -measurable random variable V. An intuitive illustration of how data generation works according to our framework is as follows. Consider, for example, the image domain, and the availability of a rendering engine that takes as an input a computer program describing a scene (coordinates of objects, textures, light sources, auxiliary labels, etc...) and that produces an output image of the scene. In a similar vein, a generative model learns how to use latent variables (which are not explicitly provided in input, but rather implicitly learned through training) to generate an image. For such a model to work, one valid strategy is to consider an SDE in the form of Equation (1) where the following holds (from a strictly technical point of view, Assumption 1 might be incompatible with other assumptions in Appendix A, or proving compatibility could require particular effort. Such details are discussed in Appendix G).
Assumption 1. 
The stochastic process  Y t  satisfies  Y T = V , P a . s .
Then, we could numerically simulate the dynamics of Equation (1) until time T. Indeed, starting from initial conditions  Y 0 , we could obtain  Y T  that, under Assumption 1, is precisely V. Unfortunately, such a simple idea requires explicit access to X, as it is evident from Equation (1). In mathematical terms, Equation (1) is adapted to the filtration  F t Y , X . However, we have shown how to reduce the available information to account only for historical values of  Y t . Then, we can combine the result in Theorem 4 with Theorem 2 and re-interpret Equation (4), which is a valid generative model, as
π t = π 0 + 0 t π s H π s , H d Y s π s , H d s , Y t = Y 0 + 0 t π s , H d s + W t F Y ,
where H denotes  H ( Y s , · , s ) . Explicit simulation of Equation (10) only requires knowledge of the whole history of the measurement process: provided Assumption 1 holds, it allows generation of a sample of the random variable V.
Although the discussion in this work includes a large class of observation functions, we focus on the particular case of generative diffusion models [14]. Typically, such models are presented through the lenses of a forward noising process and backward (in time) SDEs, following the intuition of Anderson [43]. Next, according to the framework we introduce in this work, we reinterpret such models from the perspective of enlargement of filtrations.
Consider the reversed process  Y ^ t = def Y T t  defined on  ( Ω , F , P )  and the corresponding filtration  F t Y ^ = def σ ( Y ^ 0 s t ) . The measure  P  is selected such that the process  Y ^ t  has a  F t Y ^ -adapted expression
Y ^ t = V + 0 t F ( Y ^ s , s ) d s + W ^ t ,
where  { W ^ t , F t Y ^ }  is a Brownian motion. Then, Assumption 1 is valid since  Y T = Y ^ 0 = V . Note that Equation (11), albeit with a different notation, is reminiscent of the forward SDE that is typically used as the starting point to illustrate score-based generative models [14]. In particular,  F ( · )  corresponds to the drift term of such a diffusion SDE.
Equation (11) is equivalent to  Y t = V + t T F ( Y s , T s ) d s + W ^ T t , which is an expression for the process  Y t , which is adapted to  F Y ^ . This constitutes the first step to derive an equivalent backward (generative) process according to the traditional framework of score-based diffusion models. Note that such an equivalent representation is not useful for simulation purposes: the goal of the next step is to transform it such that it is adapted to  F Y . Indeed, using simple algebra, it holds that
Y t = Y 0 0 t F ( Y s , T s ) d s + Y 0 + V + 0 T F ( Y s , T s ) d s + W ^ T t ,
where the last term in the parentheses is equal to  W ^ T + W ^ T t .
Note that  F t Y = σ ( Y ^ T t s T ) . Since  σ ( Y ^ T t s T ) = σ ( W ^ T t s T ) σ ( Y ^ T t ) , we can apply the result in [44] (Thm 2.2) to claim the following:  W ^ T + W ^ T t 0 t log p ^ ( Y s , T s ) d s  is a Brownian motion adapted to  F t Y , where this time  P ( Y ^ t d y ) = p ^ ( y , t ) d y . Then, [44].
Theorem 7. 
Consider the stochastic process  Y t  which solves Equation (11). The same stochastic process also admits a  F t Y -adapted representation
Y t = Y 0 + 0 t F ( Y s , T s ) + log p ^ ( Y s , T s ) In Theorem 8 , we call this F ( Y s , s ) d s + W t .
Equation (12) corresponds to the backward diffusion process from [14] and, because it is adapted to the filtration  F Y , it represents a valid, and easy-to-simulate, measurement process.
By now, it is clear how to go from an  F Y , X -adapted filtration to a  F Y -adapted one. We also showed that a  F Y -adapted filtration can be linked to the reverse,  F Y ^ -adapted process induced by a forward diffusion SDE. What remains to be discussed is the connection that exists between the  F Y -adapted filtration, and its enlarged version  F Y , X . In other words, we have shown that a forward, diffusion SDE admits a backward process which is compatible with our generative model that simulates a NLF process having access only to measurements, but we need to make sure that such process admits a formulation that is compatible with the standard NLF framework in which latent abstractions are available.
To carry this out, we can leverage existing results about Markovian bridges [22,45] (and further work [46,47,48,49] on filtration enlargement). This requires assumptions about the existence and well-behaved nature of densities  p ( y , t )  of the SDE process, defined by the logarithm of the Radon–Nikodym derivative of the instantaneous measure  P ( Y t d y )  with reference to the Lebesgue measure in  R N P ( Y t d y ) = p ( y , t ) d y  (the analysis of the existence of the process adapted to  F t Y  is considered in the time interval  [ 0 , T )  [50]; see also Appendix G).
Theorem 8. 
Suppose that on  ( Ω , F , P ) , the Markov stochastic process  Y t  satisfies
Y t = Y 0 + 0 t F ( Y s , s ) d s + W t ,
where  { W 0 t T , F 0 t T Y }  is a Brownian motion and F satisfies the requirements for existence and well definition of the stochastic integral [51]. Moreover, let Assumption 1 hold. Then, the same process admits  R t = σ ( Y 0 s t , Y T ) -adapted representation
Y t = Y 0 + 0 t F ( Y s , s ) + Y s log p ( Y T | Y s ) d s + β t ,
where  p ( Y T | Y s )  is the density with reference to the Lebesgue measure of the probability  P ( Y T | σ ( Y s ) ) , and  { β 0 t T , R 0 t T }  is a Brownian motion.
The connection between time reversal of diffusion processes and enlarged filtrations is finalized with the result of Al-Hussaini and Elliott [52], Thm. 3.3, where it is proved how the  β t  term of Equation (13) is a Brownian motion, using the techniques of time reversals of SDEs.
Since  p ^ ( y , T t ) = p ( y , t ) , the enlarged filtration version of Equation (12) reads
Y t = Y 0 + 0 t F ( Y s , T s ) + Y s log p ( Y s | Y T ) d s Equivalent to H ( Y t , X , t ) = F ( Y s , T s ) + Y s log p ( Y s | V + W t .
Note that the dependence of  Y t  on the latent abstractions X is implicitly defined by conditioning the score term  Y s log p ( Y s | Y T )  by  Y T , which is the “rendering” of X into the observable data domain.
Clearly, Equation (14) can be reverted to the starting generative Equation (12) by mimicking the results which allowed us to go from Equation (1) to Equation (4), by noticing that  E P [ Y s log p ( Y T | Y s ) | F t Y ] = 0  (informally, this is obtained since  y s log p ( y t | y s ) p ( y t | y s ) d y t = y s p ( y t | y s ) d y t = 0 ).
It is also important to notice that we can derive the expression for the mutual information between the measurement process and a sample from the data distribution, as follows
I ( Y 0 s t ; V ) = I ( Y 0 ; V ) + 1 2 E P 0 t | | Y s log p ( Y s ) Y s log p ( Y s | Y T ) | | 2 d s .
Mutual information is tightly related to the classical loss function of generative diffusion models.
Furthermore, by casting the result of Equation (8) according to the forms of Equations (12) and (14), we obtain the simple and elegant expression
I ( Y 0 s t ; V ) = I ( Y 0 ; V ) + 1 2 E P 0 t | | Y s log p ( Y T | Y s ) | | 2 d s .
In Appendix H, we present a specialization of our framework for the particular case of linear diffusion models, recovering the expressions for the variance-preserving and variance-exploding SDEs that are the foundations of score-based generative models [14].

4. An Informal Summary of the Results

We shall now take a step back from the rigor of this work, and provide an intuitive summary of our results, using Figure 1 as a reference.
We begin with an illustration of NLF, shown on the left of the figure. We consider an observable latent abstraction X and the measurement process  Y t , which, for ease of illustration, we consider evolving in discrete time, i.e.,  Y 0 , Y 1 , , and whose joint evolution is described by Equation (1). Such an interaction is shown in blue:  Y 3  depends on its immediate past  Y 2  and the latent abstraction X.
The a posteriori measure process  π t  is updated in an iterative fashion by integrating the flux of information. We show this in green:  π 1  is obtained by updating  π 0  with  Y 1 Y 0  (the equivalent of  d Y t ). This evolution is described by Kushner’s equation, which has been derived informally from the result of Equation (6). The a posteriori process is a sufficient statistic for the latent abstraction X: for example,  π 3  contains the same information about  ϕ  as the whole  Y 0 , , Y 3  (red boxes). Instead, in general, a projected statistic  π t , ϕ  contains less information than the whole measurement process (this is shown in orange, for time instant 2). The mutual information between all these variables is proven in Theorem 6, whereas the actual value of  I ( Y 0 s t ; ϕ )  is shown in Theorem 5.
Next, we focus on generative modeling. As per our definition, any stochastic process satisfying Assumption 1 ( Y 3 = V , in the figure) can be used for generative purposes. Since the latent abstraction is by definition not available, it is not possible to simulate directly the dynamics using Equation (1) (dashed lines from X to  Y t ). Instead, we derive a version of the process adapted to the history of  Y t  alone, together with the update of the projection  π t , H , which amounts to simulating Equation (10). In [36], diffusion models are shown to solve a “self-consistency” equation akin to a mean-field fixed point. Our framework aligns with this view by revealing how SDE-based generative processes implicitly enforce self-consistency between latent abstractions and measurements.
The update of the upper part of Equation (10), which is a particular case of Equation (6), can be interpreted as the composition of two steps: (1) (green) the update of the a posteriori measure given new available measurements, and, (2) (orange) the projection of the whole  π t  into the statistic of interest. The update of the measurement process, i.e., the lower part of Equation (10), is color-coded in blue. This is in stark contrast to the NLF case, as the update of, e.g.,  Y 3 = V  does not depend directly on X. The system in Equation (10) and its simulation describes the emergence of latent world representations in SDE-based generative models:
Entropy 27 00371 i001
The theory developed in this work guarantees that the mutual information between measurements and any statistics  ϕ  grows as described by Theorem 5. Our framework offers a new perspective, according to which, the dynamics of SDE-based generative models [14] implicitly mimic the two steps procedure described in the box above. We claim that this is the reason why it is possible to dissect the parametric drift of such generative models and find a representation of the abstract state distribution  π t , encoded into their activations. Next, we set to root our theoretical findings in experimental evidence.

Generality of Our Framework

While previous works studied latent abstraction emergence specifically within diffusion-based generative models, our current theoretical framework deliberately transcends this scope. Indeed, the results presented in this paper can be interpreted as a generalization of the results contained in [53] (see also Equation (A10)) and apply broadly to any generative model satisfying Assumption 1. Such generality includes a wide variety of generative modeling techniques, such as Neural Stochastic Differential Equations (neural SDEs) [54], Schrödinger Bridges [55], and Stochastic Normalizing Flows [56]. In all these cases, our results on latent abstractions still hold; thus, the insights provided by our framework pave the way for deeper theoretical understanding and wider applicability across generative modeling paradigms.

5. Empirical Evidence

We complement existing empirical studies [17,18,19,30,31,32,33,34] that first measured the interactions between the generative process of diffusion models and latent abstractions, by focusing on a particular dataset that allows for a fine-grained assessment of the influence of latent factors.
Dataset. We use the Shapes3D [57] dataset, which is a collection of  64 × 64  ray-tracing generated images, depicting simple 3D scenes, with an object (a sphere, cube,...) placed in a space, described by several attributes (color, size, orientation). Attributes have been derived from the computer program that the ray-tracing software executed to generate the scene: these are transformed into labels associated with each image. In our experiments, such labels are the materialization of the latent abstractions X we consider in this work (see Appendix J.1 for details).
Measurement Protocols. For our experiments, we use the base NCSPP model described by [14]: specifically, our denoising score network corresponds to a U-NET [58]. We train the unconditional version of this model from scratch using a score-matching objective. Detailed hyper-parameters and training settings are provided in Appendix J.2. Next, we summarize three techniques to measure the emergence of latent abstractions through the lenses of the labels associated with each image in our dataset. For all such techniques, we use a specific “measurement” subset of our dataset, which we partition in 246 training, 150 validation, and 371 test examples. We use a multi-label stratification algorithm [59,60] to guarantee a balanced distribution of labels across all dataset splits.
Linear probing. Each image in the measurement subset is perturbed with noise, using a variance-exploding schedule [14], with noise levels decreasing from  τ = 0  to  τ = 1.0  in steps of 0.1, as shown in Figure 2. Intuitively, each time value  τ  can be linked to a different signal-to-noise ratio ( S N R ), ranging from  S N R ( τ = 1 ) =  to  S N R ( τ = 0 ) 0 . We extract several feature maps from all the linear and convolutional layers of the denoising score network, for each perturbed image, resulting in a total of 162 feature map sets for each noise level. This process yields 11 different datasets per layer, which we use to train a linear classifier (our probe) for each of these datasets, using the training subset. In these experiments, we use a batch size of 64 and adjust the learning rate based on the noise level (see Appendix J.3). Classifier performance is optimized by selecting models based on their log-probability accuracy observed on the validation subset. The final evaluation of each classifier is conducted on the test subset. Classification accuracy, measured by the model log likelihood, is a proxy of latent abstraction emergence [17].
Mutual information estimation. We estimate mutual information between the labels and the outputs of the diffusion model across varying diffusion times, using Equation (A10) (which is a specialized version of our theory for linear diffusion models; see Appendix H) and adopt the same methodology discussed by Franzese et al. [53] to learn conditional and unconditional score functions and to approximate the mutual information. The training process uses a randomized conditioning scheme: 33% of training instances are conditioned on all labels, 33% on a single label, and the remaining 33% are trained unconditionally. See Appendix J.4 for additional details.
Forking. We propose a new technique to measure at which stage of the generative process, image features described by our labels emerge. Given an initial noise sample, we proceed with numerical integration of the backward SDE [14] up to time  τ . At this point, we fork k replicas of the backward process and continue the k generative pathways independently until numerical integration concludes. We use a simple classifier (a pre-trained ResNet50 [61] with an additional linear layer trained from scratch) to verify that labels are coherent across the k forks. Coherency is measured using the entropy of the label distribution output by our simple classifier on each latent factor for all the k branches of the fork. Intuitively, if we fork the process at time  τ = 0.6 , and the k forks all end up displaying a cube in the image (entropy equals 0), this implies that the object shape is a latent abstraction that has already emerged by time  τ . Conversely, lack of coherence implies that such a latent factor has not yet influenced the generative process. Details of the classifier training and sampling procedure are provided in Appendix J.5.
Results. We present our results in Figure 3. We note that some attributes like floor hue, wall hue and shape emerge earlier than others, which corroborates the hierarchical nature of latent abstractions, a phenomenon that is related to the spatial extent of each attribute in pixel space. This is evident from the results of linear probing, where we evaluate the performance of linear probes trained on features maps extracted from the denoiser network, and from the mutual information measurement strategy and the measured entropy of the predicted labels across forked generative pathways. Entropy decreases with  τ , which marks the moment in which the generative process proceeds along k forks. When generative pathways converge to a unique scene with identical predicted labels (entropy reaches zero), this means that the model has committed to a specific set of latent factors (breaking some of the symmetries in the language of [36]). This coincides with the same noise level corresponding to high accuracy for the linear probe, and high-values of mutual information. Further ablation experiments are presented in Appendix J.6.

6. Potential Applications and Practical Implications

Our theoretical analysis and experimental results provide a novel information-theoretic perspective on diffusion-based generative models. Besides the primary theoretical contribution, our framework naturally suggests several promising practical applications and implications. First, the explicit characterization of latent abstractions through mutual information measures may facilitate novel strategies for conditional generation, where generative processes could be guided or steered in a principled manner by explicitly controlling the latent abstraction process. Second, the insights derived from our nonlinear filtering viewpoint offer opportunities for enhanced interpretability and robustness in downstream applications that utilize learned latent representations, potentially leading to more reliable model behaviors. These practical implications represent compelling directions for future research, extending the reach and usability of diffusion-based generative modeling frameworks.

7. Conclusions

Despite their tremendous success in many practical applications, a deep understanding of how SDE-based generative models operate remained elusive. A particularly intriguing aspect of several empirical investigations was to uncover the capacity of generative models to create entirely new data by combining latent factors learned from examples. To the best of our knowledge, there exists no theoretical framework that attempts to describe such a phenomenon.
In this work, we closed this gap, and presented a novel theory—which builds on the framework of NLF—to describe the implicit dynamics allowing SDE-based generative models to tap into latent abstractions and guide the generative process. Our theory, which required advancing the standard NLF formulation, culminates in a new system of joint SDEs that fully describes the iterative process of data generation. Furthermore, we derived an information-theoretic measure to study the influence of latent abstractions, which provides a concrete understanding of the joint dynamics.
To root our theory into concrete examples, we collected experimental evidence by means of novel (and established) measurement strategies that corroborate our understanding of diffusion models. Latent abstractions emerge according to an implicitly learned hierarchy and can appear early on in the data generation process, much earlier than what is visible in the data domain. Our theory is especially useful as it allows analyses and measurements of generative pathways, opening up opportunities for a variety of applications, including image editing, and improved conditional generation.

Author Contributions

Conceptualization, G.F., M.M. and P.M.; Methodology, G.F., M.M. and P.M.; Software, G.F., G.C., P.P. and P.M.; Validation, G.F., G.C., P.P. and P.M.; Formal analysis, G.F., M.M. and P.M.; Investigation, G.F., G.C., P.P. and P.M.; Resources, G.F.; Data curation, G.F., G.C., P.P. and P.M.; Writing—original draft, G.F., M.M. and P.M.; Writing—review & editing, G.F., M.M., G.C., P.P. and P.M.; Supervision, P.P.; Funding acquisition, G.F. and P.M. All authors have read and agreed to the published version of the manuscript.

Funding

G.F. and P.M were partially funded by project MUSECOM2—AI-enabled MUltimodal SEmantic COMmunications and COMputing, in the Machine Learning-based Communication Systems, towards Wireless AI (WAI), Call 2022, ChistERA. M.M. acknowledges the financial support of the European Research Council (ERC) under the European Union’s Horizon Europe research and innovation programme (AdG ELISA project, Grant Agreement No. 101054746). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. All considered datasets are publicly available.

Conflicts of Interest

G.C. was employed by the company SAP Labs Mougins. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Assumptions

Assumption A1. 
Whenever we mention a filtration, we assume as usual that it is augmented with the  P  null sets, i.e., if the set N is such that  P ( N ) = 0 , then all  A N  should be in the filtration.
Assumption A1 is standard construction in measure theoretic formulations and ensures that “impossible events” are measurable.
We now list Assumptions A2 and A3 which correspond to Equations (2.3) and (2.4) of [24]. These assumptions are necessary to ensure that (i) the stochastic integral in Equation (3) is well defined and (ii) that Theorem 1 (which corresponds to Thm 2.1 in [24]) holds
Assumption A2. 
E P [ 0 t | | H ( Y s , X , s ) | | d s ] < .
Assumption A3. 
P ( 0 t | | E P [ H ( Y s , X , s ) | F s Y ] | | 2 d s < ) = 1 .
Assumptions A2 and A3, while sufficient, are often difficult to check. In practice it is often easier to check Assumption A4, which implies the two.
Assumption A4. 
E P [ 0 t | | H ( Y s , X , s ) | | 2 d s ] < .
Finally, it is necessary to ensure that the stochastic integrals used in the Girsanov transformations of Theorem 3 are martingales. For this reason, we consider the adaptation of Equation (3.19) in [24] and consider:
Assumption A5. 
E P [ exp { 1 2 0 t | | H ( Y s , X , s ) | | 2 d s } ] < ,
and
E P [ exp { 1 2 0 t | | E P [ H ( Y s , X , s ) | R s ] | | 2 d s } ] < ,
Note: Assumption A5 and Assumption A4 are trivially verified when H is bounded, which is consequently a condition which allow to claim validity of all the results discussed in this work.

Appendix B. Proof of Theorem 2

We start by combining Equation (3) and Equation (1)
W t R = Y 0 + 0 t H ( Y s , X , s ) d s + W t Y 0 0 t E P ( H ( Y s , X , s ) | R s ) d s   = 0 t H ( Y s , X , s ) d s + W t 0 t E P ( H ( Y s , X , s ) | R s ) d s .
We begin by showing that it is a martingale. For any  0 τ t , it holds
E P [ W t R | R τ ] = E P [ 0 t H ( Y s , X , s ) d s | R τ ] + E P [ W t | R τ ]   E P [ 0 t E P ( H ( s , Y s , X ) | R s ) d s | R τ ]   = 0 t E P [ H ( Y s , X , s ) | R τ ] d s + E P [ E P [ W t | F τ Y , X ] | R τ ]   0 τ E P [ H ( Y s , X , s ) | R s ] d s τ t E P [ H ( Y s , X , s ) | R τ ] d s   = 0 τ E P [ H ( Y s , X , s ) | R τ ] d s + E P [ W τ | R τ ] + W τ R + Y 0 Y τ   = E P [ 0 τ H ( Y s , X , s ) d s + W τ + Y 0 Y τ | R τ ] + W τ R = W τ R .
Moreover, it is easy to check that the cross-variation of  W t R  is the same as the one of  W t . Then, we can conclude the proof by Levy’s characterization of Brownian motion ( W 0 R = 0 ). ☐

Appendix C. Proof of Theorem 3

First, by combining the definition of  ψ t R  and the fact that  d Y t = E P [ H ( Y t , X , t ) | R t ] + d W t R  we obtain
( ψ t R ) 1 = exp ( 0 t E P [ H ( Y s , X , s ) | R s ] d W s R 1 2 0 t | | E P [ H ( Y s , X , s ) | R s ] | | 2 d s ) .
Notice that by Assumption A5 (which is actually the usual Novikov’s condition), the local martingale  ( ψ t R ) 1  is a real-valued martingale starting from  ( ψ 0 R ) 1 = 1 . Then, we can apply Girsanov theorem and conclude that  d Q R = ψ T R d P  is a probability measure under which the process  { W ˜ 0 t T , R 0 t T } , with
W t ˜ = W t R + 0 t E P [ H ( Y t , X , s ) | R t ] d s ,
is a Brownian motion on the space  ( Ω , R T , Q R ) . ☐

Appendix D. Proof of Theorem 4

First, let us give a precise meaning to being a weak solution of Equation (6). We say that  π t R  solves (6) in a weak sense in, for any for any  ϕ : S R  bounded and measurable, it holds
π t R , ϕ = π 0 R , ϕ + 0 t π s R , H ( Y s , · , s ) ϕ π s R , ϕ π s R , H ( Y s , · , s ) d Y s π s R , H ( Y s , · , s ) d s .
Let us recall that, on  ( Ω , F , P ) , the process  Y t  has the SDE representation (1), where  { W 0 t T , F 0 t T Y , X }  is a Brownian motion. Moreover, by Theorem 3 with  R t = F t Y , X , it holds that  { ( Y Y 0 ) 0 t T , F 0 t T Y , X }  is a Brownian motion on the space  ( Ω , F , Q F Y , X ) , where  d Q F Y , X = ( ψ T F Y , X ) 1 d P  and
ψ t F Y , X = exp ( 0 t H ( Y s , X , s ) d Y s 1 2 0 t | | H ( Y s , X , s ) | | 2 d s ) .
For notation simplicity, in this subsection  ψ t F Y , X  and  Q F Y , X  are simply indicated as  π t ψ t  and  Q , respectively.
Since we aim at showing that (A1) holds, let us fix  ϕ  and let us start from  E P [ ϕ ( X ) | R t ] = π t R , ϕ . Bayes Theorem provides us with the following
π t R , ϕ = E P [ ϕ ( X ) | R t ] = E Q [ d P d Q ϕ ( X ) | R t ] E Q [ d P d Q | R t ] = E Q [ ψ T ϕ ( X ) | R t ] E Q [ ψ T | R t ] = def π ^ t R , ϕ π ^ t R , 1 .
Starting from the numerator  π ^ t R , ϕ , we involve the tower property of conditional expectation and the fact that  ψ t  is  F t Y , X  measurable to write
π ^ t R , ϕ = E Q [ ψ T ϕ ( X ) | R t ] = E Q E Q ψ T ϕ ( X ) | F t Y , X | R t   = E Q E Q ψ T | F t Y , X ϕ ( X ) | R t = E Q ψ t ϕ ( X ) | R t .
Recalling the definition of  ψ t  (see Equation (A2)), we have
d ψ t = ψ t H ( Y t , X , t ) d Y t ,
from which it follows
ψ t = 1 + 0 t ψ s H ( Y s , X , s ) d Y s .
We continue processing Equation (A4), using Equation (A5), as
E Q ψ t ϕ ( X ) | R s = E Q 1 + 0 t ψ s H ( Y s , X , s ) d Y s ϕ ( X ) | R t   = E Q ϕ ( X ) | R t + E Q 0 t ψ s H ( Y s , X , s ) ϕ ( X ) d Y s | R t   = E Q ϕ ( X ) | R t + 0 t E Q ψ s H ( Y s , X , s ) ϕ ( X ) | R s d Y s ,
where to obtain the last equality we used Lemma 5.4 in [62]. We also recall that, under  Q , the process  ( Y t Y 0 )  is independent of X. Thus, since  R t = σ ( R 0 σ ( Y 0 s t Y 0 ) )  and  d P d Q | F 0 Y , X = 1 , we obtain  E Q ϕ ( X ) | R t = E P [ ϕ ( X ) | R 0 ] . Concluding and rearranging:
π ^ t R , ϕ = π ^ 0 R , ϕ + 0 t π ^ s R , ϕ H ( Y s , · , s ) d Y s .
Obviously by the same arguments  π ^ t R , 1 = E Q [ d P d Q | R t ] = E Q ψ t | R t , and
π ^ t R , 1 = 1 + 0 t π ^ s R , H ( Y s , · , s ) d Y s .
From now on, for simplicity we assume that all the processes involved in our computations are 1-dimensional. The extension to the multidimensional case is trivial. First, let us notice that, by (A6) and Itô’s lemma, it holds
d ( π ^ t R , 1 1 ) = π ^ t R , H ( Y t , · , t ) π ^ t R , 1 2 d Y s + π ^ t R , H ( Y t , · , t ) 2 π ^ t R , 1 3 d t .
Then, by the stochastic product rule,
d π t R , ψ = d π ^ t R , ϕ π ^ t R , 1 1 = π ^ t R , ϕ d ( π ^ t R , 1 1 ) + π ^ t R , 1 1 d π ^ t R , ϕ π ^ t R , ϕ H ( Y t , · , t ) π ^ t R , H ( Y t , · , t ) π ^ t R , 1 2 d t = π ^ t R , ϕ π ^ t R , H ( Y t , · , t ) π ^ t R , 1 2 d Y t + π ^ t R , ϕ π ^ t R , H ( Y t , · , t ) 2 π ^ t R , 1 3 d t + π ^ t R , ϕ H ( Y t , · , t ) π ^ t R , 1 d Y t π ^ t R , ϕ H ( Y t , · , t ) π ^ t R , H ( Y t , · , t ) π ^ t R , 1 2 d t .
Recalling (A3) and rearranging the terms lead us to
d π t R , ψ = π t R , ϕ π t R , H ( Y t , · , t ) d Y t + π t R , ϕ π t R , H ( Y t , · , t ) 2 d t + π t R , ϕ H ( Y t , · , t ) d Y t π t R , ϕ H ( Y t , · , t ) π t R , H ( Y t , · , t ) d t = π t R , ϕ H ( Y t , · , t ) π t R , ϕ π t R , H ( Y t , · , t ) d Y t π t R , H ( Y t , · , t ) d t .

Appendix E. Proof of Theorem 5

The proof of this Theorem involves two separate parts. First, we should show the second equality in Equation (7), i.e.,  log d P # Y 0 s t , ϕ d P # Y 0 s t d P # ϕ d P # Y 0 s t , ϕ = E P log d P | R t d P | F t Y d P | σ ( ϕ ) . Then, we should prove that the right-hand side of Equation (7) is equal to Equation (8).

Appendix E.1. Part 1

We overload in this Section the notation adopted in the rest of the paper for sake of simplicity in exposition. A random variable X on a probability space  ( Ω , F , P )  is defined as a measurable mapping  X : Ω Ψ , where the measure space  ( Ψ , G )  satisfies the usual assumptions. To be precise, X is measurable with reference to  F  if  E G , X 1 ( E ) F , where  X 1 ( E ) = { ω Ω : X ( ω ) E } . Equivalently,  E G , S F : X ( S ) = E . Of all the possible sigma-algebras which allow measurability, the sigma algebra induced by the random variable,  σ ( X ) , is the smallest one. It can be shown that  σ ( X ) = X 1 ( G ) = { A = X 1 ( B ) | B G } . We also denote by  P # X : G [ 0 , 1 ]  the push-forward measure associated with X (i.e., the law), which is defined by the relation  P # X ( E ) = P ( X 1 ( E ) )  for any  E G . Moreover, for any  G -measurable  ϕ , the following integration rule holds
Ψ φ ( x ) d P # X ( x ) = Ω φ ( X ( ω ) ) d P ( ω ) .
Let us focus on  ( Ω , σ ( X ) , P )  and let us consider a new measure  Q  absolutely continuous with reference to  P . Radon–Nikodym theorem guarantees existence of a  σ ( X ) -measurable function  Z : Ω [ 0 , + )  (the “derivative”  d Q d P = Z ) such that  Q ( A ) = A Z d P , for all  A σ ( X ) . Moreover, by Doob’s measurability criterion (see, e.g., Lemma 1.14 in [63]), there exists a  G -measurable map  f : Ψ [ 0 , + )  such that  Z = f ( X ) . Then, for any  E G ,
Q # X ( E ) = Q ( X 1 ( E ) ) = X 1 ( E ) f ( X ) d P ( ω ) = Ω 1 X 1 ( E ) ( ω ) f ( X ( ω ) ) d P ( ω )   = Ω 1 E ( X ( ω ) ) f ( X ( ω ) ) d P ( ω ) = Ψ 1 E ( x ) f ( x ) d P # X ( x ) = E f ( x ) d P # X ( x ) .
In summary, we have that  d Q # X d P # X = f , with  f : Ψ [ 0 , + ) .
Finally, then,
Ψ log ( d P # X d Q # X ) d P # X = Ψ log ( f ) d P # X = Ω log ( f ( X ) ) d P = Ω log d P d Q d P = E P [ log d P d Q ] .
What discussed so far, allows to prove that
log d P # Y 0 s t , ϕ d P # Y 0 s t d P # ϕ d P # Y 0 s t , ϕ = E P log d P | R t d P | F t Y d P | σ ( ϕ ) .
Indeed,
  • Consider on the space  ( Ω , R t , P | R t )  the random variable  T = ( Y 0 s t , ϕ ) . By construction,  σ ( T ) = R t .
  • Suppose that  P | R t  is absolutely continuous with reference to  P | F t Y × P | σ ( ϕ ) .
  • Then the desired equality follows from Equation (A7).

Appendix E.2. Part 2

Before proceeding, remember that the following holds: for all  R t R t Q R | R t = Q R | R t .
We restart from the right-hand side of Equation (7). Thanks to the chain rule for Radon–Nykodim derivatives
log d P | R t d P | F t Y d P | σ ( ϕ ) = log d P | R t d Q R | R t d Q R | R t d P | F t Y d P | σ ( ϕ )   = log d P | R t d Q R | R t d Q R | F t Y d P | F t Y d Q R | R t d Q R | F t Y d P | σ ( ϕ )   = log d P | R t d Q R | R t d Q F Y | F t Y d P | F t Y d Q R | R t d Q R | F t Y d P | σ ( ϕ )   = log ψ t R ( ψ t F Y ) 1 d Q R | F t Y d Q R | F t Y d P | σ ( ϕ )   = log ψ t R log ψ t F Y + log d Q R | R t d Q R | F t Y d Q R | σ ( ϕ ) ,
where we used Theorem 3 to make  ψ t R  and  ψ t F Y  appear, and the fact that  d Q R | σ ( ϕ ) = d P | σ ( ϕ ) .
Consequently
E P log d P | R t d P | F t Y d P | σ ( ϕ ) = E P log ψ t R log ψ t F Y + I ( Y 0 ; ϕ )   = E P 0 t E P [ h ( Y s , X , s ) | R s ] d W s R + 1 2 0 t | | E P [ h ( Y s , X , s ) | R s ] | | 2 d s   E P 0 t E P [ h ( Y s , X , s ) | F s Y ] d W s F Y + 1 2 0 t | | E P [ h ( Y s , X , s ) | F s Y ] | | 2 d s + I ( Y 0 ; ϕ )   = 1 2 E P 0 t | | E P [ h ( Y s , X , s ) | R s ] | | 2 | | E P [ h ( Y s , X , s ) | F s Y ] | | 2 d s + I ( Y 0 ; ϕ ) .
Actually, the result in the main is in a slightly different form. To show equivalence, it is necessary to prove that
E P | | E P [ h ( Y s , X , s ) | F s Y ] | | 2 2 E P E P [ h ( Y s , X , s ) | F s Y ] E P [ h ( Y s , X , s ) | R s ]   = E P | | E P [ h ( Y s , X , s ) | F s Y ] | | 2
which is trivially true since  E P · | F t Y = E P E P · | R s | F t Y . ☐

Appendix F. Proof of Theorem 6

Appendix F.1. Proof of Equation (9)

The inequality is proven considering that (i)
I ( Y 0 s t ; ϕ ) = E P | F t Y × P | σ ( ϕ ) η d P | R t d P | F t Y d P | σ ( ϕ )
and
I ( Y ˜ t ; ϕ ) = E P | σ ( Y ˜ t ) × P | σ ( ϕ ) η d P | σ ( Y ˜ t , ϕ ) d P | σ ( Y ˜ t ) d P | σ ( ϕ ) = E P | F t Y × P | σ ( ϕ ) η d P | σ ( Y ˜ t , ϕ ) d P | σ ( Y ˜ t ) d P | σ ( ϕ ) ,
with  η ( x ) = x log x , (ii) that  d P | σ ( Y ˜ t , ϕ ) d P | σ ( Y ˜ t ) d P | σ ( ϕ ) = E P | F t Y × P | σ ( ϕ ) d P | R t d P | F t Y d P | σ ( ϕ ) | σ ( Y ˜ t , ϕ )  and (iii) that Jensen’s inequality holds ( η  is convex on its domain)
E P | F t Y × P | σ ( ϕ ) η d P | σ ( Y ˜ t , ϕ ) d P | σ ( Y ˜ t ) d P | σ ( ϕ ) = E P | F t Y × P | σ ( ϕ ) η E P | F t Y × P | σ ( ϕ ) d P | R t d P | F t Y d P | σ ( ϕ ) | σ ( Y ˜ t , ϕ ) E P | F t Y × P | σ ( ϕ ) E P | F t Y × P | σ ( ϕ ) η d P | R t d P | F t Y d P | σ ( ϕ ) | σ ( Y ˜ t , ϕ ) = E P | F t Y × P | σ ( ϕ ) η d P | R t d P | F t Y d P | σ ( ϕ ) .

Appendix F.2. Proof of Conditional Independence and Mutual Information Equality

Formally the condition of conditional independence given  π  is satisfied if for any  a 1 , a 2  positive random variables which are, respectively,  σ ( X )  and  F t Y  measurable, the following holds:  E P [ a 1 a 2 | σ ( π t ) ] = E P [ a 1 | σ ( π t ) ] E P [ a 2 | σ ( π t ) ]  (see for instance [64]).
The sigma-algebra  σ ( π t )  is by definition the smallest one that makes  π t  measurable. Since  π t  is  F t Y  measurable, clearly  σ ( π t ) F t Y . By the very definition of conditional expectation,  E P [ a 1 | F t Y ] = π t , a 1 , which is an  σ ( π t )  measurable quantity. Then,  E P [ a 1 a 2 | σ ( π t ) ] = E P [ E P [ a 1 a 2 | F t Y ] | σ ( π t ) ] = E P [ E P [ a 1 | F t Y ] a 2 | σ ( π t ) ] = E P [ E P [ π t , a 1 a 2 | σ ( π t ) ] = π t , a 1 E P [ a 2 | σ ( π t ) ] . Since  π t , a 1 = E P [ π t , a 1 | σ ( π t ) ] = E P [ E P [ a 1 | F t Y ] | σ ( π t ) ] = E P [ a 1 | σ ( π t ) ] , the proof of conditional independence is concluded.
In summary,  σ ( X )  and  F t Y  are conditionally independent given  σ ( π t )  ( F t Y ). This implies that  P ( A | σ ( π t ) ) = P ( A | F t Y ) , A σ ( X ) , or equivalently  E P [ 1 ( A ) | σ ( π t ) ] = E P [ 1 ( A ) | F t Y ] . To prove this, it is sufficient to show that for any  B F t Y ,
E P [ E P [ 1 ( A ) | σ ( π t ) ] 1 ( B ) ] = E P [ 1 ( A ) 1 ( B ) ] .
By standard properties of conditional expectation
E P [ E P [ 1 ( A ) | σ ( π t ) ] 1 ( B ) ] = E P [ E P [ 1 ( A ) | σ ( π t ) ] E P [ 1 ( B ) | σ ( π t ) ] ] .
Due to conditional independence  E P [ 1 ( A ) | σ ( π t ) ] E P [ 1 ( B ) | σ ( π t ) ] = E P [ 1 ( A ) 1 ( B ) | σ ( π t ) ] . Then,  E P [ E P [ 1 ( A ) | σ ( π t ) ] E P [ 1 ( B ) | σ ( π t ) ] ] = E P [ E P [ 1 ( A ) 1 ( B ) | σ ( π t ) ] ] = E P [ 1 ( A ) 1 ( B ) ] .
The mutual information equality is then proved considering that  d P | R t d P | F t Y d P | σ ( ϕ ) = d P ( ω x | F t Y ) d P ( ω x ) , since the conditional probabilities exist, and that  P ( ω x | F t Y ) = P ( ω x | σ ( π t ) ) . ☐

Appendix G. A Technical Note

As anticipated in the main, Assumption 1 might be incompatible with the other technical assumptions in Appendix A. The problem might arise for singularities in the drift term at time  t = T , which are usually present in the construction of dynamics satisfying Assumption 1 like stochastic bridges. This mathematical subtlety can be more clearly interpreted by noticing that when Assumption 1 is satisfied the evolution of the posterior process  π t  at time T can occupy a portion of the space of dimensionality lower than at any  T ϵ ϵ > 0 . Or, we can notice that if Assumption 1 is satisfied,  I ( Y 0 s T ; V ) = I ( V ; V )  which can be equal to infinity depending on the actual structure of  S  and the mapping V. In many cases, a simple technical solution is to consider in the analysis only dynamics of the process in the time interval  [ 0 , T )  (this is akin to the discussion of arbitrage strategies in finance when the initial filtration is augmented with knowledge of the future value at certain time instants, and the fact that while the new process adapted with reference to the new filtration is also a martingale with reference to a given new measure for all  t [ 0 , T ) , it fails to do so for  t = T  thus giving an arbitrage opportunity). In the reduced time interval  [ 0 , T ) , the technical assumptions are generally shown to be satisfied. For the practical purposes explored in this work, this restriction makes no difference, and consequently neglect it for the rest of our discussion.

Appendix H. Linear Diffusion Models

Consider the particular case of linear generative diffusion models [14], which are widely adopted in the literature and by practitioners. We consider the particular case of Equation (11), where the function F has linear expression
Y ^ t = Y ^ 0 α 0 t Y ^ s d s + W ^ t ,
for a given  α 0 . We assume of course again that Assumption 1 holds, which implies that we should select  Y ^ 0 = Y T = V . Now,  α  dictates the behavior of the SDE, which can be cast to the so called variance-preserving and variance-exploding schedules of diffusion models [14]. In diffusion models jargon, Equation (A8) is typically referred to as a noising process. Indeed, by analysing the evolution of Equation (A8),  Y ^ t  evolves to a noisier and noisier version of V as t grows. In particular, it holds that
Y ^ t = exp ( α t ) V + exp ( α t ) 0 t exp ( α s ) d W ^ s .
The next result is a particular case of Theorem 7.
Lemma A1. 
Consider the stochastic process  Y t  which solves Equation (A8). The same stochastic process also admits a  F t Y -adapted representation
Y t = Y 0 + 0 t α Y s + 2 α exp ( α ( T s ) ) E P [ V | σ ( Y s ) ] Y s 1 exp ( 2 α ( T s ) ) d s + W t ,
where  Y 0 = exp ( α T ) V + 1 exp ( 2 α T ) 2 α ϵ , with ϵ a standard Gaussian random variable independent of V and  W t .
As discussed in the main paper, we can now show that the same generative dynamics can be obtained under the NLF framework we present in this work, without the need to explicitly defining a backward and a forward process. In particular, we can directly select a observation function that corresponds to an Orstein–Uhlenbeck bridge [65,66], consequently satisfying Assumption 1, and obtain the generative dynamics of classical diffusion models. In particular we consider the following about H (notice that with H selected as in Assumption A6 the validity of the theory considered is restricted to the time interval  [ 0 , T ) ; see also Appendix G):
Assumption A6. 
The function H in Equation (1) is selected to be of the linear form
H ( Y t , X , t ) = m t V d log m t d t Y t ,
with  m t = α sinh ( α ( T t ) ) , where  α 0 . When  α = 0 m t = d log m t d t = 1 T t . Furthermore,  Y 0  is selected as in Theorem 7. Under this assumption,  Y T = V , P a . s . , i.e., Assumption 1 is satisfied Appendix I.
In summary, the particular case of Equation (1) (which is  F Y , X  adapted) under Assumption A6 can be transformed into a generative model leveraging Theorem 2, since Assumption 1 holds. When doing so, we obtain that the process  Y t  has  F Y  adapted representation equal to
Y t = Y 0 + 0 t m s E P ( V | F s Y ) d s 0 t d log m s d s Y s d s + W t F Y ,
which is nothing but Equation (A9) after some simple algebraic manipulation. The only relevant detail worth deeper exposition is the clarification about the actual computation of expectation of interest. If  P  is selected such that  Y ^ t  solves Equation (A8), we have that
E P ( V | F t Y ) = E P ( Y T | σ ( Y 0 s t ) ) = E P ( Y ^ 0 | σ ( Y ^ T t s T ) ) = E P ( Y ^ 0 | σ ( Y ^ T t ) ) = E P ( V | σ ( Y t ) ) ,
where the second to last equality is due to the Markov nature of  Y ^ t .
Moreover, in this particular case we can express the mutual information  I ( Y 0 s t ; ϕ ) = I ( Y t ; ϕ )  (where we removed the past of Y since the following Markov chain holds  ϕ Y ^ 0 Y ^ t > 0 ) can be expressed in the simpler form
I ( Y t ; ϕ ) = I ( Y 0 ; ϕ ) + 1 2 E P 0 t m s 2 | | E P [ V | σ ( Y s ) ] E P [ V | σ ( Y s , ϕ ) ] | | 2 d s
matching the result described in [53], obtained with the formalism of time reversal of SDEs.

Appendix I. Discussion About Assumption A6

This is easily checked thanks to the following equality
Y t = Y 0 m 0 m t + V m 0 m T t + 0 t m s m t d W s .
To avoid cluttering the notation, we define  f t = d log m t d t . To show that Equation (A11) is true, it is sufficient to observe (i) that initial conditions are met and (ii) that the time differential of the process is the correct one. We proceed to show that indeed the second condition holds (the first one is trivially observed to be true).
d Y t = α Y 0 cosh ( α ( T t ) ) sinh ( α T ) + α r ( X ) cosh ( α t ) sinh ( α T ) α cosh ( α ( T t ) ) 0 t 1 sinh ( α ( T s ) ) d W s + d W t = α cosh ( α ( T t ) ) sinh ( α ( T t ) ) Y 0 sinh ( α ( T t ) ) sinh ( α T ) + 0 t sinh ( α ( T t ) ) sinh ( α ( T s ) ) d W s + α r ( X ) cosh ( α t ) sinh ( α T ) + d W t = α coth ( α ( T t ) ) Y t r ( X ) sinh ( α t ) sinh ( α T ) + α r ( X ) cosh ( α t ) sinh ( α T ) + d W t = f t Y t + α r ( X ) coth ( α ( T t ) ) sinh ( α t ) sinh ( α T ) + cosh ( α t ) sinh ( α T ) + d W t = f t Y t + α r ( X ) coth ( α ( T t ) ) sinh ( α t ) + cosh ( α t ) sinh ( α T ) + d W t = f t Y t + α r ( X ) coth ( α ( T t ) ) sinh ( α t ) + cosh ( α t ) sinh ( α T ) + d W t = f t Y t + m t r ( X ) + d W t
where the result is obtained considering that
coth ( α ( T t ) ) sinh ( α t ) + cosh ( α t ) sinh ( α T ) = e α ( T t ) + e α ( T t ) e α ( T t ) e α ( T t ) e α t e α t + e α t + e α t e α T e α T = e α T + e α ( T 2 t ) e α ( T 2 t ) e α T e α ( T t ) e α ( T t ) + e α t + e α t e α T e α T = e α T + e α ( T 2 t ) e α ( T 2 t ) e α T + e α T e α ( T 2 t ) + e α ( T 2 t ) e α T e α ( T t ) e α ( T t ) e α T e α T = 2 e α ( T t ) e α ( T t ) .

Appendix J. Experimental Details

Appendix J.1. Dataset Details

The Shapes3D dataset [57] includes the following attributes and the number of classes for each, as shown in Table A1.
Table A1. Attributes and class counts in the Shapes3D dataset.
Table A1. Attributes and class counts in the Shapes3D dataset.
AttributeNumber of Classes
Floor hue10
Object hue10
Orientation15
Scale8
Shape4
Wall hue10

Appendix J.2. Unconditional Diffusion Model Training

We train the unconditional denoising score network using the NCSN++ architecture [14], which corresponds to a U-NET [58]. The model is trained from scratch using the score-matching objective. The training hyperparameters are summarized in Table A2.
Table A2. Hyperparameters for unconditional diffusion model training.
Table A2. Hyperparameters for unconditional diffusion model training.
ParameterValue
Epochs100
Batch size256
Learning rate 1 × 10 4
OptimizerAdamW [67]
    β 1 0.95
    β 2 0.999
   Weight decay 1 × 10 6
   Epsilon 1 × 10 8
Learning rate schedulerCosine annealing with warmup
   Warmup steps500
Gradient clipping1.0
EMA decay0.9999
Mixed precisionFP16
SchedulerVariance Exploding [14]
    σ min 0.01
    σ max 90
Loss functionDenoising score matching [14]

Appendix J.3. Linear Probing Experiment Details

In the linear probing experiments, we train a linear classifier on the feature maps extracted from the denoising score network at various noise levels  τ . The training details are provided in Table A3.
Table A3. Hyperparameters for linear probing experiments.
Table A3. Hyperparameters for linear probing experiments.
ParameterValue
Batch size64
Loss functionCross-Entropy Loss
OptimizerAdam [68]
Learning rate 1 × 10 6  for  τ = 0.9  or  τ = 0.99
1 × 10 4  for other  τ  values
Number of epochs30
InputsFeature maps (used as-is in the linear layer)
Noisy images (scaled to  [ 1 , + 1 ] )

Appendix J.4. Mutual Information Estimation Experiment Details

For mutual information estimation, we train a conditional diffusion model using the same NCSN++ architecture as before. The conditioning is incorporated by adding a distinct class embedding for each label present in the input image, added to the input embedding along with the timestep embedding. The hyperparameters are the same as those used for the unconditional diffusion model (see Table A2).
To calculate the mutual information, we use Equation (A10), estimating the integral using the midpoint rule with 999 points uniformly spaced in  [ 0 , T ] .
Figure A1. Visualization of the forking experiment with  num _ forks = 4  and one initial seed. The image at time  τ = 0.4  is quite noisy. In the final generations after forking, the images exhibit coherence in the labels shape, wall hue, floor hue, and object hue. However, there is variation in orientation and scale.
Figure A1. Visualization of the forking experiment with  num _ forks = 4  and one initial seed. The image at time  τ = 0.4  is quite noisy. In the final generations after forking, the images exhibit coherence in the labels shape, wall hue, floor hue, and object hue. However, there is variation in orientation and scale.
Entropy 27 00371 g0a1

Appendix J.5. Forking Experiment Details

In the forking experiments, we use a ResNet50 [61] model with an additional linear layer, trained from scratch, to classify the generated images and assess label coherence across forks. The training details for the classifier are summarized in Table A4.
Table A4. Hyperparameters for the classifier in forking experiments.
Table A4. Hyperparameters for the classifier in forking experiments.
ParameterValue
Image size224 (resized with bilinear interpolation)
Image scaling [ 1 , + 1 ]
Dataset splitTraining set: 72%
Validation set: 8%
Test set: 20%
Early stoppingStop when validation accuracy exceeds 99%
Evaluated every 1000 steps
Number of epochs1
OptimizerAdam [68]
Learning rate 1 × 10 4
During the sampling process of the forking experiment, we use the settings summarized in Table A5.
Table A5. Sampling settings for the forking experiments.
Table A5. Sampling settings for the forking experiments.
ParameterValue
Stochastic predictorEuler–Maruyama method with 1000 steps
CorrectorLangevin dynamics with 1 step
Signal-to-noise ratio (SNR)0.06
Number of forks (k)100
Number of seeds10 (independent initial noise samples)

Appendix J.6. Linear Probing on Raw Data

In Figure A2, we evaluate the performance of linear probes trained on features maps extracted from the denoiser network, and show compare their log probability accuracy with a linear probe trained on the raw, noisy input and a random guesser. Throughout the generative process, linear probes obtain higher accuracy than the baselines: for large noise levels, a linear probe on raw input data fails, whereas the inner layers of the denoising network extract features that are sufficient to discern latent labels.
Figure A2. Log-probability accuracy of linear classifiers at  τ . ‘Feature map’ classifiers are trained on network features; ‘Noisy Image’ trained on noisy images; ‘Random Guess’ is the baseline for random guessing.
Figure A2. Log-probability accuracy of linear classifiers at  τ . ‘Feature map’ classifiers are trained on network features; ‘Noisy Image’ trained on noisy images; ‘Random Guess’ is the baseline for random guessing.
Entropy 27 00371 g0a2

Appendix J.7. Additional Experiments on CelebA Dataset

We present our results conducted on the CelebA dataset [69], consisting of over 200,000 celebrity images with 40 binary attributes. Next, we focus our analysis on the attributes “Male” and “Eyeglasses” as these are (i) among the most reliable and objectively labeled features in the CelebA dataset (this is supported by previous work, which highlights significant labeling issues for many other attributes, making them less suitable for consistent analysis [70]) and (ii) significant examples of attributes which can be mapped to more global and local features, respectively. The unconditional and conditional diffusion models were trained using the identical architectural, optimization, and training hyperparameters as in Song et al. [14]. Both models employed a variance-exploding diffusion process with a U-Net backbone for the denoising score network. Training details, including the learning rate, batch size, and noise schedules, are the same as of Song et al. [14]. We present a comprehensive analysis of the results derived from probing experiments, mutual information (MI) estimation, and the rate of increase in MI across the generative process.
Figure A3. Probing accuracy and mutual information (MI) as a function of the noise intensity parameter  τ .
Figure A3. Probing accuracy and mutual information (MI) as a function of the noise intensity parameter  τ .
Entropy 27 00371 g0a3
Probing vs. MI. Our results, as shown in Figure A3, illustrate a coherent growth between classifier accuracy (probing performance) and mutual information as a function of the noise intensity parameter  τ . For both attributes, probing accuracy increases steadily, mirroring the growth of MI.
Figure A4. Mutual information (MI) growth for “Male” and “Eyeglasses” attributes across the generative process.
Figure A4. Mutual information (MI) growth for “Male” and “Eyeglasses” attributes across the generative process.
Entropy 27 00371 g0a4
Mutual Information Across Labels. Figure A4 compares MI growth across the “Male” and “Eyeglasses” attributes. A key observation is that the MI for “Male” rises earlier than for “Eyeglasses”, beginning at  τ = 0.2 , compared to  τ = 0.3 . This aligns with the intuition that some latent abstractions emerge earlier in the generative process than others, given that the average number of pixels impacted by the global features is larger than the local ones.
Figure A5. Rate of change of mutual information (MI) for “Male” and “Eyeglasses” attributes as a function of  τ .
Figure A5. Rate of change of mutual information (MI) for “Male” and “Eyeglasses” attributes as a function of  τ .
Entropy 27 00371 g0a5
Rate of Increase in MI. To further investigate the dynamics, we plot  Δ ( M I ) Δ τ , the rate of change of MI, for the two attributes (Figure A5). This reveals that “Male” exhibits a significantly faster initial growth rate compared to “Eyeglasses”, peaking around  τ = 0.4 . This confirms the earlier emergence of “Male” as a latent abstraction, with a sharp rise in MI during the early stages. In contrast, the MI for “Eyeglasses” grows more gradually, reflecting a slower but steady emergence of this attribute.

References

  1. Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 8780–8794. [Google Scholar]
  2. Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D.P.; Poole, B.; Norouzi, M.; Fleet, D.J.; et al. Imagen Video: High Definition Video Generation with Diffusion Models. 2022. Available online: https://imagen.research.google/video/paper.pdf (accessed on 21 February 2025).
  3. He, Y.; Yang, T.; Zhang, Y.; Shan, Y.; Chen, Q. Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths. arXiv 2022, arXiv:2211.13221. [Google Scholar]
  4. Li, X.L.; Thickstun, J.; Gulrajani, I.; Liang, P.; Hashimoto, T. Diffusion-LM Improves Controllable Text Generation. Oh, A.H., Agarwal, A., Belgrave, D., Cho, K., Eds.. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  5. He, Z.; Sun, T.; Tang, Q.; Wang, K.; Huang, X.; Qiu, X. DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1. Long Papers. [Google Scholar]
  6. Gulrajani, I.; Hashimoto, T. Likelihood-Based Diffusion Language Models. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  7. Lou, A.; Meng, C.; Ermon, S. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
  8. Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  9. Liu, J.; Li, C.; Ren, Y.; Chen, F.; Zhao, Z. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11020–11028. [Google Scholar]
  10. Trippe, B.L.; Yim, J.; Tischer, D.; Baker, D.; Broderick, T.; Barzilay, R.; Jaakkola, T. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv 2022, arXiv:2206.04119. [Google Scholar]
  11. Hoogeboom, E.; Satorras, V.G.; Vignac, C.; Welling, M. Equivariant Diffusion for Molecule Generation in 3D. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Proceedings of Machine Learning Research. Volume 162, pp. 8867–8887. [Google Scholar]
  12. Luo, S.; Hu, W. Diffusion Probabilistic Models for 3D Point Cloud Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2837–2845. [Google Scholar]
  13. Zeng, X.; Vahdat, A.; Williams, F.; Gojcic, Z.; Litany, O.; Fidler, S.; Kreis, K. LION: Latent Point Diffusion Models for 3D Shape Generation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  14. Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  15. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
  16. Albergo, M.S.; Boffi, N.M.; Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv 2023, arXiv:2303.08797. [Google Scholar]
  17. Chen, Y.; Viégas, F.; Wattenberg, M. Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model. arXiv 2023, arXiv:2306.05720. [Google Scholar]
  18. Linhardt, L.; Morik, M.; Bender, S.; Borras, N.E. An Analysis of Human Alignment of Latent Diffusion Models. arXiv 2024, arXiv:2403.08469. [Google Scholar]
  19. Tang, L.; Jia, M.; Wang, Q.; Phoo, C.P.; Hariharan, B. Emergent correspondence from image diffusion. Adv. Neural Inf. Process. Syst. 2023, 36, 1363–1389. [Google Scholar]
  20. Berner, J.; Richter, L.; Ullrich, K. An optimal control perspective on diffusion-based generative modeling. arXiv 2022, arXiv:2211.01364. [Google Scholar]
  21. Richter, L.; Berner, J. Improved sampling via learned diffusions. arXiv 2023, arXiv:2307.01198. [Google Scholar]
  22. Ye, M.; Wu, L.; Liu, Q. First hitting diffusion models for generating manifold, graph and categorical data. Adv. Neural Inf. Process. Syst. 2022, 35, 27280–27292. [Google Scholar]
  23. Raginsky, M. A variational approach to sampling in diffusion processes. arXiv 2024, arXiv:2405.00126. [Google Scholar]
  24. Bain, A.; Crisan, D. Fundamentals of Stochastic Filtering; Springer: Berlin/Heidelberg, Germany, 2009; Volume 3. [Google Scholar]
  25. Van Handel, R. Filtering, Stability, and Robustness. Ph.D. Thesis, California Institute of Technology, Pasadena, CA, USA, 2007. [Google Scholar]
  26. Kutschireiter, A.; Surace, S.C.; Pfister, J.P. The Hitchhiker’s guide to nonlinear filtering. J. Math. Psychol. 2020, 94, 102307. [Google Scholar]
  27. Bisk, Y.; Holtzman, A.; Thomason, J.; Andreas, J.; Bengio, Y.; Chai, J.; Lapata, M.; Lazaridou, A.; May, J.; Nisnevich, A.; et al. Experience grounds language. arXiv 2020, arXiv:2004.10151. [Google Scholar]
  28. Bender, E.M.; Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5185–5198. [Google Scholar]
  29. Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv 2022, arXiv:2210.13382. [Google Scholar]
  30. Park, Y.H.; Kwon, M.; Choi, J.; Jo, J.; Uh, Y. Understanding the latent space of diffusion models through the lens of riemannian geometry. Adv. Neural Inf. Process. Syst. 2023, 36, 24129–24142. [Google Scholar]
  31. Kwon, M.; Jeong, J.; Uh, Y. Diffusion Models already have a Semantic Latent Space. arXiv 2023, arXiv:2210.10960. [Google Scholar]
  32. Xiang, W.; Yang, H.; Huang, D.; Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15802–15812. [Google Scholar]
  33. Haas, R.; Huberman-Spiegelglas, I.; Mulayoff, R.; Graßhof, S.; Brandt, S.S.; Michaeli, T. Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models. arXiv 2024, arXiv:2303.11073. [Google Scholar]
  34. Sclocchi, A.; Favero, A.; Wyart, M. A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data. arXiv 2024, arXiv:2402.16991. [Google Scholar]
  35. Raya, G.; Ambrogioni, L. Spontaneous symmetry breaking in generative diffusion models. J. Stat. Mech. Theory Exp. 2024, 2024, 104025. [Google Scholar]
  36. Ambrogioni, L. The statistical thermodynamics of generative diffusion models. arXiv 2023, arXiv:2310.17467. [Google Scholar]
  37. Newton, N.J. Interactive statistical mechanics and nonlinear filtering. J. Stat. Phys. 2008, 133, 711–737. [Google Scholar]
  38. Duncan, T.E. On the calculation of mutual information. SIAM J. Appl. Math. 1970, 19, 215–220. [Google Scholar]
  39. Duncan, T.E. Mutual information for stochastic differential equations. Inf. Control 1971, 19, 265–271. [Google Scholar] [CrossRef]
  40. Mitter, S.K.; Newton, N.J. A variational approach to nonlinear estimation. SIAM J. Control Optim. 2003, 42, 1813–1833. [Google Scholar]
  41. ksendal, B. Stochastic Differential Equations; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  42. Fotsa-Mbogne, D.J.; Pardoux, E. Nonlinear filtering with degenerate noise. Electron. Commun. Probabability 2017, 22, 1–14. [Google Scholar] [CrossRef]
  43. Anderson, B.D.O. Reverse-time diffusion equation models. Stoch. Process. Their Appl. 1982, 12, 313–326. [Google Scholar]
  44. Pardoux, E. Grossissement d’une filtration et retournement du temps d’une diffusion. In Séminaire de Probabilités XX 1984/85: Proceedings; Springer: Berlin/Heidelberg, Germany, 2006; pp. 48–55. [Google Scholar]
  45. Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes, and Martingales: Itô Calculus; Cambridge University Press: Cambridge, UK, 2000; Volume 2. [Google Scholar]
  46. Aksamit, A.; Jeanblanc, M. Enlargement of Filtration with Finance in View; Springer: Cham, Switherland, 2017. [Google Scholar]
  47. Ouwehand, P. Enlargement of Filtrations—A Primer. arXiv 2022, arXiv:2210.07045. [Google Scholar]
  48. Grigorian, K.; Jarrow, R.A. Enlargement of Filtrations: An Exposition of Core Ideas with Financial Examples. arXiv 2023, arXiv:2303.03573. [Google Scholar]
  49. Çetin, U.; Danilova, A. Markov bridges: SDE representation. Stoch. Process. Their Appl. 2016, 126, 651–679. [Google Scholar] [CrossRef]
  50. Haussmann, U.G.; Pardoux, E. Time Reversal of Diffusions. Ann. Probab. 1986, 14, 1188–1205. [Google Scholar] [CrossRef]
  51. Shreve, S.E. Stochastic Calculus for Finance II: Continuous-Time Models; Springer: Berlin/Heidelberg, Germany, 2004; Volume 11. [Google Scholar]
  52. Al-Hussaini, A.N.; Elliott, R.J. Enlarged filtrations for diffusions. Stoch. Process. Their Appl. 1987, 24, 99–107. [Google Scholar] [CrossRef]
  53. Franzese, G.; Bounoua, M.; Michiardi, P. MINDE: Mutual Information Neural Diffusion Estimation. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024. [Google Scholar]
  54. Kidger, P.; Foster, J.; Li, X.; Lyons, T.J. Neural sdes as infinite-dimensional gans. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 5453–5463. [Google Scholar]
  55. De Bortoli, V.; Thornton, J.; Heng, J.; Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 17695–17709. [Google Scholar]
  56. Hodgkinson, L.; van der Heide, C.; Roosta, F.; Mahoney, M.W. Stochastic continuous normalizing flows: Training SDEs as ODEs. In Proceedings of the Uncertainty in Artificial Intelligence, Online, 27–30 July 2021; pp. 1130–1140. [Google Scholar]
  57. Kim, H.; Mnih, A. Disentangling by factorising. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2649–2658. [Google Scholar]
  58. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  59. Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switherland, 2011; pp. 145–158. [Google Scholar]
  60. Szymański, P.; Kajdanowicz, T. A Network Perspective on Stratification of Multi-Label Data. In Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications; Skopje, Macedonia, 22 September 2017, Torgo, L., Krawczyk, B., Branco, P., Moniz, N., Eds.; Proceedings of Machine Learning Research; Volume 74, pp. 22–35.
  61. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  62. Xiong, J. An Introduction to Stochastic Filtering Theory; Oxford University Press: Oxford, UK, 2008. [Google Scholar]
  63. Kallenberg, O. Foundations of Modern Probability, 2nd ed.; Probability and its Applications (New York); Springer: New York, NY, USA, 2002; p. 18. [Google Scholar] [CrossRef]
  64. Van Putten, C.; van Schuppen, J.H. Invariance properties of the conditional independence relation. Ann. Probab. 1985, 13, 934–945. [Google Scholar] [CrossRef]
  65. Mazzolo, A. Constraint ornstein-uhlenbeck bridges. J. Math. Phys. 2017, 58, 093302. [Google Scholar] [CrossRef]
  66. Corlay, S. Properties of the Ornstein-Uhlenbeck bridge. arXiv 2013, arXiv:1310.5617. [Google Scholar]
  67. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  68. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diega, CA, USA, 7–9 May 2015. [Google Scholar]
  69. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the Proceedings of International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  70. Lingenfelter, B.; Davis, S.R.; Hand, E.M. A quantitative analysis of labeling issues in the CelebA dataset. In Proceedings of the Advances in Visual Computing: 17th International Symposium, ISVC 2022, San Diego, CA, USA, 3–5 October 2022; Proceedings, Part I. pp. 129–141. [Google Scholar] [CrossRef]
Figure 1. Graphical intuition for our results: nonlinear filtering (left) and generative modeling (right).
Figure 1. Graphical intuition for our results: nonlinear filtering (left) and generative modeling (right).
Entropy 27 00371 g001
Figure 2. Versions of an image corrupted by different values of noise for different times  τ .
Figure 2. Versions of an image corrupted by different values of noise for different times  τ .
Entropy 27 00371 g002
Figure 3. Mutual information, entropy across forked generative pathways, and probing results as functions of  τ .
Figure 3. Mutual information, entropy across forked generative pathways, and probing results as functions of  τ .
Entropy 27 00371 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Franzese, G.; Martini, M.; Corallo, G.; Papotti, P.; Michiardi, P. Latent Abstractions in Generative Diffusion Models. Entropy 2025, 27, 371. https://doi.org/10.3390/e27040371

AMA Style

Franzese G, Martini M, Corallo G, Papotti P, Michiardi P. Latent Abstractions in Generative Diffusion Models. Entropy. 2025; 27(4):371. https://doi.org/10.3390/e27040371

Chicago/Turabian Style

Franzese, Giulio, Mattia Martini, Giulio Corallo, Paolo Papotti, and Pietro Michiardi. 2025. "Latent Abstractions in Generative Diffusion Models" Entropy 27, no. 4: 371. https://doi.org/10.3390/e27040371

APA Style

Franzese, G., Martini, M., Corallo, G., Papotti, P., & Michiardi, P. (2025). Latent Abstractions in Generative Diffusion Models. Entropy, 27(4), 371. https://doi.org/10.3390/e27040371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop