A Representation of the Relative Entropy with Respect to a Diffusion Process in Terms of Its Infinitesimal Generator

In this paper we derive an integral (with respect to time) representation of the relative entropy (or Kullback–Leibler Divergence) R(μ||P ), where μ and P are measures on C([0, T ];R). The underlying measure P is a weak solution to a martingale problem with continuous coefficients. Our representation is in the form of an integral with respect to its infinitesimal generator. This representation is of use in statistical inference (particularly involving medical imaging). Since R(μ||P ) governs the exponential rate of convergence of the empirical measure (according to Sanov’s theorem), this representation is also of use in the numerical and analytical investigation of finite-size effects in systems of interacting diffusions.


Introduction
In this paper we derive an integral representation of the relative entropy R(µ||P ), where µ is a measure on C([0, T ]; R d ) and P governs the solution to a stochastic differential equation (SDE).The relative entropy is used to quantify the distance between two measures.It has considerable applications in statistics, imaging, information theory and communications.It has been used in the long-time analysis of Fokker-Planck equations [1,2], the analysis of dynamical systems [3] and the analysis of spectral density functions [4].It has been used in financial mathematics to quantify the difference between martingale measures [5,6].It has also been shown in [7] that the existence problem of the minimal relative entropy martingale measure problem of birth and death processes can be reduced to the problem of solving the Hamilton-Jacobi-Bellman equation; furthermore the minimal entropy martingale measures (MEMMs) for geometric Levy processes are investigated in [8].The finiteness of R(µ||P ) has been shown to be equivalent to the invertibility of certain shifts on Wiener space, when P is the Wiener measure [9,10].However, one of the most frequent uses of the relative entropy is in statistical inference (particularly in medical imaging) [11,12].For example, in data fitting, it is a standard technique to select the parameters that minimise the relative entropy of two conditional probability distributions [13].
Modelling in medical imaging increasingly involves diffusion process with state space C([0, T ]; R d ), for which the expression R(µ||P ) = E µ [log dµ  dP ] or the variational definition in Definition 1 may not always be tractable.Furthermore, it is not always clear that one may simply approximate the relative entropy by successively calculating it for the marginals over increasingly fine time-discretisations, since these expressions may asymptotically approach infinity (see (4) below).
Another very important application of the relative entropy is in the field of Large Deviations.Sanov's theorem dictates that the empirical measure induced by independent samples governed by the same probability law P converge towards their limit exponentially fast; and the constant governing the rate of convergence is the relative entropy [14].Large Deviations have been applied for example to spin glasses [15], neural networks [16][17][18] and mean-field models of interacting particles [19,20].In the mean-field theory of neuroscience in particular, there has been a recent interest in the modelling of "finite size effects" [18,21], that is, the deviations from the limiting behaviour for a population of a particular size.Large Deviations provides a mathematically rigorous tool to do this.In this system, the limiting system is typically the law P of a stochastic process, and therefore the likelihood of the empirical measure of the system being "near" some measure µ is the relative entropy R(µ||P ).However the numerical calculation of R(µ||P ) is not straightforward: the results of this paper provide an alternative characterization of R(µ||P ), which assists in this calculation.
For example, the rate function for the Large Deviation Principle of the interacting particle model of [20] is directly in terms of the relative entropy between two measures on the space of continuous functions (see in particular Theorem 5.2 of this paper).Similarly, the rate function in [18] (Theorem 10) may be expressed as a function of the relative entropy.In more detail, the rate function J in [18] (Theorem 10) is of the form J(µ) = lim n→∞ Here Ξ is the law of the process in [18] (Equation (31)), i.e., the law of a Z d -indexed stochastic process, and µ Vn and Ξ Vn are the marginals over the finite hypercube V n of side length (2n + 1).The results of this paper give a means of evaluating R(µ Vn ||Ξ Vn ) and therefore J(µ).
In this paper we derive a specific integral (with respect to time) representation of the relative entropy R(µ||P ) when P is the law of a diffusion process.The representation is in terms of the infinitesimal generator of P .This P is the same as in [22] (Section 4).The representation makes use of regular conditional probabilities.We expect that in some circumstances, it ought to be more tractable than the standard definition in Definition 1, and thus it might be of practical use in the applications listed above.

Outline of Main Result
Let T be the Banach Space C([0, T ]; R d ) equipped with the norm where |•| is the standard Euclidean norm over R d .We let (F t ) be the canonical filtration over (T , B(T )).
For some topological space X , we let B(X ) be the Borelian σ-algebra and M(X ) the space of all probability measures on (X , B(X )).Unless otherwise indicated, we endow M(X ) with the topology of weak convergence.Let σ = {t 1 , t 2 , . . ., t m } be a finite set of elements such that t 1 ≥ 0, t m ≤ T and t j < t j+1 .We term σ a partition, and denote the set of all such partitions by J.The set of all partitions of the above form such that t 1 = 0 and t m = T is denoted J * .We define |σ| = sup 1≤j≤m−1 {t j+1 − t j }.For some t ∈ [0, T ] and σ ∈ J * , we define σ(t) = sup{s ∈ σ|s ≤ t}.The following definition of relative entropy is standard.
where E is the set of all bounded functions.If the σ-algebra is clear from the context, we omit the H and write R(µ||ν).If Ω is Polish and H = B(Ω), then we only need to take the supremum over the set of all continuous bounded functions.
Let P ∈ M(T ) be the following law governing a Markov-Feller diffusion process on T .Stipulate P to be a weak solution (with respect to the canonical filtration) of the local martingale problem with infinitesimal generator , the space of twice continuously differentiable functions.The initial condition , and the matrix a(t, x) is strictly positive definite for all t and x.Here P is assumed to be the unique weak solution.We note that the above infinitesimal generator is the same as in [22] (p.269) (note particularly its Remark 4.4).We note that P is the law of the solution Y = (Y j ) to the following stochastic differential equation: for j ∈ [1, d], Here (W k ) are independent Wiener processes.Our major result is the following.Let µ ∈ M(T ) govern a random variable X ∈ T .For some x ∈ T , we note µ |[0,s],x , the regular conditional probability (rcp) given X r = x r for all r ∈ [0, s].The marginal of µ |[0,s],x at some time t ≥ s is noted µ t|[0,s],x .
Our paper has the following format.In Section 3 we make some preliminary definitions, defining the process P against which the relative entropy is taken in this paper.In Section 4 we employ the projective limits approach of [22] to obtain the chief result of this paper: Theorem 1.This gives an explicit integral representation of the relative entropy.In Section 5 we apply the result in Theorem 1 to various corollaries, including the particular case when µ is the solution of a martingale problem.We finish by comparing our results to those of [19] and [20].

Preliminaries
We outline some necessary definitions.For σ ∈ J of the form σ = {t 1 , t 2 , . . ., t m }, let σ ;j = {t 1 , . . ., t j }.We denote the number of elements in a partition σ by m(σ).We let J s be the set of all partitions lying in [0, s].For 0 < s < t ≤ T , we let J s;t be the set of all partitions of the form σ ∪ t, where σ ∈ J s .
Let π σ : T → T σ := R d×m(σ) be the natural projection, i.e., such that π σ (x) = (x t 1 , . . ., x t m(σ) ).We similarly define the natural projection π αγ : T γ → T α (for α ⊆ γ ∈ J), and we define π The expectation of some measurable function f with respect to a measure µ is written as . We define F s;t to be the σ-algebra generated by F s and F γ (where γ = [t]).For µ ∈ M(T ), we denote its image laws by T , the rcp given that X u = x u for all 0 ≤ u ≤ s is written as µ |[0,s],x .The rcp given that X u = x u for all u ≤ s, and X t = z, is written as µ |s,x;t,z .For σ ∈ J s and z ∈ (R d ) m(σ) , the rcp given that X u = z u for all u ∈ σ is written as µ |σ,z .All of these measures are considered to be in M(C([s, T ]; R d )) (unless indicated otherwise in particular circumstances).The probability laws governing X t (for t ≥ s), for each of these, are respectively µ t|s,z , µ t|[0,s],x and µ t|σ,z .We clearly have µ s|s,z = δ z , for µ s a.e.z, and similarly for the others.REMARK.See [23] (Definition 5.3.16)for a definition of a rcp.Technically, if we let µ * |s,z be the rcp given . By [23] (Theorem 3.18), µ |s,z is well-defined for µ s a.e.z.Similar comments apply to the other rcp's defined above.

The Relative Entropy R(•||P ) Using Projective Limits
In this section we derive an integral representation of the relative entropy R(µ||P ), for arbitrary µ ∈ M(T ).We start with the standard result in Theorem 2, before adapting the projective limits approach of [22] to obtain the central result (Theorem 1).
We begin with a standard decomposition result for the relative entropy [24].
Lemma 1.Let X be a Polish space with sub σ-algebras G ⊆ F ⊆ B(X).Let µ and ν be probability measures on (X, F), and their regular conditional probabilities over G be (respectively) The following Theorem is a straightforward consequence of [25] (Theorem 6.6): we provide an alternative proof using the theory of Large Deviations in Section 6.
It suffices for the supremums in (4) to take σ ⊂ Q s,t , where Q s,t is any countable dense subset of [s, t].
Thus we may assume that there exists a sequence σ We now provide a technical lemma.
Proof.The first statement is immediate from Definition 1 and the Markovian nature of P .For the second statement, it suffices to prove this in the case that α = σ ∪ u, for some u < s.We note that, using a property of regular conditional probabilities, for µ σ a.e x, where v(x, ω) ∈ T α , v(x, ω) u = ω, v(x, ω) r = x r for all r ∈ σ.
We consider A to be the set of all finite disjoint partitions a ⊂ B(R d ) of R d .The expression for the entropy in [26] (Lemma 1.4.3)yields Here the summand is considered to be zero if µ t|σ,x (A) = 0, and infinite if µ t|σ,x (A) > 0 and P t|s,xs (A) = 0. Making use of ( 7), we find that We note that, for µ α a.e.z, if µ t|σ,πσαz (A) = 0 in this last expression, then µ t|α,z (A) = 0 and we consider the summand to be zero.To complete the proof of the lemma, it is thus sufficient to prove that for µ α a.e.z sup a∈A A∈a µ t|α,z (A) log µ t|α,z (A) P t|s,zs (A) ≥ sup a∈A A∈a µ t|α,z (A) log µ t|σ,πσαz (A) P t|s,zs (A) .
However, in turn, the above inequality will be true if we can prove that for each partition a such that P t|s,zs (A) > 0 and µ t|σ,πσαz (A) > 0 for all A ∈ a, The left hand side is equal to A∈a µ t|α,z (A) log µ t|σ,πσαz (A) .An application of Jensen's inequality demonstrates that this is greater than or equal to zero.
REMARK.If, contrary to the definition, we briefly consider µ |[0,t],x to be a probability measure on T , such that µ(A) = 1 where A is the set of all points y such that y s = x s for all s ≤ t, then it may be seen from the definition of R that We have also made use of the Markov property of P .This is why our convention, to which we now return, is to x to be a probability measure on (C([t, T ]; R d ), F t,T ).
This leads us to the following expressions for R(µ||P ).
where in this last expression 0 ≤ s < t ≤ T .
Proof.Consider the sub σ-algebra F 0,t m(σ)−1 .We then find, through an application of Lemma 1 and ( 8), that We may continue inductively to obtain the first identity.We use Theorem 2 to prove the second identity.It suffices to take the supremum over J * , because However, this also follows from repeated application of Lemma 1.To prove the third identity, we firstly note that The proof of this is entirely analogous to that of the second identity, except that it makes use of (5) instead of (4).However, after another application of Lemma 1, we also have that On equating these two different expressions for R Fs;t (µ||P ), we obtain . Such a sequence exists by (4).Similarly, let (γ (k) ) ⊆ J s be a sequence such that E µ γ (k) (x) R t µ t|γ (k) ,x ||P t|s,xs is strictly non-decreasing and, as k → ∞, asymptotically approaches sup σ∈Js E µσ(x) R t µ t|σ,x ||P t|s,xs .Lemma 2 dictates that asymptotically approaches the same limit as well.Clearly because of the identity at the start of Theorem 2. This yields the third identity.

Proof of Theorem 1
In this section we work towards the proof of Theorem 1, making use of some results in [22].However, we first require some more definitions.
If K ⊂ R d is compact, let D K be the set of all f ∈ D whose support is contained in K.The corresponding space of real distributions is D , and we denote the action of θ ∈ D by θ, f .If ) denote the set of all continuous functions, possessing continuous spatial derivatives of first and second order, a continuous time derivative of first order, and of compact support.For f ∈ D and t ∈ [0, T ], we define the random variable j=1 a ij (t, y) ∂f ∂y j (for x ∈ T , we may also understand ∇ t f (x) := ∇ t f (x t )).Let a ij be the components of the matrix inverse of a ij .For random variables X, Y : T → R d , we define the ).Let M be the space of all continuous maps [0, T ] → M(R d ), equipped with the topology of uniform convergence.For s ∈ [0, T ], ϑ ∈ M and ν ∈ M(R d ) we define n(s, ϑ, ν) ≥ 0 and such that This definition is taken from [22] (Equation (4.7))-we note that n is convex in ϑ.For γ ∈ M(T ), we may naturally write n(s, γ, ν) := n(s, ω, ν), where ω is the projection of γ onto M, i.e., ω(s) = γ s .It is shown in [22] that this projection is continuous.The following two definitions, lemma and two propositions are all taken (with some small modifications) from [22].
Definition 2. Let I be an interval of the real line.A measure µ ∈ M(T ) is called absolutely continuous if for each compact set K ⊂ R d there exists a neighbourhood U of 0 in K and an absolutely continuous function We denote by L 2 s,t,∇ (ν) the closure in L 2 s,t (ν) of the linear subset generated by maps of the form (x, u) → ∇ u f , where f ∈ C 2,1 0 ([s, t], R d ).We note that functions in L 2 s,t,∇ (ν) only need to be defined du ⊗ ν u (dx) almost everywhere.
Proof.Fix a partition σ = {t 1 , . . ., t m }.We may conclude from ( 9) and ( 17) that The integrand on the right hand side is measurable with respect to E µ [0,t j ] (x) due to the equivalent expression (14).We may infer from ( 18) that This last step follows by noting that if ν ∈ M(R d ), and f ∈ C b (R d ), and the expectation of f with respect to ν is finite, then there exists a series In turn, for each n there exist (f This allows us to conclude that the two supremums are the same.The last expression in (20) is merely By (11), this is greater than or equal to E µσ ;j (y) R t j+1 µ t j+1 |σ ;j ,y ||P t j+1 |t j ,yt j .
We thus obtain the theorem using (10).

Some Corollaries
We state some corollaries of Theorem 1.In the course of this section we make progressively stronger assumptions on the nature of µ, culminating in the elegant expression for R(µ||P ) when µ is a solution of a martingale problem.We finish by comparing our work with that of [19,20].
Corollary 1. Suppose that µ ∈ M(T ) and R(µ||P ) < ∞.Then for all s and µ a.e.x, µ |[0,s],x is absolutely continuous over [s, T ].For each s ∈ [0, T ] and µ a.e.x ∈ T , for Lebesgue a.e.t ≥ s where for some Furthermore, For any dense countable subset Q 0,T of [0, T ], there exists a series of partitions σ REMARK.It is not immediately clear that we may simplify (23) further (barring further assumptions).The reason for this is that we only know that is measurable (as a function of w), but it has not been proven that h µ σ(t),w (t, z) is measurable (as a function of w).
Proof.Let σ = {0 = t 1 , . . ., t m = T } be an arbitrary partition.For all j < m, we find from Lemma We thus find that, for all such x, µ |[0,t j ],x is absolutely continuous on [t j , t j+1 ] from Proposition 2. We are then able to obtain ( 21) and ( 22) from Propositions 1 and 2. From (2), ( 16) and ( 21) we find that The above integral must be finite (since we are assuming R(µ||P ) is finite).Furthermore is (t, x) measurable as a consequence of the equivalent form (14).This allows us to apply Fubini's theorem to obtain (23).The last statement on the sequence of maximising partitions follows from Theorem 2.
Corollary 2. Suppose that R(µ||P ) < ∞.Suppose that for all s ∈ Q 0,T (any countable, dense subset of [0, T ]), for µ a.e.x and Lebesgue a.e.t, h µ s,x (t, x t ) = E µ |[0,s],x;t,x t (w) h µ (t, w) for some progressively measurable random variable Proof.Let G s,x;t,y be the sub σ-algebra consisting of all B ∈ B(T ) such that for all w ∈ B, w r = x r for all r ≤ s and w t = y.Thus h µ s,x (t, y) = E µ |[0,s],x;t,y (w) h µ (t, w) = E µ [h µ (t, •)|G s,x;t,y ].By [27] (Corollary 2.4), since ∩ s<t G s,x;t,xt = G t,x;t,xt (restricting to s ∈ Q 0,T ), for µ a.e.x, where s ∈ Q 0,T .By the properties of the regular conditional probability, we find from (24) that By assumption, the above limit is finite.Thus by Fatou's lemma, and using the properties of the regular conditional probability, dt.
Through use of (26), Conversely, through an application of Jensen's inequality to ( 27) A property of the regular conditional probability yields REMARK.The condition in the above corollary is satisfied when µ is a solution to a martingale problem-see Lemma 5.
We may further simplify the expression in Theorem 1 when µ is a solution to the following martingale problem.Let {c jk , e j } be progressively measurable functions [0, T ] × T → R. We suppose that c jk = c kj .For all 1 ≤ j, k ≤ d, c jk (t, x) and e j (t, x) are assumed to be bounded for x ∈ L (where L is compact) and all t ∈ [0, T ].For f ∈ C 2 0 (R d ) and x ∈ T , let We assume that for all such f , the following is a continuous martingale (relative to the canonical filtration) under µ f The law governing X 0 is stipulated to be ν ∈ M(R d ).
From now on we switch from our earlier convention and we consider µ |[0,s],x to be a measure on T such that, for µ a.e.x ∈ T , µ |[0,s],x (A s,x ) = 1, where A s,x is the set of all X ∈ T satisfying X t = x t for all 0 ≤ t ≤ s.This is a property of a regular conditional probability (see Theorem 3.18 in [23]).Similarly, µ |s,x;t,y is considered to be a measure on T such that for µ a.e.x ∈ T , µ |s,x;t,y (B s,x;t,y ) = 1, where B s,x;t,y is the set of all X ∈ A s,x such that X t = y.We may apply Fubini's Theorem (since f is compactly supported and bounded) to (28) to find that This ensures that µ |[0,s] is absolutely continuous over [s, T ], and that Proof.It follows from R(µ||P ) < ∞, (21) and ( 22) that for all s and µ a.e.x, for Lebesgue a.e.t ≥ s Let us take a countable dense subset Q 0,T of [0, T ].There thus exists a null set N ⊆ [0, T ] such that for every s ∈ Q 0,T , µ a.e.x and every t / ∈ N the above equation holds.We may therefore conclude (30) using [27] (Corollary 2.4) and taking s → t − .From (29), we observe that for all s ∈ [0, T ] and µ a.e.We have already noted in the introduction that one may infer a variational representation of the relative entropy from [19,20] by assuming that the coefficients of the underlying stochastic process are independent of the empirical measure in these papers.The assumptions in [20] on the underlying process P are both more general and more restrictive than ours.His assumptions are more general insofar as the coefficients of the SDE may depend on the past history of the process and the diffusion coefficient is allowed to be degenerate.However, our assumptions are more general insofar as we only require P to be the unique (in the sense of probability law) weak solution of the SDE, whereas [20] requires P to be the unique strong solution of the SDE.Of course when both sets of assumptions are satisfied, one may infer that the expressions for the relative entropy are identical.

Proof of Theorem 2
The following is an alternative proof to that of [25] (Theorem 6.6) employing the theory of Large Deviations.The fact that, if α ⊆ σ, then R α (µ||P ) ≤ R σ (µ||P ), follows from Lemma 1.We prove the first expression (4) in the case s = 0, t = T (the proof of the second identity ( 5) is analogous).and for all closed sets F lim If furthermore the set {x : I(x) ≤ α} is compact for all α ≥ 0, we say that I is a good rate function.
We define the following empirical measures.
We now define the projective limit M (T ).If α, γ ∈ J, α ⊂ γ, then we may define the projection π We may continuously embed M(T ) into the projective limit M(T ) of its marginals, letting ι denote this embedding.That is, for any σ ∈ J, (ι(µ))(σ) = µ σ .We note that ι is continuous because ι −1 (A γ,O ) is open in M(T ), for all A γ,O of the form in (33).We equip M(T ) with the Borelian σ-algebra generated by this topology.The embedding ι is measurable with respect to this σ-algebra because the topology of M(T ) has a countable base.The embedding induces the image laws (Π N • ι −1 ) on M(M(T )).For σ ∈ J, it may be seen that , where π M σ (⊗ α µ(α)) = µ(σ).It follows from [22] (Thm 3.3) that Π N • ι −1 satisfies a Large Deviation Principle with rate function sup σ∈J R σ (µ||P ).However, we note that ι is 1 − 1, because any two measures µ, ν ∈ M(T ) such that µ σ = ν σ for all σ ∈ J must be equal.Furthermore, ι is continuous.Because of Sanov's theorem, (Π N ) is exponentially tight (see Defn 1.2.17,Exercise 1.2.19 in [14] for a definition of exponential tightness and proof of this statement).These facts mean that we may apply the inverse contraction principle [14] (Thm 4.2.4) to infer that Π N satisfies a Large Deviation Principle with the rate function sup σ∈J R σ (µ||P ).Since rate functions are unique [14] (Lemma 4.1.4),we obtain the first identity in conjunction with Sanov's theorem.The second identity (5) follows similarly.We may repeat the argument above, while restricting to σ ⊂ Q s,t .We obtain the same conclusion because the σ-algebra generated by (F σ ) σ⊂Qs,t is the same as F s,t .The last identity follows from the fact that, if α ⊆ σ, then R α (µ||P ) ≤ R σ (µ||P ).

Definition 4 .
A series of probability laws Γ N on some topological space Ω equipped with its Borelian σ-algebra is said to satisfy a strong Large Deviation Principle with rate function I : Ω → R if for all open sets O, lim N →∞ N −1 log Γ N (O) ≥ − inf x∈O I(x) M αγ : M(T γ ) → M(T α ) as π M αγ (ξ) := ξ • π −1 αγ .An element of M(T ) is then a member ⊗ σ ζ(σ) of the Cartesian product ⊗ σ∈J M(T σ ) satisfying the consistency condition π M αγ (ζ(γ)) = ζ(α) for all α ⊂ γ.The topology on M(T ) is the minimal topology necessary for the natural projection M(T ) → M(T α ) to be continuous for all α ∈ J.That is, it is generated by open sets of the formA γ,O = {⊗ σ ζ(σ) ∈ M(T ) : ζ(γ) ∈ O},(33)for some γ ∈ J and open O (with respect to the weak topology of M(T γ )).
[22]a 4.[22](Lemma 4.2) If µ is absolutely continuous over an interval I, then its derivative exists (in the distributional sense) for Lebesgue a.e.t ∈ I.That is, for Lebesgue a.e.t ∈ I, there exists μt ∈ D such that for all f ∈ D for all u, v ∈ I and f ∈ U K .