Next Article in Journal
A Large Deviation Principle and an Expression of the Rate Function for a Discrete Stationary Gaussian Process
Previous Article in Journal
Effect of the Postural Challenge on the Dependence of the Cardiovascular Control Complexity on Age
Article Menu

Export Article

Entropy 2014, 16(12), 6705-6721; doi:10.3390/e16126705

Article
A Representation of the Relative Entropy with Respect to a Diffusion Process in Terms of Its Infinitesimal Generator
Olivier Faugeras * and James MacLaurin
INRIA Sophia Antipolis Mediterannee, 2004 Route Des Lucioles, Sophia Antipolis, France
External Editor: Kevin H. Knuth
*
Author to whom correspondence should be addressed.
Received: 23 October 2014; in revised form: 17 December 2014 / Accepted: 18 December 2014 / Published: 22 December 2014

Abstract

: In this paper we derive an integral (with respect to time) representation of the relative entropy (or Kullback–Leibler Divergence) R(μ||P), where μ and P are measures on C([0, T]; ℝd). The underlying measure P is a weak solution to a martingale problem with continuous coefficients. Our representation is in the form of an integral with respect to its infinitesimal generator. This representation is of use in statistical inference (particularly involving medical imaging). Since R(μ||P) governs the exponential rate of convergence of the empirical measure (according to Sanov’s theorem), this representation is also of use in the numerical and analytical investigation of finite-size effects in systems of interacting diffusions.
Keywords:
relative entropy; Kullback–Leibler; diffusion; martingale formulation

1. Introduction

In this paper we derive an integral representation of the relative entropy R(μ||P), where μ is a measure on C([0, T];ℝd) and P governs the solution to a stochastic differential equation (SDE). The relative entropy is used to quantify the distance between two measures. It has considerable applications in statistics, imaging, information theory and communications. It has been used in the long-time analysis of Fokker–Planck equations [1,2], the analysis of dynamical systems [3] and the analysis of spectral density functions [4]. It has been used in financial mathematics to quantify the difference between martingale measures [5,6]. It has also been shown in [7] that the existence problem of the minimal relative entropy martingale measure problem of birth and death processes can be reduced to the problem of solving the Hamilton–Jacobi–Bellman equation; furthermore the minimal entropy martingale measures (MEMMs) for geometric Levy processes are investigated in [8]. The finiteness of R(μ||P) has been shown to be equivalent to the invertibility of certain shifts on Wiener space, when P is the Wiener measure [9,10]. However, one of the most frequent uses of the relative entropy is in statistical inference (particularly in medical imaging) [11,12]. For example, in data fitting, it is a standard technique to select the parameters that minimise the relative entropy of two conditional probability distributions [13]. Modelling in medical imaging increasingly involves diffusion process with state space C([0, T];ℝd), for which the expression R ( μ P ) = E μ [ log d μ d P ] or the variational definition in Definition 1 may not always be tractable. Furthermore, it is not always clear that one may simply approximate the relative entropy by successively calculating it for the marginals over increasingly fine time-discretisations, since these expressions may asymptotically approach infinity (see (4) below).

Another very important application of the relative entropy is in the field of Large Deviations. Sanov’s theorem dictates that the empirical measure induced by independent samples governed by the same probability law P converge towards their limit exponentially fast; and the constant governing the rate of convergence is the relative entropy [14]. Large Deviations have been applied for example to spin glasses [15], neural networks [1618] and mean-field models of interacting particles [19,20]. In the mean-field theory of neuroscience in particular, there has been a recent interest in the modelling of “finite size effects” [18,21], that is, the deviations from the limiting behaviour for a population of a particular size. Large Deviations provides a mathematically rigorous tool to do this. In this system, the limiting system is typically the law P of a stochastic process, and therefore the likelihood of the empirical measure of the system being “near” some measure μ is the relative entropy R(μ||P). However the numerical calculation of R(μ||P) is not straightforward: the results of this paper provide an alternative characterization of R(μ||P), which assists in this calculation.

For example, the rate function for the Large Deviation Principle of the interacting particle model of [20] is directly in terms of the relative entropy between two measures on the space of continuous functions (see in particular Theorem 5.2 of this paper). Similarly, the rate function in [18] (Theorem 10) may be expressed as a function of the relative entropy. In more detail, the rate function J in [18] (Theorem 10) is of the form J ( μ ) = lim n 1 | V n | R ( μ V n Ξ V n ). Here Ξ is the law of the process in [18] (Equation (31)), i.e., the law of a ℤd-indexed stochastic process, and μVn and ΞVn are the marginals over the finite hypercube Vn of side length (2n + 1). The results of this paper give a means of evaluating R(μVn||ΞVn) and therefore J ( μ ).

In this paper we derive a specific integral (with respect to time) representation of the relative entropy R(μ||P) when P is the law of a diffusion process. The representation is in terms of the infinitesimal generator of P. This P is the same as in [22] (Section 4). The representation makes use of regular conditional probabilities. We expect that in some circumstances, it ought to be more tractable than the standard definition in Definition 1, and thus it might be of practical use in the applications listed above.

2. Outline of Main Result

Let T be the Banach Space C([0, T];Rd) equipped with the norm

X = sup s [ 0 , T ] { | X s | } ,
where |⋅| is the standard Euclidean norm over ℝd. We let (Ft) be the canonical filtration over (T, B(T)). For some topological space X, we let B( X) be the Borelian σ-algebra and ( X) the space of all probability measures on ( X , B ( X ) ). Unless otherwise indicated, we endow ( X) with the topology of weak convergence. Let σ = {t1, t2,…, tm} be a finite set of elements such that t1 ≥ 0, tmT and tj < tj+1. We term σ a partition, and denote the set of all such partitions by J. The set of all partitions of the above form such that t1 = 0 and tm = T is denoted J*. We define |σ| = sup1≤j≤m−1{tj+1tj}. For some t ∈ [0, T] and σ ∈ J*, we define σ ¯ ( t ) = sup { s σ | s t }. The following definition of relative entropy is standard.

Definition 1. Let (Ω, ) be a measurable space, and μ, ν probability measures.

R ( μ ν ) = sup f ε { E μ [ f ] log E ν [ exp ( f ) ] } R ,
where ε is the set of all bounded functions. If the σ-algebra is clear from the context, we omit the and write R(μ||ν). If Ω is Polish and = B(Ω), then we only need to take the supremum over the set of all continuous bounded functions.

Let P(T) be the following law governing a Markov–Feller diffusion process on T. Stipulate P to be a weak solution (with respect to the canonical filtration) of the local martingale problem with infinitesimal generator

L t ( f ) = 1 2 1 j , k d a j k ( t , x ) 2 f x j x k + 1 j d b j ( t , x ) f x j ,
for f(x) in C2(ℝd), i.e., the space of twice continuously differentiable functions. The initial condition (governing P0) is μI(ℝd). The coefficients ajk, bj: [0, T]× ℝd → ℝ are assumed to be continuous (over [0, T] × ℝd), and the matrix a(t, x) is strictly positive definite for all t and x. Here P is assumed to be the unique weak solution. We note that the above infinitesimal generator is the same as in [22] (p. 269) (note particularly its Remark 4.4). We note that P is the law of the solution Y = (Y j) to the following stochastic differential equation: for j ∈ [1, d],
d Y t j = b j ( t , Y ) d t + k = 1 d a j k ( t , Y ) d W k .

Here (Wk) are independent Wiener processes.

Our major result is the following. Let μ(T) govern a random variable XT. For some xT, we note μ|[0,s],x, the regular conditional probability (rcp) given Xr = xr for all r ∈ [0, s]. The marginal of μ|[0,s],x at some time ts is noted μt|[0,s],x.

Theorem 1. Let (σ(m))m∈ℤ+ be any series of partitions such that σ(m)σ(m+1) and |σ(m)| 0 as m → ∞. For μ(T),

R ( μ P ) = R 0 ( μ P ) + sup σ J * Γ ( σ ) = R 0 ( μ P ) + lim m Γ ( σ ( m ) )
where
Γ ( σ ) E μ ( x ) [ 0 T sup f D { t E μ t | [ 0 , σ ¯ ( t ) ] , x [ f ] E μ t | [ 0 , σ ¯ ( t ) ] , x ( y ) [ L t f ( y ) + 1 2 j , k = 1 d a j k ( t , y ) f y j f y k ] } d t ] .

Here D is the Schwartz space of compactly supported functionsd → ℝ, possessing continuous derivatives of all orders. If t E μ t | [ 0 , σ ¯ ( t ) ] , x [ f ] does not exist, then we consider it to be ∞.

Our paper has the following format. In Section 3 we make some preliminary definitions, defining the process P against which the relative entropy is taken in this paper. In Section 4 we employ the projective limits approach of [22] to obtain the chief result of this paper: Theorem 1. This gives an explicit integral representation of the relative entropy. In Section 5 we apply the result in Theorem 1 to various corollaries, including the particular case when μ is the solution of a martingale problem. We finish by comparing our results to those of [19] and [20].

3. Preliminaries

We outline some necessary definitions. For σ ∈ J of the form σ = {t1, t2,…, tm}, let σ;j = {t1,…, tj}. We denote the number of elements in a partition σ by m(σ). We let Js be the set of all partitions lying in [0, s]. For 0 < s < tT, we let Js;t be the set of all partitions of the form σt, where σ ∈ Js.

Let π: TTσ := ℝ d×m (σ) be the natural projection, i.e., such that π σ ( x ) = ( x t 1 , , x t m ( σ ) ). We similarly define the natural projection π α γ : T γ T α ( for α γ J ), and we define π [ s , t ] : T C ( [ s , t ] ; R d ) to be the natural restriction of xT to [s, t]. The expectation of some measurable function f with respect to a measure μ is written as Eμ(x)[f(x)], or simply Eμ[f] when the context is clear.

For s < t, we write F s , t = μ [ s , t ] 1 B ( C ( [ s , t ] ; R d ) ) and F σ = μ σ 1 B ( T σ ). We define Fs;t to be the σ-algebra generated by Fs and Fγ (where γ = [t]). For μ(T), we denote its image laws by μ σ : = μ o μ σ 1 ( T σ ) and μ [ s , t ] : = μ o μ [ s , t ] 1 ( C ( [ s , t ] ; R d ) ) Let μ ϵ (T) govern a random variable X = (Xs) ∈ T. For z ∈ ℝd, the rcp given Xs =z by μ|s,z For x ϵ C([0, s]; Rd) or T, the rcp given that Xu = xu for all 0 ≤ us is written as μ|[0,s],x. The rcp given that Xu = xu for all us, and Xt = z, is written as μ|s,x;t,z For σ ∈ Js and z ∈ (ℝd)m(σ), the rcp given that Xu = zu for all uσ is written as μ|σ,z. All of these measures are considered to be in (C([s, T]; ℝd)) (unless indicated otherwise in particular circumstances). The probability laws governing Xt (for t ≥ s), for each of these, are respectively μt|s,z, μt|[0,s],x and μt|σ,z. We clearly have μs|s,z = δz, for μs a.e. z, and similarly for the others.

Remark. See [23] (Definition 5.3.16) for a definition of a rcp. Technically, if we let μ | s , z * be the rcp given Xs = z according to this definition, then μ | s , z = μ s , z π [ s , T ] 1 and μ t | s , t = μ s , z π [ t ] 1. By [23] (Theorem 3.18), μ|s,z is well-defined for μs a.e. z. Similar comments apply to the other rcp’s defined above.

In the definition of the relative entropy, we abbreviate R(μ||P) by Rσ(R||P). If σ = {t}, we write Rt(μ||P).

4. The Relative Entropy R(⋅||P ) Using Projective Limits

In this section we derive an integral representation of the relative entropy R(μ||P), for arbitrary μ(T). We start with the standard result in Theorem 2, before adapting the projective limits approach of [22] to obtain the central result (Theorem 1).

We begin with a standard decomposition result for the relative entropy [24].

Lemma 1. Let X be a Polish space with sub σ-algebras GFB(X). Let μ and ν be probability measures on (X, F), and their regular conditional probabilities over G be (respectively) μω and νω. Then

R F ( μ ν ) = R G ( μ ν ) + E μ ( ω ) [ R F ( μ ω ν ω ) ] .

The following Theorem is a straightforward consequence of [25] (Theorem 6.6): we provide an alternative proof using the theory of Large Deviations in Section 6.

Theorem 2. If α, σ ∈ J and ασ, then Rα(μ||P) ≤ Rσ(μ||P). Furthermore,

R F s , t ( μ P ) = sup σ J [ s , t ] R σ ( μ || P ) ,
R F s ; t ( μ P ) = sup σ J s ; t R σ ( μ || P ) .

It suffices for the supremums in (4) to take σQs,t, where Qs,t is any countable dense subset of [s, t]. Thus we may assume that there exists a sequence σ(n)Q of partitions such that σ(n)σ(n+1), |σ(n)| 0 as n → ∞ and

R F s , t ( μ P ) = lim n R σ ( n ) ( μ || P ) .

We now provide a technical lemma.

Lemma 2. Let t > s, α, σ ∈ Js, σα and sσ. Then for μσ a.e. x, Rt(μ|σ,x||P|s,xs) = R(μt|σ,x||Pt|s,xs). Secondly,

E μ σ ( x ) [ R t ( μ | σ , x P | s , x s ) ] E μ σ ( z ) [ R t ( μ | α , z P | s , z s ) ] .

Proof. The first statement is immediate from Definition 1 and the Markovian nature of P. For the second statement, it suffices to prove this in the case that α = σu, for some u < s. We note that, using a property of regular conditional probabilities, for μσ a.e x,

μ t | σ , x = E μ u | σ , x ( ω ) [ μ t | α , ν ( x , ω ) ] ,
where v(x, ω) ∈ Tα, v(x, ω)u = ω, v(x, ω)r = xr for all rσ.

We consider A to be the set of all finite disjoint partitions a ⊂ B(ℝd) of ℝd. The expression for the entropy in [26] (Lemma 1.4.3) yields

E μ σ ( x ) [ R ( μ t | σ , s P t | s , x s ) ] = E μ σ ( x ) [ sup a A A a μ t | σ , x ( A ) log μ t | σ , x ( A ) P t | s , x s ( A ) ] .

Here the summand is considered to be zero if μt|σ,x(A) = 0, and infinite if μt|σ,x(A) > 0 and Pt|s,xs(A) = 0. Making use of (7), we find that

E μ σ ( x ) [ R ( μ t | σ , s P t | s , x s ) ] = E μ σ ( x ) [ sup a A A a E μ u | σ , x ( ω ) [ μ t | α , ν ( x , ω ) ( A ) ] log μ t | σ , x ( A ) P t | s , x s ( A ) ] E μ σ ( x ) E μ u | σ , x ( ω ) [ sup a A A a μ t | α , ν ( x , ω ) ( A ) log μ t | σ , x ( A ) P t | s , x s ( A ) ] = E μ α ( z ) [ sup a A A a μ t | α ( z ) ( A ) log μ t | σ , π σ α z ( A ) P t | s , z s ( A ) ] .

We note that, for μα a.e. z, if μ t | σ , π σ α z ( A ) = 0 in this last expression, then μt|α,z(A) = 0 and we consider the summand to be zero. To complete the proof of the lemma, it is thus sufficient to prove that for μα a.e. z

sup a A A a μ t | α , z ( A ) log μ t | α , z ( A ) P t | s , z s ( A ) sup a A A a μ t | α , z ( A ) log μ t | σ , π σ α z ( A ) P t | s , z s ( A ) .

However, in turn, the above inequality will be true if we can prove that for each partition a such that P t | s , z s ( A ) > 0 and μ t | σ , π σ α z ( A ) > 0 for all A ∈ a,

A a μ t | α , z ( A ) log μ t | α , z ( A ) P t | s , z s ( A ) A a μ t | α , z ( A ) log μ t | σ , π σ α z ( A ) P t | s , z s ( A ) 0.

The left hand side is equal to A a μ t | α , z ( A ) log μ t | α , z ( A ) μ t | σ , π σ α z ( A ) . An application of Jensen’s inequality demonstrates that this is greater than or equal to zero. □

Remark. If, contrary to the definition, we briefly consider μ | [ 0 , t ] , x to be a probability measure on T, such that μ(A) = 1 where A is the set of all points y such that ys = xs for all s ≤ t, then it may be seen from the definition of R that

R F T ( μ | [ 0 , t ] , x P | [ 0 , t ] , x ) = R F t , T ( μ | [ 0 , t ] , x P | [ 0 , t ] , x ) = R F t , T ( μ | [ 0 , t ] , x P | t , x t ) .

We have also made use of the Markov property of P. This is why our convention, to which we now return, is to consider μ | [ 0 , t ] , x to be a probability measure on (C([t, T]; Rd), Ft,T ).

This leads us to the following expressions for R(μ||P).

Lemma 3. Each σ in the supremums below is of the form {t1 < t2 < … < tm(σ)1 < tm(σ)} for some integer m(σ).

R ( μ P ) = R 0 ( μ P ) + j = 1 m ( σ ) 1 E μ [ 0 , t j ] ( x ) [ R F t j , t j + 1 ( μ | [ 0 , t j ] , x P | t j , x t j ) ] ,
R ( μ P ) = R 0 ( μ P ) + sup σ J * j = 1 m ( σ ) 1 E μ σ ; j ( x ) [ R t j + 1 ( μ t j + 1 | σ ; j , x P t j + 1 | t j , , x t j ) ] ,
E μ [ 0 , s ] ( x ) [ R t ( μ t | [ 0 , s ] , x P t | s , x s ) ] = sup σ J s E μ σ ( y ) [ R t ( μ t | σ , y P t | s , y s ) ] ,
where in this last expression 0 ≤ s < tT.

Proof. Consider the sub σ-algebra F 0 , t m ( σ ) 1. We then find, through an application of Lemma 1 and (8), that

R ( μ P ) = R F 0 , t m ( σ ) 1 ( μ P ) + E μ [ 0 , t m ( σ ) 1 ] ( x ) [ R F t m ( σ ) 1 , t m ( σ ) ( μ | [ 0 , t m ( σ ) 1 ] , x P | t m ( σ ) 1 , x t m ( σ ) 1 ) ] .

We may continue inductively to obtain the first identity.

We use Theorem 2 to prove the second identity. It suffices to take the supremum over J*, because Rσ(μ||P) ≥ Rγ(μ||P) if γσ. It thus suffices to prove that

R σ ( μ P ) = R 0 ( μ P ) + j = 1 m ( σ ) 1 E μ σ ; j ( x ) [ R t j + 1 ( μ t j + 1 | σ ; j , x P t j + 1 | t j , x t j ) ] .

However, this also follows from repeated application of Lemma 1. To prove the third identity, we firstly note that

R F s ; t ( μ P ) = R 0 ( μ P ) + sup σ J s ; t j = 1 m ( σ ) 1 E μ σ ; j ( x ) [ R t j + 1 ( μ t j + 1 | σ ; j , x P t j + 1 | t j , x t j ) ] . = sup σ J s { R σ ( μ P ) + E μ σ ( x ) [ R t ( μ t | σ , x P t | s , x s ) ] } .

The proof of this is entirely analogous to that of the second identity, except that it makes use of (5) instead of (4). However, after another application of Lemma 1, we also have that

R F s ; t ( μ P ) = R F 0 , s ( μ P ) + E μ [ 0 , s ] ( x ) [ R t ( μ t | [ 0 , s ] , x P t | s , x s ) ] .

On equating these two different expressions for R F s ; t ( μ P ), we obtain

E μ [ 0 , s ] ( x ) [ R t ( μ t | [ 0 , s ] , x P t | s , x s ) ] = sup σ J s { ( R σ ( μ P ) R F 0 , s ( μ P ) ) + E μ σ ( x ) [ R t ( μ t | σ , x P t | s , x s ) ] } .

Let (σ(k)) ⊂ Js, σ(k−1)σ(k) be such that lim k R σ ( k ) ( μ P ) = R F 0 , s ( μ P ). Such a sequence exists by (4). Similarly, let (γ(k)) ⊆ Js be a sequence such that E μ γ ( k ) ( x ) [ R t ( μ t | γ ( k ) , x P t | s , x s ) ] is strictly non-decreasing and, as k → ∞, asymptotically approaches sup σ J s E μ σ ( x ) [ R t ( μ t | σ , x P t | s , x s ) ]. Lemma 2 dictates that

E μ σ ( k ) γ ( k ) ( x ) [ R t ( μ t | σ ( k ) γ ( k ) , x P t | s , x s ) ]
asymptotically approaches the same limit as well. Clearly lim k R σ ( k ) γ ( k ) ( μ P ) = R F 0 , s ( μ P ) because of the identity at the start of Theorem 2. This yields the third identity.

4.1. Proof of Theorem 1

In this section we work towards the proof of Theorem 1, making use of some results in [22]. However, we first require some more definitions.

If K ⊂ ℝd is compact, let DK be the set of all fD whose support is contained in K. The corresponding space of real distributions is D′, and we denote the action of θD′ by 〈θ, f〉. If θ(ℝd), then clearly 〈θ, f〉 = Eθ[f]. We let C 0 2 , 1 ( R d ) denote the set of all continuous functions, possessing continuous spatial derivatives of first and second order, a continuous time derivative of first order, and of compact support. For fD and t ∈ [0, T], we define the random variable ∇tf: ℝd→ ℝd such that ( t f ( y ) ) i = j = 1 d a i j ( t , y ) f y j, we may also understand ∇tf(x) := ∇tf(xt)). Let aij be the components of the matrix inverse of aij. For random variables X, Y: T → ℝd, we define the inner ( X , Y ) t , x = i , j = 1 d X i ( x ) Y j ( x ) a i j ( t , x t ) with associated norm | X | t , x 2 = ( X ( x ) , X ( x ) ) t , x 2 We note that | t f | t , x 2 = i , j = 1 d a i j ( t , x t ) f z i ( x t ) f z j ( x t ).

Let M be the space of all continuous maps [0, T] → M(ℝd), equipped with the topology of uniform convergence. For s ∈ [0, T], ϑ ∈ M and ν(ℝd) we define n(s, ϑ, ν) ≥ 0 and such that

n ( s , ϑ , v ) 2 = sup f D { ϑ , f 1 2 E v ( y ) [ | t f | t , y 2 ] } .

This definition is taken from [22] (Equation (4.7))—we note that n is convex in ϑ. For γ(T), we may naturally write n(s, γ, ν) := n(s, ω, ν), where ω is the projection of γ onto M, i.e., ω(s) = γs. It is shown in [22] that this projection is continuous. The following two definitions, lemma and two propositions are all taken (with some small modifications) from [22].

Definition 2. Let I be an interval of the real line. A measure μ(T) is called absolutely continuous if for each compact set K ⊂ ℝd there exists a neighbourhood U of 0 in K and an absolutely continuous function HK : I → ℝ such that

| E μ u [ f ] E μ v [ f ] | | H K ( u ) H K ( v ) | ,
for all u, vI and fUK.

Lemma 4. [22] (Lemma 4.2) If μ is absolutely continuous over an interval I, then its derivative exists (in the distributional sense) for Lebesgue a.e. tI. That is, for Lebesgue a.e. tI, there exists μ ˙ t D such that for all fD

lim h 0 1 h ( μ t + h , f μ t , f ) = μ ˙ t , f .

Definition 3. For ν(C([s, t]; ℝd)), and 0 ≤ s < tT, let L s , t 2 ( ν ) be the Hilbert space of all measurable maps h : [s, t] × ℝd → ℝd with inner product

[ h 1 , h 2 ] = s t E v u ( x ) [ ( h 1 ( u , x ) , h 2 ( u , x ) ) u , x ] d u .

We denote by L s , t , 2 ( ν ) the closure in L s , t 2 ( ν ) of the linear subset generated by maps of the form (x, u) → ∇uf, where f C 0 2 , 1 ( [ s , t ] , R d ). We note that functions in L s , t , 2 ( ν ) only need to be defined duνu(dx) almost everywhere.

Recall that n is defined in (13), and note that * L t μ t , f : = μ t , L t f .

Proposition 1. Assume that μ(C([r, s]; ℝd)), such that μr = δy for some y ∈ ℝd and 0 ≤ r < sT. We have that [22] (Equation 4.9 and Lemma 4.8)

r s n ( t , μ ˙ t * L t μ t , μ t ) 2 d t = sup f C 0 2 , 1 ( R d ) { E μ s ( x ) [ f ( s , x ) ] f ( r , y ) r s E μ t ( x ) [ ( t + L t ) f ( t , x ) + 1 2 | t f ( t , x ) | t , x 2 ] d t } .

It clearly suffices to take the supremum over a countable dense subset. Assume now that r s n ( t , μ ˙ t * L t μ t , μ t ) 2 d t < . Then for Lebesgue a.e. t, t , μ ˙ t = * K t μ t, where [22] (Lemma 4.8(3))

K t f ( ) = L t f ( ) + i j d ( h μ ( t , ) ) j f x j ( ) ,
for some h μ L r , s , 2 ( μ ) that satisfies [22] (Lemma 4.8(4))
r s n ( t , μ ˙ t * L t μ t , μ t ) 2 d t = 1 2 r s E μ t ( x ) [ | h μ ( t , x ) | t , x 2 ] d t < .

Remark. We reach (17) from the proof of Lemma 9 in [22] (Eq 4.10). One should note also that in the equation (4.10) of [22] the relative entropy R as L ν ( 1 ). To reach (18), we also use the equivalence between (4.7) and (4.8) in [22].

Proposition 2. Assume that μ(T), such that μr = δy for some y ∈ ℝd and 0 ≤ r < sT. If R F r , s ( μ P | r , y ) < , then μ is absolutely continuous on [r, s], and [22] (Lemma 4.9)

R F r , s ( μ P | r , y ) r s n ( t , μ ˙ t * L t μ t , μ t ) 2 d t .

Here the derivative μ ˙ t is defined in Lemma 4. For all fD, [22] (Eq. (4.35))

E μ s [ f ] log E P s | r , y [ exp ( f ) ] r s n ( t , μ ˙ t * L t μ t , μ t ) 2 d t .

We are now ready to prove Theorem 1 (the central result).

Proof. Fix a partition σ = {t1, …, tm}. We may conclude from (9) and (17) that

R ( μ P ) R 0 ( μ P ) + j = 1 m 1 E μ [ 0 , t j ] ( x ) t j t j + 1 n ( t , μ ˙ t | [ 0 , t j ] , x , * L t μ t | [ 0 , t , j ] , x , μ t | [ 0 , t , j ] , x ) 2 d t .

The integrand on the right hand side is measurable with respect to E μ [ 0 , t j ] ( x ) due to the equivalent expression (14). We may infer from (18) that

E μ [ 0 , t j ] ( x ) t j t j + 1 n ( t , μ ˙ t | [ 0 , t j ] , x * L t μ t | [ 0 , t j ] , x , μ t | t j , x ) 2 d t E μ [ 0 , t j ] ( x ) [ sup f D { E μ t j + 1 | [ 0 , t j ] , x [ f ] log E P t j + 1 | t j , x t j [ exp ( f ) ] } ] = E μ [ 0 , t j ] ( x ) [ sup f C b ( R d ) { E μ t j + 1 | [ 0 , t j ] , x [ f ] log E P t j + 1 | t j , x t j [ exp ( f ) ] } ] .

This last step follows by noting that if ν(ℝd), and fCb((ℝd), and the expectation of f with respect to ν is finite, then there exists a series (Kn) ⊂ ℝd of compact sets such that

R d ( x ) d ν ( x ) = lim n K n f ( x ) d ν ( x ) .

In turn, for each n there exist ( f n ( m ) ) D K n such that we may write

K n f ( x ) d ν ( x ) = lim m K n f n ( m ) ( x ) d ν ( x ) .

This allows us to conclude that the two supremums are the same. The last expression in (20) is merely

E μ [ 0 , t j ] ( x ) [ R t j + 1 ( μ t j + 1 | [ 0 , t j ] , x P t j + 1 | t j , x t j ) ] .

By (11), this is greater than or equal to

E μ σ ; j ( y ) [ R t j + 1 ( μ t j + 1 | σ ; j , y P t j + 1 | t j , y t j ) ] .

We thus obtain the theorem using (10).

5. Some Corollaries

We state some corollaries of Theorem 1. In the course of this section we make progressively stronger assumptions on the nature of μ, culminating in the elegant expression for R(μ||P) when μ is a solution of a martingale problem. We finish by comparing our work with that of [19,20].

Corollary 1. Suppose that μ(T) and R(μ||P) < ∞. Then for all s and μ a.e. x, μ|[0,s],x is absolutely continuous over [s, T]. For each s ∈ [0, T] and μ a.e. xT, for Lebesgue a.e. ts

μ ˙ t | [ 0 , s ] , x = * K t | s , x μ μ t | [ 0 , s ] , x
where for some h s , x μ L s , T , 2 ( μ | [ 0 , s ] , x )
K t | s , x μ f ( y ) = L t f ( y ) + j = 1 d h s , x μ , j ( t , y ) f y j ( y ) .

Furthermore,

R ( μ P ) = R 0 ( μ P ) + 1 2 sup σ J * 0 T E μ ( w ) E μ t | [ 0 , σ ¯ ( t ) ] , w ( z ) [ | h σ ¯ ( t ) , w μ ( t , z ) | t , z 2 ] d t .

For any dense countable subset Q0,T of [0, T], there exists a series of partitions σ(n)σ(n+1)Q0,T, such that as n∞, |σ(n)| → 0, and

R ( μ P ) = R 0 ( μ P ) + 1 2 lim n 0 T E μ ( w ) E μ t | [ 0 , σ ¯ ( n ) ( t ) ] , w ( z ) [ | h σ ¯ ( n ) ( t ) , w μ ( t , z ) | t , z 2 ] d t .

Remark. It is not immediately clear that we may simplify (23) further (barring further assumptions). The reason for this is that we only know that E μ | [ 0 , σ ¯ ( t ) ] , w ( z ) [ | h σ ¯ ( t ) , w μ ( t , z ) | t , z 2 ] is measurable (as a function of w), but it has not been proven that h σ ¯ ( t ) , w μ ( t , z ) is measurable (as a function of w).

Proof. Let σ = {0 = t1, …, tm = T} be an arbitrary partition. For all j < m, we find from Lemma 3 that R F t j , t j + 1 ( μ | [ 0 , t j ] , x P | t j , x t j ) < for μ [ 0 , t j ] a.e. xC([0, tj]; ℝd). We thus find that, for all such x, μ | [ 0 , t j ] , x is absolutely continuous on [tj, tj+1] from Proposition 2. We are then able to obtain (21) and (22) from Propositions 1 and 2. From (2), (16) and (21) we find that

R ( μ P ) = R 0 ( μ P ) + 1 2 sup σ J * E μ ( x ) 0 T E μ t | [ 0 , σ ¯ ( t ) ] , x ( z ) [ | h σ ¯ ( t ) , x μ ( t , z ) | t , z 2 ] d t .

The above integral must be finite (since we are assuming R(μ||P) is finite). Furthermore E μ t | [ 0 , σ ¯ ( t ) ] , x ( z ) [ | h σ ¯ ( t ) , x μ ( t , z ) | t , z 2 ] is (t, x) measurable as a consequence of the equivalent form (14). This allows us to apply Fubini’s theorem to obtain (23). The last statement on the sequence of maximising partitions follows from Theorem 2.

Corollary 2. Suppose that R(μ||P) < ∞. Suppose that for all sQ0,T (any countable, dense subset of [0, T]), for μ a.e. x and Lebesgue a.e. t, h s , x μ ( t , x t ) = E μ | [ 0 , s ] , x ; t , x t ( w ) h μ ( t , w ) for some progressively measurable random variable hμ : [0, T] × T → ℝd. Then

R ( μ P ) = R 0 ( μ P ) + 1 2 0 T E μ ( w ) [ | h μ ( t , w ) | t , w t 2 ] d t .

Proof. Let Gs,x;t,y be the sub σ-algebra consisting of all BB(T) such that for all wB, wr = xr for all rs and wt = y. Thus h s , x μ ( t , y ) = E μ | [ 0 , s ] , x ; t , y ( w ) h μ ( t , w ) = E μ [ h μ ( t , ) | G s , x ; t , y ]. By [27] (Corollary 2.4), since s < t G s , x ; t , x t = G t , x ; t , x t (restricting to sQ0,T), for μ a.e. x,

lim s t E μ | [ 0 , s ] , x ; t , x t ( w ) h μ ( t , w ) = h μ ( t , x ) ,
where sQ0,T. By the properties of the regular conditional probability, we find from (24) that
R ( μ P ) = R 0 ( μ P ) + 1 2 lim n 0 T E μ ( w ) [ | E μ | [ 0 , σ ¯ ( n ) ( t ) ] , w ; t , w t ( υ ) [ h μ ( t , υ ) ] | t , w t 2 ] d t .

By assumption, the above limit is finite. Thus by Fatou’s lemma, and using the properties of the regular conditional probability,

R ( μ P ) R 0 ( μ P ) + 1 2 0 T E μ ( w ) [ lim ¯ n | E μ | [ 0 , σ ¯ ( n ) ( t ) ] , w ; t , w t ( υ ) [ h μ ( t , υ ) ] | t , w t 2 ] d t .

Through use of (26),

R ( μ P ) R 0 ( μ P ) + 1 2 0 T E μ ( w ) [ | h μ ( t , w ) | t , w t 2 ] d t .

Conversely, through an application of Jensen’s inequality to (27)

R ( μ P ) R 0 ( μ P ) + 1 2 lim n 0 T E μ ( w ) [ E μ | [ 0 , σ ¯ ( n ) ( t ) ] , w ; t , w t ( υ ) [ | h μ ( t , υ ) | t , w t 2 ] ] d t .

A property of the regular conditional probability yields

R ( μ P ) R 0 ( μ P ) + 1 2 0 T E μ ( w ) [ | h μ ( t , w ) | t , w t 2 ] d t .

Remark. The condition in the above corollary is satisfied when μ is a solution to a martingale problem—see Lemma 5.

We may further simplify the expression in Theorem 1 when μ is a solution to the following martingale problem. Let {cjk, ej} be progressively measurable functions [0, T] × T → ℝ. We suppose that cjk = ckj. For all 1 ≤ j, kd, cjk(t, x) and ej(t, x) are assumed to be bounded for xL (where L is compact) and all t ∈ [0, T]. For f C 0 2 ( R d ) and xT, let

u ( f ) ( x ) = 1 j , k d c j k ( u , x ) 2 f y j y k ( x u ) + 1 j d e j ( u , x ) f y j ( x u ) .

We assume that for all such f, the following is a continuous martingale (relative to the canonical filtration) under μ

f ( X t ) f ( X 0 ) 0 t u f ( X ) d u .

The law governing X0 is stipulated to be ν(ℝd).

From now on we switch from our earlier convention and we consider μ|[0,s],x to be a measure on T such that, for μ a.e. xT, μ|[0,s],x(As,x) = 1, where As,x is the set of all XT satisfying Xt = xt for all 0 ≤ ts. This is a property of a regular conditional probability (see Theorem 3.18 in [23]). Similarly, μ|s,x;t,y is considered to be a measure on T such that for μ a.e. xT, μ|s,x;t,y(Bs,x;t,y) = 1, where Bs,x;t,y is the set of all XAs,x such that Xt = y. We may apply Fubini’s Theorem (since f is compactly supported and bounded) to (28) to find that

μ t | [ 0 , s ] , x , f f ( x s ) = s t E μ | [ 0 , s ] , x [ u f ] d u .

This ensures that μ|[0,s] is absolutely continuous over [s, T], and that

μ ˙ t | [ 0 , s ] , x , f = E μ | [ 0 , s ] , x [ t f ] .

Lemma 5. If R(μ||P) < ∞ then for Lebesgue a.e. t ∈ [0, T] and μ a.e. xT,

a ( t , x t ) = c ( t , x ) .

If R(μ||P) < ∞ then

R ( μ P ) = R ( ν μ I ) + 1 2 E μ ( x ) [ 0 T | b ( s , x s ) e ( s , x ) | s , x s 2 d s ] .

Proof. It follows from R(μ||P) < , (21) and (22) that for all s and μ a.e. x, for Lebesgue a.e. ts

E μ | s , x ; t , x t [ c ( t , ) ] = a ( t , x t ) .

Let us take a countable dense subset Q0,T of [0, T]. There thus exists a null set N ⊆ [0, T] such that for every sQ0,T, μ a.e. x and every tN the above equation holds. We may therefore conclude (30) using [27] (Corollary 2.4) and taking st. From (29), we observe that for all s ∈ [0, T] and μ a.e. x, for Lebesgue a.e. t

h s , x μ ( t , x t ) = E μ | [ 0 , s ] , x ; t , x t [ e ( t , ) ] .

Equation (31) thus follows from Corollary 2.

5.1. Comparison of our Results to Those of Fischer et al. [19,20]

We have already noted in the introduction that one may infer a variational representation of the relative entropy from [19,20] by assuming that the coefficients of the underlying stochastic process are independent of the empirical measure in these papers. The assumptions in [20] on the underlying process P are both more general and more restrictive than ours. His assumptions are more general insofar as the coefficients of the SDE may depend on the past history of the process and the diffusion coefficient is allowed to be degenerate. However, our assumptions are more general insofar as we only require P to be the unique (in the sense of probability law) weak solution of the SDE, whereas [20] requires P to be the unique strong solution of the SDE. Of course when both sets of assumptions are satisfied, one may infer that the expressions for the relative entropy are identical.

6. Proof of Theorem 2

The following is an alternative proof to that of [25] (Theorem 6.6) employing the theory of Large Deviations. The fact that, if ασ, then Rα(μ||P) ≤ Rσ(μ||P), follows from Lemma 1. We prove the first expression (4) in the case s = 0, t = T (the proof of the second identity (5) is analogous).

Definition 4. A series of probability laws ΓN on some topological space Ω equipped with its Borelian σ-algebra is said to satisfy a strong Large Deviation Principle with rate function I : Ω → ℝ if for all open sets O,

lim ¯ N N 1 log Γ N ( O ) inf x O I ( x )
and for all closed sets F
lim ¯ N N 1 log Γ N ( F ) inf x F I ( x ) .

If furthermore the set {x : I(x) ≤ α} is compact for all α ≥ 0, we say that I is a good rate function.

We define the following empirical measures.

Definition 5. For x ∈ TN, y T σ N, let

μ ^ N ( x ) = 1 N 1 j N δ x j ( T ) , μ ^ σ N ( y ) = 1 N 1 j N δ y j ( T σ ) .

Clearly μ ^ σ N ( x σ ) = π σ ( μ ^ N ( x ) ). The image law P N ( μ ^ N ) 1 is denoted by s , t N ( ( T ) ). Similarly, for σ ∈ J, the image law of P σ N ( μ ^ σ N ) 1 on (Tσ) is denoted by σ N ( ( T σ ) ). Since T and Tσ are Polish spaces, we have by Sanov’s theorem (see Theorem 6.2.10 in [14]) that ΠN satisfies a strong Large Deviation Principle with good rate function R(⋅||P). Similarly, σ N satisfies a strong Large Deviation Principle on (Tσ) with good rate function R F σ ( P ).

We now define the projective limit ¯ ( T ). If α, γ ∈ J, αγ, then we may define the projection π α γ : ( T γ ) ( T σ ) as π α γ ( ξ ) : = ξ π α γ 1. An element of ¯ ( T ) is then a member ⊗σζ(σ) of the Cartesian product ⊗σ∈J(Tσ) satisfying the consistency condition π α γ ( ζ ( γ ) ) = ζ ( α ) for all αγ. The topology on ¯ ( T ) is the minimal topology necessary for the natural projection ¯ ( T ) ( T α ) to be continuous for all α ∈ J. That is, it is generated by open sets of the form

A γ , O = { σ ζ ( σ ) ¯ ( T ) : ζ ( γ ) O } ,
for some γ ∈ J and open O (with respect to the weak topology of (Tγ)).

We may continuously embed (T) into the projective limit ¯ ( T ) of its marginals, letting ι denote this embedding. That is, for any σ ∈ J, (ι(μ))(σ) = μσ. We note that ι is continuous because ι1(Aγ,O) is open in (T), for all Aγ,O of the form in (33). We equip ¯ ( T ) with the Borelian σ-algebra generated by this topology. The embedding ι is measurable with respect to this σ-algebra because the topology of (T) has a countable base. The embedding induces the image laws (ΠNι1) on ( ¯ ( T ) ). For σ ∈ J, it may be seen that Π σ N = Π N ι 1 ( π σ ) 1 ( ( T σ ) ), where π σ ( α μ ( α ) ) = μ ( σ ).

It follows from [22] (Thm 3.3) that ΠNι1 satisfies a Large Deviation Principle with rate function supσ∈J Rσ(μ||P). However, we note that ι is 1 – 1, because any two measures μ, ν(T) such that μσ = νσ for all σ ∈ J must be equal. Furthermore, ι is continuous. Because of Sanov’s theorem, (ΠN) is exponentially tight (see Defn 1.2.17, Exercise 1.2.19 in [14] for a definition of exponential tightness and proof of this statement). These facts mean that we may apply the inverse contraction principle [14] (Thm 4.2.4) to infer that ΠN satisfies a Large Deviation Principle with the rate function supσ∈J Rσ(μ||P). Since rate functions are unique [14] (Lemma 4.1.4), we obtain the first identity in conjunction with Sanov’s theorem. The second identity (5) follows similarly. We may repeat the argument above, while restricting to σQs,t. We obtain the same conclusion because the σ-algebra generated by (Fσ)σQs,t is the same as Fs,t. The last identity follows from the fact that, if ασ, then Rα(μ||P) ≤ Rσ(μ||P).

Acknowledgments

This work was supported by INRIA FRM, ERC-NERVI number 227747, European Union Project # FP7-269921 (BrainScales), and Mathemacs # FP7-ICT-2011.9.7

Author Contributions

Both authors contributed to all the article. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Plastino, A.; Miller, H.; Plastino, A. Minimum Kullback entropy approach to the Fokker-Planck equation. Phys. Rev. E. 1997, 56, 3927–3934. [Google Scholar]
  2. Desvillettes, L.; Villani, C. On the trend to global equilibrium in spatially inhomogeneous entropy-dissipating systems. Part 1: The Linear Fokker-Planck Equation. Comm. Pure Appl. Math. 2001, 54, 1–42. [Google Scholar]
  3. Yu, S.; Mehta, P. The Kullback-Leibler Rate Metric for Comparing Dynamical Systems, In Proceedings of Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, Shanghai, China, 16–18 December 2009; pp. 8363–8368.
  4. Georgiou, T.T.; Lindquist, A. Kullback-Leibler approximation of spectral density functions. IEEE Trans. Inf. Theory. 2003, 49, 2910–2917. [Google Scholar]
  5. Fritelli, M. The minimal entropy martingale measure and the valuation problem in incomplete markets. Math. Finance. 2000, 10, 39–52. [Google Scholar]
  6. Grandits, P.; Rheinlander, T. On the minimal entropy martingale measure. Ann. Prob. 2002, 30, 1003–1038. [Google Scholar]
  7. Miyahara, Y. Minimal Relative Entropy Martingale Measure of Birth and Death Process. In Discussion Papers in Economics; Nagoya City University: Nagoya, Japan, 2000. [Google Scholar]
  8. Miyahara, Y. On the Minimal Entropy Martingale Measures for Geometric Lévy Processes. In Discussion Papers in Economics; Nagoya City University: Nagoya, Japan, 2000. [Google Scholar]
  9. Ustunel, A.S. Entropy, invertibility and variational calculus of the adapted shifts on Wiener Space. J. Funct. Anal. 2009, 257, 3655–3689. [Google Scholar]
  10. Lassalle, R. Invertibility of adapted perturbations of the identity on abstract Wiener space. J. Funct. Anal. 2012, 262, 2734–2776. [Google Scholar]
  11. Akaike, H. Likelihood of a model and information criteria. J. Econometrics. 1981, 16, 3–14. [Google Scholar]
  12. Do, M.; Vetterli, M. Wavelet-based Texture Retrieval Using Generalized Gaussian Density and Kullback–Leibler Distance. IEEE Trans. Image Process 2002, 11, 146–158. [Google Scholar]
  13. Bozdogan, H. Akaike’s Information Criterion and Recent Developments in Information Complexity. J. Math. Psychol. 2000, 44, 62–91. [Google Scholar]
  14. Dembo, A.; Zeitouni, O. Large Deviations Techniques, 2nd ed; Springer: Berlin, Germany, 1997. [Google Scholar]
  15. Ben-Arous, G.; Guionnet, A. Large deviations for Langevin spin glass dynamics. Probab. Theor. Relat. Field. 1995, 102, 455–509. [Google Scholar]
  16. Moynot, O.; Samuelides, M. Large deviations and mean-field theory for asymmetric random recurrent neural networks. Probab. Theor. Relat. Field. 2002, 123, 41–75. [Google Scholar]
  17. Faugeras, O.; MacLaurin, J. A large deviation principle for networks of rate neurons with correlated synaptic weights 2013, arXiv, 1302.1029.
  18. Faugeras, O.; MacLaurin, J. Large Deviations of an Ergodic Synchoronous Neural Network with Learning 2014, arXiv, 1404.0732v3, math.PR.
  19. Budhiraja, A.; Dupuis, P.; Fischer, M. Large deviation properties of weakly interacting processes via weak convergence methods. Ann. Prob. 2012, 40, 74–102. [Google Scholar]
  20. Fischer, M. On the form of the large deviation rate function for the empirical measures of weakly interacting systems. Bernoulli 2014, 20, 1765–1801. [Google Scholar]
  21. Baladron, J.; Fasoli, D.; Faugeras, O.; Touboul, J. Mean field description of and propagation of chaos in recurrent multipopulation networks of Hodgkin-Huxley and Fitzhugh-Nagumo neurons 2011, arXiv, 1110.4294.
  22. Dawson, D.A.; Gärtner, J. Large deviations from the McKean-Vlasov limit for weakly interacting diffusions. Stochastics 1987, 20, 247–308. [Google Scholar]
  23. Karatzas, I.; Shreve, S.E. Brownian Motion and Stochastic Calculus, 2nd ed; Graduate Texts in Mathematics; Volume 113, Springer-Verlag: New York, NY, USA, 1991. [Google Scholar]
  24. Donsker, M.; Varadhan, S. Asymptotic Evaluation of Certain Markov Process Expectations for Large Time, IV. Comm. Pure Appl. Math. 1983, XXXVI, 183–212. [Google Scholar]
  25. Xanh, N.X.; Zessin, H. Ergodic Theorems for Spatial Processes. Z. Wahfscheinlichkeitstheorie verw Gebiete 1979, 48, 133–158. [Google Scholar]
  26. Dupuis, P.; Ellis, R.S. A Weak Convergence Approach to the Theory of Large Deviations; John Wiley & Sons: London, UK, 1997. [Google Scholar]
  27. Revuz, D.; Yor, M. Continuous Martingales and Brownian Motion, 2nd ed; Springer-Verlag: Berlin, Germany, 1991. [Google Scholar]
Entropy EISSN 1099-4300 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top