- freely available
- re-usable

*Entropy*
**2014**,
*16*(12),
6705-6721;
doi:10.3390/e16126705

## Abstract

**:**In this paper we derive an integral (with respect to time) representation of the relative entropy (or Kullback–Leibler Divergence) R(μ||P), where μ and P are measures on C([0, T]; ℝ

^{d})

_{.}The underlying measure P is a weak solution to a martingale problem with continuous coefficients. Our representation is in the form of an integral with respect to its infinitesimal generator. This representation is of use in statistical inference (particularly involving medical imaging). Since R(μ||P) governs the exponential rate of convergence of the empirical measure (according to Sanov’s theorem), this representation is also of use in the numerical and analytical investigation of finite-size effects in systems of interacting diffusions.

## 1. Introduction

In this paper we derive an integral representation of the relative entropy R(μ||P), where μ is a measure on C([0, T];ℝ^{d}) and P governs the solution to a stochastic differential equation (SDE). The relative entropy is used to quantify the distance between two measures. It has considerable applications in statistics, imaging, information theory and communications. It has been used in the long-time analysis of Fokker–Planck equations [1,2], the analysis of dynamical systems [3] and the analysis of spectral density functions [4]. It has been used in financial mathematics to quantify the difference between martingale measures [5,6]. It has also been shown in [7] that the existence problem of the minimal relative entropy martingale measure problem of birth and death processes can be reduced to the problem of solving the Hamilton–Jacobi–Bellman equation; furthermore the minimal entropy martingale measures (MEMMs) for geometric Levy processes are investigated in [8]. The finiteness of R(μ||P) has been shown to be equivalent to the invertibility of certain shifts on Wiener space, when P is the Wiener measure [9,10]. However, one of the most frequent uses of the relative entropy is in statistical inference (particularly in medical imaging) [11,12]. For example, in data fitting, it is a standard technique to select the parameters that minimise the relative entropy of two conditional probability distributions [13]. Modelling in medical imaging increasingly involves diffusion process with state space C([0, T];ℝ^{d}), for which the expression
$R(\mu \Vert P)={E}^{\mu}[\mathrm{log}\frac{d\mu}{dP}]$ or the variational definition in Definition 1 may not always be tractable. Furthermore, it is not always clear that one may simply approximate the relative entropy by successively calculating it for the marginals over increasingly fine time-discretisations, since these expressions may asymptotically approach infinity (see (4) below).

Another very important application of the relative entropy is in the field of Large Deviations. Sanov’s theorem dictates that the empirical measure induced by independent samples governed by the same probability law P converge towards their limit exponentially fast; and the constant governing the rate of convergence is the relative entropy [14]. Large Deviations have been applied for example to spin glasses [15], neural networks [16–18] and mean-field models of interacting particles [19,20]. In the mean-field theory of neuroscience in particular, there has been a recent interest in the modelling of “finite size effects” [18,21], that is, the deviations from the limiting behaviour for a population of a particular size. Large Deviations provides a mathematically rigorous tool to do this. In this system, the limiting system is typically the law P of a stochastic process, and therefore the likelihood of the empirical measure of the system being “near” some measure μ is the relative entropy R(μ||P). However the numerical calculation of R(μ||P) is not straightforward: the results of this paper provide an alternative characterization of R(μ||P), which assists in this calculation.

For example, the rate function for the Large Deviation Principle of the interacting particle model of [20] is directly in terms of the relative entropy between two measures on the space of continuous functions (see in particular Theorem 5.2 of this paper). Similarly, the rate function in [18] (Theorem 10) may be expressed as a function of the relative entropy. In more detail, the rate function
$\stackrel{\u2323}{J}$ in [18] (Theorem 10) is of the form
$\stackrel{\u2323}{J}(\mu )={\mathrm{lim}}_{n\to \infty}\frac{1}{\left|{V}_{n}\right|}R({\mu}^{{V}_{n}}\Vert {\mathrm{\Xi}}^{{V}_{n}})$. Here Ξ is the law of the process in [18] (Equation (31)), i.e., the law of a ℤ^{d}-indexed stochastic process, and μ^{Vn} and Ξ^{Vn} are the marginals over the finite hypercube V_{n} of side length (2n + 1). The results of this paper give a means of evaluating R(μ^{Vn}||Ξ^{Vn}) and therefore
$\stackrel{\u2323}{J}(\mu )$.

In this paper we derive a specific integral (with respect to time) representation of the relative entropy R(μ||P) when P is the law of a diffusion process. The representation is in terms of the infinitesimal generator of P. This P is the same as in [22] (Section 4). The representation makes use of regular conditional probabilities. We expect that in some circumstances, it ought to be more tractable than the standard definition in Definition 1, and thus it might be of practical use in the applications listed above.

## 2. Outline of Main Result

Let T be the Banach Space C([0, T];R^{d}) equipped with the norm

^{d}. We let (F

_{t}) be the canonical filtration over (T, B(T)). For some topological space $\mathcal{X}$, we let B( $\mathcal{X}$) be the Borelian σ-algebra and $\mathcal{M}$( $\mathcal{X}$) the space of all probability measures on $(\mathcal{X},\mathcal{B}(\mathcal{X}))$. Unless otherwise indicated, we endow $\mathcal{M}$( $\mathcal{X}$) with the topology of weak convergence. Let σ = {t

_{1}, t

_{2},…, t

_{m}} be a finite set of elements such that t

_{1}≥ 0, t

_{m}≤ T and t

_{j}< t

_{j}

_{+1}. We term σ a partition, and denote the set of all such partitions by J. The set of all partitions of the above form such that t

_{1}= 0 and t

_{m}= T is denoted J

_{*}. We define |σ| = sup

_{1}

_{≤j≤m−}

_{1}{t

_{j}

_{+1}− t

_{j}}. For some t ∈ [0, T] and σ ∈ J

_{*}, we define $\underset{\xaf}{\sigma}(t)=\text{sup}\{s\in \sigma |s\le t\}$. The following definition of relative entropy is standard.

**Definition 1.** Let (Ω, $\mathscr{H}$) be a measurable space, and μ, ν probability measures.

Let P ∈ $\mathcal{M}$(T) be the following law governing a Markov–Feller diffusion process on T. Stipulate P to be a weak solution (with respect to the canonical filtration) of the local martingale problem with infinitesimal generator

^{2}(ℝ

^{d}), i.e., the space of twice continuously differentiable functions. The initial condition (governing P

_{0}) is μ

_{I}∈ $\mathcal{M}$(ℝ

^{d}). The coefficients a

^{jk}, b

^{j}: [0, T]× ℝ

^{d}→ ℝ are assumed to be continuous (over [0, T] × ℝ

^{d}), and the matrix a(t, x) is strictly positive definite for all t and x. Here P is assumed to be the unique weak solution. We note that the above infinitesimal generator is the same as in [22] (p. 269) (note particularly its Remark 4.4). We note that P is the law of the solution Y = (Y

^{j}) to the following stochastic differential equation: for j ∈ [1, d],

Here (W^{k}) are independent Wiener processes.

Our major result is the following. Let μ ∈ $\mathcal{M}$(T) govern a random variable
$\mathcal{X}$ ∈ T. For some x ∈ T, we note μ_{|}_{[0}_{,s}_{]}_{,x}, the regular conditional probability (rcp) given X_{r} = x_{r} for all r ∈ [0, s]. The marginal of μ_{|}_{[0}_{,s}_{]}_{,x} at some time t ≥ s is noted μ_{t|}_{[0}_{,s}_{]}_{,x}.

**Theorem 1.** Let (σ^{(}^{m}^{)})_{m}_{∈ℤ+} be any series of partitions such that σ^{(}^{m)} ⊆ σ^{(}^{m}^{+1)} and |σ^{(}^{m}^{)}| → 0 as m → ∞. For μ ∈ $\mathcal{M}$(T),

Here D is the Schwartz space of compactly supported functions ℝd → ℝ, possessing continuous derivatives of all orders. If$\frac{\partial}{\partial t}{E}^{{\mu}_{t|[0,\underset{\xaf}{\sigma}(t)],x}}[f]$ does not exist, then we consider it to be ∞.

Our paper has the following format. In Section 3 we make some preliminary definitions, defining the process P against which the relative entropy is taken in this paper. In Section 4 we employ the projective limits approach of [22] to obtain the chief result of this paper: Theorem 1. This gives an explicit integral representation of the relative entropy. In Section 5 we apply the result in Theorem 1 to various corollaries, including the particular case when μ is the solution of a martingale problem. We finish by comparing our results to those of [19] and [20].

## 3. Preliminaries

We outline some necessary definitions. For σ ∈ J of the form σ = {t_{1}, t_{2},…, t_{m}}, let σ_{;}_{j} = {t_{1},…, t_{j}}. We denote the number of elements in a partition σ by m(σ). We let J_{s} be the set of all partitions lying in [0, s]. For 0 < s < t ≤ T, we let J_{s}_{;}_{t} be the set of all partitions of the form σ ∪ t, where σ ∈ J_{s}.

Let π: T → T_{σ} := ℝ ^{d×m} ^{(}^{σ}^{)} be the natural projection, i.e., such that
${\pi}_{\sigma}(x)=({x}_{t}{}_{1},\dots ,{x}_{{t}_{m}{}_{(\sigma )}})$. We similarly define the natural projection
${\pi}_{\alpha \gamma}:{\mathcal{T}}_{\gamma}\to {\mathcal{T}}_{\alpha}\phantom{\rule{0.2em}{0ex}}(\text{for}\phantom{\rule{0.2em}{0ex}}\alpha \subseteq \gamma \in \mathrm{J})$, and we define
${\pi}_{[s,t]}:\mathcal{T}\to C([s,t];{\mathrm{R}}^{d})$ to be the natural restriction of x ∈ T to [s, t]. The expectation of some measurable function f with respect to a measure μ is written as E^{μ}^{(}^{x}^{)}[f(x)], or simply E^{μ}[f] when the context is clear.

For s < t, we write
${F}_{s,t}={\mu}_{[s,t]}^{-1}B(C([s,t];{\mathrm{R}}^{d}))$ and
${F}_{\sigma}={\mu}_{\sigma}^{-1}B({\mathcal{T}}_{\sigma})$. We define F_{s}_{;}_{t} to be the σ-algebra generated by F_{s} and F_{γ} (where γ = [t]). For μ ∈ $\mathcal{M}$(T), we denote its image laws by
${\mu}_{\sigma}:=\mu \phantom{\rule{0.2em}{0ex}}\mathrm{o}\phantom{\rule{0.2em}{0ex}}{\mu}_{\sigma}^{-1}\in \mathrm{\mathcal{M}}({\mathcal{T}}_{\sigma})$ and
${\mu}_{[s,t]}:=\mu \phantom{\rule{0.2em}{0ex}}\mathrm{o}\phantom{\rule{0.2em}{0ex}}{\mu}_{[s,t]}^{-1}\in \mathrm{\mathcal{M}}(C([s,t];{\mathrm{R}}^{d}))$ Let μ ϵ $\mathcal{M}$ (T) govern a random variable X = (X_{s}) ∈ T. For z ∈ ℝ^{d}, the rcp given X_{s} =z by μ|_{s,z} For x ϵ C([0, s]; R^{d}) or T, the rcp given that X_{u} = x_{u} for all 0 ≤ u ≤ s is written as μ_{|}_{[0}_{,s}_{],}_{x}. The rcp given that X_{u} = x_{u} for all u ≤ s, and X_{t} = z, is written as μ_{|}_{s,x;t,z} For σ ∈ J_{s} and z ∈ (ℝ^{d})^{m}^{(}^{σ}^{)}, the rcp given that X_{u} = z_{u} for all u ∈ σ is written as μ_{|}_{σ,z}. All of these measures are considered to be in $\mathcal{M}$ (C([s, T]; ℝ^{d})) (unless indicated otherwise in particular circumstances). The probability laws governing X_{t} (for t ≥ s), for each of these, are respectively μ_{t|s,z}, μ_{t|}_{[0}_{,s}_{]}_{,x} and μ_{t|σ,z}. We clearly have μ_{s|s,z} = δ_{z}, for μ_{s} a.e. z, and similarly for the others.

Remark. See [23] (Definition 5.3.16) for a definition of a rcp. Technically, if we let${\mu}_{|s,z}^{*}$ be the rcp given X_{s} = z according to this definition, then${\mu}_{|}{}_{s,z}={\mu}_{s,z}^{\ast}\circ {\pi}_{[s,\mathcal{T}]}^{-1}$ and${\mu}_{t|}{}_{s,t}={\mu}_{s,z}^{\ast}\circ {\pi}_{[t]}^{-1}$. By [23] (Theorem 3.18), μ_{|s,z} is well-defined for μ_{s} a.e. z. Similar comments apply to the other rcp’s defined above.

In the definition of the relative entropy, we abbreviate R_{Fσ}(μ||P) by R_{σ}(R||P). If σ = {t}, we write R_{t}(μ||P).

## 4. The Relative Entropy R(⋅||P ) Using Projective Limits

In this section we derive an integral representation of the relative entropy R(μ||P), for arbitrary μ ∈ $\mathcal{M}$(T). We start with the standard result in Theorem 2, before adapting the projective limits approach of [22] to obtain the central result (Theorem 1).

We begin with a standard decomposition result for the relative entropy [24].

**Lemma 1.** Let X be a Polish space with sub σ-algebras G ⊆ F ⊆ B(X). Let μ and ν be probability measures on (X, F), and their regular conditional probabilities over G be (respectively) μ_{ω} and ν_{ω}. Then

The following Theorem is a straightforward consequence of [25] (Theorem 6.6): we provide an alternative proof using the theory of Large Deviations in Section 6.

**Theorem 2.** If α, σ ∈ J and α ⊆ σ, then R_{α}(μ||P) ≤ R_{σ}(μ||P). Furthermore,

It suffices for the supremums in (4) to take σ ⊂ Q_{s,t}, where Q_{s,t} is any countable dense subset of [s, t]. Thus we may assume that there exists a sequence σ^{(}^{n}^{)} ⊂ Q of partitions such that σ^{(}^{n}^{)} ⊆ σ^{(}^{n}^{+1)}, |σ^{(}^{n}^{)}| → 0 as n → ∞ and

We now provide a technical lemma.

**Lemma 2.** Let t > s, α, σ ∈ J_{s}, σ ⊂ α and s ∈ σ. Then for μ_{σ} a.e. x, R_{t}(μ_{|σ,x}||P_{|s,xs}) = R(μ_{t|σ,x}||P_{t|s,xs}). Secondly,

**Proof.** The first statement is immediate from Definition 1 and the Markovian nature of P. For the second statement, it suffices to prove this in the case that α = σ ∪ u, for some u < s. We note that, using a property of regular conditional probabilities, for μ_{σ} a.e x,

_{α}, v(x, ω)

_{u}= ω, v(x, ω)

_{r}= x

_{r}for all r ∈ σ.

We consider A to be the set of all finite disjoint partitions a ⊂ B(ℝ^{d}) of ℝ^{d}. The expression for the entropy in [26] (Lemma 1.4.3) yields

Here the summand is considered to be zero if μ_{t|σ,x}(A) = 0, and infinite if μ_{t|σ,x}(A) > 0 and P_{t|s,xs}(A) = 0. Making use of (7), we find that

We note that, for μ_{α} a.e. z, if
${\mu}_{t|\sigma ,{\pi}_{\sigma \alpha}z}(A)=0$ in this last expression, then μ_{t|α,z}(A) = 0 and we consider the summand to be zero. To complete the proof of the lemma, it is thus sufficient to prove that for μ_{α} a.e. z

However, in turn, the above inequality will be true if we can prove that for each partition a such that ${P}_{t|s,{z}_{s}}(A)>0$ and ${\mu}_{t|\sigma ,{\pi}_{\sigma \alpha}z}(A)>0$ for all A ∈ a,

The left hand side is equal to ${\sum}_{A\in \mathrm{a}}{\mu}_{t|\alpha ,z}(A)\mathrm{log}\frac{{\mu}_{t|\alpha ,z}(A)}{{\mu}_{t|\sigma ,{\pi}_{\sigma \alpha}z}(A)}.$ An application of Jensen’s inequality demonstrates that this is greater than or equal to zero. □

Remark. If, contrary to the definition, we briefly consider${\mu}_{|[0,t],x}$ to be a probability measure on T, such that μ(A) = 1 where A is the set of all points y such that y_{s} = x_{s} for all s ≤ t, then it may be seen from the definition of R that

We have also made use of the Markov property of P. This is why our convention, to which we now return, is to consider${\mu}_{|[0,t],x}$ to be a probability measure on (C([t, T]; R^{d}), F_{t,T} ).

This leads us to the following expressions for R(μ||P).

**Lemma 3.** Each σ in the supremums below is of the form {t_{1} < t_{2} < … < t_{m}_{(}_{σ}_{)}_{−}_{1} < t_{m}_{(}_{σ}_{)}} for some integer m(σ).

**Proof.** Consider the sub σ-algebra
${F}_{0,{t}_{m(\sigma )-1}}$. We then find, through an application of Lemma 1 and (8), that

We may continue inductively to obtain the first identity.

We use Theorem 2 to prove the second identity. It suffices to take the supremum over J_{*}, because R_{σ}(μ||P) ≥ R_{γ}(μ||P) if γ ⊂ σ. It thus suffices to prove that

However, this also follows from repeated application of Lemma 1. To prove the third identity, we firstly note that

The proof of this is entirely analogous to that of the second identity, except that it makes use of (5) instead of (4). However, after another application of Lemma 1, we also have that

On equating these two different expressions for ${R}_{{F}_{s;t}}(\mu \Vert P)$, we obtain

Let (σ^{(}^{k}^{)}) ⊂ J_{s}, σ^{(}^{k−}^{1)} ⊆ σ^{(}^{k}^{)} be such that
${\text{lim}}_{k\to \infty}{R}_{{\sigma}^{(k)}}(\mu \Vert P)={R}_{F}{}_{{}_{0,s}}(\mu \Vert P)$. Such a sequence exists by (4). Similarly, let (γ^{(}^{k}^{)}) ⊆ J_{s} be a sequence such that
${E}^{{\mu}_{\gamma}{(k)}^{(x)}}\left[{R}_{t}\left({\mu}_{t|{\gamma}^{(k)},x}\Vert {P}_{t|s,{x}_{s}}\right)\right]$ is strictly non-decreasing and, as k → ∞, asymptotically approaches
${\mathrm{sup}}_{\sigma \in \mathrm{J}s}{E}^{{\mu}_{\sigma}(x)}\left[{R}_{t}\left({\mu}_{t|\sigma ,x}\Vert Pt|s,{x}_{s}\right)\right]$. Lemma 2 dictates that

#### 4.1. Proof of Theorem 1

In this section we work towards the proof of Theorem 1, making use of some results in [22]. However, we first require some more definitions.

If K ⊂ ℝ^{d} is compact, let D_{K} be the set of all f ∈ D whose support is contained in K. The corresponding space of real distributions is D′, and we denote the action of θ ∈ D′ by 〈θ, f〉. If θ ∈ $\mathcal{M}$(ℝ^{d}), then clearly 〈θ, f〉 = E^{θ}[f]. We let
${C}_{0}^{2,1}({\mathrm{R}}^{d})$ denote the set of all continuous functions, possessing continuous spatial derivatives of first and second order, a continuous time derivative of first order, and of compact support. For f ∈ D and t ∈ [0, T], we define the random variable ∇_{t}f: ℝ^{d}→ ℝ^{d} such that
${({\nabla}_{t}f(y))}^{i}={\displaystyle {\sum}_{j=1}^{d}{a}^{ij}(t,y)\frac{\partial f}{\partial {y}^{j}}}$, we may also understand ∇_{t}f(x) := ∇_{t}f(x_{t})). Let a_{ij} be the components of the matrix inverse of a^{ij}. For random variables X, Y: T → ℝ^{d}, we define the inner
${(X,Y)}_{t,x}={\displaystyle {\sum}_{i,j=1}^{d}{X}^{i}(x)}{Y}^{j}(x){a}_{ij}(t,{x}_{t})$ with associated norm
$|X{|}_{t,x}^{2}={\left(X(x),X(x)\right)}_{t,x}^{2}$ We note that
$|{\nabla}_{t}f{|}_{t,x}^{2}={\displaystyle {\sum}_{i,j=1}^{d}{a}^{ij}(t,{x}_{t})\frac{\partial f}{\partial {z}^{i}}}({x}_{t})\frac{\partial f}{\partial {z}^{j}}({x}_{t})$.

Let M be the space of all continuous maps [0, T] → M(ℝ^{d}), equipped with the topology of uniform convergence. For s ∈ [0, T], ϑ ∈ M and ν ∈ $\mathcal{M}$(ℝ^{d}) we define n(s, ϑ, ν) ≥ 0 and such that

This definition is taken from [22] (Equation (4.7))—we note that n is convex in ϑ. For γ ∈ $\mathcal{M}$(T), we may naturally write n(s, γ, ν) := n(s, ω, ν), where ω is the projection of γ onto M, i.e., ω(s) = γ_{s}. It is shown in [22] that this projection is continuous. The following two definitions, lemma and two propositions are all taken (with some small modifications) from [22].

**Definition 2.** Let I be an interval of the real line. A measure μ ∈ $\mathcal{M}$(T) is called absolutely continuous if for each compact set K ⊂ ℝ^{d} there exists a neighbourhood U of 0 in K and an absolutely continuous function H_{K} : I → ℝ such that

_{K}.

**Lemma 4.** [22] (Lemma 4.2) If μ is absolutely continuous over an interval I, then its derivative exists (in the distributional sense) for Lebesgue a.e. t ∈ I. That is, for Lebesgue a.e. t ∈ I, there exists${\dot{\mu}}_{t}\in {D}^{\prime}$ such that for all f ∈ D

**Definition 3.** For ν ∈ $\mathcal{M}$(C([s, t]; ℝ^{d})), and 0 ≤ s < t ≤ T, let
${L}_{s,t}^{2}(\nu )$ be the Hilbert space of all measurable maps h : [s, t] × ℝ^{d} → ℝ^{d} with inner product

We denote by
${L}_{s,t,\nabla}^{2}(\nu )$ the closure in
${L}_{s,t}^{2}(\nu )$ of the linear subset generated by maps of the form (x, u) → ∇_{u}f, where
$f\in {C}_{0}^{2,1}([s,t],{R}^{d})$. We note that functions in
${L}_{s,t,\nabla}^{2}(\nu )$ only need to be defined du⊗ν_{u}(dx) almost everywhere.

Recall that n is defined in (13), and note that $\langle *{\mathcal{L}}_{t}{\mu}_{t},f\rangle :=\langle {\mu}_{t},{\mathcal{L}}_{t}f\rangle $.

**Proposition 1.** Assume that μ ∈ $\mathcal{M}$(C([r, s]; ℝ^{d})), such that μ_{r} = δ_{y} for some y ∈ ℝ^{d} and 0 ≤ r < s ≤ T. We have that [22] (Equation 4.9 and Lemma 4.8)

It clearly suffices to take the supremum over a countable dense subset. Assume now that${\int}_{r}^{s}\mathfrak{n}}{(t,{\dot{\mu}}_{t}-*{L}_{t}{\mu}_{t},{\mu}_{t})}^{2}dt<\infty $. Then for Lebesgue a.e. t, $t,{\dot{\mu}}_{t}=*{\mathcal{K}}_{t}{\mu}_{t}$, where [22] (Lemma 4.8(3))

Remark. We reach (17) from the proof of Lemma 9 in [22] (Eq 4.10). One should note also that in the equation (4.10) of [22] the relative entropy R as${L}_{\nu}^{(1)}$. To reach (18), we also use the equivalence between (4.7) and (4.8) in [22].

**Proposition 2.** Assume that μ ∈ $\mathcal{M}$(T), such that μ_{r} = δ_{y} for some y ∈ ℝ^{d} and 0 ≤ r < s ≤ T. If${R}_{{F}_{r,s}}(\mu \Vert {P}_{|r,y})<\infty $, then μ is absolutely continuous on [r, s], and [22] (Lemma 4.9)

Here the derivative${\dot{\mu}}_{t}$ is defined in Lemma 4. For all f ∈ D, [22] (Eq. (4.35))

We are now ready to prove Theorem 1 (the central result).

**Proof.** Fix a partition σ = {t_{1}, …, t_{m}}. We may conclude from (9) and (17) that

The integrand on the right hand side is measurable with respect to ${E}^{{\mu}_{[0,{t}_{j}]}}{}^{{}^{(x)}}$ due to the equivalent expression (14). We may infer from (18) that

This last step follows by noting that if ν ∈ $\mathcal{M}$(ℝ^{d}), and f ∈ C_{b}((ℝ^{d}), and the expectation of f with respect to ν is finite, then there exists a series (K_{n}) ⊂ ℝ^{d} of compact sets such that

In turn, for each n there exist $({f}_{n}^{(m)})\in {D}_{{\mathcal{K}}_{n}}$ such that we may write

This allows us to conclude that the two supremums are the same. The last expression in (20) is merely

By (11), this is greater than or equal to

We thus obtain the theorem using (10).

## 5. Some Corollaries

We state some corollaries of Theorem 1. In the course of this section we make progressively stronger assumptions on the nature of μ, culminating in the elegant expression for R(μ||P) when μ is a solution of a martingale problem. We finish by comparing our work with that of [19,20].

**Corollary 1.** Suppose that μ ∈ $\mathcal{M}$(T) and R(μ||P) < ∞. Then for all s and μ a.e. x, μ_{|}_{[0}_{,s}_{]}_{,x} is absolutely continuous over [s, T]. For each s ∈ [0, T] and μ a.e. x ∈ T, for Lebesgue a.e. t ≥ s

Furthermore,

For any dense countable subset Q_{0,}_{T} of [0, T], there exists a series of partitions σ^{(}^{n}^{)} ⊂ σ^{(}^{n}^{+1)} ∈ Q_{0}_{,T}, such that as n → ∞, |σ^{(}^{n}^{)}| → 0, and

Remark. It is not immediately clear that we may simplify (23) further (barring further assumptions). The reason for this is that we only know that${E}^{\mu |[0,\underset{\xaf}{\sigma}(t)],w(z)}\left[{\left|{h}_{\underset{\xaf}{\sigma}(t),w}^{\mu}(t,z)\right|}_{t,z}^{2}\right]$ is measurable (as a function of w), but it has not been proven that${h}_{\underset{\xaf}{\sigma}(t),w}^{\mu}(t,z)$ is measurable (as a function of w).

**Proof.** Let σ = {0 = t_{1}, …, t_{m} = T} be an arbitrary partition. For all j < m, we find from Lemma 3 that
${R}_{{\mathcal{F}}_{{t}_{j},{t}_{j+1}}}\left({\mu}_{|[0,{t}_{j}],x}\Vert {P}_{|{t}_{j,x{t}_{j}}}\right)<\infty $ for
${\mu}_{[0,{t}_{j}]}$ a.e. x ∈ C([0, t_{j}]; ℝ^{d}). We thus find that, for all such x,
${\mu}_{|[0,{t}_{j}],x}$ is absolutely continuous on [t_{j}, t_{j}_{+1}] from Proposition 2. We are then able to obtain (21) and (22) from Propositions 1 and 2. From (2), (16) and (21) we find that

The above integral must be finite (since we are assuming R(μ||P) is finite). Furthermore ${E}^{{\mu}_{t|\phantom{\rule{0.2em}{0ex}}[0,\underset{\xaf}{\sigma}(t)],{x}^{(z)}}}\left[{\left|{h}_{\underset{\xaf}{\sigma}(t),x}^{\mu}(t,z)\right|}_{t,z}^{2}\right]$ is (t, x) measurable as a consequence of the equivalent form (14). This allows us to apply Fubini’s theorem to obtain (23). The last statement on the sequence of maximising partitions follows from Theorem 2.

**Corollary 2.** Suppose that R(μ||P) < ∞. Suppose that for all s ∈ Q_{0}_{,T} (any countable, dense subset of [0, T]), for μ a.e. x and Lebesgue a.e. t,
${h}_{s,x}^{\mu}(t,{x}_{t})={E}^{{\mu}_{|\phantom{\rule{0.2em}{0ex}}[0,s],x;t,{x}_{t}}(w)}{h}^{\mu}(t,w)$ for some progressively measurable random variable h^{μ} : [0, T] × T → ℝ^{d}. Then

**Proof.** Let G^{s,x}^{;}^{t,y} be the sub σ-algebra consisting of all B ∈ B(T) such that for all w ∈ B, w_{r} = x_{r} for all r ≤ s and w_{t} = y. Thus
${h}_{s,x}^{\mu}(t,y)={E}^{{\mu}_{|\phantom{\rule{0.2em}{0ex}}[0,s]},x;t,{y}^{(w)}}{h}^{\mu}(t,w)={E}^{\mu}[{h}^{\mu}(t,\cdot )|{G}^{s,x;t,y}]$. By [27] (Corollary 2.4), since
${\cap}_{s<t}{G}^{s,x;t,{x}_{t}}={G}^{t,x;t,{x}_{t}}$ (restricting to s ∈ Q_{0}_{,T}), for μ a.e. x,

_{0}

_{,T}. By the properties of the regular conditional probability, we find from (24) that

By assumption, the above limit is finite. Thus by Fatou’s lemma, and using the properties of the regular conditional probability,

Through use of (26),

Conversely, through an application of Jensen’s inequality to (27)

A property of the regular conditional probability yields

Remark. The condition in the above corollary is satisfied when μ is a solution to a martingale problem—see Lemma 5.

We may further simplify the expression in Theorem 1 when μ is a solution to the following martingale problem. Let {c^{jk}, e^{j}} be progressively measurable functions [0, T] × T → ℝ. We suppose that c^{jk} = c^{kj}. For all 1 ≤ j, k ≤ d, c^{jk}(t, x) and e^{j}(t, x) are assumed to be bounded for x ∈ L (where L is compact) and all t ∈ [0, T]. For
$f\in {C}_{0}^{2}({\mathrm{R}}^{d})$ and x ∈ T, let

We assume that for all such f, the following is a continuous martingale (relative to the canonical filtration) under μ

The law governing X_{0} is stipulated to be ν ∈ $\mathcal{M}$(ℝ^{d}).

From now on we switch from our earlier convention and we consider μ_{|}_{[0}_{,s}_{]}_{,x} to be a measure on T such that, for μ a.e. x ∈ T, μ_{|}_{[0}_{,s}_{]}_{,x}(A_{s,x}) = 1, where A_{s,x} is the set of all X∈ T satisfying X_{t} = x_{t} for all 0 ≤ t ≤ s. This is a property of a regular conditional probability (see Theorem 3.18 in [23]). Similarly, μ_{|s,x}_{;}_{t,y} is considered to be a measure on T such that for μ a.e. x ∈ T, μ_{|s,x}_{;}_{t,y}(B_{s,x}_{;}_{t,y}) = 1, where B_{s,x}_{;}_{t,y} is the set of all X∈ A_{s,x} such that X_{t} = y. We may apply Fubini’s Theorem (since f is compactly supported and bounded) to (28) to find that

This ensures that μ_{|}_{[0}_{,s}_{]} is absolutely continuous over [s, T], and that

**Lemma 5.** If R(μ||P) < ∞ then for Lebesgue a.e. t ∈ [0, T] and μ a.e. x ∈ T,

If R(μ||P) < ∞ then

**Proof.** It follows from R(μ||P) < ∞, (21) and (22) that for all s and μ a.e. x, for Lebesgue a.e. t ≥ s

Let us take a countable dense subset Q_{0}_{,T} of [0, T]. There thus exists a null set N ⊆ [0, T] such that for every s ∈ Q_{0}_{,T}, μ a.e. x and every t ∉ N the above equation holds. We may therefore conclude (30) using [27] (Corollary 2.4) and taking s → t^{−}. From (29), we observe that for all s ∈ [0, T] and μ a.e. x, for Lebesgue a.e. t

Equation (31) thus follows from Corollary 2.

#### 5.1. Comparison of our Results to Those of Fischer et al. [19,20]

We have already noted in the introduction that one may infer a variational representation of the relative entropy from [19,20] by assuming that the coefficients of the underlying stochastic process are independent of the empirical measure in these papers. The assumptions in [20] on the underlying process P are both more general and more restrictive than ours. His assumptions are more general insofar as the coefficients of the SDE may depend on the past history of the process and the diffusion coefficient is allowed to be degenerate. However, our assumptions are more general insofar as we only require P to be the unique (in the sense of probability law) weak solution of the SDE, whereas [20] requires P to be the unique strong solution of the SDE. Of course when both sets of assumptions are satisfied, one may infer that the expressions for the relative entropy are identical.

## 6. Proof of Theorem 2

The following is an alternative proof to that of [25] (Theorem 6.6) employing the theory of Large Deviations. The fact that, if α ⊆ σ, then R_{α}(μ||P) ≤ R_{σ}(μ||P), follows from Lemma 1. We prove the first expression (4) in the case s = 0, t = T (the proof of the second identity (5) is analogous).

**Definition 4.** A series of probability laws Γ^{N} on some topological space Ω equipped with its Borelian σ-algebra is said to satisfy a strong Large Deviation Principle with rate function I : Ω → ℝ if for all open sets O,

If furthermore the set {x : I(x) ≤ α} is compact for all α ≥ 0, we say that I is a good rate function.

We define the following empirical measures.

**Definition 5.** For x ∈ T^{N},
$y\in {\mathcal{T}}_{\sigma}^{N}$, let

Clearly
${\widehat{\mu}}_{\sigma}^{N}({x}_{\sigma})={\pi}_{\sigma}({\widehat{\mu}}^{N}(x))$. The image law
${P}^{\otimes}{}^{N}\circ {({\widehat{\mu}}^{N})}^{-1}$ is denoted by
${\prod}_{s,t}^{N}\in \mathrm{\mathcal{M}}(\mathrm{\mathcal{M}}(\mathcal{T}))$. Similarly, for σ ∈ J, the image law of
${P}_{\sigma}^{\otimes N}\circ {({\widehat{\mu}}_{\sigma}^{N})}^{-1}$ on $\mathcal{M}$(T_{σ}) is denoted by
${\prod}_{\sigma}^{N}\in \mathrm{\mathcal{M}}(\mathrm{\mathcal{M}}({\mathcal{T}}_{\sigma}))$. Since T and T_{σ} are Polish spaces, we have by Sanov’s theorem (see Theorem 6.2.10 in [14]) that Π^{N} satisfies a strong Large Deviation Principle with good rate function R(⋅||P). Similarly,
${\prod}_{\sigma}^{N}$ satisfies a strong Large Deviation Principle on $\mathcal{M}$(T_{σ}) with good rate function
${R}_{{F}_{\sigma}}(\cdot \Vert P)$.

We now define the projective limit
$\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T})$. If α, γ ∈ J, α ⊂ γ, then we may define the projection
${\pi}_{\alpha \gamma}^{\mathrm{\mathcal{M}}}:\mathrm{\mathcal{M}}({\mathcal{T}}_{\gamma})\to \mathrm{\mathcal{M}}({\mathcal{T}}_{\sigma})$ as
${\pi}_{\alpha \gamma}^{\mathrm{\mathcal{M}}}(\xi ):=\xi \circ {\pi}_{\alpha \gamma}^{-1}$. An element of
$\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T})$ is then a member ⊗_{σ}ζ(σ) of the Cartesian product ⊗_{σ}_{∈J}$\mathcal{M}$(T_{σ}) satisfying the consistency condition
${\pi}_{\alpha \gamma}^{\mathrm{\mathcal{M}}}(\zeta (\gamma ))=\zeta (\alpha )$ for all α ⊂ γ. The topology on
$\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T})$ is the minimal topology necessary for the natural projection
$\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T})\to \mathrm{\mathcal{M}}({\mathcal{T}}_{\alpha})$ to be continuous for all α ∈ J. That is, it is generated by open sets of the form

_{γ})).

We may continuously embed $\mathcal{M}$(T) into the projective limit
$\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T})$ of its marginals, letting ι denote this embedding. That is, for any σ ∈ J, (ι(μ))(σ) = μ_{σ}. We note that ι is continuous because ι^{−}^{1}(A_{γ,O}) is open in $\mathcal{M}$(T), for all A_{γ,O} of the form in (33). We equip
$\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T})$ with the Borelian σ-algebra generated by this topology. The embedding ι is measurable with respect to this σ-algebra because the topology of $\mathcal{M}$(T) has a countable base. The embedding induces the image laws (Π^{N} ○ ι^{−}^{1}) on
$\mathrm{\mathcal{M}}(\underset{\xaf}{\mathrm{\mathcal{M}}}(\mathcal{T}))$. For σ ∈ J, it may be seen that
${\Pi}_{\sigma}^{N}={\Pi}^{N}\circ {\iota}^{-1}\circ {({\pi}_{\sigma}^{\mathrm{\mathcal{M}}})}^{-1}\in \mathrm{\mathcal{M}}(\mathrm{\mathcal{M}}({\mathcal{T}}_{\sigma}))$, where
${\pi}_{\sigma}^{\mathrm{\mathcal{M}}}({\otimes}_{\alpha}\mu (\alpha ))=\mu (\sigma )$.

It follows from [22] (Thm 3.3) that Π^{N} ○ ι^{−}^{1} satisfies a Large Deviation Principle with rate function sup_{σ}_{∈J} R_{σ}(μ||P). However, we note that ι is 1 – 1, because any two measures μ, ν ∈ $\mathcal{M}$(T) such that μ_{σ} = ν_{σ} for all σ ∈ J must be equal. Furthermore, ι is continuous. Because of Sanov’s theorem, (Π^{N}) is exponentially tight (see Defn 1.2.17, Exercise 1.2.19 in [14] for a definition of exponential tightness and proof of this statement). These facts mean that we may apply the inverse contraction principle [14] (Thm 4.2.4) to infer that Π^{N} satisfies a Large Deviation Principle with the rate function sup_{σ}_{∈J} R_{σ}(μ||P). Since rate functions are unique [14] (Lemma 4.1.4), we obtain the first identity in conjunction with Sanov’s theorem. The second identity (5) follows similarly. We may repeat the argument above, while restricting to σ ⊂ Q_{s,t}. We obtain the same conclusion because the σ-algebra generated by (F_{σ})_{σ}_{⊂}_{Qs,t} is the same as F_{s,t}. The last identity follows from the fact that, if α ⊆ σ, then R_{α}(μ||P) ≤ R_{σ}(μ||P).

## Acknowledgments

This work was supported by INRIA FRM, ERC-NERVI number 227747, European Union Project # FP7-269921 (BrainScales), and Mathemacs # FP7-ICT-2011.9.7

## Author Contributions

Both authors contributed to all the article. Both authors have read and approved the final manuscript.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

- Plastino, A.; Miller, H.; Plastino, A. Minimum Kullback entropy approach to the Fokker-Planck equation. Phys. Rev. E.
**1997**, 56, 3927–3934. [Google Scholar] - Desvillettes, L.; Villani, C. On the trend to global equilibrium in spatially inhomogeneous entropy-dissipating systems. Part 1: The Linear Fokker-Planck Equation. Comm. Pure Appl. Math.
**2001**, 54, 1–42. [Google Scholar] - Yu, S.; Mehta, P. The Kullback-Leibler Rate Metric for Comparing Dynamical Systems, In Proceedings of Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, Shanghai, China, 16–18 December 2009; pp. 8363–8368.
- Georgiou, T.T.; Lindquist, A. Kullback-Leibler approximation of spectral density functions. IEEE Trans. Inf. Theory.
**2003**, 49, 2910–2917. [Google Scholar] - Fritelli, M. The minimal entropy martingale measure and the valuation problem in incomplete markets. Math. Finance.
**2000**, 10, 39–52. [Google Scholar] - Grandits, P.; Rheinlander, T. On the minimal entropy martingale measure. Ann. Prob.
**2002**, 30, 1003–1038. [Google Scholar] - Miyahara, Y. Minimal Relative Entropy Martingale Measure of Birth and Death Process. In Discussion Papers in Economics; Nagoya City University: Nagoya, Japan, 2000. [Google Scholar]
- Miyahara, Y. On the Minimal Entropy Martingale Measures for Geometric Lévy Processes. In Discussion Papers in Economics; Nagoya City University: Nagoya, Japan, 2000. [Google Scholar]
- Ustunel, A.S. Entropy, invertibility and variational calculus of the adapted shifts on Wiener Space. J. Funct. Anal.
**2009**, 257, 3655–3689. [Google Scholar] - Lassalle, R. Invertibility of adapted perturbations of the identity on abstract Wiener space. J. Funct. Anal.
**2012**, 262, 2734–2776. [Google Scholar] - Akaike, H. Likelihood of a model and information criteria. J. Econometrics.
**1981**, 16, 3–14. [Google Scholar] - Do, M.; Vetterli, M. Wavelet-based Texture Retrieval Using Generalized Gaussian Density and Kullback–Leibler Distance. IEEE Trans. Image Process
**2002**, 11, 146–158. [Google Scholar] - Bozdogan, H. Akaike’s Information Criterion and Recent Developments in Information Complexity. J. Math. Psychol.
**2000**, 44, 62–91. [Google Scholar] - Dembo, A.; Zeitouni, O. Large Deviations Techniques, 2nd ed; Springer: Berlin, Germany, 1997. [Google Scholar]
- Ben-Arous, G.; Guionnet, A. Large deviations for Langevin spin glass dynamics. Probab. Theor. Relat. Field.
**1995**, 102, 455–509. [Google Scholar] - Moynot, O.; Samuelides, M. Large deviations and mean-field theory for asymmetric random recurrent neural networks. Probab. Theor. Relat. Field.
**2002**, 123, 41–75. [Google Scholar] - Faugeras, O.; MacLaurin, J. A large deviation principle for networks of rate neurons with correlated synaptic weights
**2013**, arXiv, 1302.1029. - Faugeras, O.; MacLaurin, J. Large Deviations of an Ergodic Synchoronous Neural Network with Learning
**2014**, arXiv, 1404.0732v3, math.PR. - Budhiraja, A.; Dupuis, P.; Fischer, M. Large deviation properties of weakly interacting processes via weak convergence methods. Ann. Prob.
**2012**, 40, 74–102. [Google Scholar] - Fischer, M. On the form of the large deviation rate function for the empirical measures of weakly interacting systems. Bernoulli
**2014**, 20, 1765–1801. [Google Scholar] - Baladron, J.; Fasoli, D.; Faugeras, O.; Touboul, J. Mean field description of and propagation of chaos in recurrent multipopulation networks of Hodgkin-Huxley and Fitzhugh-Nagumo neurons
**2011**, arXiv, 1110.4294. - Dawson, D.A.; Gärtner, J. Large deviations from the McKean-Vlasov limit for weakly interacting diffusions. Stochastics
**1987**, 20, 247–308. [Google Scholar] - Karatzas, I.; Shreve, S.E. Brownian Motion and Stochastic Calculus, 2nd ed; Graduate Texts in Mathematics; Volume 113, Springer-Verlag: New York, NY, USA, 1991. [Google Scholar]
- Donsker, M.; Varadhan, S. Asymptotic Evaluation of Certain Markov Process Expectations for Large Time, IV. Comm. Pure Appl. Math.
**1983**, XXXVI, 183–212. [Google Scholar] - Xanh, N.X.; Zessin, H. Ergodic Theorems for Spatial Processes. Z. Wahfscheinlichkeitstheorie verw Gebiete
**1979**, 48, 133–158. [Google Scholar] - Dupuis, P.; Ellis, R.S. A Weak Convergence Approach to the Theory of Large Deviations; John Wiley & Sons: London, UK, 1997. [Google Scholar]
- Revuz, D.; Yor, M. Continuous Martingales and Brownian Motion, 2nd ed; Springer-Verlag: Berlin, Germany, 1991. [Google Scholar]

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).