Variational Characterization of Free Energy: Theory and Algorithms

Carsten Hartmann; Lorenz Richter; Christof Schütte; Wei Zhang

doi:10.3390/e19110626

,

and

¹

Institut für Mathematik, Brandenburgische Technische Universität Cottbus-Senftenberg, D-03046 Cottbus, Germany

²

Institut für Mathematik, Freie Universität Berlin, D-14195 Berlin, Germany

³

Zuse Institute Berlin, D-14195 Berlin, Germany

^*

Author to whom correspondence should be addressed.

Entropy2017, 19(11), 626;https://doi.org/10.3390/e19110626

This article belongs to the Special Issue Understanding Molecular Dynamics via Stochastic Processes

Version Notes

Order Reprints

Abstract

The article surveys and extends variational formulations of the thermodynamic free energy and discusses their information-theoretic content from the perspective of mathematical statistics. We revisit the well-known Jarzynski equality for nonequilibrium free energy sampling within the framework of importance sampling and Girsanov change-of-measure transformations. The implications of the different variational formulations for designing efficient stochastic optimization and nonequilibrium simulation algorithms for computing free energies are discussed and illustrated.

Keywords:

importance sampling; Donsker–Varadhan principle; thermodynamic free energy; nonequilibrium molecular dynamics; stochastic approximation; cross-entropy method

1. Introduction

It is one of the standard problems in statistical physics and its computational applications, e.g., in molecular dynamics, that one desires to compute expected values of an observable f with respect to a given (equilibrium) probability density

π

,

E_{π} [f] = \int f (x) π (x) d x

Even if samples from the density

π

are available, the simplest Monte Carlo estimator, the mean value, may suffer from a large variance (compared to the quantity that one tries to estimate), such that the accurate estimation of

E_{π} [f]

requires an unreasonably large sample size. Various approaches to circumvent this problem and to reduce the variance of an estimator are available, one of the most prominent representatives being importance sampling where samples are drawn from another probability density

ρ

and reweighted with the likelihood ratio

π / ρ

[1,2]. It is well-known that theoretically (and under certain assumptions) there exists an optimal importance sampling density

ρ^{*}

such that the resulting estimator has variance zero. By a clever choice of the importance sampling proposal density, it is thus possible to completely remove the stochasticity from the problem and to obtain what is sometimes called certainty equivalence. Yet, drawing samples from the optimal density (or an approximation of it) is a difficult problem in itself, so that the striking variance-reduction due to importance sampling often does not pay off in practice.

The zero variance property of importance sampling and the challenge to utilize it algorithmically is the starting point of this article where the focus is on its generalization to path sampling problems and its algorithmic realization. Regarding the former, we will show that the Donsker–Varadhan variational principle, a well-known measure-theoretic characterization of the cumulant generating functions [3] that gives rise to a variational characterization of the thermodynamic free energy [4,5] permits several stunning utilizations of the importance sampling framework for path sampling problems; examples involve trajectory-dependent expectations like expected hitting times or free energy differences [6,7]. We will see that finding the optimal change of measure in path space is equivalent to solving an optimal control problem for the underlying dynamical system in which the dynamics is controlled by external driving forces and thus driven out of equilibrium [8,9].

One of the central contributions of this paper is that we prove that the resulting path space importance sampling scheme features zero variance estimators under quite general assumptions. We furthermore elaborate on the connection between optimized importance sampling and the famous Jarzynski fluctuation relation for the thermodynamic free energy [10]. In particular we will explore this connection and describe how to devise better non-equilibrium free energy algorithms, and hopefully obtain a better understanding of Jarzynski-based estimators; cf. [11,12,13,14].

Regarding the algorithmic realization, the theoretical insight into the relation between (adaptive) importance sampling and optimal control leads to novel algorithms that aim at utilizing the zero variance property without having to sample from the optimal importance sampling density. We will demonstrate how this can be achieved by discretizing the optimal control problem, using ideas from stochastic approximation and stochastic optimization [9,15]; see [16,17,18] for an alternative approach using ideas from the theory of large deviations. The examples we present are mainly pedagogical and admittedly very simple, but they highlight important features of the importance sampling scheme, such as the exponential tilting of the (path space) probability measure or the uniqueness of the solution to the stochastic approximation problem within a certain parametric family of trial probability measures, and this is why we confine our attention to such low-dimensional examples. Regarding the application of our approach to molecular dynamics simulation, we allude to the relevant literature.

Outline

The article is organized as follows: Firstly, in Section 2 we review certainty equivalence and the zero variance property of optimized importance sampling in state space, starting from the Donsker–Varadhan principle and its relation to importance sampling, and comment on some algorithmic issues. Then, in Section 3, we consider the generalization to path space, discuss the relation to stochastic optimal control and revisit Jarzynski-based estimators for thermodynamic free energies. Section 4 surveys and discusses novel algorithms that are exploiting the theoretical properties of the control-based importance sampling scheme. We briefly discuss some of these algorithms with simple toy examples in Section 5, before the article concludes in Section 6 with a brief summary and a discussion of open issues. The article contains four appendices that record various technical identities, including a brief derivation of Girsanov’s change of measure formula, and the proof of the main theorem: the zero-variance property of optimized importance sampling on path space.

2. Certainty Equivalence

In mathematical finance, the guaranteed payment that an investor would accept instead of a potentially higher, but uncertain return on an asset is called a certainty equivalent. In physics, certainty equivalence amounts to finding a deterministic surrogate system that reproduces averages of certain fluctuating thermodynamic quantities with probability one. One such example is the thermodynamic free energy difference between two equilibrium states that can be either computed by an exponential average over the fluctuating nonequilibrium work done on the system or by measuring the work of an adiabatic transformation between these states.

2.1. Donsker–Varadhan Variational Principle

Before getting into the technical details, we briefly review the classical Donsker–Varadhan variational principle for the cumulant generating function of a random variable. To this end, let X be a real-valued, n-dimensional random variable with smooth probability density

π

and call

E_{π} [f (X)] = \int_{R^{n}} f (x) π (x) d x

(1)

the expectation with respect to

π

for any integrable function

f : R^{n} \to R

.

Definition 1.

Let

W : R^{n} \to R

be a bounded random variable. The quantity

γ : B (R^{n}) \to R, W \mapsto - log E_{π} [exp (- W)]

(2)

is called the free energy of the random variable

W = W (X)

with respect to π, where

B (R^{n})

is the set of bounded and measurable, real-valued functions on

R^{n}

(“If you can write it down, it’s measurable!”, S. R. Srinivasa Varadhan).

Definition 2.

Let ρ be another probability density on

R^{n}

. Then

D (ρ | π) = \int_{R^{n}} log \frac{ρ (x)}{π (x)} ρ (x) d x

(3)

is called the relative entropy of ρ with respect to π (or: Kullback–Leibler divergence), provided that

π (x) = 0

implies that

ρ (x) = 0

for every

x \in R^{n}

. Otherwise, we set

D (ρ | π) = + \infty

.

The requirement that

π (x)

must not be zero without

ρ (x)

being zero is known as absolute continuity and guarantees that the likelihood ratio

L = ρ / π

is well defined. In what follows, we may assume without loss of generality that

π > 0

. (Otherwise we may exclude those states

x \in R^{n}

for which

π (x) = 0

.)

A well-known thermodynamic principle states that the free energy is the Legendre transform of the entropy. The following variant of this principle is due to Donsker and Varadhan and says that (e.g., see [3] and the references therein)

- log E_{π} [exp (- W)] = min_{ρ \geq 0} \{E_{ρ} [W] + D (ρ | π)\},

(4)

where the minimum is over all probability density functions

ρ

on

R^{n}

. The last equality easily follows from Jensen’s inequality by noting that

\begin{matrix} - log \int_{R^{n}} exp (- W) π d x & = - log \int_{R^{n}} exp (- W) \frac{π}{ρ} ρ d x \\ = - log \int_{R^{n}} exp (- W - log \frac{ρ}{π}) ρ d x \\ \leq \int_{R^{n}} (W + log \frac{ρ}{π}) ρ d x \\ = \int_{R^{n}} W ρ d x + \int_{R^{n}} log \frac{ρ}{π} ρ d x . \end{matrix}

Additionally, it can be readily seen that equality is attained if and only if

ρ^{*} = exp (γ - W) π,

(5)

which defines a probability measure with

γ

given in Equation (2).

Importance Sampling

The relevance of Equations (4) and (5) lies in the fact that, by sampling X from the probability distribution with density

ρ^{*}

, one removes the stochasticity from the problem, since the random variable

U^{*} = W (X) + log \frac{ρ^{*} (X)}{π (X)}

is almost surely (a.s.) constant. As a consequence, the Monte Carlo scheme for computing the free energy on the left hand side of Equation (4) based on the empirical mean of the independent draws of

U^{*} = U^{*} (X)

with

X \sim ρ^{*}

, will have a zero variance. This zero-variance property is a consequence of Jensen’s inequality and the strict concavity of the logarithmic function which implies that equality is attained if and only if the random variable inside the expectation is almost surely constant. The next statement makes this precise.

Theorem 1 (Optimal importance sampling).

Let

ρ^{*}

be the probability density given in Equation (5). Then the random variable

Z = exp (- W) π / ρ^{*}

has zero variance under

ρ^{*}

, and we have:

Z = E_{π} [exp (- W)], ρ^{*} - a . s .

Proof.

We need to show that

{Var}_{ρ^{*}} (Z) = E_{ρ^{*}} [Z^{2}] - {(E_{ρ^{*}} [Z])}^{2} = 0

. Using Equation (5) and noting that

ρ^{*} > 0

since W is bounded and

π > 0

, it follows that Z has finite second moment and

\begin{matrix} E_{ρ^{*}} [Z^{2}] - {(E_{ρ^{*}} [Z])}^{2} & = E_{π} [exp (- 2 W) π / ρ^{*}] - {(E_{π} [exp (- W)])}^{2} \\ = exp (- γ) E_{π} [exp (- W)] - exp (- 2 γ) \\ = 0, \end{matrix}

where we have used that

exp (- γ) = E_{π} [exp (- W)]

. ☐

The above theorem asserts that

ρ^{*}

-almost surely (

ρ^{*}

-a.s.)

Z = E_{ρ^{*}} [Z],

(6)

which means that the importance sampling scheme based on estimating

E_{ρ^{*}} [Z]

using draws from the density

ρ^{*}

is a zero-variance estimator of

E_{π} [exp (- W)]

. We will discuss the problem of drawing from an approximation of the optimal distribution

ρ^{*}

later on in Section 4.

Remark 1.

Equation (4) furnishes the famous relation

F = U - T S

for the Helmholtz free energy F, with U being the internal energy, T the temperature and S denoting the Gibbs entropy. If we modify the previous assumptions by setting

π \equiv 1

and

W = β E

where

β = {(k_{B} T)}^{- 1}

with

k_{B} > 0

being Boltzmann’s constant and E denoting a smooth potential energy function that is bounded from below and growing at infinity, then

\underset{= F}{\underset{︸}{- β^{- 1} log \int exp (- β E) d x}} = min_{ρ > 0} \{\underset{= U}{\underset{︸}{\int E ρ d x}} + \underset{= - T S}{\underset{︸}{β^{- 1} \int ρ log ρ d x}}\},

with the unique minimizer being the Gibbs-Boltzmann density

ρ^{*} = exp (- β E) / Z

with normalization constant

Z = exp (- β F)

. In the language of statistics,

ρ^{*}

is a probability distribution from the exponential family with sufficient statistic

E (X)

and parameter

β > 0

.

An alternative variational characterization of expectations is discussed in Appendix A.

2.2. Computational Issues

In practice, the above result is of limited use because the optimal importance sampling distribution is only known up to the normalizing constant C where the latter is just the sought quantity

C = exp (- γ)

. Clearly we can resort to Markov Chain Monte Carlo (MCMC) to generate samples

{({\hat{Y}}_{i})}_{i \geq 1}

that are asymptotically distributed according to (see, e.g., [19])

ρ^{*} (y) = \frac{exp (- Φ (y))}{C}, Φ (y) = V (y) + W (y) .

However, in the situation at hand, we wish estimate

E_{ρ^{*}} (Z)

in Equation (6), where

Z = exp (- W) π / ρ^{*}

is given in Theorem 1, and the problem is that the likelihood ratio

π / ρ^{*}

is only known up to the normalizing factor. In this case, the self-normalized importance sampling estimator must be used (see, e.g., [20]):

{\hat{C}}_{N} = \frac{\sum_{i = 1}^{N} exp (- W ({\hat{Y}}_{i})) exp (W ({\hat{Y}}_{i}))}{\sum_{i = 1}^{N} exp (W ({\hat{Y}}_{i}))} = \frac{N}{\sum_{i = 1}^{N} exp (W ({\hat{Y}}_{i}))},

(7)

which is a consistent estimator for

C = exp (- γ)

. Note that unlike in the case of the importance sampling estimators with known likelihood ratio, the self-normalized estimator is only asymptotically unbiased, even if we can draw exactly from

ρ^{*}

(see Appendix B for details.)

To avoid the bias due to the self-normalization, it is helpful to note that

exp (γ) = E_{ρ^{*}} [exp (W)]

holds

ρ^{*}

-a.s. As a consequence,

{\hat{C}}_{N}^{- 1} = \frac{\sum_{i = 1}^{N} exp (W ({\hat{Y}}_{i}))}{N}

(8)

is an unbiased estimator of

C^{- 1} = exp (γ)

, provided that we can generate i.i.d. samples from

ρ^{*}

. Taking the logarithm, it follows that

{\hat{γ}}_{N} = - log N + log (\sum_{i = 1}^{N} exp (W ({\hat{Y}}_{i})))

(9)

is a consistent estimator for

γ

, which by Jensen’s inequality and the strict concavity of the logarithm will again be only asymptotically unbiased.

Comparison with the Standard Monte Carlo Estimator

In most cases, the samples from

ρ^{*}

will be generated by MCMC or the like. If we consider the advantages of

{\hat{γ}}_{N}

as compared to the plain vanilla Monte Carlo estimator

{\tilde{γ}}_{N} = log N - log (\sum_{i = 1}^{N} exp (- W ({\hat{X}}_{i}))),

(10)

with

{({\hat{X}}_{i})}_{i \geq 1}

being a sample that is drawn from the reference distribution

π

, there are two aspects that will influence the efficiency of Equation (9) relative to Equation (10), namely:

(a): the speed of convergence towards the stationary distribution and
(b): the (asymptotic) variance of the estimator.

By construction, the asymptotic variance of the importance sampling estimator is zero or close to zero if we take numerical discretization errors into account, hence the efficiency of the estimator (9) is solely determined by the speed of convergence of the corresponding MCMC algorithm to the stationary distribution

ρ^{*}

, which, depending on the problem at hand, may be larger or smaller than the speed of convergence to

π

. It may even happen that

π

is unimodal, whereas

ρ^{*} \propto e^{- W} π

is multimodal and hence difficult to generate, for example when

π

is the standard Gaussian density and

W = {(x^{2} - d)}^{2}

with

d ≫ 0

is a bistable (energy) function. We refrain from going into details here and instead refer to the review article [21] for an in-depth discussion of the asymptotic properties of reversible diffusions.

In Section 4 and Section 5, we discuss alternatives to Monte Carlo sampling based on stochastic optimization and approximation algorithms that are feasible even for large-scale systems.

Remark 2.

The comparison of Equations (9) and (10) suggests that the importance sampling estimator (9) is an instance of Bennett’s bidirectional estimator for a positive random variable, called “weighting function” in Bennett’s language [22]. As a consequence, Theorem 1 implies that Bennett’s bidirectional estimator has zero variance when the negative logarithm of the weighting function equals the bias potential.

3. Certainty Equivalence in Path Space

The previous considerations nicely generalize from the case of real-valued random variables to time-dependent problems and path functionals.

3.1. Donsker–Varadhan Variational Principle in Path Space

Let

{(X_{s})}_{s \geq 0}

with

X_{0} = x \in R^{n}

be the solution of the stochastic differential equation (SDE)

d X_{s} = b (X_{s}, s) d s + σ (X_{s}) d B_{s}, X_{0} = x,

(11)

where

b : R^{n} \times [0, \infty) \to R^{n}

is a smooth, possibly time-dependent vector field,

σ : R^{n} \to R^{n \times m}

is a smooth matrix field and B is an m-dimensional Brownian motion. Our standard example will be an SDE with

b (x, s) = - \nabla V (x)

for a smooth potential energy function V and

σ (x) = \sqrt{2} I_{n \times n}

, so that

X_{s}

satisfies a gradient dynamics. We assume throughout this paper that the functions

b, σ, V

are such that Equation (11) or the corresponding gradient dynamics have unique strong solutions for all

s \geq 0

.

Now, suppose that we want to compute the free energy (2) where W is now considered to be a functional of the paths

X = {X_{s} : 0 \leq s \leq τ}

for some bounded stopping time

τ

:

W_{τ} (X) = \int_{0}^{τ} f (X_{s}, s) d s + g (X_{τ}),

(12)

for some bounded and sufficiently smooth, real valued functions

f, g

. We assume throughout the rest of the paper that

f, g

are bounded from below and that W is integrable.

We define P to be the probability measure on the space

Ω = C ([0, \infty), R^{n})

of continuous trajectories that is induced by the Brownian motion

{(B_{s})}_{s \geq 0}

that drives the SDE (11). We call P a path space measure, and we denote the expectation with respect to P by

E_{P} [\cdot]

.

Definition 3 (Path space free energy).

Let

{(X_{s})}_{s \geq 0}

be the solution of Equation (11) and

W_{τ} = W_{τ} (X) \geq 0

be integrable and defined by Equation (12). The quantity

γ = - log E_{P} [exp (- W_{τ})] = - log E_{P} [exp (- \int_{0}^{τ} f (X_{s}, s) d s - g (X_{τ}))]

(13)

is called the free energy of

W_{τ}

with respect to the path space measure P.

Note that Equation (13) simply is the path space version of Equation (2) which now implicitly depends on the initial condition

X_{0} = x

. The Donsker–Varadhan variational principle now reads

γ = inf_{Q ≪ P} \{E_{Q} [\int_{0}^{τ} f (X_{s}, s) d s + g (X_{τ})] + D (Q | P)\},

(14)

where

Q ≪ P

stands for absolute continuity of Q with respect to P, which means that

P (E) = 0

implies that

Q (E) = 0

for any measurable set

E \subset Ω

, as a consequence of which

D (Q | P) = \int_{Ω} log \frac{d Q}{d P} (ω) d Q (ω),

(15)

exists. Note that Equation (15) is just the generalization of the relative entropy (3) from probability densities on

R^{n}

to probability measures on the measurable space

(Ω, E)

, with

E

being a

σ

-algebra containing measurable subsets of

Ω

, where we again declare that

D (Q | P) = \infty

when Q is not absolutely continuous with respect to P. Therefore it is sufficient that the infimum in Equation (14) is taken over all path space measures

Q ≪ P

.

If

W_{τ} \geq 0

, it is again a simple convexity argument (see, e.g., [4]), which shows that the minimum in Equation (14) is attained at

Q^{*}

given by:

\frac{d Q^{*}}{d P} |_{[0, τ]} = exp (γ - W_{τ}),

(16)

where

{φ |}_{[0, τ]}

denotes the restriction of the path space density

φ (X) = (d Q^{*} / d P) (X)

to trajectories

X = {(X_{s})}_{s \geq 0}

of length

τ

. More precisely,

{φ |}_{[0, τ]}

is understood as the restriction of the measure

Q^{*}

defined by

d Q^{*} = φ d P

to the

σ

-algebra

F_{τ}

that contains all measurable sets

E \in E

, with the property that for every

t \geq 0

the set

E \cap {τ \leq t}

is an element of the

σ

-algebra

F_{t} = σ (X_{s} : 0 \leq s \leq t)

that is generated by all trajectories

{(X_{s})}_{0 \leq s \leq t}

of length t. In other words,

F_{τ} \subset E

is a

σ

-algebra that contains the history of the trajectories of (the random) length

τ

.

Even though Equation (16) is the direct analogue of Equation (5), this result is not particularly useful if we do not know how to sample from

Q^{*}

. Therefore, let us first characterize the admissible path space measures

Q ≪ P

and discuss the practical implications later on.

3.1.1. Likelihood Ratio of Path Space Measures

It turns out that the only admissible change of measure from P to Q such that

D (Q | P) < \infty

results in a change of the drift in Equation (11). Let

{(u_{s})}_{s \geq 0}

be an

R^{m}

-valued stochastic process that is adapted, in that

u_{t}

depends only on the Brownian motion

B_{s}

up to time

s \leq t

, and that satisfies the Novikov condition (see, e.g., [23]):

E_{P} [exp (\frac{1}{2} \int_{0}^{τ} {| u_{s} |}^{2} d s)] < \infty .

(17)

Now, define the auxiliary process

B_{t}^{u} = B_{t} - \int_{0}^{t} u_{s} d s .

Using the definition of

B_{t}^{u}

, we may write Equation (11) as

d X_{s} = (b (X_{s}, s) + σ (X_{s}) u_{s}) d s + σ (X_{s}) d B_{s}^{u}, X_{0} = x .

(18)

Note that Equations (11) and (18) govern the same process

{(X_{s})}_{s \geq 0}

, because the extra drift

σ (\cdot) u

is absorbed by the shifted mean of the process

{(σ (X_{s}) B_{s}^{u})}_{s \geq 0}

. By construction,

{(B_{s}^{u})}_{s \geq 0}

is not a Brownian motion under P, because its expectation with respect to P is not zero in general. On the other hand,

{(B_{s})}_{s \geq 0}

is a Brownian motion under the measure P, and our aim is to find a measure

Q ≪ P

under which

{(B_{s}^{u})}_{s \geq 0}

is a Brownian motion. To this end, let

{(Z_{s}^{u})}_{s \geq 0}

be the process defined by

Z_{t}^{u} = \int_{0}^{t} u_{s} \cdot d B_{s} - \frac{1}{2} \int_{0}^{t} {| u_{s} |}^{2} d s,

(19)

or, equivalently,

Z_{t}^{u} = \int_{0}^{t} u_{s} \cdot d B_{s}^{u} + \frac{1}{2} \int_{0}^{t} {| u_{s} |}^{2} d s .

(20)

Girsanov’s theorem (see, e.g., [23], Theorem 8.6.4, or Appendix C) now states that

{(B_{s}^{u})}_{0 \leq s \leq τ}

is a standard Brownian motion under the probability measure Q with likelihood ratio

\frac{d Q}{d P} |_{[0, τ]} = exp (Z_{τ}^{u})

(21)

with respect to P where the Novikov condition (17) guarantees that

E_{P} [exp (Z_{τ}^{u})] = 1

, i.e., that Q is a probability measure. Inserting Equations (20) and (21) into the Donsker–Varadhan formula Equation (14), using that

B_{s}^{u}

is Brownian motion with respect to Q, it follows that the term in

Z_{τ}^{u}

in the expression of the relative entropy, which is linear in u, drops out, and what remains is (cf. [4,6]):

γ = inf_{u} E_{Q} [\int_{0}^{τ} (f (X_{s}, s) + \frac{1}{2} {| u_{s} |}^{2}) d s + g (X_{τ})],

with

X_{s}

being the solution of Equation (18). Since the distribution of

B^{u}

under Q is the same as the distribution of B under P, an equivalent representation of the last equation is

γ = inf_{u} E_{P} [\int_{0}^{τ} (f (X_{s}^{u}, s) + \frac{1}{2} {| u_{s} |}^{2}) d s + g (X_{τ}^{u})] .

(22)

where

X_{s}^{u}

is the solution of the controlled SDE

d X_{s}^{u} = (b (X_{s}^{u}, s) + σ (X_{s}^{u}) u_{s}) d s + σ (X_{s}^{u}) d B_{s}, X_{0}^{u} = x,

(23)

with

B_{s}

being our standard, m-dimensional Brownian motion (under P). See Appendix C for a sketch of derivation of Girsanov’s formula.

3.1.2. Importance Sampling in Path Space

Similarly to the finite dimensional case considered in the last section, we can derive optimal importance sampling strategies from the Donsker–Varadhan principle. To this end, we consider the case that

τ

is a random stopping time, which is a case that is often relevant in applications (e.g., when computing transition rates or committor functions [7]), but that is rarely considered in the importance sampling literature. Let

T > 0

and

O \subset R^{n}

be an open and bounded set with smooth boundary

\partial O

. We define

τ_{O} = inf {s > 0 : X_{s}^{u} \notin O},

as the first exit time of the set O and define the stopping time

τ = τ_{O} \land T

(24)

to be the minimum of

τ_{O}

and T, i.e., the exit from the set O or the end of the maximum time interval, whatever comes first. For the ease of notation, we will use the same symbol

τ

to denote the stopping time under the controlled or uncontrolled process (i.e., for

u = 0

) throughout the article. Unless otherwise noted, it should be clear from the context whether

τ

is understood with respect to

X^{u}

or

X = X^{u = 0}

. Here,

X_{s}^{u}

satisfies the controlled SDE (23).

We will argue that the optimal

Q^{*}

, which yields zero variance in the reweighting scheme

E_{Q} [Y_{τ}^{u}] = E_{P} [exp (- W_{τ})]

via

Y_{τ}^{u} = exp (- Z_{τ}^{u} - W_{τ}^{u})

, can be generated by a feedback control of the form

u_{s}^{*} = c (X_{s}^{u^{*}}, s),

with a suitable function

c : R^{n} \times [0, \infty) \supset D \to R^{m}

. Finding

u^{*}

turns the Donsker–Varadhan variational principle (14) into an optimal control problem by virtue of Equations (22) and (23). The following statement characterizes the optimal control by which the infimum in Equation (14) is attained and which, as a consequence, provides a zero variance reweighting scheme (or: change of measure).

Theorem 2.

Let

Ψ (x, t) = E_{P} [exp (- \int_{t}^{τ} f (X_{s}, s) d s - g (X_{τ})) | X_{t} = x]

(25)

be the exponential of the negative free energy, considered as a function of the initial condition

X_{t} = x

with

0 \leq t \leq τ \leq T

. Then, the path space measure

Q^{*}

induced by the feedback control

u_{s}^{*} = σ {(X_{s}^{u^{*}})}^{T} \nabla_{x} log Ψ (X_{s}^{u^{*}}, s)

(26)

yields a zero variance estimator, i.e.,

Y_{τ}^{u^{*}} = Ψ (x, 0), Q^{*} - a . s .

(27)

Proof.

See Appendix D. ☐

Remark 3.

We should mention that Theorem 2 covers also the special cases that either

τ = T

is a deterministic stopping time (see, e.g., [24], Proposition 5.4.4) or, by sending

T \to \infty

, that

τ = τ_{O}

is the first exit time of the set O, assuming that the stopping time

τ_{O}

is a.s. finite (but not necessarily bounded).

3.2. Revisiting Jarzynski’s Identity

The Donsker–Varadhan variational principle shares some features with the nonequilibrium free energy formula of Jarzynski [10], and, in fact, the variational form makes this formula amenable to the analysis of the previous paragraphs, with the aim of improving the quality of the corresponding statistical estimators. Jarzynski’s identity relates the Helmholtz equilibrium free energy to averages that are taken over an ensemble of non-equilibrium trajectories generated by forcing the dynamics.

We discuss a possible application of importance sampling to free energy calculation à la Jarzynski with a simple standard example, but we stress that all considerations easily generalize to more general situations than the one treated below.

As an example, let

{(V_{λ})}_{0 \leq λ \leq 1}

be a parametric family of smooth potential energy functions

V_{λ} : R^{n} \to R

and define the free energy difference between the two equilibrium densities

π_{0} \propto exp (- V_{0})

and

π_{1} \propto exp (- V_{1})

as the log-ratio

Δ F = - log \frac{\int_{R^{n}} exp (- V_{1} (x)) d x}{\int_{R^{n}} exp (- V_{0} (x)) d x} .

(28)

(Often

π_{0}

and

π_{1}

are called thermodynamic states.) Defining the energy difference

V_{diff} = V_{1} - V_{0}

and the equilibrium probability density

π_{0} (x) = \frac{exp (- V_{0} (x))}{\int_{R^{n}} exp (- V_{0} (x)) d x},

the Helmholtz free energy is seen to be an exponential average of the familiar form (2):

Δ F = - log E_{π_{0}} [exp (- V_{diff})] .

(29)

Jarzynski’s formula [10] states that the last equation can be represented as an exponential average over non-stationary realizations of a parameter-dependent process

X^{λ} = {(X_{s}^{λ})}_{0 \leq s \leq T}

. Specifically, letting

W_{T}^{λ} = W_{T}^{λ} (X^{λ})

denote the nonequilibrium work done on the system by varying the parameter from

λ = 0

to

λ = 1

within time

[0, T]

, Jarzynski’s equality states that

Δ F = - log E [exp (- W_{T}^{λ})],

(30)

where

W_{T}^{λ}

will be specified below. In the last equation the expectation is taken over all realizations of

X^{λ}

, with initial conditions distributed according to the equilibrium density

π_{0}

. To be specific, we assume that the parametric process

X_{s}^{λ}

is the solution of the SDE

d X_{s}^{λ} = - (1 - λ_{s}) \nabla V_{0} (X_{s}^{λ}) d s - λ_{s} \nabla V_{1} (X_{s}^{λ}) d s + \sqrt{2} d B_{s},

(31)

with

{(λ_{s})}_{0 \leq s \leq T}

being a differentiable parameter process (called: protocol) that interpolates between

λ_{0} = 0

and

λ_{T} = 1

. Further, let the work exerted by the protocol be given by

W_{T}^{λ} = \int_{0}^{T} {\frac{\partial V_{λ}}{\partial λ}|}_{X_{s}^{λ}} {\dot{λ}}_{s} d s = \int_{0}^{T} V_{diff} (X_{s}^{λ}) {\dot{λ}}_{s} d s,

(32)

where

V_{λ} (x) = (1 - λ) V_{0} (x) + λ V_{1} (x)

is the interpolated potential, and

\dot{λ_{s}} = d λ_{s} / d s

denotes the time derivative of

λ_{s}

. Note that

W_{τ}^{λ}

is a path functional of the standard form (12), with bounded deterministic stopping time

τ = T

and cost functions

f (X_{s}, s) = V_{diff} (X_{s}^{λ}) {\dot{λ}}_{s}, g \equiv 0 .

Letting now P denote the path space measure that is generated by the Brownian motion

{(B_{s})}_{s \geq 0}

in the parameter dependent SDE (31), we can express Jarzynski’s equality (30) by

Δ F = - log E [exp (- W_{T}^{λ})] = - log \int_{R^{n}} E_{P} [exp (- W_{T}^{λ})] π_{0} (x) d x,

(33)

where the (conditional) expectation

E_{P} [exp (- W_{T}^{λ})] = E_{P} [exp (- W_{T}^{λ} (X^{λ})) | X_{0}^{λ} = x]

is understood over all realizations of (31) with initial condition

X_{0}^{λ} = x

.

Optimized Protocols by Adaptive Importance Sampling

The applicability of Jarzynski’s formula heavily depends on the choice of the protocol

λ_{s}

. The observation that an uneducated choice of a protocol may render the corresponding statistical estimator virtually useless because of a dramatic increase of its variance is in accordance with what one observes in importance sampling. An attempt to optimize the protocol by minimizing the variance of the estimator has been carried out in [12], but here, we shall follow an alternative route, exploiting the fact that Jarzynski’s formula has the familiar exponential form considered in this paper; cf. also [11,13,14].

Having this said and recalling Theorem 2, it is plausible that there exists a zero variance estimator for

E_{P} [exp (- W_{T}^{λ})]

which appeared in the integrand of Jarzynski’s equality (33), under certain assumptions on the functional

W_{T}^{λ}

. For simplicity, we confine the following considerations to the above example of a diffusion process of the form (31) with a deterministic protocol

{(λ_{s})}_{s \in [0, T]}

. To make the idea of optimizing the protocol more precise, we introduce the shorthand

Y_{s} = X_{s}^{λ}

for the solution of Equation (31) and define

γ (x, t) = min_{v} E_{P} [\int_{t}^{T} (f (Y_{s}^{v}, s) + \frac{1}{2} {| v_{s} |}^{2}) d s | Y_{t}^{v} = x],

(34)

with

f (x, s) = V_{diff} (x) {\dot{λ}}_{s}

and the expectation taken over all realizations of

Y_{s}^{v}

. The process

Y_{s}^{v}

solves a controlled variant of the SDE (31), specifically,

d Y_{s}^{v} = (b_{λ} (Y_{s}^{v}, s) + \sqrt{2} v_{s}) d s + \sqrt{2} d B_{s}, Y_{t}^{v} = x .

(35)

Here, we have used the shorthand

b_{λ} (x, s) = - (1 - λ_{s}) \nabla V_{0} (x) - λ_{s} \nabla V_{1} (x)

. Theorem 2 that specifies the zero-variance importance sampling estimator in terms of a feedback control policy can be adapted to our situation (see, e.g., [7,18]) by letting

O ↑ R^{n}

so that

τ = τ_{O} \land T \to T

a.s. The zero-variance estimator is generated by the feedback control

v_{s}^{*} = - \sqrt{2} \nabla_{y} γ (Y_{s}^{v^{*}}, s),

with

γ (x, t)

given by Equation (34), and thus, by the SDE

d Y_{s}^{v^{*}} = (b_{λ} (Y_{s}^{v^{*}}, s) - 2 \nabla_{y} γ (Y_{s}^{v^{*}}, s)) d s + \sqrt{2} d B_{s}, Y_{0}^{v^{*}} = x .

(36)

Specifically, given N independent draws

x_{1}, \dots, x_{N} \sim π_{0}

from the equilibrium distribution and corresponding N independent trajectories

{(Y_{s}^{v^{*}})}_{s \in [0, T]}

of the SDE (36) with initial conditions

Y_{0}^{v^{*}} = x_{i}

, an asymptotically unbiased, the minimum variance estimator of the free energy is given by

{\hat{Δ F}}_{N} = - log \frac{1}{N} \sum_{i = 1}^{N} G_{T}^{*} (x_{i}),

(37)

where

G_{T}^{*} = exp (- Z_{T}^{v^{*}} - W_{T}^{λ} (Y^{v^{*}}))

with

Z_{T}^{u = v^{*}}

given by Equation (19) and

W_{T}^{λ} (Y^{v^{*}})

being the nonequilibrium work (32) under the controlled process (36).

Remark 4.

Generally, the discretization of the work

W_{T}^{λ}

requires some care, because the discretization error may introduce some “shadow work” that may spoil the properties of the importance sampling estimator [25]. Further note that, even if time-discretization errors are ignored, the estimator (37) is not a zero-variance estimator because we have minimized only the conditional estimator (for fixed initial condition). Moreover the estimator is only asymptotically unbiased by Jensen’s inequality and the strict concavity of the logarithm.

Further notice that the estimator hinges on the availability of

γ (x, t)

which is typically difficult to compute. An idea, inspired by the adaptive biasing force (ABF) algorithm [26,27,28] is to estimate γ on the fly and then iteratively refine the estimate in the course of the simulation using a suitable parametric representation [29,30]. If good collective variables or reaction coordinates are known, it is further possible to choose a representation that depends only on these variables and still obtain low variance estimators [31,32].

4. Algorithms: Gradient Descent, Cross Entropy Minimization and beyond

According to Theorem 2 designing reweighting (importance sampling) schemes on path space that feature zero variance estimators comes at the price of solving an optimal control problem of the following form: minimize the cost functional

J (u) = E_{P} [\int_{0}^{τ} (f (X_{s}^{u}, s) + \frac{1}{2} {| u_{s} |}^{2}) d s + g (X_{τ}^{u})]

(38)

over all admissible controls and subject to the dynamics

d X_{s}^{u} = (b (X_{s}^{u}, s) + σ (X_{s}^{u}) u_{s}) d s + σ (X_{s}^{u}) d B_{s} .

(39)

Here, admissible controls are Markovian feedback controls

u_{s} = c (X_{s}^{u}, s)

such that Equation (39) has a unique strong solution. Leaving all technical details aside see Section IV.3 in [8], it can be shown that the value function (or: optimal cost-to-go)

γ (x, t) = E_{P} [\int_{t}^{τ} (f (X_{s}^{u^{*}}, s) + \frac{1}{2} {| u_{s}^{*} |}^{2}) d s + g (X_{τ}^{u^{*}}) | X_{t}^{u^{*}} = x],

(40)

with

u^{*}

being the unique optimal control given by Equation (26), is the solution of a nonlinear partial differential equation of Hamilton–Jacobi–Bellman type. Solving this equation numerically is typically even more difficult than solving the original sampling problem by brute-force Monte Carlo (especially when the state space dimension n is large).

Note that Equations (38) and (39) is simply the concrete form of the Donsker–Varadhan principle when the path space measure is generated by a diffusion. Therefore the equivocation with the path space free energy (13) or (34) is not a coincidence, because by definition the value function is the free energy, considered as a function of the initial conditions. In other words and in view of Theorem 2, there is no need for further sampling once the value function is known.

We will now discuss concrete numerical algorithms to minimize Equations (38) and (39) without resorting to the associated Hamilton–Jacobi–Bellman equation.

4.1. Gradient Descent

The fact that solving the optimal control problem can be as difficult as solving the sampling problem suggests to combine the two in an iterative fashion using a parametric representation of the value function (or: free energy). To this end, notice that the optimal control is essentially a gradient force that can be approximated by

{\hat{u}}_{s} = - σ {(X_{s}^{u})}^{T} \sum_{i = 1}^{N} α_{i} \nabla_{x} ϕ_{i} (X_{s}^{u}, s),

(41)

based on a finite-dimensional approximation

\hat{γ} (x, t) = \sum_{i = 1}^{N} α_{i} ϕ_{i} (X_{s}^{u}, s)

(42)

of the value function with suitable smooth basis functions

{ϕ_{i} : \bar{D} \to R^{m} : i = 1, \dots, N}

that span an N-dimensional subspace of the space

C^{2, 1} (D) \cap C (\bar{D})

of classical solutions of the associated Hamilton–Jacobi–Bellman equation. Here we denote by

C^{r, s} (R^{n} \times [0, \infty)

the Banach space of functions that are r and s times continuously differentiable in their first and second arguments, respectively, and

C (R^{n} \times [0, \infty)) = C^{0, 0} (R^{n} \times [0, \infty))

for continuous functions. Plugging the above representation into Equations (38) and (39) yields the following finite-dimensional optimization problem: minimize

J (\hat{u}) = E_{P} [\int_{0}^{τ} (f (X_{s}^{\hat{u}}, s) + \frac{1}{2} {| {\hat{u}}_{s} |}^{2}) d s + g (X_{τ}^{\hat{u}})]

(43)

over the controls

\hat{u}

where

X^{\hat{u}}

is the solution of the SDE (23) with control

u = \hat{u}

.

Let us define

\hat{J} (α) = J (\hat{u} (α))

, with the shorthand

α = {(α_{1}, \dots, α_{N})}^{T} \in R^{N}

. Because of the dependence of the process

X^{α}

and the random stopping time

τ = τ^{α}

on the parameter

α

, the functional

\hat{J}

is not quadratic in

α

, but it has been shown [33] that it is strongly convex if the basis functions

ϕ_{i}

are non-overlapping. In this case

\hat{J}

has a unique minimum, which suggests to do a gradient descent in the parameter

α

:

α^{(m + 1)} = α^{(m)} - h_{m} \nabla \hat{J} (α^{(m)}) .

(44)

Here,

{(h_{m})}_{m \geq 0}

is a sequence of step sizes that goes to zero as

m \to \infty

, and the gradient

\nabla \hat{J} (α)

must be interpreted in the sense of a functional derivative:

δ J (\hat{u}) \cdot ξ = {\frac{d}{d ϵ} J (\hat{u} + ϵ ξ)|}_{ϵ = 0},

(45)

for suitable test functions

ξ \in V

(i.e., square-integrable and adapted to the Brownian motion). Then, the gradient

\nabla \hat{J} (α)

has the components

\frac{\partial \hat{J}}{\partial α_{k}} = - δ J (\hat{u} (α)) \cdot (σ^{T} \nabla_{x} ϕ_{k}) .

(46)

Introducing the shorthand

ℓ (X^{\hat{u}}, \hat{u}) = \int_{0}^{τ} (f (X_{s}^{\hat{u}}, s) + \frac{1}{2} {| {\hat{u}}_{s} |}^{2}) d s + g (X_{τ}^{\hat{u}})

for the cost and the convention

E [\cdot] = E_{P} [\cdot]

for the expectation with respect to P, the derivative (45) can again be found by means of Girsanov’s formula: there exists a measure

Q^{ϵ}

that is absolutely continuous with respect to the reference measure P, such that

{\frac{d}{d ϵ} J (\hat{u} + ϵ ξ)|}_{ϵ = 0} = {\frac{d}{d ϵ} E [ℓ (X^{\hat{u} + ϵ ξ}, \hat{u} + ϵ ξ)]|}_{ϵ = 0} = {\frac{d}{d ϵ} E [ℓ (X, \hat{u} + ϵ ξ) \frac{d Q^{ϵ}}{d P}]|}_{ϵ = 0},

(47)

with the likelihood ratio

{\frac{d Q^{ϵ}}{d P}|}_{[0, τ]} = exp (Z_{τ}^{u + ϵ ξ}) .

Assuming that the derivative and the expectation in Equation (47) commute, we can differentiate inside the expectation

E [\cdot]

which is independent of the parameter

ϵ

and then switch back to the controlled process

X^{u}

under the reference measure P, by which we obtain (see [33]):

\begin{matrix} δ J (\hat{u}) \cdot ξ & = E [ℓ (X^{\hat{u}}, \hat{u}) \int_{0}^{τ} ξ_{s} \cdot d B_{s} + \int_{0}^{τ} {\hat{u}}_{s} \cdot ξ_{s} d s] . \end{matrix}

(48)

Hence, using Equation (46), we find

\frac{\partial \hat{J}}{\partial α_{k}} = - E [ℓ (X^{\hat{u}}, \hat{u}) \int_{0}^{τ} (σ^{T} \nabla_{x} ϕ_{k}) (X_{s}^{\hat{u}}, s) \cdot d B_{s} + \int_{0}^{τ} {\hat{u}}_{s} \cdot (σ^{T} \nabla_{x} ϕ_{k}) (X_{s}^{\hat{u}}, s) d s] .

(49)

where the last expression can be estimated by Monte Carlo, possibly in combination with variance minimizing strategies to improve the convergence of the gradient estimation in the course of the gradient descent [33,34]. Before we conclude, we shall briefly explain why the gradient vanishes when the variance is zero.

Lemma 1.

Under the optimal control

u^{*}

, it holds that:

δ J (u^{*}) \cdot ξ = 0 \forall ξ \in V .

Proof.

By the Itô isometry (see [23], Corollary 3.1.7), we can recast Equation (48) as:

\begin{matrix} δ J (u^{*}) \cdot ξ & = E [(ℓ (X^{u^{*}}, u^{*}) + \int_{0}^{τ} u_{s}^{*} \cdot d B_{s}) \int_{0}^{τ} ξ_{s} \cdot d B_{s}] \\ = E [(W (X^{u^{*}}) + Z_{τ}^{u^{*}}) \int_{0}^{τ} ξ_{s} \cdot d B_{s}] \\ = (W (X^{u^{*}}) + Z_{τ}^{u^{*}}) E [\int_{0}^{τ} ξ_{s} \cdot d B_{s}], \end{matrix}

(50)

where in the last equality we have used that

W + Z

is a.s. constant under the optimal control. Since B is a Brownian motion under P, the expectation is zero, and it follows that

δ J (u^{*}) \cdot ξ = 0 \forall ξ \in V,

and hence, the assertion is proven. ☐

We summarize the above considerations in Algorithm 1 below.

Algorithm 1 Gradient descent

Set maximum no. of iteration $M_{axit}$ and $ϵ > 0$
Initialize $m = 0$ , $α^{(0)} \in R^{N}$ and $h_{0} > 0$
Evaluate $G_{0} = \nabla \hat{J} (α^{(0)})$
while $m < M_{axit}$ & $h_{m} > ϵ$
$α^{(m + 1)} = α^{(m)} - h_{m} G_{m}$
Evaluate $G_{m + 1} = \nabla \hat{J} (α^{(m + 1)})$
Evaluate step size, e.g., $h_{m + 1} = \frac{(α^{(m + 1)} - α^{(m)}) \cdot (G_{m + 1} - G_{m})}{| G_{m + 1} - G_{m} |^{2}}$
$m \mapsto m + 1$
end while

Remark 5.

The step size control in Algorithm 1 follows the Barzilai-Borwein procedure that guarantees convergence as

m \to \infty

when the functional is convex. Another alternative is to do a line search after each iterate in the descent direction and then determine

h_{m + 1}

so that it satisfies the Wolfe condition; see [35] for further details.

Remark 6.

In practice, it may be advantageous to pick the basis functions so that they are not explicitly time-dependent (e.g., Gaussians, Chebyshev polynomials or the alike). If the associated control problem is stationary, as is for example the case when the SDE is homogeneous and the stopping time is a hitting time, the value function will be stationary too and, as a consequence, the control policy will be stationary. If, however, the problem is explicitly time-dependent, one may change the ansatz (42) to have stationary basis functions, but time-dependent coefficients

α_{i}

, where the time-dependence is mediated by the initial data; see [29] for a discussion.

4.2. Cross-Entropy Minimization

Another algorithm for minimizing

J (\hat{u})

is based on an entropy representation of

J (u)

, namely,

J (u) = J (u^{*}) + D (Q | Q^{*})

(51)

where u is any admissible control for Equations (38) and (39),

u^{*}

is the optimal control, and

Q = Q (u)

and

Q^{*} = Q (u^{*})

are the corresponding path space measures. Equation (51) is a consequence of the zero-variance property of the optimal change of measure, since Equation (27) implies that

exp (- W) \frac{d P}{d Q^{*}} = ψ (x, 0) = exp (- J (u^{*}))

(52)

and hence

W - log (\frac{d P}{d Q} \frac{d Q}{d Q^{*}}) = J (u^{*}) .

(53)

Taking the expectation with respect to Q and using that both Q and

Q^{*}

are absolutely continuous with respect to P and vice versa yields Equation (51).

The idea now is to seek a minimizer of

D (Q | Q^{*})

in the set of probability measures

Q \in \hat{M}

that are generated by the discretized controls

\hat{u}

, i.e., one would like to minimize

\hat{I} (α) = D (Q (\hat{u} (α)) | Q^{*})

(54)

over

α \in R^{N}

, such that

\hat{Q} = Q (\hat{u} (α))

is absolutely continuous with respect to

Q^{*}

. By Equation (16) the optimal change of measure is only known up to the normalizing factor

exp (γ)

, which enters Equation (54) only as an additive constant; note that we call

exp (γ)

or

exp (J (u^{*}))

a normalizing factor, even though it is clearly a function of the initial conditions

(x, t)

or

(x, 0)

. Nevertheless minimizing

\hat{I}

is not easily possible since the functional may have several local minima. With a little trick, however, we can turn the minimization of Equation (54) into a feasible minimization problem, simply by flipping the arguments. To this end, we define:

\hat{H} (α) = D (Q^{*} | Q (\hat{u} (α))) .

(55)

Clearly, Equation (51) does not hold with arguments in the Kullback-Leibler (or: KL) divergence term reversed, since

D (\cdot | \cdot)

is not symmetric; nevertheless, it holds that

\hat{I} (α) \geq 0, \hat{H} (α) \geq 0 and \hat{I} (α) = 0 if and only if \hat{H} (α) = 0,

(56)

where the minimum is attained if and only if

\hat{Q} = Q^{*}

. Hence, by continuity of the relative entropy, we may expect that by minimizing the “wrong” functional

\hat{H}

we get close to the optimal change of measure, provided that the optimal

Q^{*}

can be approximated by our parametric family

\hat{Q}

. We have the following handy result (see [15]).

Lemma 2 (Cross-entropy minimization).

The minimization of (55) is equivalent to the minimization of the cross-entropy functional

C E (α) = - E [log φ (\hat{u}) e^{- W (X)}]

(57)

where the log likelihood ratio

log φ = log (d Q / d P)

between controlled and uncontrolled trajectories is quadratic in the unknown

α

and can be computed via Girsanov’s theorem.

Proof.

By definition of the KL divergence, we have

\hat{H} (α) = \int log (\frac{d Q^{*}}{d P}) \frac{d Q^{*}}{d P} d P - \int log (\frac{d \hat{Q}}{d P}) \frac{d Q^{*}}{d P} d P,

since all measures are mutually absolutely continuous. The first term in the last equation is independent of

α

, and the second term is proportional to the cross-entropy functional

- E [log φ exp (- W)]

up to the unknown normalizing factor

exp (γ)

. ☐

The fact that the cross-entropy functional is quadratic in

α

implies that the necessary optimality condition is of the form

S α = b,

(58)

where

S = {(S_{i j})}_{1 \leq i, j \leq N}

and

b = {(b_{i})}_{1 \leq i \leq N}

are given by:

\begin{matrix} \begin{matrix} S_{i j} & = E [e^{- W (X)} \int_{0}^{τ} (σ^{T} \nabla_{x} ϕ_{i}) (X_{s}, s) (σ^{T} \nabla_{x} ϕ_{j}) (X_{s}, s) d s] \\ b_{i} & = - E [e^{- W (X)} \int_{0}^{τ} (σ^{T} \nabla_{x} ϕ_{i}) (X_{s}, s) \cdot d B_{s}] . \end{matrix} \end{matrix}

(59)

Note that the average in Equation (59) is over the uncontrolled realizations X. It is easy to see that the matrix S is positive definite if the basis functions

ϕ_{i}

are linearly independent, which implies that Equation (58) has a unique solution and our necessary condition is in fact sufficient. Nevertheless it may happen in practice that the coefficient matrix S is badly conditioned, in which case it may be advisable to evaluate the coefficients using importance sampling or a suitable annealing strategy; see [15,29] for further details.

A simple, iterative variant of the cross-entropy algorithm is summarized in Algorithm 2.

Algorithm 2 Simple cross-entropy method

Set maximum no. of iteration $M_{axit}$ and $α^{(0)} = 0$
Evaluate $S = S^{(0)}$ and $b = b^{(0)}$ according to (59)
for $m = 0$ to $M_{axit}$ do
Solve linear system of equations $S^{(m)} α^{(m + 1)} = b^{(m)}$
Evaluate $S^{(m + 1)}$ and $b^{(m + 1)}$ by importance sampling using realizations of $X^{α^{(m + 1)}}$
end for

4.3. Other Monte Carlo-Based Methods

We refrain from listing all possibilities to compute the optimal change of measure or the optimal control, and mention only two more possibilities that are functional in situations in which grid-based discretization methods (e.g., for solving the nonlinear Hamilton–Jacobi–Bellman equation) are unfeasible. The strength of the methods described below is that they can be combined with model reduction methods such as averaging, homogenization or Markov state modeling if either suitable collective variables, a reaction coordinate or some dominant metastable sets are known; see, e.g., [7,15,29,32,36,37] for the general approach and the application to molecular dynamics.

4.3.1. Approximate Policy Iteration

The first option is based on successive linearization of the Hamilton–Jacobi–Bellman equation of the underlying optimal control problem. The idea is the following: Given any admissible control

u_{s} = c (X_{s}^{u}, s)

, the Feynman–Kac theorem [23] (Theorem 8.2.1) states that the cost functional

J (u)

, considered as a function of the initial data

(x, t)

of the controlled process

X^{u} = {(X_{s}^{u})}_{s \geq t}

with

X_{t}^{u} = x

, solves a linear boundary value problem of the form

A (c) Θ (c) = ℓ (x, c),

(60)

where

A (c)

is a linear differential operator that depends on the chosen control policy and which precise form (e.g., parabolic, elliptic or hypoelliptic) depends on the problem at hand. Clearly,

γ (x, t) = {min}_{c} Θ (c; x, t)

is the value function (or free energy), i.e., the solution we seek. For an arbitrary initial choice of a control policy

c_{0} \neq c^{*}

we have

γ < Θ (c_{0})

, and a successive improvement of the policy can be obtained by iterating

c_{n + 1} (x, s) = - σ {(x, s)}^{T} \nabla_{x} Θ (c_{n}; x, t), n \geq 0 .

(61)

Under suitable assumptions on the drift and diffusion coefficients, iteration of Equations (60) and (61) yields a convergent series of control policies

c_{n}

that converges to the unique optimal control, hence the name of the method is policy iteration. Clearly, solving the linear partial differential Equation (60) by any grid-based method will be unfeasible if the state space dimension is larger than, say, three or four. In this case, it is possible to approximate the infinitesimal generators

A (c)

by a sparse and grid-free Markov State Model (MSM) that captures the underlying dynamics

X^{u} = X^{u (c)}

; see, e.g., [36] for the error analysis of the corresponding nonequilibrium MSM and an application to molecular dynamics. In this case, one speaks of an approximate policy iteration. For further details on approximate policy iteration algorithms we refer to the article [38] and the references therein.

4.3.2. Least-Squares Monte Carlo

If

τ = T

is a finite stopping time, another alternative is to exploit that the value function of the control problem Equations (38) and (39) can be computed as the solution to a forward-backward stochastic differential equation (FBSDE) of the form

\begin{matrix} d X_{s} & = b (X_{s}, s) d s + σ (X_{s}) d B_{s}, X_{t} = x \\ d Y_{s} & = - f (X_{s}, s) d s + \frac{1}{2} {| Z_{s} |}^{2} + Z_{s} \cdot d B_{s}, Y_{T} = g (X_{T}), \end{matrix}

(62)

where

t \leq s \leq T

and the second equation must be interpreted as an equation that runs backwards in time. A solution of the FBSDE (62) is a triplet

(X_{s}, Y_{s}, Z_{s})

, with the property that

Y_{s^{'}}

and

Z_{s^{'}}

at time

s^{'} \in [t, T]

depend only on the history of the forward process

{(X_{s})}_{t \leq s \leq s^{'}}

up to time

s^{'}

. In particular, since

X_{t} = x

, the backward process

Y_{t}

is a deterministic function of the initial data

(x, t)

only, and it holds that (e.g., [39])

γ (x, t) = Y_{t} .

(63)

The specific structure of the control problem Equations (38) and (39) implies that the forward equation is decoupled from the backward equation and that the backward process

(Y_{s}, Z_{s})

can be expressed by

Y_{s} = γ (X_{s}, s), Z_{s} = - σ {(X_{s})}^{T} \nabla_{x} γ (X_{s}, s),

where

X_{s}

is the uncontrolled forward process. Since we can simulate the forward process and we know the functional dependence of

(Y_{s}, Z_{s})

on

X_{s}

, the idea here is again to use the representation Equation (42) of

γ

in terms of a finite basis. It turns out that the coefficient vector

α \in R^{N}

can be computed by solving a least-squares problem in every time step of the time-discretized backward SDE, which is why methods for solving an FBSDE like Equation (62) are termed least squares Monte Carlo; for the general approach we refer to [40,41]; details for the situation at hand will be addressed in a forthcoming paper [42].

5. Illustrative Examples

From a measure-theoretic viewpoint, changing the drift of an SDE (also known as Girsanov transformation) is an exponential tilting of a Gaussian measure on an infinite-dimensional space. Here, for illustration purposes, we consider a one-dimensional paradigm that is in the spirit of Section 2 and that illustrates the basic features of Gaussian measure changes, Girsanov transformations and the cross-entropy method.

To this end, let

π = N (0, 1)

the density of the standard Gaussian distribution on

R

, and define an exponential family of “tilted” probability densities by

ρ_{α} (x) = exp (α x - \frac{α^{2}}{2}) π (x) .

(64)

It can be readily checked that

ρ_{α}

is the density of the normal distribution

N (α, 1)

with mean

α

and unit variance, in other words, the exponential tilting results in a shift of the mean, which represents a change of the drift in the case of an SDE (compare Equations (19) and (21)).

5.1. Example 1 (Moment Generating Function)

Let

β > 0

and define

ψ_{β} = E_{π} [exp (- β X)] .

(65)

By Jensen’s inequality, it follows that

- β^{- 1} log E_{π} [exp (- β X)] \leq E_{α} [X + β^{- 1} log \frac{ρ_{α}}{π}],

(66)

where

E_{α} [\cdot]

denotes the expectation with respect to

ρ_{α}

. A simple calculation shows that the inequality is sharp where equality is attained for

α = - β

, i.e., when

ρ_{α} = ρ^{*}

, with

ρ^{*} = N (- β, 1) .

(67)

As a consequence, the Donsker–Varadhan variational principle (4) holds when the minimum is taken over the exponential family (64) with sufficient statistic X.

We will now show that

ρ^{*}

can be computed by the cross-entropy method. To this end, let

J (α) = E_{α} [X + β^{- 1} log \frac{ρ_{α}}{π}] .

(68)

As we have just argued, there exists a unique minimizer

α^{*} = - β

of J that by Theorem 1 has the zero variance property, which implies that

J (α) = J (α^{*}) + β^{- 1} D (ρ_{α} | ρ^{*}) .

(69)

The associated cross-entropy functional has the form (see Section 4.2):

C E (α) = - E_{π} [log \frac{ρ_{α}}{π} exp (- β X)] .

(70)

Using Equation (64), it is easily seen that the cross-entropy functional is quadratic,

C E (α) = E_{π} [(\frac{α^{2}}{2} - α X) exp (- β X)],

(71)

with unique minimizer:

\hat{α} = \frac{E_{π} [X exp (- β X)]}{E_{π} [exp (- β X)]} = - \frac{\partial}{\partial β} log ψ_{β},

(72)

where the second equality follows from Equation (65), using the fact that the derivative and the expectation commute because

π

is Gaussian and hence the moment-generating function

ψ_{β}

exists for all

β \in R

. Rearranging the terms in the last equation, we obtain

- \frac{\partial}{\partial β} log ψ_{β} = E_{π} [X exp (- β X - \frac{β^{2}}{2})] = E_{- β} [X] = - β,

(73)

showing that:

\hat{α} = α^{*} = - β .

The above consideration readily generalize to the multidimensional Gaussian case, and hence this simple example illustrates that the cross-entropy method yields the same result as direct minimization of the functional (68)—at least in the finite-dimensional case.

5.2. Example 2 (Rare Event Probabilities)

The following example illustrates that the cross-entropy method can be used and produces meaningful results, even though the Donsker–Varadhan principle does not hold. To this end consider again the case of a real-valued random variable

X \sim P

with density

π = N (0, 1)

and

W = - log 1_{{X > d}}

with

d ≫ 0

. Then

P (X > d) = E_{π} [exp (- W)]

(74)

is a small probability that is difficult to compute by brute-force Monte Carlo. In this case, a zero-variance change of measure exists, but it is not of the form (64). As a consequence, equality in Equation (66) cannot be attained within the exponential family

{ρ_{α} : α \in R}

given by Equation (64). Instead, the optimal density in this case would be the conditional density

ρ^{*} (x) = \frac{1_{{x > d}}}{p} π (x),

(75)

where the normalization constant

p = P (X > d)

is of course the quantity we want to compute (cf. Section 2.1). Note that this expression formally agrees with the optimal density (5), which was derived under different assumptions though.

The idea now is to minimize the distance between

ρ_{α}

and

ρ^{*}

in the sense of relative entropy, i.e., we seek a minimizer of the Kullback–Leibler divergence

D (ρ^{*} | ρ_{α})

in the exponential family

{ρ_{α} : α \in R}

. The associated cross-entropy functional is given by

C E (α) = E_{π} [(\frac{α^{2}}{2} - α X) 1_{{X > d}}],

(76)

with unique minimizer

α^{*} = \frac{E_{π} [X 1_{{X > d}}]}{E_{π} [1_{{X > d}}]} = E_{π} [X | X > d] .

(77)

Comparing Equations (77) and (75), we observe that both densities

ρ_{α^{*}}

and

ρ^{*} \neq ρ_{α^{*}}

have the same mean (namely

α^{*}

), hence the suboptimal density

ρ_{α^{*}}

is concentrated around the typical values that the optimal density

ρ^{*}

would produce when samples were drawn from it.

Clearly, the optimal tilting parameter (77) is probably as difficult to compute by brute-force Monte Carlo as the probability

p = P (X > d)

since

{X > d}

is a rare event when

d ≫ 0

is far away from the mean. The strength of both gradient descent and cross-entropy method is, however, that the optimal tilting parameter can be computed iteratively. This is illustrated numerically in Figure 1 for the choice

d = 5

, where we use Algorithm 1 with a constant stepsize and Algorithm 2 as specified. In each iteration m we draw a sample of size

N = 10^{8}

from the density

ρ^{α^{(m)}}

, and estimate the mean,

\hat{p} = \frac{1}{N} \sum_{i = 1}^{N} 1_{{X_{i} > d}} \frac{π}{ρ^{α^{(m)}}} (X_{i})

and the sample variance in each sample. The latter is proportional to the normalized variance

K Var (\hat{p})

of an estimator that has been estimated K times.

Figure 1. Comparison of the cross-entropy (green) and the gradient descent method (blue) for a rare event with probability

p \approx 2.867 \times 10^{- 7}

for fixed sample size

N = 10^{8}

. Both algorithms quickly converge to the optimal tilting parameter

α^{*} \approx 5.187

for the family

N (α, 1)

of importance sampling distributions (left panel) and lead to a drastic reduction of the normalized relative error by a factor of 1000, from about 2000 to 2.38 after few iterations (right panel).

For this (admittedly simple) example both gradient descent and cross-entropy method converge well and lead to a drastic reduction of the normalized relative error

δ = \sqrt{Var} (\hat{p}) / \hat{p}

of the estimator by a factor of about 1000, from about 2000 without importance sampling to about

δ \approx 2.38

under (suboptimal) importance sampling with exponential tilting, indicating that both methods can handle situations in which the optimal (i.e.,

δ = 0

) change of measure is not available within the set of trial densities.

6. Conclusions

We have presented a method for constructing minimum-variance importance sampling estimators. The method is based on a variational characterization of the thermodynamic free energy and essentially replaces a Monte Carlo sampling problem by a stochastic approximation problem for the optimal importance sampling density. For path sampling, the stochastic approximation problem boils down to a Markov control problem, which again can be solved by stochastic optimization techniques. We have proved that for a large class of path sampling problems that are relevant in, e.g., molecular dynamics or rare events simulation, the (unique) solution to the optimal control problem can yield zero-variance importance sampling schemes.

The computational gain when replacing the sampling problem by a variational principle is, besides improved convergence due to the variance reduction and often a higher hitting rate of the relevant events, due to the fact that the variational problem can be solved iteratively, which makes it amenable to multilevel approaches. The cross-entropy method as an example of such an approach has been presented in some detail. A substantial difficulty still is a clever choice of basis functions that is highly problem-specific, and hence, future research should address non-parametric approaches, as well as model reduction methods in combination with the stochastic optimization/approximation tools that can be used to solve the underlying variational problems.

Acknowledgments

This work was funded by the Einstein Center for Mathematics (ECMath) and by the Deutsche Forschungsgemeinschaft (DFG) under Frant DFG-SFB 1114 “Scaling Cascades in Complex Systems”.

Author Contributions

C.H. and W.Z. conceived and designed the research; L.R. performed the numerical experiments; C.H. and C.S. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Yet Another Certainty Equivalence

A similar variational characterization of expected values as in Equation (4), based on convexity arguments and Jensen’s inequality, can be formulated for non-negative random variables

W = W (X) \geq 0

. For simplicity and as in Section 2.1, we assume W to be bounded and measurable and

π > 0

. Then, for all

p \in [1, \infty)

, it holds that

{(E_{π} [W^{p}])}^{1 / p} = max_{η \geq 0} E_{η} [W {(\frac{η}{π})}^{- 1 / p}],

(A1)

where

η

is any non-negative probability density. If we exclude the somewhat pathological case

W \equiv 0

a.s., it follows that

E_{π} [W^{p}] > 0

and the supremum is attained for

η^{*} = \frac{W^{p}}{E_{π} [W^{p}]} π .

(A2)

The proof is along the lines of the proof of the Donsker–Varadhan principle Equation (4). Indeed, applying Jensen’s inequality and noting that

η / π

is non-zero

η

-a.s. it readily follows that

{(E_{π} [W^{p}])}^{1 / p} \geq E_{η} [W {(\frac{η}{π})}^{- 1 / p}],

and it is easy to verify that the supremum in Equation (A1) is attained at

η = η^{*}

given by Equation (A2).

Similar as in Section 2.1, the above discussions can be applied to study the importance sampling schemes for the p-th moment of random variable W. We have:

Theorem A1 (Optimal importance sampling, cont’d).

Let

p > 1

and

η^{*}

be defined in Equation (A2). Then the random variable

Y = W^{p} {(η^{*} / π)}^{- 1}

has zero variance under

η^{*}

, and we have:

Y = E_{π} [W^{p}], η^{*} - a . s .

Again, Theorem A1 implies that drawing random variable X from

η^{*}

and then estimating the reweighted expectation

E_{η^{*}} [Y]

provides a zero variance estimator for the quantity

E_{π} [W^{p}]

.

Remark A1.

When

0 < p \leq 1

the function

f (u) = u^{p}

is concave for

u \geq 0

and the variational principle (A1) needs to be modified as (see [4])

{(E_{π} [W^{p}])}^{1 / p} = min_{η \geq 0} E_{η} [W {(\frac{η}{π})}^{- 1 / p}],

(A3)

where the minimizer

η^{*}

is given by Equation (A2). If

W > 0

a.s., then

η^{*}

belongs to the exponential family with sufficient statistic

S (X) = p log W (X)

and reference density π.

Appendix B. Ratio Estimators

We shall briefly discuss the properties of the self-normalized importance sampling estimator (7) that is based on estimating a ratio of expected values

q_{N} = \frac{1}{N} \sum_{i = 1}^{N} Q_{i}, p_{N} = \frac{1}{N} \sum_{i = 1}^{N} P_{i}

where

Q_{i}

and

P_{i}

are i.i.d. random variables living on a joint probability space and having finite variances

σ_{Q}^{2}

and

σ_{P}^{2}

and covariance

σ_{Q P}

. Further assume that

q = E [Q_{1}] \neq 0

, then, by the strong law of large numbers, the ratio

p_{N} / q_{N}

converges a.s. to

p / q

where

p = E [P_{1}]

.

Appendix B.1. The Delta Method

We can apply the delta method (e.g., [20] (Section 4.1)) to analyze the behavior of the ratio estimator in more detail. Roughly speaking, the delta method says that for a sum

S_{N} = X_{1} + \dots + X_{N}

,

N \in N

of square-integrable, i.i.d. random variables

X_{k}

with mean

μ \in R^{n}

, covariance matrix

Σ \in R^{n \times n}

and a sufficiently smooth function

ϕ : R^{n} \to R

which can be Taylor expanded about

μ

, the central limit theorem applies. Specifically, using mean value theorem, it is easily seen that

ϕ (S_{N} / N) - ϕ (μ) = \nabla ϕ (ζ_{N}) (S_{N} / N - μ)

for some

ζ_{N} \in R^{n}

lying component-wise in the half open interval between

S_{N} / N

and

μ

. By the continuity of

\nabla ϕ

at

μ

, the fact that

S_{N} / N \to μ

a.s. as

N \to \infty

and that

\sqrt{N} (S_{N} / N - μ)

is approximately Gaussian with mean zero and covariance

Σ

, we have

\sqrt{N} (ϕ (S_{N} / N) - ϕ (μ)) \overset{L}{\to} N (0, \nabla ϕ {(μ)}^{T} Σ \nabla ϕ (μ)), N \to \infty,

where “

\overset{L}{\to}

” denotes convergence in law (or: convergence in distribution), and

N (m, C)

denotes a Gaussian distribution with mean m and covariance C.

Appendix B.2. Asymptotic Properties of Ratio Estimators

Applying the delta method to the function

ϕ : R^{2} \to R, (u, v) \mapsto u / v

, and assuming that

| v |

is bounded away from zero, we find that the ratio estimator satisfies a central limit theorem too. Specifically, assuming that

q \neq 0

so that

| q_{N} |

is asymptotically bounded away from zero, the delta method yields

\sqrt{N} (\frac{p_{N}}{q_{N}} - \frac{p}{q}) \overset{L}{\to} N (0, σ^{2}),

with variance

σ^{2} = \frac{Var (P_{1} - \frac{p}{q} Q_{1})}{q^{2}} .

In particular, the estimator is asymptotically unbiased.

Appendix C. Finite-Dimensional Change of Measure Formula

We will explain the basic idea behind Girsanov’s theorem and the change of measure Formula (21). To keep the presentation easily accessible, we present only a vanilla version of the theorem based on finite-dimensional Gaussian measures, partly following an idea in [43].

Appendix C.1. Gaussian Change of Measure

Let P be a probability measure on a measurable space

(Ω, E)

, on which an m-dimensional random variable

B : Ω \to R^{m}

is defined. Further suppose that B has standard Gaussian distribution

P_{B} = P \circ B^{- 1}

. Given a (deterministic) vector

b \in R^{n}

and a matrix

σ \in R^{n \times m}

, we define a new random variable

X : Ω \to R^{n}

by

X (ω) = b + σ B (ω) .

(A4)

The similarity to the SDE (11) is no coincidence. Since B is Gaussian, so is X, with mean b and covariance

C = σ σ^{T}

. Now, let

u \in R^{n}

and define the shifted Gaussian random variable

B^{u} (ω) = B (ω) - u,

and consider the alternative representation

X (ω) = b^{u} + σ B^{u} (ω)

(A5)

of X that is equivalent to Equation (A4) if and only if

σ u = b^{u} - b

has a solution (that may not be unique though). Following the line of Section 3.1, we seek a probability measure

Q ≪ P

such that

B^{u}

is standard Gaussian under Q, and we claim that such a Q should have the property:

\frac{d Q}{d P} (ω) = exp (u \cdot B (ω) - \frac{1}{2} {| u |}^{2})

(A6)

or, equivalently,

\frac{d Q}{d P} (ω) = exp (u \cdot B^{u} (ω) + \frac{1}{2} {| u |}^{2}),

(A7)

in accordance with Equations (19)–(21). To show that

B^{u}

is indeed standard Gaussian under the above defined measure Q, it is sufficient to check that for any measurable (Borel) set

A \subset R^{m}

, the probability

Q (B^{u} \in A)

is given by the integral against the standard Gaussian density:

Q (B^{u} \in A) = \frac{1}{{(2 π)}^{m / 2}} \int_{A} exp (- \frac{{| x |}^{2}}{2}) d x .

Indeed, since B is standard Gaussian under P, it follows that:

\begin{matrix} Q (B^{u} \in A) & = \int_{{ω : B^{u} (ω) \in A}} exp (u \cdot B (ω) - \frac{1}{2} {| u |}^{2}) d P (ω) \\ = \int_{{ω : B (ω) - u \in A}} exp (u \cdot B (ω) - \frac{1}{2} {| u |}^{2}) d P (ω) \\ = \frac{1}{{(2 π)}^{m / 2}} \int_{{x : x - u \in A}} exp (u \cdot x - \frac{1}{2} {| u |}^{2} - \frac{1}{2} {| x |}^{2}) d x \\ = \frac{1}{{(2 π)}^{m / 2}} \int_{{x : x - u \in A}} exp (- \frac{{| x - u |}^{2}}{2}) d x \\ = \frac{1}{{(2 π)}^{m / 2}} \int_{A} exp (- \frac{{| y |}^{2}}{2}) d y, \end{matrix}

showing that

B^{u}

has a standard Gaussian distribution under Q.

Appendix C.2. Reweighting

Clearly, by the definition of Q, it holds that:

E [f (X)] = E_{Q} [f (X) exp (- u \cdot B^{u} (ω) - \frac{1}{2} {| u |}^{2})]

(A8)

for any bounded and measurable function

f : R^{n} \to R

, where

E [\cdot] = E_{P} [\cdot]

denotes the expectation with respect to the reference measure P. Now, let

X^{u} (ω) = b^{u} + σ B (ω) .

Since the distribution of the pair

(X^{u}, B)

under P is the same as the distribution of the pair

(X, B^{u})

with

X = b^{u} + σ B^{u}

under Q, the reweighting identity Equation (A8) entails that

E [f (X)] = E [f (X^{u}) exp (- u \cdot B (ω) - \frac{1}{2} {| u |}^{2})],

(A9)

with

E [B] = 0

. Equation (A9) is the finite dimensional analogue of the reweighting identity that has been used to convert the Donsker–Varadhan Formula (14) into its final form (22).

Remark A2.

If

σ (\cdot)

in the SDE (18) is square and invertible, then an alternative derivation of Girsanov’s theorem and Equation (21) can be based on the Euler-Maruyama discretization of the SDE and a change of measure for the corresponding Markov chain.

Appendix D. Proof of Theorem 2

The proof is based on the Feynman–Kac formula and Itô’s lemma. Here, we give only a sketch of the proof and leave aside all technical details regarding the regularity of solutions of partial differential equations, for which we refer to [8] (Section VI.5). Recall the definition

Ψ (x, t) = E_{P} [exp (- \int_{t}^{τ} f (X_{s}) d s - g (X_{τ})) | X_{t} = x] .

By the Feynman–Kac formula, function

Ψ

solves the parabolic boundary value problem

\begin{matrix} (A - f) Ψ & = 0, for (x, t) \in O \times [0, T), \\ Ψ & = exp (- g), for (x, t) \in D^{+} \end{matrix}

(A10)

on the domain

D = O \times [0, T)

where

D^{+} = (\partial O \times [0, T)) \cup (O \times {T})

denotes the terminal set of the augmented (control-free) process

(X_{t}, t)

and

A = \frac{\partial}{\partial t} + \frac{1}{2} σ σ^{T} : \nabla_{x}^{2} + b \cdot \nabla_{x}

is its infinitesimal generator under the probability measure P.

By construction, the stopping time

τ

is bounded, and we assume that

Ψ

is of class

C^{2, 1}

on D, and continuous and uniformly bounded away from zero on the closure

\bar{D}

. Now, let us define the process

ζ_{s}^{u} = - log Ψ (X_{s}^{u}, s),

with

X_{s}^{u}

given by Equation (18). Then, using Itô’s lemma (e.g., [23] (Theorem 4.2.1)) and introducing the shorthands

Ψ_{s}^{u} = Ψ (X_{s}^{u}, s), b_{s}^{u} = b (X_{s}^{u}, s), σ_{s}^{u} = σ (X_{s}^{u}),

we see that

{(ζ_{s}^{u})}_{0 \leq s < τ}

satisfies the SDE

\begin{matrix} d ζ_{s}^{u} = & - \frac{\partial}{\partial t} log Ψ_{s}^{u} d s - \nabla_{x} log Ψ_{s}^{u} \cdot (b_{s}^{u} + σ_{s}^{u} u_{s}) d s \\ - \frac{1}{2} σ_{s}^{u} {(σ_{s}^{u})}^{T} : \nabla_{x}^{2} (log Ψ_{s}^{u}) d s - ({(σ_{s}^{u})}^{T} \nabla_{x} log Ψ_{s}^{u}) \cdot d B_{s}^{u} \\ = & - [\frac{A Ψ_{s}^{u}}{Ψ_{s}^{u}} + ({(σ_{s}^{u})}^{T} \frac{\nabla_{x} Ψ_{s}^{u}}{Ψ_{s}^{u}}) \cdot u_{s} - \frac{1}{2} \frac{| {(σ_{s}^{u})}^{T} \nabla_{x} Ψ_{s}^{u} |^{2}}{{(Ψ_{s}^{u})}^{2}}] d s - ({(σ_{s}^{u})}^{T} \frac{\nabla_{x} Ψ_{s}^{u}}{Ψ_{s}^{u}}) \cdot d B_{s}^{u} \\ = & - [f (X_{s}^{u}, s) + ({(σ_{s}^{u})}^{T} \frac{\nabla_{x} Ψ_{s}^{u}}{Ψ_{s}^{u}}) \cdot u_{s} - \frac{1}{2} \frac{| {(σ_{s}^{u})}^{T} \nabla_{x} Ψ_{s}^{u} |^{2}}{{(Ψ_{s}^{u})}^{2}}] d s - ({(σ_{s}^{u})}^{T} \frac{\nabla_{x} Ψ_{s}^{u}}{Ψ_{s}^{u}}) \cdot d B_{s}^{u} . \end{matrix}

In the last equation, we have used that the first equality in Equation (A10) holds in the interior of the bounded domain D, i.e., for

s < τ

. Choosing

u_{s} = u_{s}^{*}

for

0 \leq s < τ

to be the optimal control

u_{s}^{*} = σ {(X_{s}^{u^{*}}, s)}^{T} \nabla_{x} log Ψ (X_{s}^{u^{*}}, s),

as in Equation (26), the last equation can be recast as

d ζ_{s}^{u^{*}} = - (f (X_{s}^{u^{*}}, s) + \frac{1}{2} {| u_{s}^{*} |}^{2}) d s - u_{s}^{*} \cdot d B_{s}^{u^{*}} .

Similar to Equation (20), if we introduce

Z_{s, τ}^{u^{*}} = \int_{s}^{τ} u_{r}^{*} \cdot d B_{r}^{u^{*}} + \frac{1}{2} \int_{s}^{τ} {| u_{r}^{*} |}^{2} d r,

then

Z_{0, τ}^{u^{*}} = Z_{τ}^{u^{*}}

, and we have:

d ζ_{s}^{u^{*}} = - f (X_{s}^{u^{*}}, s) d s - d Z_{s, τ}^{u^{*}} .

As a consequence, using the continuity of the process as

s ↓ 0

,

ζ_{τ}^{u^{*}} = ζ_{0}^{u^{*}} - Z_{τ}^{u^{*}} - \int_{0}^{τ} f (X_{s}^{u^{*}}, s) d s .

(A11)

By definition of

ζ_{s}^{u}

, the initial value

ζ_{0}^{u^{*}} = - log Ψ (X_{0}^{u^{*}}, 0) = - log Ψ (x, 0)

is deterministic. Moreover

ζ_{τ}^{u^{*}} = - log Ψ (X_{τ}^{u^{*}}, τ) = g (X_{τ}^{u^{*}})

, which in combination with Equation (A11) yields:

- log Ψ (x, 0) = g (X_{τ}^{u^{*}}) + \int_{0}^{τ} f (X_{s}^{u^{*}}, s) d s - Z_{τ}^{u^{*}} .

(A12)

Rearranging the terms in the last equation, we find

Ψ (x, 0) = exp (- Z_{τ}^{u^{*}} - \int_{0}^{τ} f (X_{s}^{u^{*}}, s) d s - g (X_{τ}^{u^{*}})),

(A13)

with probability one, which yields the assertion of Theorem 2.

Remark A3.

Letting

T \to \infty

in the proof of the theorem it follows that

τ \to τ_{O}

a.s. where

τ_{O}

is the first exit time of the set O. As a consequence, the zero variance property of the importance sampling estimator carries over to the case of a.s. finite (but not necessarily bounded) hitting times or first exit times.

References

Hammersely, J.M.; Morton, K.W. Poor Man’s Monte Carlo. J. R. Stat. Soc. Ser. B 1954, 16, 23–38. [Google Scholar]
Rosenbluth, M.N.; Rosenbluth, A.W. Monte Carlo Calculations of the Average Extension of Molecular Chains. J. Chem. Phys. 1955, 23, 356–359. [Google Scholar] [CrossRef]
Deuschel, J.D.; Stroock, D.W. Large Deviations; Academic Press: New York, NY, USA, 1989. [Google Scholar]
Dai Pra, P.; Meneghini, L.; Runggaldier, W.J. Connections between stochastic control and dynamic games. Math. Control Signals Syst. 1996, 9, 303–326. [Google Scholar] [CrossRef]
Delle Site, L.; Ciccotti, G.; Hartmann, C. Partitioning a macroscopic system into independent subsystems. J. Stat. Mech. Theory Exp. 2017, 2017, 83201. [Google Scholar] [CrossRef]
Boué, M.; Dupuis, P. A variational representation for certain functionals of Brownian motion. Ann. Probab. 1998, 26, 1641–1659. [Google Scholar]
Hartmann, C.; Banisch, R.; Sarich, M.; Badowski, T.; Schütte, C. Characterization of rare events in molecular dynamics. Entropy 2014, 16, 350–376. [Google Scholar] [CrossRef]
Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions; Springer: New York, NY, USA, 2006. [Google Scholar]
Hartmann, C.; Schütte, C. Efficient rare event simulation by optimal nonequilibrium forcing. J. Stat. Mech. Theory Exp. 2012, 2012. [Google Scholar] [CrossRef]
Jarzynski, C. Nonequilibrium equality for free energy differences. Phys. Rev. Lett. 1997, 78, 2690–2693. [Google Scholar] [CrossRef]
Sivak, D.A.; Crooks, G.A. Thermodynamic Metrics and Optimal Paths. Phys. Rev. Lett. 2012, 109, 190602. [Google Scholar] [CrossRef] [PubMed]
Oberhofer, H.; Dellago, C. Optimum bias for fast-switching free energy calculations. Comput. Phys. Commun. 2008, 179, 41–45. [Google Scholar] [CrossRef]
Rotskoff, G.M.; Crooks, G.E. Optimal control in nonequilibrium systems: Dynamic Riemannian geometry of the Ising model. Phys. Rev. E 2015, 92, 60102. [Google Scholar] [CrossRef] [PubMed]
Vaikuntanathan, S.; Jarzynski, C. Escorted Free Energy Simulations: Improving Convergence by Reducing Dissipation. Phys. Rev. Lett. 2008, 100, 109601. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Wang, H.; Hartmann, C.; Weber, M.; Schütte, C. Applications of the cross-entropy method to importance sampling and optimal control of diffusions. SIAM J. Sci. Comput. 2014, 36, A2654–A2672. [Google Scholar] [CrossRef]
Dupuis, P.; Wang, H. Importance sampling, large deviations, and differential games. Stoch. Int. J. Probab. Stoch. Proc. 2004, 76, 481–508. [Google Scholar] [CrossRef]
Dupuis, P.; Wang, H. Subsolutions of an Isaacs equation and efficient schemes for importance sampling. Math. Oper. Res. 2007, 32, 723–757. [Google Scholar] [CrossRef]
Vanden-Eijnden, E.; Weare, J. Rare Event Simulation of Small Noise Diffusions. Commun. Pure Appl. Math. 2012, 65, 1770–1803. [Google Scholar] [CrossRef]
Roberts, G.O.; Tweedie, R.L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 1996, 2, 341–363. [Google Scholar] [CrossRef]
Glasserman, P. Monte Carlo Methods in Financial Engineering; Springer: New York, NY, USA, 2004. [Google Scholar]
Lelièvre, T.; Stolz, G. Partial differential equations and stochastic methods in molecular dynamics. Acta Numer. 2016, 25, 681–880. [Google Scholar] [CrossRef]
Bennett, C.H. Efficient estimation of free energy differences from Monte Carlo data. J. Comput. Phys. 1976, 22, 245–268. [Google Scholar] [CrossRef]
Øksendal, B. Stochastic Differential Equations: An Introduction with Applications; Springer: Berlin, Germany, 2003. [Google Scholar]
Lapeyre, B.; Pardoux, E.; Sentis, R. Méthodes de Monte Carlo Pour les Équations de Transport et de Diffusion; Springer: Berlin, Germany, 1998. (In French) [Google Scholar]
Sivak, D.A.; Chodera, J.D.; Crooks, G.A. Using Nonequilibrium Fluctuation Theorems to Understand and Correct Errors in Equilibrium and Nonequilibrium Simulations of Discrete Langevin Dynamics. Phys. Rev. X 2013, 3, 11007. [Google Scholar] [CrossRef]
Darve, E.; Rodriguez-Gomez, D.; Pohorille, A. Adaptive biasing force method for scalar and vector free energy calculations. J. Chem. Phys. 2008, 128, 144120. [Google Scholar] [CrossRef] [PubMed]
Lelièvre, T.; Rousset, M.; Stoltz, G. Computation of free energy profiles with parallel adaptive dynamics. J. Chem. Phys. 2007, 126, 134111. [Google Scholar] [CrossRef] [PubMed]
Lelièvre, T.; Rousset, M.; Stoltz, G. Long-time convergence of an adaptive biasing force methods. Nonlinearity 2008, 21, 1155–1181. [Google Scholar] [CrossRef]
Hartmann, C.; Schütte, C.; Zhang, W. Model reduction algorithms for optimal control and importance sampling of diffusions. Nonlinearity 2016, 29, 2298–2326. [Google Scholar] [CrossRef]
Zhang, W.; Hartmann, C.; Schütte, C. Effective dynamics along given reaction coordinates, and reaction rate theory. Faraday Discuss. 2016, 195, 365–394. [Google Scholar] [CrossRef] [PubMed]
Hartmann, C.; Latorre, J.C.; Pavliotis, G.A.; Zhang, W. Optimal control of multiscale systems using reduced-order models. J. Comput. Nonlinear Dyn. 2014, 1, 279–306. [Google Scholar] [CrossRef]
Hartmann, C.; Schütte, C.; Weber, M.; Zhang, W. Importance sampling in path space for diffusion processes with slow-fast variables. Probab. Theory Relat. Fields 2017. [Google Scholar] [CrossRef]
Lie, H.C. On a Strongly Convex Approximation of a Stochastic Optimal Control Problem for Importance Sampling of Metastable Diffusions. Ph.D. Thesis, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany, 2016. [Google Scholar]
Richter, L. Efficient Statistical Estimation Using Stochastic Control and Optimization. Master’s Thesis, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany, 2016. [Google Scholar]
Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 1999. [Google Scholar]
Banisch, R.; Hartmann, C. A sparse Markov chain approximation of LQ-type stochastic control problems. Math. Control Relat. Fields 2016, 6, 363–389. [Google Scholar] [CrossRef]
Schütte, C.; Winkelmann, S.; Hartmann, C. Optimal control of molecular dynamics using Markov state models. Math. Program. Ser. B 2012, 134, 259–282. [Google Scholar] [CrossRef]
Bertsekas, D.P. Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 2011, 9, 310–355. [Google Scholar] [CrossRef]
El Karoui, N.; Hamadène, S.; Matoussi, A. Backward stochastic differential equations and applications. Appl. Math. Optim. 2008, 27, 267–320. [Google Scholar]
Bender, C.; Steiner, J. Least-Squares Monte Carlo for BSDEs. In Numerical Methods in Finance; Carmona, R., Del Moral, P., Hu, P., Oudjane, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 257–289. [Google Scholar]
Gobet, E.; Turkedjiev, P. Adaptive importance sampling in least-squares Monte Carlo algorithms for backward stochastic differential equations. Stoch. Proc. Appl. 2005, 127, 1171–1203. [Google Scholar] [CrossRef]
Hartmann, C.; Kebiri, O.; Neureither, L. Importance sampling of rare events using least squares Monte Carlo. 2018; under preparation. [Google Scholar]
Papaspiliopoulos, O.; Roberts, G.O. Importance sampling techniques for estimation of diffusions models. In Centre for Research in Statistical Methodology; Working Papers, No. 28; University of Warwick: Coventry, UK, 2009. [Google Scholar]

Figure 1. Comparison of the cross-entropy (green) and the gradient descent method (blue) for a rare event with probability

p \approx 2.867 \times 10^{- 7}

for fixed sample size

N = 10^{8}

. Both algorithms quickly converge to the optimal tilting parameter

α^{*} \approx 5.187

for the family

N (α, 1)

of importance sampling distributions (left panel) and lead to a drastic reduction of the normalized relative error by a factor of 1000, from about 2000 to 2.38 after few iterations (right panel).

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Variational Characterization of Free Energy: Theory and Algorithms

Abstract

1. Introduction

Outline

2. Certainty Equivalence

2.1. Donsker–Varadhan Variational Principle

Importance Sampling

2.2. Computational Issues

Comparison with the Standard Monte Carlo Estimator

3. Certainty Equivalence in Path Space

3.1. Donsker–Varadhan Variational Principle in Path Space

3.1.1. Likelihood Ratio of Path Space Measures

3.1.2. Importance Sampling in Path Space

3.2. Revisiting Jarzynski’s Identity

Optimized Protocols by Adaptive Importance Sampling

4. Algorithms: Gradient Descent, Cross Entropy Minimization and beyond

4.1. Gradient Descent

4.2. Cross-Entropy Minimization

4.3. Other Monte Carlo-Based Methods

4.3.1. Approximate Policy Iteration

4.3.2. Least-Squares Monte Carlo

5. Illustrative Examples

5.1. Example 1 (Moment Generating Function)

5.2. Example 2 (Rare Event Probabilities)

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A. Yet Another Certainty Equivalence

Appendix B. Ratio Estimators

Appendix B.1. The Delta Method

Appendix B.2. Asymptotic Properties of Ratio Estimators

Appendix C. Finite-Dimensional Change of Measure Formula

Appendix C.1. Gaussian Change of Measure

Appendix C.2. Reweighting

Appendix D. Proof of Theorem 2

References

Article Metrics

Citations

Article Access Statistics