Konrad-zuse-zentrum F ¨ Ur Informationstechnik Berlin Characterization of Rare Events in Molecular Dynamics Characterization of Rare Events in Molecular Dynamics

A good deal of molecular dynamics simulations aims at predicting and quantifying rare events, such as the folding of a protein or a phase transition. Simulating rare events is often prohibitive, especially if the equations of motion are high-dimensional, as is the case in molecular dynamics. Various algorithms have been proposed for efficiently computing mean first passage times, transition rates or reaction pathways. This article surveys and discusses recent developments in the field of rare event simulation and outlines a new approach that combines ideas from optimal control and statistical mechanics. The optimal control approach described in detail resembles the use of Jarzynski's equality for free energy calculations, but with an optimized protocol that speeds up the sampling, while (theoretically) giving variance-free estimators of the rare events statistics. We illustrate the new approach with two numerical examples and discuss its relation to existing methods.


Introduction
Rare but important transition events between long-lived states are a key feature of many systems arising in physics, chemistry, biology, etc.Molecular dynamics (MD) simulations allow for analysis and understanding of the dynamical behavior of molecular systems.However, realistic simulations for interesting (large) molecular systems in solution on timescales beyond microseconds are still infeasible even on the most powerful general purpose computers.This significantly limits the MD-based analysis of many biological equilibrium processes, because they often are associated with rare events.These rare events require prohibitively long simulations because the average waiting time between the events is orders of magnitude longer than the timescale of the transition characterizing the event itself.Therefore, the straightforward approach to such a problem via direct numerical simulation of the system until a reasonable number of events has been observed is impractically excessive for most interesting systems.As a consequence, rare event simulation and estimation are among the most challenging topics in molecular dynamics.
In this article, we consider typical rare events in molecular dynamics for which conformation changes or protein folding may serve as examples.They can be described in the following abstract way: The molecular system under consideration has the ability to go from a reactant state given by a set A in its state space (e.g., an initial conformation) to a product state described by another set B (e.g., the target conformation).Dynamical transitions from A to B are rare.The general situation we will address is as follows: • The system is (meta)stable, with the sets A and B being two of its metastable sets in the sense that if the system is put there, it will remain there for a long time; transitions between A and B are rare events.• The sets A and B are separated by an unknown and, in general, rough or diffusive energy landscape (that will be denoted by V ).
In addition, we will assume that the system under consideration is in equilibrium with respect to the stationary Gibbs-Boltzmann density We are interested in characterizing the transitions leading from A into B, that is, we are interested in the statistical properties of the ensemble of reactive trajectories that go directly from A to B (i.e., start in A without returning to A before going to B).In other words, we are interested in all trajectories comprising the actual transition.We would like to: • know which parts of state space such reactive trajectories visit most likely, i.e., where in state space do we find transition pathways or transition channels through which most of the probability current generated by reactive trajectories flows and • characterize the rare event statistically, i.e., compute the transition rate, the free energy barrier, the mean first passage time or even more elaborated statistical quantities.
The molecular dynamics literature on rare event simulations is rich.Since the 1930s, transition state theory (TST) [1,2] and extensions thereof based on the reactive flux formalism have provided the main theoretical framework for the description of transition events.TST can, however, at best deliver rates and does not allow one to characterize transition channels.It is based on partitioning the state space into two sets with a dividing surface in between, leaving set A on one side and the target set B on the other, and the theory only tells how this surface is crossed during the reaction.Often, it is difficult to choose a suitable dividing surface, and a bad choice will lead to a very poor estimate of the rate.The TST estimate is then extremely difficult to correct, especially if the rare event is of the diffusive type, where many different reaction channels co-exist.Therefore, many techniques have been proposed that try to go beyond TST.
These different strategies approach the problem by sampling the ensemble of reactive trajectories or by directly searching for the transition channels of the system.Most notable among these techniques are (1) Transition Path Sampling (TPS) [3]; (2) the so-called String Methods [4], or optimal path approaches [5][6][7] and variants thereof; (3) techniques that follow the progress of the transition through interfaces, like Forward-Flux Simulation (FFS) [8], Transition Interface Sampling (TIS) [9] or the Milestoning techniques [10,11]; and (4) methods that drive the molecular system by external forces with the aim of making the required transition more frequent while still allowing one to compute the exact rare event statistics for the unforced system, e.g., based on Jarzynski's and Crook's identity [12,13].All of these methods consider the problem in continuous state space, i.e., through reactive trajectories or transition channels in the original state space of the molecular system.They all face substantial problems, e.g., if the ensemble of reactive trajectories and/or transition channels of the system under consideration are too complicated (multi-modal, irregular, essentially high dimensional) or they suffer from too large variance of the underlying statistical estimators.We should moreover stress that each of these methods has its specific scope of application; some methods are mainly useful for computing transition rates, whereas others can be used to compute transition pathways or free energy differences.
Our aim is (A) to review some of these methods based on a joint theoretical basis and (B) to outline a new approach to the estimation of rare event statistics based on a combination of ideas from optimal control and statistical mechanics.In principle, this approach allows for a variance-free estimation of rare event statistics in combination with much reduced simulation time.The rest of the article is organized as follows: We start with a precise characterization of reactive trajectories, transition channels and related quantities in the framework of Transition Path Theory (TPT) in Section 2.Then, in Sections 3 and 4, we discuss the methods from classes (1)-( 3) and characterize their potential problems in more detail.In Section 5, we consider methods of type (4) as an introduction to the presentation of the new optimal control approach that is outlined in detail in Sections 6 and 7, including some numerical experiments.
Alternative, inherently discrete methods, like Markov State Modeling, that discretize the state space appropriately and try to compute transition channels and rates a posteriori based on the resulting discrete model of the dynamics will not be discussed herein and are considered in the article [14] in a way related to the presentation at hand.We should further mention that not all rare event problems in molecular dynamics are related to sampling the underlying Gibbs-Boltzmann statistics, e.g., nucleation events under shear [15] or genuine nonequilibrium systems without a stationary probability distribution [16].

Reactive Trajectories, Transition Rates and Transition Channels
Since our results are rather general, it is useful to set the stage somewhat abstractly.To this end, we borrow some notation from [17] and consider a system whose state space is R n and denote by X t the current state of the system at time t.For example, X t may be the set of instantaneous positions and momenta of the atoms of a molecular system.We assume that the system is ergodic with respect to a probability (equilibrium) distribution µ and that we can generate an infinitely long equilibrium trajectory {X t } t∈R where, for technical reasons, we let the trajectory start at time t = −∞.The trajectory will go infinitely many times from A to B and each time the reaction happens.This reaction involves reactive trajectories that can be defined as follows: Given the trajectory {X(t)} t∈R , we say that its reactive pieces are the segments during which X t is neither in A or B, came out of A last and will go to B next.To formalize things, let Then, the trajectory {X(t)} t≥0 is reactive for all t ∈ R where R ⊂ [0, ∞) is defined by the requirements ∈ A and the ensemble of reactive trajectories is given by the set where each specific continuous piece of trajectory going directly from A to B in the ensemble belongs to a specific interval [t 1 , t 2 ] ⊂ R.
Given the ensemble of reactive trajectories, we want to characterize it statistically by answering the following questions: (Q1) What is the probability of observing a trajectory at x ∈ (A ∪ B) at time t, conditional on t ∈ R? (Q2) What is the probability current of reactive trajectories?This probability current is the vector field j AB (x) with the property that given any separating surface S between A and B (i.e., the boundary of a region that contains A but not B), the surface integral of j AB over S gives the probability flux of reactive trajectories between A and B across S. (Q3) What is the transition rate of the reaction, i.e., what is the mean frequency k AB of transitions from A to B? (Q4) Where are the main transition channels used by most of the reactive trajectories?Question (Q1) can be answered easily, at least theoretically: The probability density to observe any trajectory (reactive or not) at point x is µ(x).Let q(x) be the so-called committor function, that is the probability that the trajectory starting from x reaches first B rather than A. If the dynamics are reversible, then the probability that a trajectory we observe at state x is reactive is q(x)(1 − q(x)), where the first factor appears since the trajectory must go to B rather than A next, and the second factor appears since it needs to come from A rather than B last.Now, the Markov property of the dynamics implies that the probability density to observe a reactive trajectory at point x is which is the probability of observing any trajectory in x times the probability that it will be reactive (the proportionality symbol ∝ is used to indicate identity up to normalization).

Transition Path Theory (TPT)
In order to give answers to the other questions, we will exploit the framework of transition path theory (TPT), which has been developed in [17][18][19][20] in the context of diffusions and has been generalized to discrete state spaces in [21,22].In order to review the key results of TPT, let us consider diffusive molecular dynamics in an energy landscape V : R n → R: Here, B t denotes standard n-dimensional Brownian motion, and > 0 is the temperature of the system.Under mild conditions on the energy landscape function V , we have ergodicity with respect to the stationary distribution µ(x) = Z −1 exp(−βV (x)) with β = 1/ .The dynamics are reversible with respect to this distribution, i.e., the detailed balance condition holds.We assume throughout that the temperature is small relative to the largest energy barriers, i.e., ∆V max .As a consequence, the relaxation of the dynamics towards equilibrium is dominated by the rare transitions over the largest energy barriers.
For these kind of dynamics, Questions (Q2) and (Q3) have surprisingly simple answers: The reactive probability current is given by where ∇q denotes the gradient of the committor function q.Based on this, the transition rate can be computed by the total reactive current across an arbitrary separating surface S: where n S denotes the unit normal vector on S pointing towards B and σ S the associated surface element.
The rate can also be expressed by where (A ∪ B) c denotes the entire state space excluding A and B. Given the reactive current, we can even answer Question (Q4): The transition channels of the reaction A → B are the regions of (A ∪ B) c in which the streamlines of the reactive current, i.e., the solutions of are exceptionally dense.Figure 1 illustrates these quantities for the case of a 2D three well potential with two main wells (the bottoms of which we take as A and B in the following) and a less significant third well.The three main saddle points separating the wells are such that the two saddle points between the main wells and the third well are lower in energy than the saddle point between the main wells, such that in the zero temperature limit, we expect that almost all reactive trajectories take the route through the third well across the two lower saddle points.We observe that the committor functions for low and higher temperatures exhibit smooth isocommittor lines separating the sets A and B, as expected.The transition channels computed from the associated reactive current also show what one should expect: For a lower temperature, the channel through the third well and across the two lower saddle points is dominant, while for a higher temperature, the direct transition from A to B across the higher saddle point is preferred.For details of the computations underlying the pictures, see [22].
These considerations can be generalized to a wide range of different kinds of dynamics in continuous state spaces, including, e.g., full Langevin dynamics, see [17][18][19][20].
This example illustrates that TPT in principle allows us to quantify all aspects of the transition behavior underlying a rare event.We can compute transition rates exactly and even characterize the transition mechanisms if we can compute the committor function.Deeper insight using the Feynman-Kac formula yields that the committor function can be computed as the solution of a linear boundary value problem, which for diffusive molecular dynamics reads where the generator L has the following form where ∆ = i ∂ 2 /∂x 2 i denotes the Laplace operator.This equation allows the computation of q AB in relatively low-dimensional spaces, where the discretization of L is possible based on finite element methods or comparable techniques.In realistic biomolecular state spaces, this is infeasible because of the curse of dimensionality.Therefore, TPT gives a complete theoretical background for rare event simulation, but its application in high dimensional situations is still problematic.As a remedy, a discrete version of TPT has been developed [21,22], which can be used in combination with Markov State Modeling; see [23].

Transition Path Sampling (TPS)
TPS has been developed in order to sample from the probability distribution of reactive trajectories in so-called "path space", which means nothing else than the space of all discrete or continuous paths starting in A and ending up in B equipped with the probability distribution generated by the dynamics through the ensemble of associated reactive trajectories.Let P T denote the path measure on the space of discrete or continuous trajectories {X t } 0≤t≤T of length T .The path measure of reactive trajectories then is where TPS is a Metropolis Monte-Carlo (MC) method for sampling P AB T ({X t } 0≤t≤T )) that uses explicit information regarding the path measure P T , such as Equation ( 5), with MC moves that are based on a perturbation of a precomputed reactive trajectory [3,24].It delivers an ensemble of reactive trajectories of length T that (under the assumption of convergence of the MC scheme) is representative for P AB T and thus allows one to compute respective expectation values, like the probability to observe a reactive trajectory or the reactive current.However, its potential drawbacks are obvious: (1) A typical reactive trajectory is very long and rather uninformative (cf. Figure 1), i.e., the computational effort of generating an entire ensemble of long reactive trajectories can be prohibitive; (2) convergence of the MC scheme in the infinite dimensional path space can be very poor; and (3) the limitation to a pre-defined trajectory length T can lead to biased statistics of the TPS ensemble.Advanced TPS schemes try to remedy these drawbacks by combining the original TPS idea with interface methods [9].Even though TPS can be used no matter whether the underlying dynamics is deterministic or stochastic, the algorithm is usually used in connection with deterministic Hamiltonian dynamics [3].

Finding Transition Channels
Whenever a transition channel exists, one can try to approximate the center curve of the transition channel instead of sampling the ensemble of reactive trajectories.If the center curve (also: principal curve) is a rather smooth object, then such a method would not suffer from the extensive length of reactive trajectories.Several such methods have been introduced; they differ with respect to the definition of the transition channel and the corresponding center or principal curve.

Action-Based Methods
Rather than sampling the probability distribution of reactive pathways, such as Equation ( 4), one can try to obtain a representative or dominant pathway, e.g., by computing the pathway that has maximum probability under P T .For the case of diffusive molecular dynamics, the path measure P T has a probability density relative to a (fictitious) uniform measure on the space of all continuous paths in R n of length T that are generated by Brownian motion; the relative density reads where I is the Onsager-Machlup action More precisely, (ϕ) is the limiting ratio between the probability that the solution of Equation ( 2) remains in a small tubular neighborhood of a smooth path ϕ(•) and the probability that √ 2 B t remains in a small neighborhood of the initial value x = ϕ(0), as the size of the neighborhoods go to zero [25].
The fact that the Euler discretization of the path density , with I interpreted in the sense of Itô integrals, corresponds to the probability density of the Euler-discretized reaction path with respect to Lebesgue measure has led to the idea that by minimizing the Onsager-Machlup action over all continuous paths ϕ : [0, T ] → R n going from A to B, one can find the dominant reactive path ϕ * = argmin ϕ I (ϕ) in the sense of a maximum likelihood estimator.The hope is that this path, often also called the optimal path or most probable path, on the one hand, contains information on the transition mechanism and, on the other hand, is much smoother and easier to interpret than a typical reactive trajectory.Note, however, that the actual probability that the solution of Equation ( 2) remains in a small neighborhood of a given path ϕ(•) is exponentially small in the size of the neighborhood.
In [7], a comparison between the Onsager-Machlup action and its zero temperature limit has been given using gradient descent methods, raising issues regarding the correct interpretation of the minimizers of I (that need not exist) as most probable paths.In [5], the dominant reaction pathway method has been outlined, which uses a simplified version of the Onsager-Machlup functional that leads to a computationally simpler optimization problem and is applicable to large-scale problems, e.g., protein folding [6].However, even if the globally dominant pathways can be computed, such that the optimization does not get stuck in local minima, and even if we ignore the issues regarding the correct interpretation of minimizers, the resulting pathways in general do not allow one to gain statistical information on the transition (like rates, currents, mean first passage times).
Another action-based method that has been introduced in [26] is the MaxFlux method, which seeks the path that carries the highest reactive flux among all reactive trajectories of a certain length.The idea is to compute the path of least resistance by minimizing the functional Several algorithmic approaches for the minimization of the resistance functional L have been proposed, e.g., a path-based method [27], discretization of the corresponding Euler-Lagrange equation based on a mean-field approximation of it [28] or a Hamilton-Jacobi-based approach using the method of characteristics [29].Minimizing L for different values of T then yields a collection of paths, each of which carries a certain percentage of the total reactive flux.The method is useful if the temperature is small, so that the reactive flux concentrates around a sufficiently small number of reactive pathways.

String Method and Variants
There are several other methods that entirely avoid the computation of reactive trajectories, but try to reconstruct the less complex transition channels or pathways instead, analyzing the energy landscape of the system.One group of such techniques, like the Zero Temperature String method [4], the Geometric Minimum Action method [30] or the Nudged Elastic Band method [31], concentrate on the computation of the minimal energy path (MEP), i.e., the path of lowest potential energy between (a point in) A and (a point in) B. Under diffusive molecular dynamics and for vanishing temperature, the MEP is the path that transitions take with probability one [32].It turns out that the MEP in this case is the minimizer of the Onsager-Machlup action (5) in the limit → 0. For non-zero temperature and a rugged energy landscape, the MEP will in general be not very informative and must be replaced by a finite-temperature transition channel.This is done by the finite-temperature string (FTS) method [33] based on the following considerations: Firstly, the isocommittor surfaces Γ α , α ∈ [0, 1], of the committor q are taken as natural interfaces that separate A from B. Secondly, each Γ α is weighted with the stationary distribution µ to find reactive trajectories crossing it at a certain point x ∈ Γ α , The idea of the FTS method is that the ensemble of reactive trajectories can be characterized by this distribution on the isocommittor surfaces.Third, one assumes that for each α, the probability density ρ α is peaked in just one point ϕ(α) and that the curve ϕ = ϕ(α), α ∈ [0, 1] defined by the sequence of these points forms the center of the (single) transition channel.More precisely, one defines ϕ(α) = x Γα where the average is taken according to ρ α along the respective isocommittor surface Γ α .Fourth, it is assumed that the covariance C α = (x − ϕ(α)) ⊗ (x − ϕ(α)) Γα , which defines the width of the transition channel, is small, which implies that the isocommittor surfaces can be locally approximated by hyperplanes P α .The computation of the FTS string ϕ then is done by approximating it via ϕ(α) = x Pα , where the average is computed by running constrained dynamics on P α while iteratively refining the hyperplanes P α ; see [34] for details.Later extensions [35] remove the restrictions resulting from the hyperplanes by using Voronoi tessellations instead.
The FTS method allows one to compute single transition channels in rugged energy landscapes as long as these are not too extended and rugged.Compared to methods that sample the ensemble of reactive trajectories, it has the significant advantage that the string, that is, the principal curve inside the transition channel, is rather smooth and short, as compared to the typical reactive trajectories.The FTS further allows one to compute the free energy profile F = F (α) along the string, that characterizes the transition rates associated with the transition channel (at least in the limits of the approximations invoked by the FTS).

Computing Transition Rates
The computation of transition rates can be performed without computing the dominant transition channels or similar objects.There is a list of rather general techniques, with Forward Flux Sampling (FFS) [8], Transition Interface Sampling (TIS) [9] and Milestoning [10] as examples, that approximate transition rates by exploring how the transition progresses from one to the next interface that separates A from B.

Forward Flux Sampling (FFS)
The first step of FFS is the choice of a finite sequence of interfaces I k , k = 1, . . ., N , in state space between A and B = I N .The transition rate k AB comes as the product of two factors: (1) the probability current J A of all trajectories leaving A and hitting I 1 ; and (2) the probability that a trajectory that leaves I 1 makes it to B before it returns to A; here, P(I k+1 |I k ) denotes the probability that a trajectory starting in I k makes it to I k+1 before it returns to A. FFS first performs a brute-force simulation starting in A, which yields an ensemble of points at the first interface I 1 , yielding an estimate for the flux J A (the number of trajectories hitting I 1 per unit of time).Second, a point from this ensemble on I 1 is selected at random and used to start a trajectory, which is followed until it either hits the next interface I 2 or returns to A; this gives P(I 2 |I 1 ).This procedure then is iterated from interface to interface.Finally, the rate k AB = J A • P(B|I 1 ) is computed.Variants of this algorithm are described in [36,37], for example.
FFS has been demonstrated to be quite general in approximating the flux of reactive trajectories through a given set of interfaces; it can be applied to equilibrium, as well as nonequilibrium systems, and its implementation is easy (see [16,38]).The interfaces used in FFS are, in principle, arbitrary.However, the efficiency of the sampling of the reactive hitting probabilities P(I k+1 |I k ) crucially depends on the choice of the interfaces.In practice, the efficiency of FFS will drop dramatically if one does not use appropriate surfaces, and totally misleading rates may result from this.Ideally, one would like to choose these surfaces, so that the computational gain offered by FFS in optimized, but in practice, this is not a trivial task; see [39].The same is true for TIS that couples TPS with progressing from interface to interface.

Milestoning
Milestoning [10] is similar to FFS in so far as it also uses a set of interfaces I k , k = 1, . . ., N that separate A and B = I N .In contrast to FFS and TIS, the fundamental quantities in Milestoning are the hitting time distributions K ± i (τ ), i = 1, . . ., N − 1, where K ± i (τ ) is the probability that a trajectory starting at t = 0 at interface I i hits I i±1 before time τ .Trajectories that make it to milestone I i must come from milestones I i±1 and vice versa.In the original algorithm, these distributions are approximated as follows [10]: For each milestone I i , one first samples the distribution µ constrained to I i .Based on the resulting sample, we start a trajectory from each point, which is terminated when it reaches one of its two neighboring milestones I i±1 .The hitting times are recorded and collected into two distributions K ± i (τ ).These local kinetics are then compiled into the global kinetics of the process: For each i, one defines P i (t) as the probability that the process is found between I i−1 and I i+1 at time t and that the last milestone hit was I i .Milestoning is based on a (non-Markovian) construction of P i (t) from the K ± i (τ ).Its efficiency comes from two sources: (1) It does not require the computation of long reactive trajectories but only short ones between milestones (which therefore should be 'close enough'); (2) It is easily parallelizable.Its disadvantage is the dependence on the milestones that have to be chosen in advance: It can be shown that Milestoning with perfect sampling allows one to compute exact transition rates or mean first passage times if the interfaces are given by the isocommittor surfaces (which in general are not known in advance) [40]; if the interfaces are chosen inappropriately, the results can be rather misleading.

Nonequilibrium Forcing and Jarzynski's Identity
The computation of reliable rare event statistics suffers from the enormous lengths of reactive trajectories.One obvious way to overcome this obstacle is to force the system to exhibit the transition of interest on shorter timescales.Therefore, can we drive the molecular system to make the required transition more frequently but still compute the exact rare event statistics for the unforced system?
As was shown by Jarzynski and others, nonequilibrium forcing can in fact be used to obtain equilibrium rare event statistics.The advantage seems to be that the external force can speed up the sampling of the rare events by biasing the equilibrium distribution towards a distribution under which the rare event is no longer rare.We will shortly review Jarzynski's identity before discussing the matter in more detail.

Jarzynski's Identity
Jarzynski's and Crook's formulae [12,13] relate the equilibrium Helmholtz free energy to the nonequilibrium work exerted under external forcing: Given a system with energy landscape V (x), the total Helmholtz free energy can be defined as Jarzynski's equality [12] then relates the free energy difference ∆F = −β −1 log(Z 1 /Z 0 ) between two equilibrium states of a system given by an unperturbed energy V 0 and its perturbation V 1 with the work W applied to the system under the perturbation: Suppose we set V ξ = (1 − ξ)V 0 + ξV 1 with ξ ∈ [0, 1], and assume we set a protocol that describes how the system evolves from ξ = 0 to ξ = 1.If, initially, the system is distributed according to exp(−βV 0 ), then, by the second law of thermodynamics, it follows that E(W ) ≥ ∆F where W is the total work applied to the system and E denotes the average overall possible realizations of the transition from ξ = 0 to ξ = 1; equality is attained if the transition is infinitely slow (i.e., adiabatic).Jarzynski's identity now asserts that the free energy is always equal to the exponential average of the nonequilibrium work, arbitrarily far away from the adiabatic regime.Many generalizations exist: In [13], a generalized version of this fluctuation theorem, the so-called Crook's formula, for stochastic, microscopically reversible dynamics, is derived.In [41,42], it is shown how one can compute conditional free energy profiles along a reaction coordinate for the unperturbed system, rather than total free energy differences between a perturbed and unperturbed system.
Algorithmic application prohibitive.Despite the fact that Jarzynski's and Crook's formulae are used in molecular dynamics applications [43], their algorithmic usability is limited by the fact that the likelihood ratio between equilibrium and nonequilibrium trajectories is highly degenerate, and the overwhelming majority of nonequilibrium forcings generate trajectories that have almost zero weight with respect to the equilibrium distribution that is relevant for the rare event.This leads to the fact that most rare event sampling algorithms based on Jarzynski's identity have prohibitively large variance.Recent developments have reduced this problem by sampling just the reversible work processes based on Crook's formula, but could not fully remove the problem of large variance [44]; see also [45].Because of this, we will approach the problem of variance reduction subsequently.

Cumulant Generating Functions
In order to demonstrate how to improve approaches based on the idea of driving molecular systems to make rare events frequent, we first have to introduce some concepts and notation from statistical mechanics: Let W be a random variable that depends on the sample paths of (X t ) t≥0 , i.e., on molecular dynamics trajectories of the system under investigation.Further, let P be the underlying probability measure on the space of continuous trajectories as introduced in Section 2.2 (but without the restriction to a given length T ).We define the cumulant generating function (CGF) of W by where σ is a non-zero scalar parameter and E[f ] = f dP denotes the expectation value with respect to P .Note that the CGF is basically the free energy at inverse temperature β as in Jarzynski's formula, but here, it is considered as a function of the independent parameter σ. (Definition (6) differs from the standard CGF only by the prefactor σ −1 in front.)Taylor expanding the CGF about σ = 0, we observe ; hence, for sufficiently small σ, the variance is decoupled from the mean.Moreover, it follows by Jensen's inequality that where equality is achieved if and only if W is almost surely constant, in accordance with the second law of thermodynamics.(This is the case, e.g., when W is the work associated with an adiabatic transition between thermodynamic equilibrium states.) Optimal reweighting.
The CGF admits a variational characterization in terms of relative entropies.To this end, let Q be another probability measure, so that P is absolutely continuous with respect to Q, i.e., the likelihood ratio dP/dQ exists and is Q-integrable.Then, using Jensen's inequality again, which, noting that the logarithmic term is the relative entropy (or Kullback-Leibler divergence) between Q and P , can be recast as where and we declare that H(Q P ) = ∞ if Q does not have a density with respect to P .Again, it follows from the strict convexity of the exponential function that equality is achieved if and only if the new random variable is Q-almost surely constant.This gives us the following variational characterization of the cumulant generating function that is due to [46]: Variational formula for the cumulant generating function.
Let W be bounded from above, with where the infimum runs over all probability measures Q that have a density with respect to P .Moreover, the minimizer Q * exists and is given by dQ * = e γ(σ)−σW dP .

Optimal Driving from Control Theory
When X t denotes stochastic dynamics, such as Equation ( 2), the above variational formula admits a nice interpretation in terms of an optimal control problem with a quadratic cost.To reveal it, we first need some technical assumptions.
a bounded open set with smooth boundary ∂O.Further, let τ < ∞ be the stopping time i.e., τ is the stopping time that either t = T or X t leaves the set O, whichever comes first.(A2) The random variable W is of the form for some continuous and nonnegative functions f, g : R n → R, which are bounded from above and at most polynomially growing in x (compare Jarzynski's formula).(A3) The potential V : R n → R in Equation ( 2) is smooth, bounded below and satisfies the usual local Lipschitz and growth conditions.
We consider the conditioned version of the moment generating function (which is just the exponential of the cumulant generating function): By the Feynman-Kac theorem, ψ σ solves the linear boundary value problem where E + is the terminal set of the augmented process (t, X t ), precisely is the backward evolution operator associated with X t , with the shorthand introduced in Equation (3).Assumptions (A1)-(A3) guarantee that Equation ( 11) has a unique smooth solution ψ σ for all σ > 0.Moreover, the stopping time τ is almost surely finite, which implies that for some constant c ∈ (0, 1).
Log transformation of the cumulant generating function.In order to arrive at the optimal control version of the variational formula (9), we introduce the logarithmic transformation of ψ σ as which is analogous to the CGF γ, except for the leading factor and the dependence on the initial condition x.As we will show below, v σ is related to an optimal control problem.To see this, remember that ψ σ is bounded away from zero and note that which implies that Equation ( 11) is equivalent to Equivalently, where we have used that (For the general framework of change-of-measure techniques and Girsanov transformations and their relation to logarithmic transformations, we refer to ( [47] (Section VI.3)).)Optimal control problem.Equation ( 12) is a Hamilton-Jacobi-Bellman (HJB) equation and is recognized as the dynamic programming equation of the following optimal control problem: minimize over a suitable space of admissible control functions u : [0, ∞) → R n and subject to the dynamics Form of optimal control.In more detail, one can show (e.g., see ( [47] (Section IV.2))) that assumptions (A1)-(A3) above imply that Equation ( 12) has a classical solution (i.e., twice differentiable in x, differentiable in t and continuous at the boundaries).Moreover, v σ satisfies where u * is the unique minimizer of J(u) that is given by the Markovian feedback law The function v σ is called the value function or optimal-cost-to-go for the optimal control problems ( 13) and (14).Specifically, v σ (x, t) measures the minimum cost needed to drive the system to the terminal state when started at x at time t.We briefly mention the two most relevant special cases of ( 13) and ( 14).

Case I: The Exit Problem
We want to consider the limit T → ∞.To this end, call τ O = inf{t > 0 : As a consequence (using monotone convergence), v σ converges to the value function of an optimal control problem with cost functional It can be shown that the value function with u * = argmin J ∞ (u) is independent of the initial time t; hence, we can drop the dependence on t and redefine v σ (x) := v σ (x, t).The value function now solves the boundary value HJB equation

Case II: Finite Time Horizon Optimal Control
In this case, v σ converges to the value function with a finite time horizon and cost functional Now, v σ is again a function on R n × [0, T ] and given by with u * being the minimizer of J T (u).The value function solves the backward evolution HJB equation with a terminal condition at time t = T .

Optimal Control Potential and Optimally Controlled Dynamics
The optimal control u * that minimizes the functional in Equation ( 13) is again of gradient form and given by u * t = −2σ∇v σ (X t , t) as can be readily checked by minimizing the corresponding expression in Equation ( 12) over α.Given v σ , the optimally controlled dynamics reads with the optimal control potential In the case when T → ∞ (Case I, above), the biasing potential is independent of t.
Remarks.Some remarks are in order.
(a) Monte-Carlo estimators of the conditional CGF that are based on the optimally controlled dynamics have zero variance.This is so because the optimal control minimizes the variational expression in Equation ( 9), but at the minimum, the random variable inside the expectation must be almost surely constant (as a consequence of Jensen's inequality and the strict convexity of the exponential function).Hence, we have a zero-variance estimator of the conditional CGF.(b) The reader may now wonder as to whether it is possible to extract single moments from the CGF (e.g., mean first passage times).In general, this question is not straightforward to answer.One of the difficulties is that extracting moments from the CGF requires one to take derivatives at σ = 0, but small values of σ imply strong penalization, which renders the control inactive and, thus, makes the approach inefficient.Another difficulty is that reweighting the controlled trajectories back to the original (equilibrium) path measure can increase the variance of a rare event estimator, as compared to the corresponding estimator based on the uncontrolled dynamics.As yet, the efficient calculation of moments from the CGF by either extrapolation methods or reweighing is an open question and currently a field of active research (see, e.g., [48,49]).(c) Jarzynski's identity relates equilibrium free energies to averages that are taken over an ensemble of trajectories generated by controlled dynamics, and the reader may wonder whether the above zero-variance property can be used in connection with free energy computations à la Jarzynski (cf.[45]).Indeed, we can interpret the CGF as the free energy of the nonequilibrium work where f is the nonequilibrium force exerted on the system under driving it with some prescribed protocol ξ : [0, T ] → R; in this case, the dynamics X t depend on ξ t , as well, and writing down the HJB equation according to Equation ( 19) is straightforward.However, even if we can solve Equation ( 19), we do not get zero-variance estimators for the free energy The reason for this is simple: Jarzynski's formula requires that the initial conditions are chosen from an equilibrium distribution, say, π 0 the equilibrium distribution corresponding to the initial value ξ 0 of the protocol, but optimal controls are defined point-wise for each state (t, X t ) and In other words: (d) A similar argument as the one underlying the derivation of the HJB equation from the linear boundary value problem yields that Jarzynski's formula can be interpreted as a two-player zero-sum differential game (cf.[50]).

Characterize Rare Events by Optimally Controlled MD
Now, we illustrate how to use the results of the last section in practice.We will mainly consider the case discussed in Section 6.1 regarding the statistical characterization of hitting a certain set.

First Passage Times
Roughly speaking, the CGF encodes information about the moments of any random variable W that is a functional of the trajectories (X t ) t≥0 .For example, for f = and T → ∞, we obtain the CGF of the first exit time from O, i.e., where we have introduced the shorthand to denote the conditional expectation when starting at X 0 = x and the superscript "u" to indicate that the expectation is understood with respect to the controlled dynamics where E = E 0 denotes expectation with respect to the unperturbed dynamics.

Committor Probabilities Revisited
It is not only possible to use the moment generating function to collect statistics about rare events in terms of the cumulant generating function, but also to express the committor function directly in terms of an optimal control problem (see Section 2.1 for the definition of the committor q AB between to sets A and B).To this end, let σ = 1, and suppose we divide ∂O into two sets B ⊂ ∂O and A = ∂O \ B (i.e., τ O is the stopping time that is defined by hitting either A or B).Setting f = 0 and g(x) = − log 1 B (x) reduces the moment generating function (10) to or, in more familiar terms, According to Equation ( 16) the corresponding optimal control problem has the cost functional which amounts to a control problem with zero terminal cost when ending up in B and an infinite terminal cost for hitting A. Therefore, the HJB equation for v = v 1 has a singular boundary value at A; it reads Setting v(x) = − log q AB (x) yields the equality In this case, the optimally controlled dynamics (20) is of the form with optimal control potential U AB (x) = V (x) − 2 log q AB (x) .
Remarks.Some remarks on the committor equation follow: (a) The logarithmic singularity of the value function at "reactant state" A has the effect that the control will try to avoid running back into A, for there is an infinite penalty on hitting A. In other words, by controlling the system, we condition it on hitting the "product state" B at time t = τ O .Conditioning a diffusion (or general Markov) process on an exit state has a strong connection with Doob's h-transform, which can be considered a change-of-measure transformation of the underlying path measure that forces the diffusion to hit the exit state with probability one [51].(b) The optimally controlled dynamics has a stationary distribution with a density proportional to exp(−βU AB (x)) = q 2 AB (x) exp(−βV (x)) where we used β = 1/ .

Algorithmic Realization
For the exit problem ("Case I", above), one can find an efficient algorithm for computing the conditional CGF γ(σ; x) or, equivalently, the value function v σ (x) in [52].The idea of the algorithm is to exploit that, according to Equations ( 20) and ( 21), the optimal control is of gradient form.The latter implies that the value function can be represented as a minimization of the cost functional over time-homogeneous candidate functions C for the optimal bias potential, in other words, where the expectation E is understood with respect to the path measure generated by Once the optimal C has been computed, both value function and CGF can be recovered by setting The algorithm that finds the optimal C works by iteratively minimizing the cost functional for potentials C from a finite-dimensional ansatz space, i.e., C(x) = M j=1 a j ϕ j (x) , with appropriately chosen ansatz functions ϕ j .The iterative minimization is then carried out on the M -dimensional coefficient space of the a 1 , . . ., a M .With this algorithm, we are able to compute the optimal control potential for the exit problem in the two interesting cases: first passage times and committor probabilities (as outlined in Sections 7.1 and 7.2).
Remarks.Let us briefly comment on some aspects of the gradient descent algorithm.
(a) The minimization algorithm for the value function belongs to the class of expectation-maximization algorithms (although, here, we carry out a minimization rather than a maximization), in that each minimization step is followed by a function evaluation that involves computing an expectation.In connection with rare events sampling and molecular dynamics problems, a close relative is the adaptive biasing force (ABF) method for computing free energy profiles, the latter being intimately linked with cumulant generating functions or value functions (cf.Section 5).In ABF methods (or its variants, such as metadynamics or Wang-Landau dynamics), the gradient of the free energy is estimated on the fly, running a molecular dynamics simulation, and then added as a biasing force to accelerate the sampling in the direction of the relevant coordinates [53,54].The biasing force eventually converges to the derivative of the free energy, which is the optimal bias for passing over the relevant energy barriers that are responsible for the rare events [55].
(b) The number of basis functions needed depends mainly on the roughness of the value function, but is independent of the system dimension.For systems with clear time scale separation, it has been moreover shown [56] that the optimal control is independent of the fast variables; hence, we expect that the algorithm can be efficient, even for large-scale systems, provided that some information about the relevant collective variables and a reasonable initial guess are available.Yet, the question remains: How many basis functions are needed to approximate the optimal control up to a given accuracy?Controlling the error in the value function and the resulting optimal control is particularly important, as a wrong (e.g., suboptimal) bias potential may lead to Monte-Carlo estimators that may have a larger variance than the vanilla rare event estimator, as has been pointed out in [57,58].The first results in this direction have been obtained in [59], in which error bounds for the CGF for suboptimal controls have been derived, and [60], which discusses the approximation error of the Galerkin approximation of the corresponding HJB equations; see also [61]for a related discussion regarding so-called log-efficient estimators for rare events.

Numerical Examples
In our first example, we consider diffusive molecular dynamics as in Equation ( 2) with = 0.1 and V being the five-well potential shown in Figure 2. We first compute the CGF of the first passage time as discussed in Section 7.1, using the gradient descent algorithm described in Section 7.3 with 10 Gaussian ansatz functions that are centered around the critical points of the potential energy function.The resulting optimal control potential (21) after roughly 20 iterations of the gradient descent is displayed in Figure 2 for different values of σ.As the set O, we take the whole state space, except a small neighborhood of its global minimum of V , so that its complement O c is identical to the vicinity of the global minimum and the exit time τ O is the first passage time to O c . Figure 2 shows that the optimal control potential alters the original potential V significantly in the sense that for σ > 0, the set O c is the bottom of the only well of the potential, so that all trajectories starting somewhere else will quickly enter O c .This case is instructive: For the unperturbed original dynamics, the mean first passage time E x (τ O ) takes values of around 10 4 for x > −2.For the optimally controlled dynamics, the mean first passage times into O c are less than five for σ = 0.1, 0.5, 1.0, so that the estimation of E x (τ O ) resulting from the optimal control approach requires trajectories that are a factor of at least 10 3 shorter than the ones we would have to use by direct numerical simulation of the unperturbed dynamics.
Figure 3 shows the optimal control potentials for computation of the committor q AB , as described in Section 7.2.We observe that the optimal control potential exhibits a singularity at the boundary of the basin of attraction of the set A. That is, it prevents the optimally controlled dynamics from entering the basin of attraction of A and, thus, avoids the waste of computational effort by unproductive returns to A.
In our second example, we consider two-dimensional diffusive molecular dynamics as in Equation (2) with the energy landscape V being the three-well potential shown in Figure 1.In Figure 4, the optimal control potential for computing the committors q AB between the two main wells for two different temperatures = 0.15 and = 0.6 are displayed.The numerical solution is based on a Galerkin approximation of the log-transformed HJB equation, using precomputed committor functions as the basis set; see [60] for details.As in our former experiment, we observe that the optimal control potential prevents the dynamics from returning to A; in addition, it flattens the third well significantly, such that the optimally controlled dynamics in any case quickly goes into B. For = 0.15, a TPS sampling of reactive trajectories between the two main wells, precisely from A to B with A and B, as indicated in Figure 4, results in an average length of 367 for reactive trajectories based on the original dynamics.For the optimally controlled dynamics, we found an average length of 1.3.Optimally-corrected potential for the three-well potential shown in Figure 1 for the committor q AB for the medium temperature = 0.6 case (left), the low temperature = 0.15 case (right) and for the sets A (ellipse in main well, right-hand side) and B (ellipse in main well, left-hand side).Note that the committor basis is not smooth at the boundaries of the initial and target sets (see Figure 1 for comparison), which explains the roughness of the control potential in the neighborhood of the sets A and B.

Conclusions
We have surveyed various techniques for the characterization and computation of rare events occurring in molecular dynamics.Roughly, the approaches fall into two categories: (a) methods that approach the problem by characterizing the ensemble of reactive trajectories between metastable states or (b) path-based methods that target dominant transition channels or pathways by minimization of suitable action functionals.Methods of the first type, e.g., Transition Path Theory, Transition Path Sampling, Milestoning or variants thereof, are predominantly Monte-Carlo-type methods for generating one very long or many short trajectories, from which the rare event statistics can then be estimated.Methods that belong to the second category, e.g., MaxFlux, Nudged-Elastic Band or the String Method, are basically optimization methods (sometimes combined with a Monte-Carlo scheme); here, the objectives are few (single or multiple) smooth pathways that describe, e.g., a transition event.It is clear that this classification is not completely unambiguous, in that action-based methods for computing most probable pathways can be also used to sample an ensemble of reactive trajectories.Another possible classification (with its own drawbacks) is along the lines of the biased-unbiased dichotomy that distinguishes between methods that characterize rare events based on the original dynamics and methods that bias the underlying equilibrium distribution towards a new probability distribution under which the rare events are no longer rare.Typical representatives of the second class range from biasing force methods, such as ABF or metadynamics, up to genuine nonequilibrium approaches based on Jarzynski's identity for computing free energy profiles.The problem often is that rare event estimators based on an ensemble of nonequilibrium trajectories suffer from large variances, unless the bias is cleverly chosen.
We have described a strategy to find such a cleverly chosen perturbation, based on ideas from optimal control.The idea rests on the fact that the cumulant generating function of a certain observable, e.g., the first exit time from a metastable set, can be expressed as the solution to an optimal control problem, which yields a zero variance estimator for the cumulant generating function.The control acting on the system has essentially two effects: (1) Under the controlled dynamics, the rare events are no longer rare, as a consequence of which the simulations become much shorter; (2) The variance of the statistical estimators is small (or even zero if the optimal control is known exactly).We should stress that, depending on the type of observable, the approach only appears to be a nonequilibrium method, for the optimal control is an exact gradient of a biasing potential; hence, the optimally perturbed system satisfies a detailed balance, which is one criterion for thermodynamic equilibrium.Future research should address the question as to whether the approach is competitive for realistic molecular systems, how to efficiently and robustly extract information about specific moments rather than cumulant generating functions and how to extend it to the more general observables or the calculation of free energy profiles.

Figure 1 .
Figure 1.(Top left panel) Three-well energy landscape V as described in the text.(Top right panel) Typical reactive trajectory in the three-well landscape.(Middle left panel) Committor functions q AB for diffusion molecular dynamics with relatively high temperature = 0.6 for the sets A (main well, right-hand side) and B (main well, left-hand side).(Middle right panel) Committor q AB for the low temperature case = 0.15.(Bottom left panel) Transition channels for = 0.6.(Bottom right panel) Transition channels for = 0.15.For details of the computations underlying the pictures, see [22].

Figure 2 .
Figure2.Five-well potential (left) and associated optimal control potential for the first passage time to the target set O c given by a small interval around the main minimum x 1 (right) for different values of σ (right).= 0.1; the gradient descent solution fully agrees with the reference finite element solution (that is not shown) in the "eye-norm".

Figure 3 .
Figure 3. Optimally-corrected potential for the case of J being the committor q AB for B being the ±0.1-interval around the main minimum x 1 of the potential.(Left panel) A =]x 3 − 0.1, x 3 + 0.1[ the ±0.1 interval around the highest minimum x 3 .(Right panel) A =]x 2 − 0.1, x 2 + 0.1[ the ±0.1 interval around the second lowest minimum x 2 .

Figure 4 .
Figure 4.Optimally-corrected potential for the three-well potential shown in Figure1for the committor q AB for the medium temperature = 0.6 case (left), the low temperature = 0.15 case (right) and for the sets A (ellipse in main well, right-hand side) and B (ellipse in main well, left-hand side).Note that the committor basis is not smooth at the boundaries of the initial and target sets (see Figure1for comparison), which explains the roughness of the control potential in the neighborhood of the sets A and B.