In recent work, we have illustrated the construction of an exploration geometry on free energy surfaces: the adaptive computer-assisted discovery of an approximate low-dimensional manifold on which the effective dynamics of the system evolves. Constructing such an exploration geometry involves geometry-biased sampling (through both appropriately-initialized unbiased molecular dynamics and through restraining potentials) and, machine learning techniques to organize the intrinsic geometry of the data resulting from the sampling (in particular, diffusion maps, possibly enhanced through the appropriate Mahalanobis-type metric). In this contribution, we detail a method for exploring the conformational space of a stochastic gradient system whose effective free energy surface depends on a smaller number of degrees of freedom than the dimension of the phase space. Our approach comprises two steps. First, we study the local geometry of the free energy landscape using diffusion maps on samples computed through stochastic dynamics. This allows us to automatically identify the relevant coarse variables. Next, we use the information garnered in the previous step to construct a new set of initial conditions for subsequent trajectories. These initial conditions are computed so as to explore the accessible conformational space more efficiently than by continuing the previous, unbiased simulations. We showcase this method on a representative test system.

In its most straightforward formulation, Molecular Dynamics (MD) consists of solving Newton’s equations of motion for a molecular system described with atomic resolution. The goal of performing MD simulations is twofold: on the one hand, we want to gather samples from a given thermodynamic ensemble, while, on the other hand, we may seek to gain insight into time-dependent behavior. The first objective leads us to equilibrium properties. The second yields kinetic properties and is the reason why it is said that MD acts as a computational microscope. Recent success stories involving systems having more than one million atoms [

The possibility of using MD to study bigger bio-molecules at longer time scales is hindered by the problem of time scale separation. While the processes of interest (protein folding, permeation of cellular membranes, etc.) act on timescales of milliseconds to minutes, we are currently restricted by limitations in available computer capabilities and algorithms to simulations spanning timescales of microseconds. Moreover, to ensure stability when numerically integrating the equations of motion, we need to take steps of just a few femtoseconds. The reader interested in the numerical analysis of integration schemes in MD is referred to the excellent treatise [

The inherent difficulty behind the problem of timescale separation lies in the fact that many biophysical systems display metastability. That is, the solutions of the equations of motion spend large amounts of time trapped in the basins of attraction of local free energy minima, called metastable states [

It is often possible to identify a suitable set of so-called collective or coarse variables describing the progress of the process being studied (i.e., a “slow manifold”). The simplest such “coarse variable” is perhaps the interatomic distance in the process of the dissociation of a diatomic molecule. In other cases, a subset of dihedral angles on the amino acids of a peptide proves to be a good choice. In practice, it is not always clear how to devise good coarse variables a priori, and it is necessary to rely on the expertise of computational chemists to postulate these variables with varying degrees of success. Of course, the quality of the coarse variables can be assessed a posteriori by methods such as the histogram test, etc. [

In this paper, we present a detailed account of the iMapD [

The paper is organized as follows.

Consider a mechanical system whose conformational space is denoted by

Let:

We will refer to the operator on the right-hand side of the partial differential equation in (

For systems with time scale separation, there will arise a spectral gap; that is,

The components of

Observe that the parameter

The explicit computation of the eigenfunctions

By multiplying

The matrix

As a more realistic example, we analyze a one microsecond-long simulation of the catalytic domain of the human tyrosine protein kinase ABL1 [

We obtain the first two coarse variables,

The previous considerations are motivated by/conform with statistical mechanics; however, it is important to emphasize that the DMAPS method will work, in the sense that it will provide a parameterization of the manifold, just as well with data points on the manifold not necessarily coming from sampling the solution of (

As we stated in the Introduction, the iMapD method is aimed at enhancing the sampling of unexplored regions of the conformational space of a system. The method works by first running an ensemble of independent trajectories initialized from an initial configuration for a short time (e.g., a few nanoseconds). The points comprising the trajectories are actually samples of the local free energy minimum to which the initial configuration belongs. Next, we perform a diffusion map computation, giving us a set of coarse variables that parameterize the current basin of attraction, and we locate (in DMAP coordinates) the boundary of the region that our set of points has explored so far. By extending the boundary outwards in its normal direction, we get a new tentative boundary whose points we realize in the original, high-dimensional conformational space (typically by resorting to a suitable biasing potential). Finally, the new points are used as initial conditions in a new batch of simulations. By actively restarting simulations from the extrapolated points, we enhance the ability of the system to exit local free energy minima and to explore new regions of conformational space.

In order to illustrate the applicability of our method, we demonstrate how the algorithm works on a simple, yet non-trivial model system, which can be studied in-depth by numerically solving the stochastic differential equations involved.

Let:

The above system of SDEs exhibits the most meaningful qualitative aspect of the type of problems that iMapD is designed for: a phase space with higher dimensionality than that of the manifold in which the effective dynamics occurs. Indeed, our system, despite being three-dimensional, has by construction a two-dimensional attractor located on the surface of a cylinder with radius

In order to sample the conformational space of the system, we begin by running a single trajectory for enough time such that it gets trapped into one of the metastable sets. We process the trajectory so that the initial transient descent is removed, and points on the manifold have a more uniform distribution (e.g., by removing nearest neighbors that are closer than a fixed minimum distance). We then locate the boundary of the currently sampled area by running the alpha-shapes boundary detection method, which will be described in

To create

To further illustrate the expansion of the point-cloud throughout the iterative process, we show in

In this section we introduce several techniques on which the iMapD method relies.

As we previously mentioned, conventional diffusion maps are obtained by solving the eigenproblem corresponding to the Laplace–Beltrami operator on a domain with reflecting (Neumann) boundary conditions [

The independent eigenfunctions,

Extrapolating directly in cosine-diffusion map space presents some difficulties. This is because the parameterization near the edges of the currently explored region is flat, and extending functions in the diffusion map coordinates gives rise to ambiguities [

To approximate these eigenfunctions using the samples

We make two observations. First, note that only the first nontrivial sine-coordinate is of importance: the subsequent eigenvectors are simply higher harmonics of the first. Because of this, the parameterization of a 2D nonlinear manifold can be accomplished with one sine-coordinate and one cosine-coordinate. Automatic detection of higher harmonics can be carried out in a variety of ways; here, we will just mention that we can accomplish this by studying the functional dependence between the eigenfunctions, and we refer the reader to the treatment in [

To systematically determine which cosine-coordinate is poorly behaved for each boundary point, we examine the

Once the data are divided into groups based on which cosine-coordinate to replace and the sign of its eigenvector, the boundary points can be extended and mapped to the original conformational space using the same techniques as for cosine-diffusion maps. A sample manifold extended via sine-diffusion maps with geometric harmonics is shown in

Rather than using DMAPS coupled with geometric harmonics, one could also use LPCA to extend the manifold. LPCA is simpler than DMAPS, but it requires a local set of collective variables for each boundary point rather than a single, global set of collective variables for the entirety of the data.

LPCA is based on PCA, a widely-used dimensionality reduction technique [

Given

Since

Dimensional reduction occurs when only the first

For use in the proposed exploration algorithm, we must first locate the edge points of the underlying manifold. Then, to obtain a reduced description, we can perform LPCA on small “patches” surrounding each boundary point [

Note that extended points within the manifold of

The success of our proposed algorithm is contingent on the ability to identify the boundary of the set of samples collected so far in the metastable state being currently visited. There exist at least two types of boundary detection algorithms: methods to find the concave hull around the sampled points, that is the tightest piecewise linear surface that contains all of the points; and more general methods that attempt to appropriately classify all of the data points so as to determine which samples belong to the boundaries. For a

The first set of algorithms construct the concave hull of the dataset (an optimal polytope that contains all points while minimizing volume) and include, e.g., the swinging arm [

In the alpha-shapes algorithm [

Let

Consider the point

Let

In this section, we review the construction of geometric harmonics introduced in [

Let us define the kernel:

For

The accuracy of the extrapolation method described above depends on the relative error between

A complete treatment of geometric harmonics can be found in [

The algorithm we propose performs a systematic search for unknown metastable states on the attracting manifold of a high-dimensional molecular system without a priori knowledge of coarse variables. The method relies on an external molecular dynamics package to numerically solve the equations of motion in (typically short) simulations, starting from a single set of initial conditions as input. There is also a number of problem-dependent algorithmic parameters (e.g., alpha shape parameters, extrapolation step lengths, etc.); the ones germane to iMapD are reported. The steps in the algorithm are detailed below:

Collection of an initial set of samples: The molecular system is initialized and evolved long enough so that it arrives at some basin of attraction. After removing the initial points that quickly arrive at the attracting manifold, the remaining data points constitute the initial set of samples (point cloud) on the manifold. These samples will be used in the subsequent steps of the method.

Parameterization of point cloud in lower dimensions: Using the set of samples from the previous step, we extract an optimal (and typically low-dimensional) set of coarse variables using DMAPS (for example, with cosine-diffusion maps). This process yields a parameterization of the local geometry of the free energy landscape around the region being currently visited by our system. All of our points are then mapped to the new set of coarse variables, thereby reducing the dimensionality of the system.

Outward extrapolation in low-dimensional space: After identifying the current generation of boundary points in the space of coarse variables (for example, via the alpha-shapes algorithm), we obtain additional points by extrapolating in the direction normal to the boundary.

Lifting of points from the (local) space of coarse variables to the conformational space: In order to continue the simulation, we must obtain a realization in conformational space of the newly-extended points in DMAP (or other reduced) space. In other words, we need a sufficient number of points in conformational space that are consistent with the DMAP (reduced) coordinates of the newly-extrapolated points. In the present paper, we use geometric harmonics, but in general, this task can be accomplished using biasing potentials, such as those available in PLUMED [

Repetition until the landscape is sufficiently explored: The lifted points serve as guesses for regions of the manifold that are yet to be probed. The system is reinitialized at these points (usually by running new parallel simulations), and the unexplored space is progressively discovered. This process is then repeated, effectively growing the set of sampled points on the free energy landscape.

In practice, this process begins with the initial simulation. The outcome is a set of samples within some basin of attraction that are then used in order to identify a few coarse variables via DMAPS. Once the points are mapped to the coarse variables, we run a boundary detection algorithm to identify points at the edges of the dataset. Then, for each boundary point

In implementations of the algorithm, there arise various practical questions that affect the exploration of the attracting manifold, including:

Simulation run time: Though system dependent, simulations should be run until (a) the trajectory enters a region already explored, or (b) a new basin is discovered, or (c) a reasonable amount of time has passed for the trajectory to have explored “new ground” within the current basin. These conditions can be tested by detecting if the trajectory remains within a certain radius for a given amount of time (it has most likely found a potential well) or if the trajectory has a nontrivial amount of nearest neighbors from already explored regions.

Selection of trajectory points: Only “on manifold” points that belong to the basin of attraction should be collected. We implement this by removing a fixed number of points early in the trajectory that correspond to the initial approach to the attracting manifold. Discarding them will have the beneficial effect of preventing the exploration in directions orthogonal to the attractor. The exploration among the remaining points will lead to better sampling of basins and around saddle points within the attracting manifold.

Memory storage of data points: Observe that the samples gathered throughout the exploration process need not be kept in memory and can instead be stored in the hard drive. In principle, the file system or an appropriate database can be used to keep the corresponding files, but if storage space becomes an issue, then it is possible to randomly prune points whenever a (user-specified) maximum number of data points is exceeded. Note that if, between random pruning and preprocessing the data, distinct patches of explored regions appear, each sample of the manifold must be expanded separately so as not to discard samples that may have potentially reached new metastable states.

We have presented, illustrated and discussed several components of an algorithm for the exploration of effective free energy surfaces. The algorithm links machine learning (manifold learning, in particular, diffusion maps) with established simulation tools (e.g., molecular dynamics). The main idea is to discover, in a data-driven fashion, coordinates that parametrize the intrinsic geometry of the free energy surface and that can help usefully bias the simulation so that it does not revisit already explored basins, but rather extends in new, unexplored regimes. Smoothness and low-dimensionality of the effective free energy surface are the two main underpinning features of the algorithm. Its implementation involves several components (like point-cloud edge detection) that are the subject of current computer science research and has led to the development of certain “twists” in data mining (like the sine-diffusion maps presented here). We believe that such a data-driven approach holds promise for the parsimonious exploration of effective free energy surfaces. The algorithm is (in its current implementation) based on the assumption that the effective free energy surface retains its dimension throughout the computation. The systematic recognition of points at which this dimensionality may change and the classification of the ways this can occur are some of the areas of current research that could expand the scope and applicability of this new tool.

The work of Anastasia S. Georgiou, C. William Gear and Ioannis G. Kevrekidis was partially supported by the U.S. National Science Foundation and the U.S. Air Force Office of Scientific Research (F. Darema). Ioannis G. Kevrekidis was also partially supported through DARPA Contract HR0011-16-C0016. Eliodoro Chiavazzo acknowledges partial support of Italian Ministry of Education through the NANO-BRIDGE project (PRIN 2012, grant number 2012LHPSJC). We thank R. Covino and G. Hummer for many fruitful discussions and their collaboration.

Eliodoro Chiavazzo and Ioannis G. Kevrekidis conceived of and designed the illustrative example computations, which were performed mainly by Anastasia S. Georgiou with Eliodoro Chiavazzo’s assistance. C. William Gear provided crucial insights in point cloud edge detection, as well as the development of sine-diffusion map algorithms. Hau-tieng Wu devised the differential geometric scheme for manifold extension. Juan M. Bello-Rivas wrote part of the paper, conducted the ABL1 protein analysis and integrated the remaining material with contributions from Eliodoro Chiavazzo and Ioannis G. Kevrekidis, as well as with input from all other authors. All authors have read and approved the final manuscript.

The authors declare no conflict of interest.

_{2}dissociative adsorption: Evaluation of free energy barriers in multidimensional quantum systems

First eigenvalues and the corresponding eigenfunctions (represented by continuous lines) of the operator

Data-driven computation of the right eigenvectors of the random walk Laplacian

Joint density plot of visited points mapped onto the first two diffusion map coordinates,

Two particular conformations from the two local maxima shown in

A trajectory “descends” from its initial condition onto the attracting manifold, the cylinder with radius

At each iteration, the algorithm extends the set of samples in the basin of attraction in order to better explore the underlying manifold and increase the likelihood of exiting the metastable state through one of the boundary points. The point cloud in conformational space is shown on the left, and the corresponding points in Diffusion Map (DMAP) space are displayed on the right. Green points represent the boundary of the so-far explored region. The system is reinitialized from the extended points, shown in magenta in both DMAP and conformational space. (

The goal of reaching the second metastable state is attained here at Step 11.

Cosine-diffusion maps on a 2D strip. (

Sine-diffusion map on a 2D plane. Solving the eigenproblem associated with the Laplace–Beltrami operator with absorbing boundary conditions results in diffusion coordinates with sine-like behavior. (

Extending and lifting using one sine-coordinate and one cosine-coordinate. Geometric harmonics is used as the lifting technique. Blue points represent the original point cloud, while red points depict the newly extended points.

Extended manifolds using local PCA. Points extended into the manifold are a function of the boundary detection algorithm. Blue points represent the original point cloud, while red points depict the newly extended points.

An illustration of

Difference between maximum and minimum values of the azimuthal angle

Iteration | |
---|---|

0 | 0.36 |

1 | 0.55 |

2 | 0.75 |

3 | 0.99 |

4 | 1.21 |

5 | 1.49 |

6 | 1.76 |

7 | 2.00 |

8 | 2.25 |

9 | 2.60 |

10 | 2.95 |

11 | 3.33 |

12 | 3.49 |

13 | 3.98 |