Non-Markov-Type Analysis and Diffusion Map Analysis for Molecular Dynamics Trajectory of Chignolin at a High Temperature

We apply the non-Markov-type analysis of state-to-state transitions to nearly microsecond molecular dynamics (MD) simulation data at a folding temperature of a small artificial protein, chignolin, and we found that the time scales obtained are consistent with our previous result using the weighted ensemble simulations, which is a general path-sampling method to extract the kinetic properties of molecules. Previously, we also applied diffusion map (DM) analysis, which is one of a manifold of learning techniques, to the same trajectory of chignolin in order to cluster the conformational states and found that DM and relaxation mode analysis give similar results for the eigenvectors. In this paper, we divide the same trajectory into shorter pieces and further apply DM to such short-length trajectories to investigate how the obtained eigenvectors are useful to characterize the conformational change of chignolin.


Introduction
The kinetic description of (bio)molecules is inevitable to understand their chemical reactions or conformational change, but it is still difficult to thoroughly understand such transition processes due to the limitations of experimental and computational means. Recently, in particular for numerical simulations of biomolecules, the Markov state model (MSM) [1] has been often employed to analyze the kinetic properties of molecules, such as reaction rates and reaction pathways. A good thing about the MSM is its conceptual simplicity and ease of application. By calculating the so-called "transition matrix" (as described below), we can estimate the rate as an inverse of a mean first passage time (MFPT) between two states of concern. Furthermore, using transition path theory [2], we can also estimate dominant pathways using a committor function [3,4]. However, there are at least three issues in the MSM, which are (1) the "lag time" (this represents a time interval between observations of some quantities in a trajectory) dependence of the result, (2) the state definition dependence of the result, and (3) the effects of the finiteness of the trajectory data. These are why many researchers have been developing new or improved methods for calculating reaction rates and other kinetic properties. One such method is non-Markov-type analysis recently introduced by Zuckerman and coworkers [5,6]. This method is an extension of the conventional MSM, which lifts Problems (1) and (2) above, and the rate estimations can be robust, as shown in [5,6]. However, this type of analysis needs a very long-time trajectory or bunches of trajectories to robustly estimate kinetic properties. To overcome this issue, transition path sampling (TPS) [4] is often employed; however, the original idea of TPS is too demanding, and we need to employ "easier" path sampling methods based on collective variables (CVs), such as partial path sampling [7], forward flux sampling [8], or the weighted ensemble (WE) method [9]. We have applied the WE method to several proteins including chignolin [10][11][12] and estimated the rate constants between two metastable states. Hence, it is interesting to compare the rate constants using different computational methods, which is one of the concerns in this paper.
Another concern is how to choose "optimal" CVs. For biomolecules, CVs are often chosen based on chemical intuitions or traditional ideas, but recently, machine learning or manifold learning techniques have become popular to extract CVs. Historically, principal component analysis (PCA) has been used over the years, but there are several problems in PCA, so many researchers have been devising more advanced approaches such as relaxation mode analysis (RMA) [13][14][15][16][17], time-structured independent component analysis (tICA) [18][19][20], the isomap [21][22][23], the diffusion map (DM) [24][25][26], and many others. It is assumed that the kinetic properties are not so sensitive to the choice of the CVs (as exemplified in the reaction flux formalism [4,27]), but "optimal" CVs should be better for both the convergence of calculations and the interpretations of the results. Previously, we used DM for a long-time trajectory of chignolin at a high temperature (420 K) [10] and found that (1) the first few DM eigenvectors well correlated with eigenvectors calculated from RMA, (2) the second DM eigenvector correlated most with the dihedral angle of glycine in chignolin, and (3) the efficiency to calculate the kinetic properties of chignolin does not seem to depend on whether we choose hydrogen bond distances or DMs as CVs. The trajectory analyzed was long enough (∼750 ns) to sample the whole conformational space at the folding temperature, but it is not always the case when we attack bigger or longer-time scale problems. Hence, it is always important to consider what we can learn from shorter trajectories about the global information of the conformational space. Dividing the same trajectory into shorter pieces, we here investigate the kinetic properties of chignolin from shorter-time perspectives, hoping to connect with the enhanced sampling techniques such as the weighted ensemble method [9]. This paper is organized as follows. In Section 2, we briefly describe the methodologies (non-Markov-type analysis and diffusion map analysis) used here for the investigation of the kinetic properties of a small protein (chignolin). In Section 3, after describing the simulated system, we present numerical results for the kinetic properties of chignolin and discuss the connection with the previous results. In Section 4, we conclude the paper.

Non-Markov-Type Analysis
Recently, Zuckerman and coworkers advocated for a new trajectory analysis method called non-Markov-type analysis [5,6], which is an extension of the conventional MSM [1]. We will briefly summarize it here for completeness.
The basic quantity for the Markov-type analysis is the transition matrix T ij calculated as [1]: where N ij is the counting matrix between states i and j, which is directly enumerated from a given trajectory with a given lag time τ (hereafter, we omit the τ dependence for variables such as N ij and T ij ). It is well known that from this transition matrix, we can calculate the equilibrium population P eq i for each state i as: (2) and the mean first passage time (MFPT) F i f from state i to f as where time is measured in units of the lag time τ. These two relations are most relevant for the practical application of the MSM. The latter relation is proven as follows. We can define the first passage time distribution from state i to j over n steps f where f (n) ij is recursively defined using the following relations: From these distributions, the MFPT is defined as and just the rearrangement of the terms leads to Equation (3).
In the non-Markov-type analysis [5,6], we keep track of which state a trajectory is in until it transits to other states, so there remains a kind of memory in the analysis (this is why we call it non-Markov). For concreteness, we take a three-state model (A, I, B) and construct the transition matrix from state A to B, T A→B , as Here, T µν ij is a conditional transition matrix where the last state is µ(=A) and the next entering state is ν(=A,B), and i, j runs through (1,2,3), which are identified as (A,I,B). From this transition matrix, we can calculate the first passage time distribution and MFPT as in the case of the conventional MSM.
There is a similar method called core-set MSM [28], which is an extension of the milestoning method [29] using the idea of a "core set". We found that the results obtained are similar for the system analyzed, so we here only show the numerical results using the non-Markov-type analysis.

Diffusion Map Analysis
The diffusion map (DM) is a manifold-learning method, which was invented by Coifman and coworkers [24] and since then has been applied to many problems including image classification, speaker classification, and so on. The basic idea is to extract a lowdimensional manifold embedded in a high-dimensional data space, and to this end, we construct a matrix, which will be diagonalized. The construction goes as follows. Given we have time series data or a data ensemble, where the dimension of the data space is M and the number of samples is N, i.e., we have x i ∈ R M (i = 1, · · · , N), we firstly consider the following Gaussian kernel: where || · || is a metric and a normal Cartesian metric is usually employed. is a hyperparameter, which is tuned by some criteria. From this kernel, we next construct the N × N "transition matrix" M ij as follows: This form looks like a transition matrix T ij in the MSM defined above because both N ij and k(x i , x j ) represent a propensity to make a move from state i to j. Another construction starts from defining the following matrix: and in this case, a transition matrix M ij is defined as It is known that regarding this form as a propagator for a density function, the backward Fokker-Planck (FP) equation is obtained in the N → ∞, → 0 limit. However, notice that the time series data analyzed do not necessarily have such a stochastic character that the data are generated by the backward FP equation. Since the eigenvalues and eigenvectors calculated from M ij and M * ij are qualitatively similar, we use the first transition matrix M ij (9) for the numerical analysis of the trajectory data.
By diagonalizing M ij with ∑ j M ij u α (x j ) = λ α u α (x i ), we obtain the eigenfunctions u α (x i ) and eigenvalues λ α . There is the following property that λ 1 = 1 > λ 2 > λ 3 > · · · , and u 1 (x) represents the equilibrium distribution as in the case of the MSM. As the CVs in this paper, we decided to take the second and third DM coordinates (λ t 2 u 2 (x), λ t 3 u 3 (x)) where t is time measured in units of the lag time (for simplicity, we take t = 1 in this paper).

On Chignolin and Simulation Setup
The molecular system we used here is a small peptide, chignolin (PDBID: 1UAO), which is an artificially synthesized peptide [30] with only 10 amino acids (GYDPETGTWG). It is known that this is one of the smallest peptides that has a unique fold (native state) [30], so it can be regarded as a "mini-protein". After its discovery, chignolin has been studied by many researchers with MD simulations and has been used to examine new simulation algorithms and protocols. The free energy landscape using two hydrogen bond (HB) distances was calculated by Terada and coworkers using the multicanonical sampling method [31] and multiscale enhanced sampling method [32], and it was found that there is a misfolded state where the HB configuration is different from that in the native state ( Figure 1). These native and misfolded states were obtained by other researchers [33][34][35].
Note that there are two types of chignolin, the above "original" chignolin [30] and a mutated one called CLN025 (PDBID: 2RVD, 5AWL) with amino acid sequence YYDOPET-GTWY [36]. The dynamics of CLN025, as well as other small peptides and proteins was extensively studied by D.E. Shaw's research group using their Anton hardware [37]. Zuckerman's group used the Anton data to clarify the folding mechanism and folding rate of CLN025 and other peptides at room temperature [5]. The MD simulations of CLN025 showed that CLN025 has a simple two-state folding (native and unfolding) mechanism without a misfolded state. Here, we will examine the "original" chignolin, which has somewhat complicated folding pathways.
Directly related to our study, one of the authors (A.M.) performed an MD simulation of the original chignolin near its folding transition temperature and showed the effectiveness of relaxation mode analysis (RMA) [38], which extracts slow relaxation modes and their associated timescales from simulation data. Historically, RMA was developed to examine the "dynamic" properties of spin systems [13] and homopolymer systems [14,15], but has also been recently applied to biomolecular systems [16,17,38]. (RMA is similar to timestructure-based independent component analysis (tICA) in [18][19][20], but tICA is a special case of RMA with t 0 = 0. From RMA, the concept of slow relaxation is naturally introduced. See the conclusion of [16] for more details about the difference between tICA and RMA.) In [38], the free energy landscape using slow modes obtained by RMA was calculated and an intermediate state was found in addition to the previously found native and misfolded states, as shown in Figure 1. Here, we use the same trajectory data of chignolin in [38], so the setup of the molecular dynamics calculation is the same as well [38]. An MD simulation, augmented by a GPGPU, was performed using the AMBER package (AMBER 11.0) with the ff99SB force field and TIP3P water model. An extended structure was solvated with a 15 Å buffer of TIP3P water around the peptide in each dimensional direction. The numbers of atoms of the peptide and water molecules are 138 and 10,941 (3647 water molecules), respectively. Two potassium ions (Na + ) are included in the system, resulting in a net-neutral system. The total number of atoms in the system is 11,081, and a 750 ns MD production run at 1 atm pressure and a 420 K temperature (near folding temperature) was performed with a time step of 2 fs. The Langevin thermostat with a friction constant γ = 2.0 ps −1 was used for temperature control. For analysis, the coordinates were saved every 10 ps, and the total number of samples was 75,000. The free energy landscape of chignolin along the first and second slowest relaxation mode (RM) directions is shown in Figure 1.

First Passage Time Distributions and Transition Rates
We here evaluate the first passage time distributions (FPTDs) using the non-Markovtype analysis introduced above. From the free energy landscape in Figure 1 (3) and (4), we can calculate the first passage time distribution and MFPT, respectively. For comparison, we also employ a "naive" method to calculate the first passage time distribution as follows: We pick x i , which is classified as state A, and then, search along the time series when it makes a first transition to state B. When it happens at x j , we then calculate the FPT from state A to B as (j − i)τ where τ is the lag time.
We show the numerical results for FPTDs among F, M, and I in Figure 2. Basically, the order of the time scales are ∼10 ns (for detail, see the caption in Figure 2), and there are slight differences between the forward and backward transitions. We also notice that the naive method agrees well with the non-Markov-type analysis, though there are large fluctuations in the naive method. We believe that we need much longer simulations to obtain fully converged results when we use the naive method. On the other hand, when we use the non-Markov analysis, the convergence seems to be faster (shorter simulations give a reasonable result), as shown in Figure 3.
We previously estimated the MFPTs for the conformational change dynamics of chignolin using the weighted ensemble (WE) method [10], and it turned out that the time scales for MFPTs obtained were also ∼10 ns when we assumed a linear regression for the population dynamics. (In the previous paper, we obtained shorter time scales for relaxation using a three-state kinetic model, but such time scales are not directly related to the MFPTs.) Hence, we conclude that the previous WE simulation is consistent with the present analysis.

Correlations between Dihedral Angles of Chignolin and Collective Variables
Previously, we employed the diffusion map (DM) method to extract the collective variables (CVs) of chignolin [10] and discussed the correlation between a dihedral angle of glycine and the DM coordinates and relaxation mode (RM) coordinates. To look into more detail of such correlations, we calculated the Pearson correlation coefficients between several collective variables (second DM coordinate, first RM coordinate, and two hydrogen bond distances between Asp3O and Gly7N named HB1 or between Asp3N and Thr8O named HB2) and 16 dihedral angles (φ, ψ) of chignolin in Figure 4. For the numerical protocols for the DM and RM coordinates, see the previous papers [10,38]. We see that except HB2, the correlations are good, indicating that HB1 is a "good" CV since we know that the second DM and first RM coordinates are good CVs. In addition, the absolute values of the coefficients are the largest at the 12th and 14th angles (except HB2), which are the ψ's of glycine and threonine, indicating the importance of these two residues for conformational change.
The cross-correlations between the DM and RM coordinates are shown below. We see that the correspondence between DM2 and RM1 or DM3 and RM2 is good, but that between DM4 and RM3 is less significant.

Short-Time Diffusion Map Analysis for Chignolin
Finally, we show a different type of analysis using the DM method, that is short-time diffusion map analysis. Clementi et al. [25] and Trstanova et al. [26] have used this type of analysis for different molecular systems, and we here apply this method to chignolin dynamics. The basic idea is simple and trivial: we chop a long-time trajectory into shorter pieces, and apply the DM method to each short piece of the trajectories. As shown in [26], DM coordinates extracted by such short-time DM can approximate the local equilibrium dynamics of the system, and furthermore (more importantly), as shown in [25], the shorttime DM might be able to extract the directions of conformational change, which can be further used for sampling. This kind of idea was recently elucidated by Morishita [40], but his idea is to combine short-time principal component analysis [41] with sampling. Here, we examine chignolin dynamics in terms of short-time DM analysis.
We tested two time intervals to calculate the DM coordinates, which are 7.5 ns and 0.75 ns. Since 10 ps is the time interval to save the trajectory for chignolin, each DM matrix is 750 × 750 and 75 × 75, respectively. In Figure 5, we depict the time courses of the second and third short-time DM coordinates, as well as the conventional DM coordinates (calculated by the full trajectory, but with a time interval of 100 ps) and the glycine dihedral angle. We see if we use a longer time interval (7.5 ns), the behavior of the short-time DM is similar to that of the long-time DM, though the correlations between short-time and long-time DM can interchange between DM2 and DM3 (for example, the time duration between 25 and 30 ns). If we use a shorter time interval (0.75 ns), such a correspondence is less significant, but we can see that at the transitions, the fluctuations of the DM2 and DM3 coordinates become large, indicating the usefulness of the short-time DM to detect rare events. Hence, if we use the short-time DM to extract the tentative CVs for conformational change or rare events, it will be possible to enhance the sampling of the conformational space or the calculations of the kinetic properties using these coordinates.

Concluding Remarks
We analyzed a 750 ns-long molecular dynamics trajectory of chignolin, a small peptide with 10 amino acids, in terms of kinetic properties. There are three particular metastable states in chignolin, folded, misfolded, and intermediate states, and the first passage time distributions between these states were estimated using the non-Markov-type analysis and the naive method. The estimated mean first passage times are ∼10 ns, which is comparable to the time scales calculated by the weighted ensemble method. We also applied the shorttime diffusion map analysis to the same trajectory and found that the DM coordinates calculated from short-time trajectories correlate well with those calculated from a long-time trajectory, and even if we use a short-time interval (0.75 ns), the conformational change or rare events can be detected as the large fluctuations of the DM coordinates.
We mention further issues related to this study: We here used the previous trajectory at a high temperature to accelerate the convergence and computation, but if the temperature decreases, the computation becomes harder because of the existence of high free energy barriers. For lower temperatures or bigger systems, we should use more powerful computational resources such as Anton [37] or accelerated simulation methods such as the weighted ensemble method [9], keeping the kinetic properties of the system intact. One idea is to use short-time DM analysis to extract good CVs, which are further combined with the weighted ensemble method for more efficient sampling of the kinetics.