Einstein Model of a Graph to Characterize Protein Folded/Unfolded States

The folded structures of proteins can be accurately predicted by deep learning algorithms from their amino-acid sequences. By contrast, in spite of decades of research studies, the prediction of folding pathways and the unfolded and misfolded states of proteins, which are intimately related to diseases, remains challenging. A two-state (folded/unfolded) description of protein folding dynamics hides the complexity of the unfolded and misfolded microstates. Here, we focus on the development of simplified order parameters to decipher the complexity of disordered protein structures. First, we show that any connected, undirected, and simple graph can be associated with a linear chain of atoms in thermal equilibrium. This analogy provides an interpretation of the usual topological descriptors of a graph, namely the Kirchhoff index and Randić resistance, in terms of effective force constants of a linear chain. We derive an exact relation between the Kirchhoff index and the average shortest path length for a linear graph and define the free energies of a graph using an Einstein model. Second, we represent the three-dimensional protein structures by connected, undirected, and simple graphs. As a proof of concept, we compute the topological descriptors and the graph free energies for an all-atom molecular dynamics trajectory of folding/unfolding events of the proteins Trp-cage and HP-36 and for the ensemble of experimental NMR models of Trp-cage. The present work shows that the local, nonlocal, and global force constants and free energies of a graph are promising tools to quantify unfolded/disordered protein states and folding/unfolding dynamics. In particular, they allow the detection of transient misfolded rigid states.


Introduction
In spite of significant advances in experimental [1][2][3][4][5][6][7], theoretical [8][9][10][11][12][13][14][15][16][17][18][19], and computational research [20][21][22][23][24][25][26][27][28][29], many questions related to protein folding remain unanswered [19].In particular, a complete understanding of preferred folding pathways and misfolding and protein aggregation, which are related to neurodegenerative diseases, still remains a challenge.So far, none of these problems can be tackled by current deep-learning and other recent successful computational approaches of protein folding [25], as these methods relate two ensembles of end structures, the linear sequences of amino acids (completely unfolded unstable structures) and the protein folded geometries extracted from an experimental database, and have no reliable information on the ensemble of intermediate structures.It is worth noting that the inclusion of sequence databases in these methods only improves the results by a few percent, emphasizing the importance of information arising from the physical laws: the database of equilibrium experimental structures.
According to Anfinsen's principle, the equilibrium protein structure (native state in vivo ) is the global minimum of the protein free energy in vitro [8], i.e., it is governed by the second law of thermodynamics.More precisely, the protein equilibrium structure is defined by the amino-acid sequence and the thermodynamical parameters of the environment (T, p, pH) [9].The protein folding phenomenon is thus cast in thermodynamic terms as a phase transition between an unfolded state and a folded native state.From the Boltzmann law, we know that free energy always involves all microstates.The partition of microstates in folded and unfolded ensembles is at the heart of Landau's order parameter theory for which the free energy is expanded as a function of one macroscopic parameter that varies between the two phases.This two-state view is challenged by the nanoscopic nature of the macromolecules.Unlike a macroscopic system, there is no sharp unique melting temperature of the transition for a protein, as it depends on the physical property (order parameter) measured and its spatial localization [4,17,18].Moreover, protein folding may occur through intermediate states or even via a continuum of states (barrierless folding) [6].A predominant description of protein folding is thus the consideration of an expansion of free energy as a continuous function of order parameter(s): the protein free-energy landscape (FEL) [12].As for glasses, the protein FEL has multiple local minima [14,[30][31][32].It evolves as a function of temperature, as often pictured as a funnel [12].The protein FEL concept is essential to understanding the misfolding and aggregation of these heterogeneous polymers.A challenging problem is to define appropriate order parameters to describe the folded, misfolded, unfolded, and intrinsically disordered ensembles of protein structures.The nonfolded state of proteins is not necessarily random, nor does it resemble a Gaussian chain model, and must be characterized.For example, we showed recently that the α-synuclein monomer, a prototypical intrinsically disordered protein involved in Parkinson's disease, occurs in two distinct disordered states by using an FEL representation based on two order parameters [33].Therefore, there is still a need to develop useful representations (order parameters) of the folding process based on fundamental laws.The present work aims to develop and test order parameters derived from graph theory [34,35] to contribute to the characterization of protein-disordered ensembles.The theoretical concepts developed in the present work will be tested for protein structures extracted from all-atom molecular dynamics (MD) simulations.Small-and medium-size proteins have been successfully folded using physical laws by all-atom MD simulations [22].
To associate a graph with a protein structure, we represent the amino acids by just a set of points (vertices of the protein graph) together with lines (edges of the graph) joining pairs of these points according to some rules (see Section 3).We select the C α atom in the protein structure as a vertex representing each amino acid in the graph.For example, the model protein studied here, TRP-cage, will be represented by a graph of 20 vertices.The graph derived from atom positions in a 3D protein structure is hereby called a protein graph (PG).Unlike 3D models, where each C α atom has a defined position in space and links between the C α represent pseudo bonds, the relative positions of the vertices and of the shape and length of lines representing edges of a 2D representation of the PG have a priori no significance.The selection of a specific representation of a PG in 2D will depend on some additional descriptors which will serve to cluster the vertices in groups to reveal hidden information in structural or dynamical properties.
The applications of graphs and simplified three-dimensional networks to analyze protein structures and functions have been widely developed [36][37][38][39].It was established early that graphs representing protein structures share the characteristics of small-world networks [40][41][42][43].Critical amino acids [44], conserved amino-acids networks [45] in proteins, and signal propagation within the macromolecule were identified by using graphs [42].Network models were applied to study protein flexibility [46,47], protein unfolding [48], and protein folding pathways [49][50][51].The complex network of folding pathways can be represented by a graph where each vertex is a microstate or ensemble of microstates, and the edges represent the transitions between them [49].
Here, we show that two topological descriptors of a PG, the Kirchhoff index and the average shortest path between two vertices, can be used to cluster folded and disordered protein structures.Using a linear chain model and statistical physics, we demonstrate that the Kirchhoff index has the physical meaning of the inverse global force constant of the network, and we introduce the local force constant of a vertex, which can be related to Einstein's seminal model of crystal heat capacity.The free-energy models of the PG are thus defined based on Einstein's hypothesis and normal mode analysis.As a proof of concept, the present order parameters are used to analyze an all-atom molecular dynamics folding/unfolding trajectory and the ensemble of experimental NMR models of the fastfolder Trp-cage protein.To test the robustness of the findings, the analysis of the topological parameters was repeated for an all-atom molecular dynamics folding/unfolding trajectory of the fast-folder HP-36 protein.
This paper is organized as follows.In Section 2.1, we present the theory based on the analogy between a PG and a 1D chain with harmonic spring force constants.An analytical relation between the Kirchhoff index and the average shortest path length of a graph is derived for a fully unfolded protein.The definitions of the local, nonlocal, and global force constants of a PG and their relation with the Randić resistance of a graph, as well as the definition of the free energies of a PG, are given.In Section 2.2, numerical results are presented for the MD trajectory of Trp-cage.The results for HP-36 are presented in the Section S2 of the Supplementary Materials, as they are similar to those presented in the main text for Trp-cage protein.Technical details on the numerical construction of the PG and on the MD are reported in Section 3.This paper concludes with Section 4.

Results and Discussion
2.1.Theory 2.1.1.Mechanical Interpretation of a Simple, Connected, and Undirected Graph Here, we introduce the topological equivalence between a simple, connected, and undirected graph and a linear chain of atoms with interatomic harmonic potentials.
First, we consider the Hamilonian of the linear chain with n atoms in the harmonic approximation: where φ is the force constant matrix (Hessian), u i and u j are the displacements of the ith and jth atoms of the chain, and 0 is the ground-state energy.Because the chain is linear, the displacements are scalar numbers, which can take positive or negative values.By construction, the chain is connected: there is no atom not linked to another.Newton's equations of motion are: where m i is the mass of atom i.The force f i is conservative: From Equation (3), one finds: because for a rigid translation, i.e., u i = U (∀i), all the forces must be zero.Second, we consider a graph G = (V, E) with vertex set Vand edge set E. The number of vertices in V is n.We assume the graph to be undirected, simple, and connected.A pair of vertices v i and v j has edge weight w ij , which is defined to be 0 if there is no edge between v i and v j .Because the graph is assumed undirected, w ij = w ji .The adjacency matrix A has elements: The degree d i of the ith vertex is the sum of the weights of all edges having v i as one end: The Laplacian L of the graph is defined as usual by: There is a complete equivalence between L and φ if we interpret the nondiagonal elements of the force constant matrix as proportional to minus the weights of the edges of a graph connecting the atoms, i.e., where c is an arbitrary force constant, which ensures the proper physical dimension of φ.Therefore, we have: For a particular case of an unweighted graph for which A is a binary matrix (all nonzero weights are equal to 1), the associated linear chain has atoms connected with the same spring force constant c.For the particular case of the PG, A is binary.The vertices of the PG represent all the C α atoms of the protein.An edge within the PG represents a contact between two C α atoms, i.e., they are at a distance in the 3D protein structure shorter than a cut-off radius (see Section 3 for the construction of the PG).The PG Laplacian is thus equivalent to the force constant matrix of a linear chain where the atoms are connected (according to A) by the same harmonic spring strength c.
The spectral properties of L are equivalent (to a constant factor) to the spectral properties of the Hessian (Equation ( 9)).For a chain of atoms having the same mass, i.e., (∀i) : m i = m, the eigenvalues of L are related to the vibrational modes of the chain.Indeed, assuming a harmonic solution at the frequency ω for the displacement of the ith atom, i.e., u i (t) = u i (0)e iωt , then üi (t) = (−ω 2 )u i (0)e iωt .Using this in Equation (3), we find the following usual eigenvalue equation: The diagonalization of B gives the frequencies ω l and eigenvectors e l of the vibrational modes of the chain: where e l (i) is the component of the eigenvector of the ith atom in the lth vibrational mode.We sort the modes by increasing frequency ω l+1 > ω l .The mode 1 corresponds to a translation with ω 1 = 0. From Equations ( 9) and (10), the eigenvectors of L are identical to those of B, and thus the spectral decomposition of L is: λ l e l (i)e l (j) (12) with eigenvalues given by: with Ω 2 = c/m.One has λ 1 = 0 as expected for a connected graph.The lowest nonzero eigenvalue λ 2 is named the algebraic connectivity, and the eigenvector e 2 is the Fiedler vector, which can be used to partition the graph.The corresponding vibrational mode 2 of the atomic chain is the mode with the largest wavelength, i.e., the components of its eigenvector are those that fluctuate less along the chain, i.e., less varying among the vertices in a graph.

Thermostatistical Interpretation of Topological Descriptors of a Simple, Connected, and Undirected Graph
We define three descriptors of the thermal fluctuations of a linear chain of atoms with harmonic interatomic potentials and show their equivalence with topological descriptors of a simple, connected, and undirected graph.
The most general solution of the equations of motions is a linear combination of the eigenmodes: where Q l (t) are the weights of the modes, the so-called normal coordinates.Using Equations ( 2) and (3), we have as usual for ∀l = 1: with the solution Q l (t) = Q l (0)cos(ω l t) and where µ l is an arbitrary effective mass.For l = 1, Q l = constant, and ω 1 = 0.In the microcanonical ensemble (n, V, E) for which the energy E is constant, i.e., Q l (0), it is easy to show that where . . .= lim τ→∞ 1 τ τ 0 . . . is a time average.In the canonical ensemble (n, V, T), the energy of the microstates are , with P l being the momentum associated with mode l.Using the equipartition theorem for the chain in thermal equilibrium, we have ∀l = 1: where . . .T is the average over all microstates in the canonical ensemble, T is the absolute temperature, and k B is the Boltzmann constant.Because the normal coordinates are statistically independent in the harmonic approximation (as stated by Equation ( 15), there is no coupling between the modes), thus we have: and finally, from Equation ( 14), where we have introduced an effective local force constant (local stiffness) ki .Equation ( 19) has a clear physical meaning: it represents the statistical fluctuations of the displacement of the ith atom in a local harmonic potential with a curvature ki .Summing the atom thermal fluctuations of the entire chain defines an effective global force constant (global stiffness) K: Equation ( 20) represents the entire chain as fluctuating in a harmonic ground-state potential of curvature K.A third effective nonlocal force constant (nonlocal stiffness) can be defined for the thermal fluctuations of a pair of atoms relative to each other: Equation ( 21) represents the fluctuations of each pair of atoms ij as if they were in a harmonic potential with curvature Kij .The global force constant can also be defined and measured for actual protein structures, where it is related to the dynamical transition of proteins [52].
The relation between the force constants and the topological descriptors of the corresponding graph is deduced from Equations ( 19) to ( 21) by using m i = m ∀i, the normalization of eigenvectors ∑ n i=1 |e l (i)| 2 = 1, and Equation ( 13) for the Laplacian eigenvalues: where γ ≡ k B T c and with the dimensionless local (k i ), global (K), and nonlocal (K ij ) force constants of the graph defined by The three force constants are related to each other through three sum rules obtained by using the normalization condition The formulation of the topological descriptors in the context of a chain in thermal equilibrium provides an interesting physical interpretation of the known topological descriptors of a graph as follows.The robustness of a graph is an important concept to measure the quality of a physical network represented by a graph, as in for example a telecommunication system.One measure of the robustness is related to the effective (Randić) resistances of a graph [53].For a pair of vertices (i, j), this quantity, noted Ω ij , is defined through the Moore-Penrose inverse of the Laplacian, L −1 : From the spectral decomposition of the Moore-Penrose inverse of the Laplacian, i.e., L −1 = ∑ n i=1 e l (i)e l (j)/λ l , one immediately find that the nonlocal force constant K ij and the Randić resistance Ω ij are simply inversely related: Therefore, all the other force constants of a graph can be formulated in terms of Ω ij because of the sum rules (Equations ( 28)-( 30)).As shown in Randić's seminal paper, if a connected graph is associated with an electric network with resistances equal to the inverse of the weight w ij between two nodes, Ω ij represents the effective resistance between the nodes i and j if a voltage difference is applied between these two nodes.By analogy, if the connected graph is associated with a linear chain with interatomic force constants equal to the weights w ij (normalized by a constant c), K ij represents the effective force constant between the atoms i and j if a couple of forces are applied to this pair of nodes, as explicitly demonstrated in Section S4 of the Supplementary Materials.For a linear chain, the Randić resistance Ω ij of an atom pair is also exactly its compliance C ij [54] and equal to 1/K ij .For a three-dimensional elastic network, the compliance is a 3 × 3 tensor representing the elastic response of an atom pair to a couple of forces.A scalar compliance of a pair of nodes of a three-dimensional elastic network, similar to the Randić resistance or nonlocal force constant, can be computed by applying a couple of forces to the atoms in the direction of the vector that joins them [54].As demonstrated in Section S4 of the Supplementary Materials, this scalar compliance [54] can be related analytically to the tensorial Randić resistance of the atom pair (equal to the inverse of the nonlocal force constant matrix).
Another usual measure of the robustness of a graph is the Kirchhoff index of a graph, K f , defined by the sum of the eigenvalues of the Moore-Penrose inverse of the Laplacian and is simply the inverse of the global force constant: The K f is proportional to the average Randić resistance of the graph.Indeed, from Equation (28), For a linear chain, Ω ij is also its average compliance.
In the present thermal statistical interpretation of the topological descriptors, each descriptor has the same physical meaning as the stiffness of a specific harmonic potential.

Relation between the Global Force Constant and the Average Shortest Path Length: Analytical Results
The path length between two vertices is defined as the sum of the weights of edges constituting the path.For a binary adjacency matrix, the length of a path between two vertices is the number of edges of the path connecting them.For a path α(i, j) between the vertices i and j, the length l α(i,j) is φ rs (34) where (r, s) is an edge of the path α(i, j).The shortest path length between two vertices i and j is an important topological descriptor.We use the notation l 0 ij : The average over all shortest path lengths between all pairs of vertices of the graph, l 0 , is another well-studied topological descriptor of the graph robustness and is defined by In Equation (36), the double sum is the so-called graph Wiener index.
As both the Kirchhoff index and the average shortest path length describe the robustness of a network in the literature, a natural question to ask is if and how they are related.An analytical answer can be found for a binary adjacency matrix A of a graph for which the first (i = 1) and last (i = n) vertices have degree 1, and all the others have degree 2. This graph corresponds to a linear chain with spring force constants between nearest-neighbor atoms only, and the PG is one of a completely unfolded protein (straight polypeptide).For this graph, the average over all shortest path lengths is simply: From Equation (37) and simple but tedious algebra (see Section S1 of the Supplementary Materials), one finds The spectral properties of the Laplacian of this graph can be found analytically by analogy with the corresponding linear chain of atoms where φ ij = −c between nearestneighbor atoms only.The vibrational frequencies and eigenvectors of such a chain are well known [55] from which we find the eigenvalues of the Laplacian: with l = 1, 2, 3, . . .n.Using Equations ( 26) and (39), the global force constant for this graph is given by 1 where the last equality is found as follows.Using 2sin 2 lπ 2n = 1 − cos lπ n , we have We observe that x l ≡ cos lπ n are the roots of the derivative of the Chebyshev polynomial of the first kind T n (x) = cos(nθ) with x = cos(θ).The derivative of T n (x) is a polynomial of degree n − 1 : [56,57].Using the chain rule for the derivatives, , where U n (x) is the Chebyshev polynomial of the second kind [57].Then, the sums in the right-hand-side of Equation ( 41) can be found by using the general rule where x l are the roots of P n−1 .From Equation ( 42), one has Using l'Hôspital's rule, one easily finds P(1) = n 2 , P (1) = n 4 −n 2 3 .Inserting these expressions into Equation ( 43) leads to the announced result in Equation (40).Combining Equations ( 38) and ( 40), we derive the analytical relation between K and l 0 for this particular graph: which shows an inverse relation between the global force constant (1/Kirchhoff index) and the average of the shortest path length.Equation ( 44) defines the lowest possible value in the diagram (K, l) of a PG.Indeed, for any PG, there is always a path from vertex i to vertex j with a length equal to |i − j| because the corresponding C α atoms form pseudo-bonds at a distance smaller than the cut-off radius defining a contact (see Section 3).Any contact between the C α of amino acids not adjacent in the protein sequence in the 3D protein structure will add an edge that either does not change the length |i − j| or reduces the length |i − j|.Therefore, the average shortest length given by Equation (37) is the largest possible among all PGs having the same number of vertices (n amino acids).Consequently, the smallest value of K is given by Equation (40) and is proportional to n −2 for large n.
It is worth noting that the graph having the smallest average shortest path length is a complete graph where all vertices are related to all the others by a single edge for which l 0 = 1.This graph is unrealistic for a macromolecule.The eigenvalues of the Laplacian of a complete graph are well known: λ 1 = 0 and λ i = n ∀i = 1.Then, we have for such an extreme case: For a large n, the K of a complete graph converges to its minimum mathematical value of 1.

Einstein's Model of a Graph
We build an Einstein model [55] of a simple, connected, and undirected graph by using the mechanical analogy described in Section 2.1.2.The Einstein model is applied here to protein structures recorded every picosecond.On this short timescale, each structure can be considered as fluctuating harmonically on a frozen energy landscape both in the folded and unfolded states.Experimentally, the fluctuations of the vibrational modes of a protein as a function of time can be measured by single-molecule Raman spectroscopy [58].Following the Einstein model hypothesis, one assumes that each atom i (for i = 1, . . .n) of the linear chain with identical masses and force constants (c) has a position fluctuating in a local harmonic potential with a local force constant ki (Equation ( 19)) with a local frequency ω2 i ≡ ( ki /m).Unlike the original Einstein model, one assumes a frequency difference for each atom.The energy of each atom is given by Êi where q is the number of vibrational quanta, and Êoi is the potential energy minimum of atom i.Assuming thermal equilibrium at a temperature T in the (NVT) ensemble, the atom internal energy is [55] Ûi with ẑi ≡ h ωi k B T .The atom entropy is The classical limit of enthalpy and entropy are found at high temperatures by expanding the exponential around ẑi 1: The classical limit of the free energy is simply The constant ki (local frequency ωi ) is associated with a conformation of the chain, i.e., a PG built from the three-dimensional structure of the protein.Assuming some reference conformation of the chain with a free energy Fi (0) corresponding to a local force constant ki (0) and frequency ωi (0), the free-energy difference ∆ Fi = Fi − Fi (0) in the classical limit is lim ẑi 1 where the first term is the difference between potential energy minima Equation ( 52) can be simplified where ∆ Ŝi is the entropy variation, which is the only term depending on the local force constants.
We further make the hypothesis that each atom oscillates independently (as in the Einstein model).Therefore, for n amino acids, we have, Thus, from Equation (55), we define the free energy ∆F local of a PG by where the dimensionless parameter < 0 is unknown and controls the enthalpic (potential energy) contribution to the graph free energy.The reference conformation can be the graph examined in the previous section for which K is given by Equation ( 40) (completely un-folded chain) or any other reference, for example, the PG built from the native experimental structure of the protein as in the numerical applications of Section 2.2.Another formula for the free energy of a graph, ∆F nonlocal , can be built similarly: and finally a coarse-grained expression, ∆F global , is defined: An important property of the local and nonlocal dimensionless graph free energies of a PG is that the entropic contribution is dominated by the smallest force constants of the graph.For a PG, ∆F is of course ∆ F/k B T, but Equations ( 56)-( 58) can be applied to any graph where the temperature has no meaning, such as, for example, a communication network.
An alternative to the Einstein model is the graph free energy built from the collective modes of the chain (Equation ( 15)).Each mode is associated with a collective frequency ω l (Equation ( 11)).According to Equation ( 51), the collective graph free energy is thus defined as: where λ l (0) are the eigenvalues of the Laplacian of the reference conformation (Equation ( 13)).
For the PG of a completely unfolded chain, they are given by Equation ( 39).If we neglect the degree term, given an ensemble of graphs with the same number of vertices, the one that has the lowest free energy is the one with the smallest product of the eigenvalues of its Laplacian.

Two-State Definition
We evaluate and investigate the application of graph force constants and free energies presented in Section 2.1 to folding/unfolding.As a proof of concept, we present here numerical applications for one MD trajectory of the mini-protein: Trp-cage [59].The Trpcage is a well-known toy model to study protein folding.This 20-amino-acid peptide is a C-terminal fragment of exendin-4.This construct folds within 4 microseconds in water at physiological pH and exhibits a tightly folded tertiary structure in solution.It consists of a short helix, a 3/10 helix, and a C-terminal poly-proline that packs against a Trp in the alpha helix [59].The MD trajectory is 500 ns in duration and consists of snapshots calculated on every picosecond when the temperature is 380 K.More details of the MD trajectory are given in Section 3. The strategy is to build the PG of each snapshot and compute the parameters K and < l 0 >.In this way, we capture the topological information of the protein structures during the folding/unfolding dynamics.
A two-state (folded/unfolded) description of protein folding dynamics hides the complexity of unfolded and misfolded microstates [18].To decipher the complexity behind these two macrostates, we need first to define them.Many usual global order parameters can be used to partition protein structures in folded and unfolded ensembles.Here, we use the fraction of the native contacts ξ(t) computed for each snapshot at time t in the MD trajectories (see Section 3).At time t = 0 by construction, ξ(0) = 1 and fluctuates below 1 at 380 K (above the unfolding temperature) in the MD trajectory of Trp-cage, as shown in Figure 1.From this figure, we divide the snapshots into a folded state ξ ≥ 0.6 and an unfolded state ξ < 0.6.Based on this criterion, we identify an interesting region 100 ns < t < 400 ns where the behavior of descriptors obtained from graph representations can be studied.It contains a folding transition in the first half and an unfolding transition in the next half.It is important to note here that for a protein to function, apart from the kinetic criterion of it folding to its native structure, it should also populate its native state for a significant fraction of time which can be mentioned as the thermodynamic criterion.Hence, even if we can observe more instances where the fraction of native contacts is above 0.6, the above-mentioned time interval becomes the most important since it pushes the structure to situations where the thermodynamic criterion is favored.

Force Constants and Shortest Path Length
First, we computed the global force constant K of the PG as a function of time, as shown in Figure 2. A visual inspection of the curves shows that the global force constant is somehow related to the fraction of native contacts, but as K is more fluctuating than ξ, the Pearson correlation coefficient is not large: 0.4684.As intuitively expected, the time average value of the global force constant of folded structures (ξ ≥ 0.6) <K f olded > = 0.0882 is significantly larger than its value for unfolded structures (ξ < 0.6), <K un f olded > = 0.0631.According to Equation ( 40), the smallest possible value for Trp-cage is K = 0.0150, and the maximum hypothetical value is 1.0526 (Equation ( 45)).From Figure 2, the minimum and maximum values observed in the MD trajectory are K = 0.0150 and K = 0.1940, respectively.Although the folded protein is expected to be more rigid than an unfolded polymer chain, disordered or misfolded structures are also expected to be rigid.For example, in the time window 201 ns-208 ns, structures with ξ ≈ 0.4-0.5 have K ≈ 0.12 much larger than <K f olded > and twice the value of <K un f olded >.Thus, the descriptor K contains more information on the unfolded state than the global ξ order parameter.The two-dimensional probability density function (PDF) of the (ξ, K) values computed from the trajectory is represented in Figure 3a and revealed the existence of two unfolded substates at (ξ ≈ 0.3, K ≈ 0.062) and (ξ ≈ 0.4, K ≈ 0.04) and two folded substates at (ξ ≈ 0.8, K ≈ 0.052) and (ξ ≈ 0.8, K ≈ 0.100).
The time variation of l 0 is shown in Figure 4, where it is compared to K. The minimum and maximum values observed in the MD trajectory of l 0 are 2.0263 and 7.0, respectively.The maximum observed value corresponds to a completely unfolded chain, as predicted by Equation (38).The means of l 0 computed for folded (ξ ≥ 0.6) and unfolded structures (ξ < 0.6) are 2.8465 and 3.2751, respectively.As expected, the paths in the folded PG are shorter on average.As for K, l 0 is not significantly correlated with the nativeness characterized by ξ (shown in Figure 1), as the Pearson correlation coefficient is only −0.3920.The variations of l 0 thus provide additional information on the different protein substates, as shown by the three local minima of the (ξ, l 0 ) PDF in Figure 3b.  Figure 4 clearly shows that l 0 ∝ 1/K as for a completely unfolded polymer chain (Equation ( 44)) (Pearson correlation coefficient is −0.8398).However, except for extremal values of l 0 , a given average shortest path length corresponds to a range of values for K, as can be seen from Figure 5a.This can be explained because an intermediate protein size corresponds to a large number of possible conformations with different K values.For example, we show three selected structures s1, s2, and s3 (named by increasing K value) with the same value l 0 = 3 in Figures 5c-e, respectively.They correspond to graphs with different robustness.In particular, the structures s1 and s2 have N-term and C-term which remain flexible, unlike the s3 structure.The nonuniqueness of the relation between K and l 0 explains why the PDF of the (ξ, l 0 ) values computed from the trajectory (Figure 3b) shows only one substate in the unfolded region, whereas the PDF of (ξ, K) (Figure 3a) has two substates.A striking property of the (K, l 0 ) plot is that the ensemble of points draws nearly continuous lower and upper limits.These maximum and minimum values must be degenerated for l 0 = 7 which corresponds to a completely unfolded chain of n amino acids according to Equation (38).Indeed, the value of K predicted by Equation (44), shown by the black dot with label 1 in Figure 5a, agrees with the MD result.Although Equation (44) was derived for an unfolded chain, we applied it to predict a value of K for each value of l 0 observed in the MD trajectory.Surprisingly, it predicts nearly perfectly the upper limit of K for all the values of l 0 , as shown in Figure 5a.This unexpected result seems at first glance in contradiction with the fact that for a chain of length n, Equation (44) predicts the absolute possible minimum value of K as explained in Section 2.1.3.
This apparent contradiction is explained as follows.At each value of l 0 of the PG of Trp-cage (with n = 20 amino acids), we can associate the PG of a completely unfolded shorter protein chain with n < 20 amino acids.For example, the value l 0 = 3.66 is the average shortest path length of the PG of an unfolded chain with n = 10 vertices according to Equation (38).This unfolded shorter chain can be built from the unfolded chain of n = 20 amino acids by eliminating every other amino acid and by connecting the remaining ones by first nearest-neighbor contacts.Therefore, a good approximation of this shorter chain (n = 10) by the PG of Trp-cage (n = 20) is a structure having contacts only between second nearest-neighbor C α atoms in addition to contacts representing the peptide bonds between the first nearest-neighbor atoms.We name this model (20,2).The value of K of (20, 2) should be close to the minimum value of K for a chain with n = 10 amino acids predicted by Equation (44).The value of K of the (20, 2) chain is shown by the black dot 2 in the (K, l 0 ) plot and is indeed very close to the analytical prediction.We have built a series of models of completely unfolded chains (20, j) with contacts only between the third (j = 3), fourth (j = 4), fifth (j = 5), etc., nearest neighbors represented by the black dots numbered, 3, 4, 5, ..., respectively.These points follow the predictions of Equation ( 44) perfectly confirming the reasoning.It can also be seen in Figure 5d that the structure s3, close to the upper limit for the value l 0 = 3, corresponds approximately to a three-dimensional structure having third-neighbor contacts only.From a topological point of view, the PG of s3 is equivalent to the PG of the (20, 3) structure having a value of K close to a chain with n = 8.This reasoning explains the predictions of Equation ( 44) but not why s3 is an upper limit for a chain of n = 20 for that value of l 0 = 3.This can be understood qualitatively because a PG where each vertex is connected similarly as in (20, j) structures corresponds to a PG where there is no vertex with a low degree, i.e., no weak local force constant which would significantly lower K as stated by Equation (30).On the opposite end, as we can see in Figure 5c, the s1 structure with a low K has end amino acids connected with only peptide bonds and thus has low local force constants.Although we can figure out the reason for the lower bound in the (K, l 0 ), at the time of writing, we have not found an analytical formula to predict it.
It is worth comparing the (K, l 0 ) plot extracted for the MD trajectory to the one computed from the 38 experimental NMR models of Trp-cage (PDB code: 1L2Y), as shown in Figure 5b.Surprisingly, the NMR data reveal two distinct groups separated by a gap along the axis l 0 .The first and second groups are in the regions 2.5 < l 0 < 2.75 and 3.05 < l 0 < 3.38, respectively.The first group corresponds to more robust structures with K ≈ 0.10-0.12,whereas the second group has softer structures with K ≈ 0.06-0.07.The native NMR structure used as a reference in the present work (marked t 0 for which ξ = 1 and K = 0.0632) is in the second group.Averaging the values of K and of l 0 of the NMR models gives 0.075 and 3.06, respectively.This is in good agreement with the average values of these quantities computed for folded snapshots (ξ ≥ 0.6 relative to the model chosen at t 0 ), which are, respectively, 0.0882 and 3.2751, as mentioned above.The existence of two substates in the native state of Trp-cage was discussed above and are visible in the PDF (K, ξ) (Figure 3a) with a major substate identified as the softest second experimental group and a minor state as the first hardest one.The PDF ( l 0 , ξ) (Figure 3b) also shows the two groups but not with the correct weight as many unfolded structures populated the region of the softer group.Indeed, we recall that the average value l 0 computed from unfolded snapshots (ξ < 0.6 relative to the model chosen at t 0 ) is 2.8465.
The topological descriptors K and l 0 are global properties of the different protein microstates represented by PGs.A more detailed topological description of these microstates is the sequence of their local force constants k i .To illustrate how these sequences vary in the two folding/unfolding transitions (defined here by crossing the limit ξ = 0.6 in Figure 1), we selected four representative snapshots in the MD trajectory at t 1 = 150 ns (ξ = 0.3333, K = 0.0422), t 2 = 230 ns (ξ = 0.8333, K = 0.0967), t 3 = 300 ns (ξ = 0.9167, K = 0.1128), and t 4 = 400 ns (ξ = 0.3333, K = 0.0380), as indicated in Figure 4.The snapshots at different times are shown in Figure 5f-k.As we can see in Figure 5b, both structures in the folded state at t 2 and t 3 are in the first experimental group (hardest structures).We also selected an unfolded structure at t 5 = 204 ns (ξ = 0.4583) corresponding to a snapshot with high rigidity, i.e., K = 0.1236.
A representation of sequences of k i is shown in Figure 6.The sequence of k i of the folded structures at times t 2 and t 3 are similar.All k i at these times are larger or equal to the values of k i at t 0 (reference native state).However, this is not sufficient to explain why the native structure has a global force constant nearly twice as small as these two.In fact, the very low k i of PRO18, PRO19, and SER20 at t 0 decrease K significantly (and thus increase the entropy) more for the native structure than for the structures at t 2 and t 3 .The sums of inverse k i (Equation ( 30)) from ASN1 to PRO17 at times t 0 , t 2 , t 3 give a value of K equal to 0.1267 [0.0632], 0.1518 [0.0967] and 0.1727 [0.1128] to compare with the values for the complete chain recalled in brackets.The unfolded structures at t 1 and t 4 have nearly all their k i smaller than the ones at t 0 , but their low K global force constant is mainly due to the very low values of k i at the N-term and C-term regions.Indeed, calculations of the sums of inverse k i of only ASN1, LEU2, TYR3, PRO17, PRO18, PRO19, and SER20 at times t 1 and t 4 give values of K equal to 0.0620 [0.0422] and 0.0547 [0.0380] which are relatively close to the K values of the complete chain recalled in brackets.Low k i at times t 0 to t 4 are due to vertices with a low degree, as shown in the representations of the snapshots at different times in Figure 5. On the contrary, the structure at t 5 has no small k i in the N-term and C-term regions, which explains the strong rigidity of this unfolded state.As can be seen in the representation in Figure 5k, the PG of this snapshot has no vertex with a low degree.In addition, this PG is close to the model structure (20, 5) (Figure 3b) and indeed has long-distance contacts.The contributions of residues at the C-term region (PRO17, PRO18, PRO19, and SER20) to K explain the large difference of rigidity between the structures at t 0 and t 5 .Indeed, the calculation of K for ASN1 to ARG16 for t 0 and t 5 give similar global force constants: 0.1475 [0.0632] and 0.1481 [0.1236], respectively (values for the complete chain are in brackets).Metastable states competing with the native structure can be related to residual frustration.Frustration in condensed matter physics means that the system cannot simultaneously minimize the competing interactions between its different parts [60].Proteins evolved in order to minimize frustration, which shapes a funnel free-energy landscape [60].Although the study of the relations between residual frustration and topological descriptors (global, local, and nonlocal force constants) is far beyond the scope of the present study, we can make some qualitative observations.We computed the local residual frustration configurational index of each amino acid for the native structure of Trp-cage (PDB ID: 1L2Y, model 1) using the protein frustratometer 2 program [61].The program predicts that amino acids in the N-term (from 1 to 9) are about 20% highly frustrated.This might be qualitatively related to the folded states at t 2 and t 3 for which contact between the N-term and C-term stabilizes these non-native folded configurations, as can be seen in Figure 5e,f, respectively.The difference between the sequence of k i of these two configurations with the one of the native structure is also larger in the N-term, as shown in Figure 6.

Calculation of Free Energies Using the Einstein Model
In the graph free-energy formula derived from the Einstein model, the force constant term is purely entropic (Equation ( 54)).This contribution is parameter free.The enthalpic part (first term in Equation ( 56)) depends on an energy scale defined by the single parameter .As we do not have information on , we treat it here as a variable.First, we compare the entropic contribution (i.e., for = 0) of the local (Equation ( 56)), nonlocal (Equation ( 57)), global (Equation ( 58)), and collective (Equation ( 59)) models of the graph free energy in Figure 7a.The local, nonlocal, and collective models agree remarkably with each other with only a change in scale.The coarse-grained global model has the smallest scale and is also very similar to the other models, with a high Pearson correlation coefficient of 0.97 compared to the local model for example.In all models, the entropy change is positive in the folded parts of the MD trajectory, as expected, since the folding reduces possible structural fluctuations.In unfolded parts of the trajectory, the entropy change is mostly negative as expected.There are a few exceptions, for example, times around t 5 .The time parts with positive entropy indicate unfolded very rigid structures.The calculation of the entropic term is thus a means to identify misfolded structures in the trajectory.In Figure 7b, we represent an enthalpic term for different values of .This term is positive in the unfolded parts of the trajectory, as expected, as the unfolded structures have vertices with a lower degree (fewer contacts), with the structures around t 5 being an exception.The enthalpic term is small in the folded parts, which indicates that folded structures are on average as connected as the reference structure at t 0 .The enthalpic term is only roughly anticorrelated with the entropic term (the Pearson correlation coefficient between the two terms for the local model is −0.31).We observe structures with a positive entropy (rigid) but with fewer contacts than in the reference folded structure at t 0 (such as, for example, in the region 80-90 ns).The examination of the enthalpic and entropic parts of the free-energy models permits one to characterize the different rigid misfolded structures.The addition of the two terms is represented for a value of = −5 in Figure 7c.With this value of , the structures in time ranges where the folded structure is stable on average (marked red in Figure 1) have zero or negative free energies.The metastable rigid structure at time t 5 also has negative free energy, whereas most of the unfolded structures have positive free energy.

Contacts and Protein Graph (PG)
Although a PG might be built from the all-atom protein structure, we focus here on a coarse-grained representation of the protein main chain, which only has proven to be useful in describing protein folding [18].Namely, we represent the protein's three-dimensional structure by the sole positions of its C α atoms.Each vertex of the PG thus represents the C α atom of an amino acid, and the vertices are ordered as in the amino-acid sequences from i = 1 to n, where n is the number of amino acids.An edge between two vertices is drawn if the distance between the two C α atoms is a contact.A contact is defined as usual for two C α atoms belonging to nonadjacent amino acids in the protein sequence and which are at a distance in the three-dimensional protein structure below a cut-off radius R = 0.6 nm.This typical value includes the peak of the first nearest neighbors of the C α atoms in folded protein structures.In the present work, a PG is always connected because we add an edge between two C α atoms, which are nearest neighbors in the amino-acid sequence.These additional edges represent the peptide bonds.The PG is simple, i.e., there is no edge connecting a single vertex (graph loop) or multiple edges between two vertices.We do not make any distinction between the different edges and assume their weight is equal to 1. PG with no contact corresponds to the straight unfolded chain examined in Section 2.1.3and has the minimum number of edges, i.e., n − 1.We define also as usual the native contacts as the contacts present in the experimental folded structure (PDB ID: 1L2Y, model 1).Say nc native (t), the number of native contacts in the structure of the snapshot at time t in the MD trajectory, then we define the fraction of native contacts ξ(t) as follows: where nc * native corresponds to the number of contacts in the experimental native structure.In the MD trajectories studied here, it is also equal to nc native (t = 0) because the initial structure is the experimental one (see Section 3.2).We consider the fraction of native contacts at time t to obtain a measure of the structure's nativeness as a function of time (see text Section 2.2).

Molecular Dynamics Trajectories
The MD trajectory of Trp-cage was generated in a previous unrelated work using an all-atom force field in explicit water at 380 K (above the folding transition temperature) [62].This MD trajectory was chosen as it clearly shows folding/unfolding events.The MD trajectory is 500 ns in duration and consists of snapshots stored every picosecond (500,000 structures/protein).The initial structure at time t = 0 in the MD trajectory is an experimental native structure (PDB ID: 1L2Y, model 1).More details on the MD trajectory can be found in the original paper [62].

Statistics
All statistical calculations (averages, probability densities, Pearson correlation coefficients) were computed from raw data (not from moving average data).The number of bins for both axes in the PDF calculations is 25 for Figure 3 and 100 for Figure 5.The average shortest path length between two vertices was computed with the aver-age_shortest_path_length function of the NetworkX Python library [63].

Figure 1 .
Figure 1.MD trajectory of Trp-cage at 380 K. Time t in red (∀t) : ξ(t) > 0.6.The yellow curve is computed for a moving mean with a window size of 1 ns.

Figure 2 .
Figure 2. Evolution of the global force constant K for the MD trajectory shown in Figure 1.The bold green curve is computed for a moving mean with a window size of 1 ns.

Figure 3 .
Figure 3. Panels (a,b) represent respectively the PDF of (ξ, K) values and (ξ, l 0 ) computed from the trajectory shown in Figure 1.

Figure 4 .
Figure 4. Comparison between the average shortest path length (blue) and global force constant (green) for the MD trajectory shown in Figure1.The bold green curve is computed for a moving mean with a window size of 1 ns.Times t 0 , t 1 , t 2 , t 3 , t 4 , and t 5 discussed in the text are indicated.

Figure 5 .
Figure 5. Relationship between K and l computed for the MD trajectory in Figure 1.Panel (a) PDF of (K, l 0 ) (blue dots) and pairs of values (K, l 0 ) for three selected snapshots named s 1 (green dot), s 2 (pink dot), and s 3 (red dot) with the same value of l 0 as discussed in the text.Red line is the result of application of Equation (44).Black dots are the results of model chains with regular long distance spring force constants of different lengths named (20, j = 1, 2, 3...) in the main text.Panel (b) PDF of (K, l 0 ) from all snapshots with ξ > 0.6 (blue).Red line and black dots are as in Panel (a).Orange dots are the (K, l 0 ) values of the experimental NMR models of Trp-cage (PDB ID: 12lY).Colors dots correspond to the values computed for the snapshots at times t 0 to t 5 indicated at Figure 4. Panels (c-k) are three-dimensional representations of the structures s 1 , s 2 , s 3 in Panel (a) and of the structures at times t 0 , t 1 , t 2 , t 3 , t 4 , t 5 , respectively.The spheres are the positions of the C α atoms, and the tube represents the backbone.The black lines are the contacts considered to build the PG.

Figure 6 .
Figure 6.Distribution of the local force constants at times t 0 (bold green), t 1 (light blue), t 2 (red), t 3 (brown), t 4 (dark blue) and t 5 (orange) indicated in Figure 4 and discussed in the text.The gray area limited by dashed lines represents the range of values observed in the MD trajectory.

Figure 7 .
Figure 7. Free-energy graph calculations for the trajectory of Figure 1.(a) Local (blue), nonlocal (green), global (red), and collective (black) free energy with = 0. Horizontal dashed lines indicate the zero baselines of the free energies with the corresponding colors.(b) Enthalpy term of the free-energy graph