Folding Rate Optimization Promotes Frustrated Interactions in Entangled Protein Structures

Many native structures of proteins accomodate complex topological motifs such as knots, lassos, and other geometrical entanglements. How proteins can fold quickly even in the presence of such topological obstacles is a debated question in structural biology. Recently, the hypothesis that energetic frustration might be a mechanism to avoid topological frustration has been put forward based on the empirical observation that loops involved in entanglements are stabilized by weak interactions between amino-acids at their extrema. To verify this idea, we use a toy lattice model for the folding of proteins into two almost identical structures, one entangled and one not. As expected, the folding time is longer when random sequences folds into the entangled structure. This holds also under an evolutionary pressure simulated by optimizing the folding time. It turns out that optmized protein sequences in the entangled structure are in fact characterized by frustrated interactions at the closures of entangled loops. This phenomenon is much less enhanced in the control case where the entanglement is not present. Our findings, which are in agreement with experimental observations, corroborate the idea that an evolutionary pressure shapes the folding funnel to avoid topological and kinetic traps.


Introduction
The biological function of most proteins requires them to fold into a well-defined native state, implying that both structure maintenance and efficient folding are kept under selective pressure by evolutionary processes [1]. In particular, a direct experimental evidence, pointing to some degree of folding rate optimization throughout evolution, was recently provided for ribonuclease H, using ancestral sequence reconstruction [2]. Bio-informatics analyses had also uncovered similar evolutionary signals already two decades ago for several folds [3], and more recently for a large catalog of protein domains [4].
The latter study was based on the well known empirical correlation between experimentally measured folding rates of proteins and simple descriptors of the structural organization of the native state [5]. More general features of the folding mechanism are as well dictated by the overall topology of the native state [6]. In fact, contact order [7] and other related descriptors are based on the topological properties of the network formed by pairs of residues that are nearby in the three-dimensional space [8].
The simpler the network, the faster the predicted folding. The topology of the network of contacts, however, does not necessarily capture the topology of the protein backbone seen as a curve in the three-dimensional space, and the possible formation of knots and other entangled motifs.
The discovery of knots in few proteins [9] came indeed as a surprise because they seem an unnecessary complication for the folding. Their presence could be related to some biological function

Results
In our toy model for protein chains, we consider a structure-based energy function with a sequence-dependent energy (Section 4.1 for details). Once a native structure is chosen to define the energy function, all sequences in the model will have that structure as a ground state, with a sequence-dependent ground state energy. In this study, we consider two putative native states. One state is entangled, with two concatenated loops, whereas the second "twin" state is non-entangled ( Figure 1).
Despite the different overall topology, the two twin states display a very similar contact network topology (compare Table 1 and Table 2). In fact, one of the two states can be converted into the other by just switching the spatial positions of two particular amino acids, so that only a few contacts are rewired. The overall energy, however, can be kept exactly the same, upon also switching the corresponding amino acid types (Section 4.2 for details).
We focus on the two energies of the contacts involved in the closures of the loops, which can be either the concatenated loops in the entangled native state or the corresponding loops in the "twin" non-entangled state. These are represented as dashed lines in Figure 1, joining amino acid 3 with 8 and amino acid 11 with 16. Due to the symmetry of the conformations, the two contacts are in equivalent positions. Therefore, for any sequence s, we can distinguish the energies V 1 (s) > V 2 (s) of the contacts with, respectively, the weaker (higher energy) and the stronger interaction (more stable due to lower energy). Non-entangled native state. The putative "twin" native states are two self-avoiding walks on the fcc lattice with N = 18 sites. The two loops shown in red and blue are concatenated in the entangled structure, and not concatenated in the non-entangled structure. The amino acids at the ends of each loop form the contacts (dashed yellow and green lines) whose energy is studied in this work. Table 1. For the entangled protein structure (Figure 1a), pairs i ÷ j for which ∆ i,j = 1. Underlined pairs refer to the contacts at the ends of loops discussed in this work. Pairs in bold-face (10 over 35) refer to the contacts that are not present in the non-entangled "twin" (Table 2).

Concatenated Loops Slow Down the Folding of Random Sequences
We begin by comparing the average folding times of random sequences that fold onto the entangled native state, shown in Figure 1a, with those of their twin sequences that fold onto the twin non-entangled native state, shown in Figure 1b, with the same ground state energy. The random sequences {a 1 , . . . , a p , . . . , a q , . . . , a N }, assigned to the entangled native state, are sampled with a uniform distribution over all 20 possible amino acid types. The related twin sequences can then be defined as {a 1 , . . . , a q , . . . , a p , . . . , a N }. We perform n = 100 runs to estimate the average folding times for each considered sequence (Section 4.3 for simulation details). This procedure is repeated for 15 random sequences and their twins. Figure 2a shows the average folding time as a function of the weaker energy V 1 (s). As expected, the proteins that fold onto the entangled state need on average much more time than their twins. For random sequences, in the absence of an evolutionary process, the folding time is not correlated with the loop closure energy in both the link and the no-link case. A similar conclusion can be drawn for the stronger energy V 2 (s). We present the results for 15 random sequences only, because the difference between the two ensembles shown in Figure 2a is already apparent (note the log scale on the folding time axis).

Folding Time Optimization Results in Slightly Lower Average Energies for the Concatenated Protein Structure
After having established the folding kinetic properties of random sequences, we now study the outcome of an evolutionary process that optimizes the average folding time for the resulting sequences (Section 4.4 for more details on the simulated evolutionary process).
For a given choice of the putative native state, the evolutionary process is simulated for S = 31 independent replicas. In each replica, Z = 100 proteins with random initial sequences are evolved for a total of G = 1000 evolutionary steps, or generations. At the end of the process, the final ensemble of each replica consists of Z optimized sequences with statistical properties distinguished from the initial random ones.
We simulate the evolutionary process with either the entangled conformation (Figure 1a), or its non-entangled "twin" (Figure 1b), chosen as the respective native state. We focus on the properties of the "best" protein ω * s , i.e., the protein with the lowest average folding time τ s = τ(ω * s ) , within each system 1 ≤ s ≤ S.
The resulting native energy per residue E of such proteins, averaged on the ensemble of S independent replicas, is slightly lower in the link case ( E = −22.13 ± 0.16) with respect to the no-link case ( E = −21.76 ± 0.15). This energy difference is significant at the level of 1.7 standard deviations (p-value 0.045 with a one-tailed test); it may be needed to compensate for the entropy loss caused by the rigidity due to loop concatenation [25].

Folding Time Optimization Promotes Weak Interactions At the End of Concatenated Loops
For all proteins with the lowest average folding time, the latter is shown in Figure 2b as a function of the energy V 1 (s) of the weakest closure (a similar pattern is found for V 2 (s)), for the S fastest proteins evolved on the entangled native state and for the S fastest proteins in the no-link case. Clearly, the former ones on average fold more slowly than the latter ones. Most importantly, proteins evolved on the entangled native state are characterized by higher V 1 values, i.e., the closures of concatenated loops are less stable as a result of the evolutionary process, even if their overall native energy is lower (Section 2.2).
On average, over the S independently replicated evolutionary processes, we find Consistently, our findings reveal also that folding time optimization on the entangled native structure leads to sequences where concatenated loops are closed by amino acids that are less hydrophobic than the average one. This is made apparent in Figure 3a, where we focus again on the ensemble of the S fastest proteins found at the end of the corresponding independent replicas of the evolutionary process. We plot the frequency observed for each amino acid type a (the corresponding hydrophobicities are negative integers, −20 ≤ E a ≤ −1 in our toy model, Section 4.1): (i) regardless of its position along the chain, (ii) at one of the 4 sites at the ends of the two concatenated loops (dashed lines in Figure 1a), and (iii) at the complementary N − 4 sites. Case (i) and (iii) show that, on average, all amino acids are equally frequent, consistently with the sampling of amino acid types used in the evolutionary process. Note that, in principle, one might have expected an overall bias towards more hydrophobic residues, based on the naive expectation that the stronger the interactions the faster they form, but this is not the case. The presence of an essentially flat distribution over all residue types also suggests that our results do not depend on the specific choice of the simulation temperature nor on possibly different definitions of folding temperature [32] on which that choice is based (Section 4.3). The slight difference between (i) and (iii) is due to the inclusion in (i) of the statistics (ii) of the 4 special sites, which shows a significant departure from the flat profile. Indeed, due to the evolutionary pressure promoting fast folding, the most hydrophobic residues are selected against at the end of concatenated loops (amino acids with E a < −10 are nearly absent), whereas hydrophilic ones are instead found much more frequently (the distribution has a large peak in correspondence of the more hydrophilic amino acid with E a = −1).
It is important to check that this trend is actually due to the presence of two concatenated loops and not simply to the overall arrangement of the native structure. The corresponding frequencies observed for the amino acid types in the no-link case, where the 4 "twin" sites in the non-entangled state (connected by dashed lines in Figure 1b) are either singled out or excluded, are shown in Figure 3b. The residue type selection observed in the link case is much stronger than in the no-link case. Nevertheless, a similar, albeit much slighter, trend is present also in the latter case, with hydrophilic residues found more frequently than hydrophobic ones at the 4 special sites. The most hydrophobic residues can anyhow still be found in a significant amount.

Discussion
Within a toy model for short protein chains, we simulated an evolutionary process where folding time is optimized for a given native structure. Coarse-grained structure-based models are commonly used to study folding kinetics, in particular in the context of knotted and entangled proteins [24,31,[33][34][35]. We focus on folding time optimization because we want to test the effect of a specific evolutionary pressure on the folding kinetics due to the presence of entangled motifs. In general, the evolutionary fitness of a protein is obviously related to its biologic functionality, of which fast and reproducible folding is only a prerequisite.
In order to better understand the role of entanglement, we considered two similar native structures that are related by a subtle rewiring of few chain bonds and thereby differ only for the presence of a pair of concatenated loops in just one of them ( Figure 1). For any given sequence that folds onto the entangled native structure, a "twin" sequence can be obtained by switching two amino acid types, having the same ground state energy onto the non-entangled native structure.
Given the fundamental role played by the topology of protein structures [6], we expect that even the extremely coarsened approach used in this paper should detect the differences in the folding properties due to the presence of entangled motifs. Ultimately, the reliability of any modeling approach needs to be evaluated against its ability in reproducing observed data. As a matter of fact, our results reproduce sequence patterns related to the presence of entangled motifs that were detected by analyzing single-domain protein structures [25].
Namely, the evolutionary process leads to optimized sequences whose amino acids are enriched in hydrophilic residue types at the end of concatenated loops, when the latter are present in the native structure ( Figure 3). The results obtained within the toy model thus corroborate the hypothesis that the need to perform a fast and smooth folding process has selected amino acid sequences where some degree of frustration, in the form of unfavorable amino acid pair stability, is allowed. This energetic frustration is localized at the ends of concatenated/entangled loops, allowing to overcome the topological frustration implied by the presence of entangled motifs, in keeping with the principle of minimal frustration [36,37].
The crucial role of loop concatenation is benchmarked against the results obtained for the non-entangled native structure. A residual selection of hydrophilic residues is observed also in this case, hinting that the evolution of weak interactions to allow fast folding may be a feature non restricted to the closures of entangled loops. At any rate, the observed enrichment in hydrophilic residues is markedly weaker than for the entangled structure (Figure 3). At the same time, a bias, if any, is observed instead towards lower overall native energies of the evolved sequences for the entangled native structure (Section 2.2).
In general, folding is on average much slower in the presence of concatenated loops, as expected. This holds true when comparing the ensemble of sequences evolved independently on the two native structures (Figure 2b), and also when comparing the folding of random sequences onto the entangled native structure with their "twin" sequences on the non-entangled structure (Figure 2a).
To sum up, our results support the following picture: given a specific three dimensional arrangement of residues in the native structure, if evolution selects sequences enhancing the folding rate, a crucial byproduct is the removal of strong stabilizing interactions at the ends of loops, in particular at the end of concatenated loops that presumably need to be formed in the latter stages of the folding process. Future work should obviously investigate whether this results hold for more fine-grained protein models and/or for bigger protein chains.

Protein Chain Model
We model a protein as a N-site self-avoiding walk on the fcc lattice. Each residue i ∈ {1, 2, . . . , N} carries an amino acid type a i with hydrophobicity quantified from E a = −a for a ∈ {1, 2, . . . , 19, 20}. Hydrophilic amino acids have E a closer to zero and thus form weaker binding with other residues, as described next.
Non consecutive residues i, j (|i − j| > 1) form a contact if they are nearest neighbors on the fcc lattice. We follow a structure-based Go-like approach [30] and assign an energy to any such contact as Here ∆ i,j = 0, 1 is the native connectivity matrix (the contact map) in the Go-like model, namely ∆ i,j = 1 only if the contact i ÷ j is present in the native state. The energetic contribution E a i ,a j ≡ E a i + E a j is the simplest linear combination of the amino acid hydrophobicities. Note that all contact energies are attractive in our model, so that any given structure with a non zero connectivity matrix is the ground state for all sequence choices. The overall energy for a given chain configuration Γ with amino acid sequence {a i } N i=1 is obtained by summing over all nearest-neighbor interactions on the lattice among non consecutive residues along the chain: where I i,j (Γ) = 1 if sites Γ i and Γ j are nearest neighbors and I i,j (Γ) = 0 otherwise. Note that the simple additive form Equation (1) of the pairwise interaction potential was shown to capture much of the statistical variability of the interaction parameters derived by Miyazawa and Jernigan in a knowledge-based approach [38].

Native Conformations
We consider two alternative native state configurations for short self-avoiding chains (N = 18), shown in Figure 1. Both structures are the ground states for any sequence choice, in the corresponding structure-based models defined by Equation (2). They are non-degenerate only up to mirror images, since we do not consider terms that break the chiral symmetry. The list of their contacts in the native conformation is presented in Table 1 for the entangled protein and in Table 2 for the twin without entanglement.
The choice of short chains allows fast simulations, at the same time leaving the possibility for a non trivial topological structure that exhibit a pair of concatenated loops. We choose this entangled structure, Γ l , as the main object of our study (see Figure 1a).
The fcc lattice allows to build an almost identical non-entangled structure, Γ nl (through the text we called it the "twin" structure), where no pair of loops is concatenated (Figure 1b). The twin non-entangled structure Γ nl can be obtained by just switching the spatial positions (r p ↔ r q ) of the two residues p = 6 and q = 13 in the entangled structure Γ l . Note that this entails a rewiring of chain connectivity, consistently with the change in the overall topology. With a simple notation, the two twin structures are labeled as "link" and "no link" in the figures.
If, in addition, one performs a similar switch (a p ↔ a q ) between the corresponding amino acid types in the sequence, all amino acid types turn out to be kept in the same three-dimensional positions. However, the values of 4 contact energies are modified under the combined structural and sequence switches, in correspondence of the interacting pairs affected by chain rewiring.
Nevertheless, the simple form Equation (1) for the amino acid pairwise interaction potential ensures that, the sequence {a i }, with ground state energy E (Γ l , {a i }) on the entangled structure, will have exactly the same ground state energy as the switched "twin" sequence a i on the non-entangled twin structure: E (Γ l , {a i }) = E Γ nl , a i . Thanks to the above property, the non-entangled structure does not have any energetic advantage with respect to the entangled structure Γ l during the folding of selected sequences.
Finally, we note that there is an additional symmetry for both structures, because the inversion of the chain direction (i ↔ N − i + 1) produces its mirror image. Accordingly, the detected evolutionary signals will not depend on the asymmetry in the location of the concatenated loops along the chain.

Folding Simulations
The time trajectory of each protein in the conformation space is initialized from a random high-temperature configuration. The temperature is switched at time t = 0 to a value T = 1/β = 1/0.071 14.1 (with units in k B = 1) that was determined by averaging the folding thermodynamics properties of random sequences (Figure 4). This temperature leads to the eventual folding of the protein, a stochastic process that we simulate n = 100 times. Each realization 1 ≤ α ≤ n for a given protein ω takes place in a time τ α (ω) and an average folding time is then evaluated as τ(ω) = 1 n ∑ n α=1 τ α (ω). . Each curve is for a given random sequence. The mean folding inverse temperature is found by averaging the points where the curves cross the 50%. The inverse temperature used in the folding simulations within the evolutionary process is fixed to this value.
Protein time dynamics is simulated thanks to a set of Monte Carlo moves, including both local (crankshaft and end-flip) moves and a global sliding/reptation move, which have been carefully implemented to satisfy detailed balance ( Figure 5). ). In reptation, to satisfy detailed balance, the frequency of attempted moves in the two directions satisfy W R = 3W F ("F" and "R" following the notation in the figure) because there are 12 possible points for the added end vs only 4 internal sites for the added corner. The internal flip is in one out of the 4 possible sites if the corner is of 60 • or 90 • , while only two sites are allowed when the corner is 120 • . All attempted moves are then validated with self-avoidance constraints and eventually accepted with a Metropolis algorithm.
Time is measured in units of Monte Carlo sweeps, one sweep containing a fixed amount M = N + 1 of Monte Carlo moves. Local crankshaft and end-flip moves are selected at random with probability N/M and the global reptation move is chosen with probability 1/M to meet the intuition that a global rearrangement of the backbone is less likely to occur than a random local displacement. We include the global sliding move to endow the dynamics with the chance of threading a portion of the backbone through an already formed loop. The use of lattice Monte Carlo moves to study folding kinetics and the identification of Monte Carlo sweeps with folding time are well established in the field [35].

Evolutionary Process
Z = 100 protein sequences, each representing an organism, are involved in the evolutionary process. All sequences are assumed to fold to a fixed native structure according to the Go-like model defined in Equation (2). The native structure can be chosen as either the entangled structure in Figure 1a, or the twin non-entangled structure in Figure 1b. At beginning of the process, protein sequences are initialized by choosing randomly each amino acid with a uniform probability across all 20 possible types. During the process, we evaluate the average folding times for all proteins as described in Section 4.3. The longest folding time is associated with the lowest fitness of the organism hosting that protein. For simplicity, this just leads to its extinction, and its position in the niche is occupied by another element. For its replacement we follow this procedure: with probability p r = 1/3 a completely new random sequence of amino acids, sampled with uniform probability as above, is assigned to the newborn; otherwise the sequence of the faster folder is copied with partial fidelity, i.e., amino acids are kept the same with probability p c = 0.9 and uniformly sampled at random otherwise. The protocol for generating the new sequence thus includes the priority gained by the organism with the best fitness (yet allowing mutations of its genome) to populate the empty slot, but also the possibility of a random entrance of brand new organisms in the empty niche.