Heterodimer Binding Scaffolds Recognition via the Analysis of Kinetically Hot Residues

Physical interactions between proteins are often difficult to decipher. The aim of this paper is to present an algorithm that is designed to recognize binding patches and supporting structural scaffolds of interacting heterodimer proteins using the Gaussian Network Model (GNM). The recognition is based on the (self) adjustable identification of kinetically hot residues and their connection to possible binding scaffolds. The kinetically hot residues are residues with the lowest entropy, i.e., the highest contribution to the weighted sum of the fastest modes per chain extracted via GNM. The algorithm adjusts the number of fast modes in the GNM’s weighted sum calculation using the ratio of predicted and expected numbers of target residues (contact and the neighboring first-layer residues). This approach produces very good results when applied to dimers with high protein sequence length ratios. The protocol’s ability to recognize near native decoys was compared to the ability of the residue-level statistical potential of Lu and Skolnick using the Sternberg and Vakser decoy dimers sets. The statistical potential produced better overall results, but in a number of cases its predicting ability was comparable, or even inferior, to the prediction ability of the adjustable GNM approach. The results presented in this paper suggest that in heterodimers at least one protein has interacting scaffold determined by the immovable, kinetically hot residues. In many cases, interacting proteins (especially if being of noticeably different sizes) either behave as a rigid lock and key or, presumably, exhibit the opposite dynamic behavior. While the binding surface of one protein is rigid and stable, its partner’s interacting scaffold is more flexible and adaptable.


Introduction
The advent of the Next Generation Sequencing (NGS) technologies and accompanying software tools offers an unprecedented ability to sequence and analyze genomes not only of whole species but of individual specimens also [2,3,4]. More than 90 million protein sequences have been deciphered so far and that number grows at an enormous rate [5]. The analysis and comparison of individual specimen genomes to a referent genome(s) enables the detection of genomic patterns responsible for disease outset and progression [6,7]. The analysis of human proteome reveals that almost half of human genes and more than 60% of metabolic enzymes are expressed in majority of tissues [8]. Genes, their expression levels, their position within cells and tissues are easily obtained, but that data is not sufficient if we want to fully grasp the genomic and metabolic processes. Therefore, our ability to connect the enormous amount of sequencing and expression data to the biological mechanisms on the molecular level lags behind the capacity to generate that data. An even more pressing issue is the high attrition rate in drug development reflected as "industry's lower revenue growth, poor stock performance, the lowest number of new chemical entities (NCE) approvals and the poor late-stage R&D pipelines prevalent throughout the industry", a conclusion drawn in 2004 [9]. The past decade did not rectify this issue, as explained in [10] with emphasis on suboptimal preclinical strategies and failure of animal model monotherapies to address complex diseases like cancer. For example, on average, only 5% of anticancer agents active in preclinical trials are approved, as opposed to 20% of compounds used for cardiovascular diseases treatments [10]. The difficulties encountered by big pharma companies led to their general reordering, the lower market cap of top ten players and disappearance of some major companies due to acquisitions and mergers [11]. Rising budgets did not remedy this issue, nor the increase in licensing based and collaborative projects [11]. A better preclinical filtering should improve the initial drug testing and evaluation [10], but there is no definite strategy to address that issue yet. New approaches, such as the analysis of ligand binding behavior within a framework of chemico-biological space [12], may be a way toward a much better compound filtering during preclinical trials and thus toward a more efficient drug design.
To fully comprehend the biological process on the molecular level we first have to understand physical laws governing the interactions of biological polymers. Protein-DNA and protein-lipid interactions had been successfully addressed [13,14,15,16,17], but the problems of protein folding [18,19] and protein-protein interactions [20,21,22,23,24,25] are issues still requiring full attention of the scientific community. Many attempts were made to develop a comprehensive protein-protein interaction theory. The recognition of binding residues using analysis of sequential and structural properties of heteromeric, transient protein-protein interactions produced very good overall results as shown by Neuvirth et al [26]. Chen and Zhou [27] used sequence profiles as well as solvent accessibility of spatially neighboring surface residues fed to neural networks to develop a successful binding sites recognition protocol. By applying a linear combination of the energy score, interface propensity and residue conservation score Liang et al [28] achieved a decent coverage and accuracy. Zhang et all focused their effort on the interface conservation across structure space [29], while Saccà at all introduced multilevel (protein, domain and residue) binding recognition using Semantic Based Regularization Machine Learning framework [30]. It was shown recently that three-dimensional structural information, either based on data from PDB [31] or obtained from homology modeling, produces robust and efficient prediction of protein-protein interactions when applied with information on structural neighbors of queried proteins and Bayesian classifiers [32]. The protein (co)expression also attracts researchers. For example, Bhardwaj and Lu [33] showed that the complexity of co-expression profiles in protein networks rises with the increase of the interaction/connectedness in those networks.
The application of coarse-grained force fields in the analysis of protein-protein associations also attracted the attention of research community [34,23]. Basdevant, Borgis and Ha-Duong analyzed dimer association using coarse-grained SCORPION force field model of protein and solvent [35]. The force field model was able to recognize near native decoys of three different protein complexes (out of thousands of decoys), and to efficiently simulate the dynamics of recognition of a protein complex starting from different initial structures. A similar approach, in a combination with a push-pull-release sampling strategy, was applied by Ravikumar, Huang, and Yang to examine protein-protein association in a number of complexes [36]. M. Zacharias combined bonded atomistic with coarse-grained nonbonded interactions in his force-field model to simulate peptide-protein docking and refinement from different stating geometries with acceptable accuracy [37]. Solernou and Fernandez-Recio developed pyDockCG [38], a coarse-grained potential for protein-protein docking scoring and refinement, based on the earlier UNRES model developed for the protein structure prediction [39]. A coarse-grained approach (one pseudo atom per every three residues) by Frembgen-Kesner and Elcock showed to be able to excellently reproduce the absolute association rate constants of wild-type and mutant protein pairs via Brownian motion simulations when hydrodynamic interactions between diffusing proteins are included [40].
The elucidation of physical interactions of proteins is appealing to pharmaceutical industry also [41], with emphasis on small molecule inhibitors of protein-protein interactions [42,43]. The inhibitors 4 "directly interfere with the interface of the protein complex, or bind away from the interface and cause or prevent conformational changes that preclude formation of the complex" [42]. The interest of pharmaceutical industry is not surprising because it is known that mutations, which disrupt threedimensional structure, can be cancer drivers and thus require adequate treatment [44].
My aim to address the physical interactions between individual protein chains forming protein dimers. I want to use the structural information only and the theory of phantom networks through its implementation via the Gaussian network model [45,46,47,48,49,50,51,52,53,54,55,56]. The connection between kinetically hot residues and interface residues has been established already [57].
The methodology I describe in this paper moves forward and introduces a self-adjusting approach aimed at recognizing binding surfaces and correspond structural scaffolds (contact and first layer residues).
That approach was first described in [1]. That manuscript was only a rough draft of the present manuscript. The results depicted here show that at least one of the chains forming a heterodimer has its contacting scaffold surrounded or bounded through its kinetically hot residues. One of the partners (usually the longer one), has binding areas and corresponding binding scaffolds defined by its kinetically hot residues. The other chain is often more flexible. It passes through structural adjustments, thus the recognition of its binding residues may be less successful with the methodology based on the distribution of kinetically hot residues. This pattern is more frequent with heterodimers composed of protein chains of similar sizes (chain lengths). However, with shorter chains the adjustable GNM approach may be less precise because their small size produces many false positive hits. The fact that at least one of the binding partners has binding areas defined by its kinetically hot residues (and thus less movable than other residues) may suggest that the heterodimer protein formation is entropically driven, i.e., that the protein chains are driven to interact by a common need to increase entropy (reduce the ordered/rigid binding scaffolds).
The term "kinetically hot residues" is similar to the term "hot spots" and they have a lot in common. Residues that often appear in structurally preserved interfaces (in more than 50% of cases) are termed hot spots. The hot spots are important because they are major contributors to the binding free energy. They were initially screened using the alanine-scanning mutagenesis, and therefore defined as spots where alanine mutation increases the binding free energy at least 2.0 kcal/mol [58,59,60,61,62,63,64]. Bogan and Thorn [58] showed that hot spot residues are enriched in tryptophan, tyrosine and arginine, and that they are surrounded with residues which role is to occlude solvent from the hot spots (O-ring residues hypothesis). They also observed that "(n)either the change in total side-chain solventaccessible surface area on complex formation (ΔASA) nor the sidechain ΔASA of hydrophobic atoms is well correlated to the change in free energy". They concluded that solvent occlusion is necessary, but not sufficient condition for a residue to be a hot spot. The hot spots have been addressed using various computational methods [60]. Tuncbag et al. used information on conservation, solvent accessibility area and statistical pairwise residue potentials of the interface residues to computationally determine hot spots. Their combined approach achieved both accuracy and precision between 64 and 73 % of the Alanine Scanning Energetics and Binding Interface Databases. They observed that "conservation does not have significant effect in hot spot prediction as a single feature". However, their results indicate that the "residue occlusions from solvent and pairwise potentials are found to be the main discriminative features in hot spot prediction". Lise at al. [61,62] combined machine learning and energy-based methods to predict hot spot residues. They applied standard energy terms (van der Waals potentials, solvation energy, hydrogen bonds and Coulomb electrostatics) as input features to Support Vector Machine (SVM) and Gaussian Processes learning protocols. They also attempted to predict the change in binding free energy ΔΔG upon alanine substitution but achieved only a limited success. Den et al. [65] also used Support Vector Machines with Random Forest selection and Sequential Backward feature elimination to predict hot spots. As features they used various molecular attributes (local structural entropy, side chain energy score, four-body pseudo-potential, weighted relative surface area burial) and residue neighborhood defined via Euclidian distances between residues/heavy atoms, with Voronoi diagram/Delaunay triangulation employed to describe residue's neighbors. They ended with 38 features which the SVM protocol utilized to very efficiently predict hot spots. Kozakov et al. [63] analyzed druggable hot-spots via computational method that places small organic molecules-probes (16 of them) on a grid around target protein. The spots on the surface of the target protein that favorably interact with a number of probes are clustered and those clusters ranked according the average free energy. The consensus regions (the ones that bind many probes) are taken to be hot spots. The authors concluded that according to their protocol, hot spots "possess a general tendency to bind organic compounds with a variety of structures, including key side chains of the partner protein". I emphasize this because my results show that the binding surfaces of proteins are often determined by the structure of the host proteins only. That may imply that they are receptive for other proteins and/or small molecules besides their general binding partners. Their method is able to recognize hot spots even in unbound cases. My approach was also able to distinguish dimer decoys that were created with structures of unbound chains (see Testing set analysis in this paper). Tuncbag et al. [64] observed that "globally different protein structures can interact via similar architectural motifs". They employed that fact through the PRISM 6 algorithm that "utilizes rigid-body structural comparisons of target proteins to known template proteinprotein interfaces and flexible refinement using a docking energy function." The paper starts with an overview of methods and tools (a short overview of the theory of and the Gaussian network model (GNM) is given in the Supplementary material). After that, the definition of target residues as well as the short description of training and testing sets is given. The first simple 1D prediction (sequential neighbors influence) based on five fastest modes only is given in the third chapter. In the same chapter, a prediction approach that uses modes that correspond to the upper 10% of the vibrational modes range is given. After that, I briefly describe the behavior of dimers with different lengths of their protein constituents and introduce a significant improvement of the simple modelprediction based on the adjustable number of modes. In the next chapter, I introduce the predication improvement, i.e., the movement from the one dimensional (1D) prediction to a full 3D approach.
Finally, I test the adjustable algorithms on the Sternberg [66] and Vakser decoy sets [67]. While doing so, I compare the structurally based, adjustable GNM method to the Lu and Skolnick's detailed, residue level statistical potential approach [68]. The paper ends with the Conclusion.

a. GNM code
The Adjustable Gaussian Network Model code is composed of several units. The first unit calculates contact maps and the corresponding eigenvectors and eigenvalues [69] for both chains forming a protein dimer (given as a PDB file). The code first calculates the Kirchhoff contact matrix . The matrix  calculation is based on the distances between C atoms only, and those distances have to be lesser or equal to 7 Å to consider two residues in a contact [51,52,53]. The code then calculates and sorts  matrix eigenvalues and eigenvectors. The eigenmodes are sorted according to their corresponding eigenvalues. Those eigenvalues and eigenvectors are used in the second part that (iteratively) calculates the weighted sum of modes [54] as An additional code extracts contact and first layer residues and reports the number of atoms per each contact or first layer residue, followed by the total number of heavy atoms per each residue. Finally, the third set of routines extracts neighboring residues and their distances for each residue per chain. That information is later used in the spatial spreading of the influence of kinetically hot residues.

b. Targets
My aim is to recognize contact patches on the surface and the corresponding scaffolds the interior of chains forming protein dimers. First, I want to recognize contact residues. Those are amino acid residues in which at least one atom is at the maximum distance of 4.5 Å from one or more atoms from the surface of the other chain. The distance of 4.5 Å corresponds to the size of one water molecule. I also use the notion of a first layer residues (FLR) to describe the scaffold(s) surrounding contact residues. Those are neighboring residues from the same chain as the chain's contact residues (at the maximum atom-atom distance of 4.5 Å) (for a visual description see Figure S1 in the Supplementary material).

c. Training set
The training set is comprised of 434 protein dimer complexes (see Supplementary material for the full list of dimers; this set is inspired by the Chen dimer set). The set is made of existing protein dimers and two chain subsets of various protein monomers. It is separated into heterodimer and homodimers, using two criteria: (1) if the ratio of chain lengths (chain length is the number of residues in that chain) in a dimer complex is greater than 2, that complex is considered to be heterodimer; (2) if the ratio of chain lengths is smaller than 2, the Smith-Waterman sequence alignment algorithm [71] was applied to recognize and separate dimers in which chain sequences are highly similar. This approach was applied following the logic of homology modeling principles that say that high sequence similarity implies high structural similarity [72,73,74,75]. Therefore, the first group contains dimers in which constituents do not bear structural similarity, while the second group has members that are sequentially and structurally highly similar. We separated dimers into two groups because we assumed that heterodimers and homodimers have different binding mechanism. Different behaviors of these two groups may imply that their kinetically hot residues may not be have the same role in protein binding. We used this approach 8 because the Gaussian Network Model is based on structural organization of residues, i.e., on the spatial distribution of C atoms in proteins.
Of 434 dimers, 135 are heterodimers, and the rest are homodimers. Majority of chains in our set are shorter than 300 residues, but we also have a number of chains longer than 400 residues. The distribution of chain lengths is shown in Figure S2 in the Supplementary material.

d. Testing sets
The Sternberg [66] and Vakser [67] decoy sets are numerically created decoy sets made with the intention to test protein binding prediction protocols. Each decoy set is based on a naturally occurring protein dimer complex with a known structure. Each individual decoy from a set is a protein complex numerically created by joining two (or more) individual chains based on the corresponding non-bound structures.
The Sternberg decoys sets [66] are comprised of 100 decoys each with first four being near native structures, and the first one being the native structure itself. The decoys are generated using unbound structures of the chains forming native dimer structures. I used only dimer sets (10 sets) and applied the adjustable GNM algorithms independently to both chains per decoy.
Every Vakser set contains 110 decoys. Certain number of those decoys are near native structures (in most cases, 10 of decoys in a set are near native, as determined by their root mean square deviations from the native structure(s)). I used only dimer sets (41 of the 61 decoy sets).

Results and Discussion
a. Simplest 1D prediction (sequential neighbors influence only) based on

fastest modes
The first method I tried is based on the approach of Demirel et all [54]. They used five fastest modes to recognize kinetically hot residues in proteins. We used this approach to separate the potential contact and first layer residues from the rest of residues in the chain. In my implementation of their scheme, the 9 first step was the calculation of the weighted sum (Eq. 1). That sum gives a kinetic contribution of each residue for a given set of modes. I normalized the sum and analyzed only hot residues with the normalized amplitude higher than 0.05. The number of hot residues is usually smaller than the number of contact or first layer residues. To account for that, I spread the influence of a hot residue to its sequential neighbors only using sequence information from the PDB file (I did that to account for possible missing residues). I spread the influence of hot residues linearly, to sequential neighbors because proteins are chain with physically connected residues. That means that sequentially neighboring residues exhibit correlated behavior. For chains longer than 100 amino acids, I labeled the hot residue and its 8 neighboring residues upstream and downstream as predictions (four upstream, four downstream). For shorter chains the influence is spread to 6 neighboring residues only. That means that I assumed that those residues are either contact or belong to the first layer residues. I used this approach on all 414 dimers regardless the chain length or the nature (hetero or homodimer) of a particular dimer complex.  predictions is still not good enough for the general purpose application. However, the distribution of good and bad predictions is not uniform over the chain lengths as the histogram in Figure S3 in the Supplementary material nicely depicts. The prediction method based on the five fastest modes is much more successful with shorter (and thus less voluminous) chains, than with longer ones. With chains longer than 100 residues, but shorter than 200, the prediction algorithm was not satisfying at all, because it put more predictions in the lower right quadrant than in upper left. However, for chains shorter than 100 residues, it put much more predictions in the upper left (good predictions), than in lower right quadrant, which means that 5 modes may be only good for smaller proteins.
To test the assumption that heterodimers behave differently from homodimers, I applied the above-described method on heterodimers only. Figure   The previous analysis and the corresponding distribution of good and bad predictions over the chain lengths indicate that five modes are not enough to decipher the contact patterns on the surface and in the interior of proteins via the weighted sum calculation (Eq. 1). With shorter chains (up to 100 residues), the weighted sum of five fastest modes is able to give a satisfying prediction of binding and first layer residues, but with longer chains, the prediction efficiency fails (Figs. S3 and S4 in the Supplementary material). Therefore, it can be assumed that the number of modes should be adapted to each individual protein chain. That number would be difficult to determine knowing only the length of a protein chain because, when sorted, the distribution of the mode intensities (eigenvalues) is not a linear function of the mode indexes (see Fig. S6 in the Supplementary material). The distribution of eigenvalues depends on the chain's length as well as on the chain's three-dimensional configuration. Therefore, I opted to apply modes that correspond to the top 10% of the eigenvalues span. Figure S6 nicely depicts that top ten percent of eigenvalues are covered by five modes for the protein 1ETT chain H, itself made of 231 amino acid residues. The chain P from dimer 1BVN (496 residues), has 14 modes covering top 10%, and the chain A from 1QGK (876 residues) has 29 fastest modes covering the top 10% of eigenvalues. There is a correlation between the number of residues and the number of fast modes, but it is not strictly linear.
When this approach is applied with heterodimers, the amount of true positives becomes increased. That can be observed with the four protein chains I used previously (see shorter than 200 residues. However, this approach is able to put a chain with very high number of residues (876, chain A from 1QGK) into a group of good predictions.

b2. Analysis of heterodimers with noticeably different sequence lengths
Heterodimers are protein complexes composed of two amino acid chains with no apparent sequential and structural similarity. What is the mechanism of formation of such entities? What kind of attraction connects two or more different proteins entities? In protein-DNA or protein-lipid interactions, electrostatic forces are the key binding ingredients. Such forces have a small influence in protein-protein interactions.
When a protein dimer is analyzed, one may wonder whether its two constituents evolved separately, or were they created by a mutation that broke a single protein chain into two separate parts?
If we expand this assumption, we can assume that such a mutation can easier survive if a point of separation is toward the end (or beginning) of the initial, single chain (single mutation is, of course, a euphemism for a much more complex random biological process). In that case, the longer sub-chain has a higher probability of preserving its fold and function. It will preserve homology with the initial chain, and high homology implies similar folding pattern [72,73,74,75]. The probability of surviving is much higher than with mutations that break a protein into constituents of similar sizes. Namely, a chain produced by an asymmetric breaking will have more chances of surviving the evolutionary pressures.
That may also imply that a longer chain, produced by that single mutation, when interacting with its shorter partner (if that partner survived throughout evolution) may preserve its fold during (and upon) the binding. On the other hand, if the dimer constituents evolved separately, longer partners may be less prone to significant structural changes during the binding. However, we can also apply the reverse logic, namely that shorter constituents produced by a sequence breaking mutation preserve their fold during the binding process, only because they are short and therefore easier to fold and fit into the longer partner's pocket(s). All this may imply that kinetically hot residues may determine the shape and the position of a scaffold that determines binding spots in individual heterodimer chains.
To test the assumption that kinetically hot residues determine the binding scaffolds in protein dimers I separated the list of 135 heterodimers into two groups according to the length ratios of their constituents and separately analyzed heterodimers with sequence length ratios higher than two. I eliminated chains with sequence lengths smaller than 80 from this group to eliminate examples with high percentage of both true and false positives. Figure

c. Prediction based on the adjustable number of modes
The previous attempts to recognize contact and first layer residues via the Gaussian Network Model were based on a static approach in which protein dimer structures were analyzed using either 5 fastest normal modes, or modes which correspond to the top 10 % of the eigenvalues range. Those approaches showed that kinetically hot residues may play a role in protein-protein interactions, but they did not offer enough proof for that assertion. With some chains, they produced excellent results, but with some they failed. More importantly, the percentage of good predictions (the amount of chains with more than 50% of true positives and less than 50% of false positives) was comparatively small (always less than 40 % of all the chains analyzed). Many of the chains had a very high percentage of booth true positives and false positives. In addition, a significant number of proteins had a very small percent of both true and false predictions. All that implied that prediction algorithm had to be improved.
The analysis of the average percent of targets per sequence length reveals that the amount of targets and the chain length are inversely proportional. Larger proteins with longer sequences have a smaller percentage of contact and first layer residues than shorter chains. Figure 4 depicts the distribution of targets over the protein sequence lengths. It clearly shows that small proteins (shorter sequence lengths) have much higher ratio of contact and first layer residues than larger proteins (longer chain lengths).
The information on the targets distribution can be used to improve the prediction approach. The prediction can be adjusted to each particular protein chain through a comparison of the current prediction 14 output, i.e., current percent of predictions, to the expected, i.e., average percent of targets for that protein's sequence length class. The improvement of the prediction algorithm can be performed as follows: -If the overall percent of predictions is too large for that protein's sequence length class (for example, if the percent of predictions is larger than 60% of the total number of residues), the number of fast modes should be reduced by one and the whole prediction procedure should be repeated (Eq. 1).
-If the percent of predictions is too small for the protein's sequence length class (e.g. less than 20% of all residues), the number of fast modes should be increased by one and the whole prediction procedure should be repeated (Eq. 1).
-The procedure should be repeated until the percent of predictions does not fall between maximum and minimum amount of predictions for a given sequence length.

c1. One-dimensional linear prediction
I adopted a simple strategy to adjust the number of modes for each particular chain. If the number of residues in a chain is less or equal than 300, too many predictions is taken to be 60%. In that case, i.e., if the amount of predictions is over 60% of all residues, the number of modes is reduced by one, and the prediction procedure is repeated (Eq. 1). Similarly, if the number of residues greater than 300, the too many predictions is taken to be 50%. Furthermore, if the chain length is less or equal to 500 residues, too few predictions is taken to be 40%. For cases like that, the number of modes is increased by one and the whole procedure is repeated. For longer chains, too few predictions is 20%. To avoid infinite loops, only one increase followed by a decrease is allowed, and vice versa. The prediction procedure itself spreads the influence of kinetically hot residues linearly upstream and downstream along the sequence, as with the previously described methods. The initial number of modes corresponds to the top 10% of eigenvalues range for the chain being analyzed. This approach ensures that longer chains have enough predictions, and that shorter ones are not saturated with too many predictions which produce increase of false positives. The analysis performed on the four chains used previously used to describe the prediction procedure confirms the above results ( Figure S14 in the Supplementary material). For the three longest chains, 1BVN chain P, 2SNI chain E, 1UDI chain E, the percent of true positives is over 50%, and the percent of false positives is less than 50 %. The chain E of 1UDI, has a highest difference between the true and false positives which is an indication of a high correlation between the kinetically hot residues and contact scaffolds for that chain. Only the shortest example, 1CXZ chain A, has both true and false positives over 50 %.
When this approach is applied to heterodimer proteins with low sequence length ratios (length ratio less than two, with the chains longer than 80 residues), the results are less than satisfactory (

c2. 3D spatial prediction -Variable influence of hot residues
The adjustable algorithm I introduced in the previous chapter uses sequential neighbors only to spread the influence of hot residues. It produces a good prediction of contact and first layer residues, but offers a room for improvement. The prediction can be improved if spatial neighbors, instead of sequential ones, are used to spread the influence of hot residues. This approach is much closer to the true nature of the GNM algorithm that uses only spatial distances between C atoms and disregards any sequential/connectivity information. To apply this approach, I defined the maximum C-C distance (a cutoff distance) to which the influence of hot residues can be spread. I applied the cutoff distance of 6 Å for shorter chains (for amino-acids sequences shorter than 250) and the cutoff of 8 Å for longer chains.
All residues which are within the sphere with the center in the C atom of the analyzed hot residue, and with the radius equal to the assigned cutoff distance, are considered to be "predictions", i.e., they are assumed to be either contact or first layer residues. All other residues are rejected (for that particular hot residue). I came to the two cutoff values empirically. To extract spatial neighbors, I calculated distances between residues (C-C distances) for each particular chain and sorted them in ascending order.

c3. Combining the sequential and spatial approaches
The two methods described previously base their prediction on the adjustable number of modes.
The first method spreads the influence of hot residues linearly, i.e., to sequential neighbors only, while the second method spread the influence to spatial neighbors within a sphere of a given radius. The first method, in which the influence of a hot residue is spread to sequential neighbors only, treats a protein chain only as a set of amino acids chained together and disregards its three dimensional structure. The second method, on the other hand, only sees the spatial-3D neighborhood of a hot residue and rejects the fact that the protein is a chain, i.e., an ordered set of amino acids that are physically connected. In this chapter, I will present my attempt at combining the sequential and 3D spatial approaches to boost the overall prediction. By combining the one-dimensional linear approach with the three-dimensional one, I include the residues connectivity information into the structure based method and thus take into account the chain-like nature of proteins (GNM method disregard chain connectivity and uses only physical distances between C atoms to calculate the protein connectivity matrix). In this combined approach, the influence of a hot residue is first spread linearly to its sequential neighbors (upstream and downstream along the sequence). After that, the influence is spread to the hot residue's spatial neighbors whose C atoms are within a sphere of a given cutoff radius with a center in the hot residue's Cα atom (the radius is 6 or 8 Å, depending on the sequence length). When both chains per dimer are addressed as a pair using the combined approach (adjustable GNM, plus 1D & 3D influence of kinetically hot residues), the analysis confirms that heterodimers with high sequence length ratios often behave quite differently from heterodimers with low sequence length ratios; see Figure 8. In majority of cases belonging to the former group, at least one chain has contact and first layer residues gathered around its kinetically hot residues (as recognized by the adjustable GNM). Figure 8a shows dimers has only one). On the other hand, 28.8 % of the low sequence length ratio dimers has both chains above the diagonal (43.9 % of low seq. heterodimers has only one chain above the diagonal). This analysis suggests that chains forming high sequence length ratio heterodimers (seq. length ration higher than 2) often behave like rigid lock and key. They have at least one rigid interfacial surface (often both surfaces are rigid). Chains forming low sequence length ratio dimers are more flexible in that respect and more often than not have one of the chains more flexible than the other. That chain adjusts its conformations for a tighter fit.

d. Prediction algorithms comparison
In the previous chapters, I have presented a number of methods for the contact and first layer residues prediction. I started with a very simple approach based on the fixed number of modes (5) Figure 9). The true prediction improvement is achieved only with the adjustable number of modes (prediction protocol d in the Figure 9). Additional improvement is achieved with the full 3D influence spread (protocol e in Fig. 9) and with the combination of 2D and 3D approaches (protocol f in Fig. 9).
The analysis of the relationship between the number of modes and sequence length (Table S1 in  The visualization of targets predictions can give a good overview of the ability of the adjustable approach (1D and 3D combined) to recognize binding scaffolds. Figure 9 nicely depicts than in some cases the adjustable GNM can be very accurate in predicting the binding scaffolds.

e. Vakser and Sternberg decoy sets
Previous chapters dealt with the development of the contact residues recognition protocol. In this one, I would like to depict how the protocol behaves with the Vakser [67] and Sternberg decoy sets [66].
Those decoy sets are numerically created protein structure sets made with the intention to test protein binding prediction protocols. Each decoy set is based on a naturally occurring protein dimer complex with a known structure. For each decoy, its contact and first layer residues are calculated, for each chain forming a dimer. When decoys are formed using the same structure, the fast mode calculation was performed only ones, as well as the calculation of neighboring residues per chain (for 3D prediction protocol). All adjustable GNM algorithms were tested, and the pure 3D adjustable approach showed the best overall results I also calculated the binding energy for each decoy pair using the statistical potential of Lu and Skolnick [68] and compared it to my adjustable GNM prediction protocol. The residue-residue based approach of Lu and Skolnick assesses the strength of each decoy (both chains together) using an empirical statistical potential (given as a 20×20 matrix). The binding affinity of each decoy is expressed as a potential energy of binding. The lower that energy is, the more probable the decoy is, according to the statistical potential method. Figure 11 depicts the behavior of decoys on the true/false scatter plot used in previous chapters via two decoy sets (1CHO and 2SIC).. In my analysis the standing of each decoy chain is calculated as its Cartesian distance from the point with coordinates 0, 1, i.e., the standing is the chain's "distance" from a point with 0% of false predictions and 100% true predictions (see Fig.   11). Figure 12 depicts the behavior of the adjustable GNM in the combination with the statistical potential using the same two decoy sets.
The quality of the assessment of both methods (adjustable GNM and statistical potential) is chains forming these dimers probably experience significant structural rearrangements during and upon the binding. In most cases, the longer chain is better assessed through the adjustable GNM than its shorter partner, which was to be expected following the assumption on the opposite behavior of binding partners, but in some cases (Vakser sets 1BVN_PT, 1GPQ_DA, 1HE1_CA, 1MA9_AB, 2BKR_AB, 2BTF_AP) the shorter partner has a better score. The adjustable GNM protocol is fairly successful in predicting near native structures. That information should be taken in the light of fact that near native decoys are based on nonbound structures, which makes the protocol even more successful. Similar ability was reported by Kozakov et al. [63] The statistical potential produces much better overall results, but in some cases (Vakser set 1PPF_EI, for example), the structural evaluation of the decoys set was better than the evaluation using the empirical statistical potential.

Conclusion
In this paper, I addressed the physical interactions of proteins. I was primarily interested in physical laws governing the protein binding, and not in correlation of expression profiles of proteins belonging to same metabolic pathways. I used the statistical mechanics tools, namely the theory of phantom networks [45,46,49], and its Gaussian Network Model expansion [51,52,53,54,57] in attempt to relate the kinetically hot residues and the binding scaffolds on the surface and in the interior of proteins. As the number of residues emphasized by GNM is usually smaller than the number of binding residues, I developed an algorithm which spreads the influence of a hot residue to its neighbors.
I started with a small, and fixed number of fastest modes, i.e. 5, as suggested in Demirel et al [54], and later expanded that to the number of modes corresponding to the upper 10% of the eigenvalues span.
Both approaches offered only a limited ability to correlate the kinetically hot residues and the binding and first layer residues. A limited success was produced when heterodimers with significant difference in chain length were analyzed (chain lengths ratio higher than 2). The true improvement was produced when the number of modes was allowed to fluctuate until the number of predictions matches the expected number of targets for a given sequence length. The further improvement was achieved when the influence was spread to spatial neighbors instead to sequential ones only. Their combination was also tested, and it showed to have the best ability to recognize the target residues. The two adjustable methods, the first with sequential spread of the influence of hot residues and the second with the spatial spread have similar abilities to recognize binding scaffolds. That may imply that connectivity information, 21 although not explicitly present in the Gaussian Network Model, partially determines kinetic behavior of residues and thus should be involved in the analysis of the behavior of contact and first layer residues (and in the analysis of the overall physical behavior of proteins).
The adjustable GNM with spatial influence of hot residues was tested on two set of decoys, Sternberg [66] and Vakser [67]. With both sets, the adjustable GNM algorithm achieved a noticeable success. In 19 out of 41 Vakser cases, and in 4 out of 10 Sternberg decoy sets it was able to correctly distinguish at least one near native monomer from the rest of false decoys (it was usually a longer chain in tested dimer). I also combined the adjustable GNM with the statistical potential of Lu and Skolnick [68], and that combination improved the prediction in comparison to the pure adjustable GNM (see last 6 rows in Tables 1 and 2).
In a high sequence length protein dimer (a heterodimer with a significant difference in chain size), at least one of the chains has a rather stable contacting scaffold (in 83% of cases at least one chain is in the upper left quadrant). Stable in a sense that it has a significant number of kinetically hot residues in its interacting scaffold. Its smaller partner is often more flexible in that respect. In low sequence length ratio heterodimers that propensity is much lower (only 57% of cases has at least one chain in the upper left quadrant, with 43% of cases having none). However, this should not be a general conclusion because smaller (sequentially shorter) protein chains have large number of false positives simply because they have relatively large total number of targets (contact and first layer residues, see Figure 4). A similar observation was made by Martin and Lavery [76]. They concluded that the surface of a small chain easily gets saturated with contacts when bound to a larger partner. With longer chains, contact residues are highly localized. They also observed that docking hits tend to accumulate closer to the geometrical center of the protein. That observation is in concordance with my approach in which I use first layer residues besides contact residues to distinguish near native decoys from false ones. Residues in the geometrical center are surrounded with a large number of neighbors and have higher packing density. They are, therefore, more stable and thus emphasized by the fast modes.
The results depicted here may imply that the protein-protein interactions are entropically driven [77]. Highly organized pockets delineated by kinetically hot residues attract (often smaller) partnering chain in attempt to increase the total entropy of the system (decrease the order defined by then unmovable, kinetically hot residues).
My approach, although able to dynamically connect the kinetically hot residues and binding patches and the corresponding structural scaffolds still has a space for improvement. For example, the application of surface area descriptors may reduce the false positives rate. A better estimation of the number of expected target residues may also help. The application of latest protein-protein and proteinligand databases [43] may help in tuning the algorithm.
The work depicted here was primarily focused on the behavior of heterodimers. Homodimers and were not explored in detail. Their behavior should be addressed more thoroughly, for example, using a combination of slow and fast modes [34]. Furthermore, slow modes describe global motions of chain segments and may lead toward a better understanding of conformational changes proteins undergo during and upon binding. Those changes still lack proper quantification [78].
The importance of this work is twofold. First, it gives an efficient algorithm able to decipher individual protein-protein interactions, and offers a theoretical insight into the mechanism of protein binding. Second, it can be an excellent guide to a more efficient drug design, especially for the design of small molecule inhibitors of protein-protein interactions [41,42,43,44]. The algorithm(s) depicted here can help in filtering "in-house" databases [42] and thus facilitate the drug screening process.
The main message this manuscript delivers is that kinetically hot residues often determine heterodimer interactions. The fact that in heterodimers, at least one of the chains has its binding scaffold determined by the kinetically hot residues may imply that protein-protein interactions are, at least partially, entropically driven. This observation opens an area for further research       It is clearly visible that in general the percent of targets is a decreasing function of the sequence length.

Main text tables
32

Gaussian Network Model Theory
Biological polymers can be perceived as canonical ensembles (NVT ensembles described by the number of particles N, volume V and temperature T). That implies that they should be treated as systems in statistical equilibrium that do not evolve over time. They are also in thermal equilibrium with each other. Two proteins brought into contact will retain the same ensembles; their combined ensemble will be canonical ensemble itself. They are guided by the Boltzmann distribution, i.e., molecular bonds can be seen as independent entities guided by the Boltzmann distribution (the probability of bond i having the energy Ei is . Further approximation treats polymers as phantom networks. The theory of phantom networks was introduced by James and Guth in 1947 [44]. It was further expanded by Flory [45,48]. The phantom network theory assumes that:  [45,48]. I will give a brief overview of the theory of phantom networks, for the sake of clarity of the presentation. The theory of phantom networks begins with the assumption that chains and junctions can move freely through each other. It is also assumed the equilibrium Gaussian distribution of polymer constituents as a density distribution W(r) W(r) can be expressed as   . Z is the configurational integral for the free chain, and Z is the configurational integral over the configurational space in which is r is restricted to a given value [45].
The configuration partition function ZN for the network may be written as a product of the partition functions of the network's v individual chains. The individual partition functions are fully determined by the end-to-end vectors rij that connect junctions i and j [45]. Therefore, ZN can be expressed as a product of all junction ij connected by a chain as [45]   3 The partition function ZN can be described by the end-to-end vectors rij as The sum of these vectors can be expressed through the distances Ri between junctions with coefficient γij being equal to if junctions i and j are connected by a chain, zero otherwise [45].
The sum of end-to-end vectors in the Eq. S4 can be expressed via the quadratic symmetric matrix Γ [45]     This transformation can be easily proved via the basic tools of linear algebra.
The elements of the matrix Γ are If all non-zero elements γij are equal, and that is the case when all chain links are equal, consequently all 0 ij r identical and matrix  is Kirchhoff contact matrix [45]. Therefore, Eq. S4 can be written as (S7) If one of the junctions is designated as zeroth, then all others can be measured from that one. In polymers we have two types of junctions. The matrix  can be represented as a composition of two sets of junctions, fixed junctions σ that usually give shape to the phantom network, and free junctions τ. The sum in the above partition function of the phantom network can be decomposed as [45]  In this equation   is the quadratic matrix composed of rows and columns of matrix  for the fixed junctions.   is the corresponding matrix for the free junctions, and   is the rectangular matrix composed of the rows from the set {τ} and columns from the set {σ}. 4 Eq. S8 can be further simplified, by separating the free junctions τ and the fixed junctions σ as [45]  (S12)    R , within this framework [45], define the most probable positions for the free junctions.
The partition function of the phantom network (Eq. S7) thus can be written as (S13) This function is a multivariate normal distribution. The integration of this function over the free junctions produces a form that does not depend on free junctions at all [45]  (S14) In 1997 Haliloglu, Bahar and Erman [50] applied the above-described theory of phantom networks to folded proteins and thus introduced the Gaussian Network Model (GNM). They removed fixed junctions σ following the assumption that the protein folding is not guided by the external constraints. In their approach the contact matrix  was calculated with the cutoff distance of 7 Angstroms, i.e., the residues are in contact only if their C -C distance is less or equal than 7 Å [50,51,52]. They also used the approximation of M. Tirion [49] which replaces non-bonded interactions with Hookean springs, and defines γ* to be constant. In their approach the Kirchhoff contact matrix  is defined via Heaviside's step function [53] as The diagonal elements of the matrix  in this approximation represent local packing densities around the residues in the protein [50]. In the native state, in equilibrium, protein assumes stable conformation with minimum energy in respect to all residue fluctuations (the protein is a canonical ensemble) [52]. The vibrational contribution to the Helmholtz free energy is [46,47,52]       The partition function ZN is the vibrational partition function given by , and γ* is γ/2kBT. The last equality in Eq. S16, originally derived by Flory [45], comes from the integration of the single parameter multivariate Gaussian function in the configurational integral. Therefore, the internal Hamiltonian of the protein, is expressed via the contact matrix Γ [45].
Within the GNM framework, ΔR are fluctuations of Cα atoms around their most probable positions [45,52].
The average Hamiltonian can be expressed in terms of the matrices U and Λ of the eigenvectors ui and eigenvalues λi of the matrix Γ as [52]   . The correlation of equilibrium fluctuations of two α carbons i and j, can be expressed as [50]    ij B j i (S18)  [52]. Eigenvectors U in this framework can be interpreted as fluctuation modes of Cα atoms and eigenvalues Λ as their corresponding mode intensities. Slow, large amplitude modes, with small i, correspond to polymer's (protein's) global motions, while fast, small amplitude modes, with large i, correspond to polymer's (protein's) localized motions (hot residues) [52,78,79]. Therefore, residues having high amplitude fast mode fluctuations are stableunmovable. My aim is to decipher the role of those kinetically hot residues. and k1 being variable. The above equation is very similar to the singular value decomposition method [69] used in the linear least squares optimization method.
The correspondence of GNM to real world experimental values was confirmed by showing that the vibrational spectrums obtained by GNM strongly correlates to crystallographic B factors [51,52,53,54] and NMR data [55], which means that equilibrium fluctuations are properties of static crystal.

Supplementary material figures
Green lines are first layer residues. Cyan dots are predictions. For the three longest chains from that group, 1BVN chain P, 2SNI chain E, 1UDI chain E, the percent of true positives is over 50%, and percent of false positives is less than 50 % (the chain E of 1UDI, has a highest difference between true and false positives which is an indication of a high correlation between the kinetically hot residues and contact scaffolds for that chain). Only the shortest example, 1CXZ chain A, has both true and false positives over 50 %.
24 Figure S15. Prediction output based on the approach that uses an adjustable number of fastest modes per chain and sequential influence of hot residues, for low sequence-length ratio dimer chains (length ratio less than two, chain length greater than 80 residues). The true positives mean true is 50.28 %, and the false positives mean is 49.23 %. There is 34.33 % of good predictions and 26.87 % of very bad predictions.
25 Figure S16. Prediction histogram over the sequence lengths for the prediction approach based on the adjustable number of fast modes, for the 1D influence of hot residues, for chains in dimers with low sequence length ratio (Length ratio <= 2, length > 80 residues). There is 34.33 % of good predictions and 26.87 % of very bad predictions.
26 Figure S17. Prediction histogram over the sequence lengths for the prediction approach based on the adjustable number of fast modes and variable 3D influence per hot residue, for chains in dimers with high sequence length ratio (Length ratio > 2, length > 80 residues). The true positives mean is 53.54 %, and the false positives mean is 42.05 %. There is 51.46 % of good predictions and 9.71 % of very bad predictions.
27 Figure S18. Examples of the prediction based on the adjustable number of fast modes and the sequential influence of hot residues. The four different chains are depicted (1BVN chain P, 2SNI chain E, 1UDI chain E and 1CXZ chain A). Red lines depict weighted sums. Blue lines designate contacts residues.
Green lines are first layer residues. Cyan dots are predictions. For the three longest chains from that group, 1BVN chain P, 2SNI chain E, 1UDI chain E, the percent of true positives is over 60%, and percent of false positives is about 50 % or less (the chain E of 1UDI, has a highest difference between true and false positives which is an indication of a high correlation between the kinetically hot residues and contact scaffolds for that chain). Only the shortest example, 1CXZ chain A, has both true and false positives over 50 %.
28 Figure S19. Prediction output for the prediction approach based on the adjustable number of fastest modes per chain and the variable 3D influence per hot residue, for chains in dimers with low sequence length ratios (Length ratio less than 2, chain length > 80 residues). The true positives mean is 51.72 %, and false positives mean is 48.96 %. There is 39.55 % of good predictions and 26.12 % of very bad predictions.
29 Figure S20. Prediction histogram over the sequence lengths for the prediction approach based on the adjustable number of fast modes and the variable 3D influence per hot residue, for chains in dimers with low sequence length ratio (Length ratio < 2, length > 80 residues). The true positives mean is 51.37 %, and false positives mean is 49.60 %. There is 37.50 % of good predictions and 27.34 % of very bad predictions.
30 Figure S21. Prediction histogram over the sequence lengths for the prediction approach based on the adjustable number of fast modes and combined 1D & fixed 3D influence per hot residue for chains in dimers with high sequence length ratio (length ratio higher than 2, length > 80 residues). The influence is first spread linearly, upstream and downstream along the sequence, and then the it is spread to residue's spatial neighbors, the ones closer than 6 or 8 Å). True positives mean is 56.77 %, and the false positives mean is 43.21 %. There is 63.11 % of good predictions and 11.65 % of very bad predictions. 31 Figure S22. Examples of the prediction based on the adjustable number of fast modes and combined 1D & 3D influence per hot residue. The four different chains are depicted (1BVN chain P, 2SNI chain E, 1UDI chain E and 1CXZ chain A). Red lines depict weighted sums. Blue lines designate contacts residues. Green lines are first layer residues. Cyan dots are predictions. For the three longest chains from that group, 1BVN chain P, 2SNI chain E, 1UDI chain E, the percent of true positives is over 60%, and percent of false positives is about 50 % or less (the chain E of 2SNI, has a highest difference between true and false positives which is an indication of a high correlation between the kinetically hot residues and contact scaffolds for that chain). Only the shortest example, 1CXZ chain A, has both true and false positives over 50 %.
32 Figure S23. Prediction output for the prediction approach based on the adjustable number of fastest modes per chain and combined 1D & 3D influences of hot residues, for chains in dimers with low sequence length ratio (Length ratio < 2, length > 80 residues). The true positives mean is 51.55 %, and the false positives mean is 49.71 %. There is 38.06 % of good predictions and 29.1 % of very bad predictions.
33 Figure S24. Prediction histogram over the sequence lengths for the prediction approach based on the adjustable number of fast modes and combined 1D & fixed 3D influence per hot residue for chains in dimers with low sequence length ratio (length ratio < 2, length > 80 residues). The influence is first spread linearly, upstream and downstream along the sequence, and then the it is spread to residue's spatial neighbors, the ones closer than 6 or 8 Å). The true positives mean is 51.55 %, and the false positives mean is 49.71 %. There is 38.06 % of good predictions and 29.1 % of very bad predictions. Figure S25. Linear and quadratic relationships of the number of modes for successfully predicted chains from heterodimers with high sequence length ratios.
35 Figure S26. Comparison of the abilities of the adjustable 3D GNM approach, the statistical potential and their combination to distinguish near native decoys from the false decoys, expressed as percent of correctly predicted near native structures among the first n structures, where n is the number of near native structures. The taller the bar, the better is the prediction. The upper plot correspond to longer chains, and the lower plot to their shorter partners. The plots on the left correspond to Vakser decoy sets and the plots on the right to Sternberg decoy sets. The upper plots correspond to longer chains, and the lower plot to their shorter partners.
36 Figure S27. Comparison of the abilities of the adjustable 3D GNM approach, the statistical potential and their combination to distinguish near native decoys from the false decoys. The status of the best near native structure for each decoys set is depicted as a vertical bar. The shorter the bar, the better the prediction.