A Comparative Evaluation of the Structural and Dynamic Properties of Insect Odorant Binding Proteins

Insects devote a major part of their metabolic resources to the production of odorant binding proteins (OBPs). Although initially, these proteins were implicated in the solubilisation, binding and transport of semiochemicals to olfactory receptors, it is now recognised that they may play diverse, as yet uncharacterised, roles in insect physiology. The structures of these OBPs, the majority of which are known as “classical” OBPs, have shed some light on their potential functional roles. However, the dynamic properties of these proteins have received little attention despite their functional importance. Structural dynamics are encoded in the native protein fold and enable the adaptation of proteins to substrate binding. This paper provides a comparative review of the structural and dynamic properties of OBPs, making use of sequence/structure analysis, statistical and theoretical physics-based methods. It provides a new layer of information and additional methodological tools useful in unravelling the relationship between structure, dynamics and function of insect OBPs. The dynamic properties of OBPs, studied by means of elastic network models, reflect the similarities/dissimilarities observed in their respective structures and provides insights regarding protein motions that may have important implications for ligand recognition and binding. Furthermore, it was shown that the OBPs studied in this paper share conserved structural ‘core’ that may be of evolutionary and functional importance.


Introduction
Insects are the dominant animal species on earth. Their remarkable evolutionary success is largely attributed to their highly intricate olfactory systems. Early olfactory processing takes place at the peripheral nervous system, and involves solubilisation, and transport of semiochemicals (pheromones, and odorant or sapid molecules) to olfactory receptors (ORs) [1,2]. ORs are located on the dendrites of sensory neurons and bathed in an aqueous medium known as the sensillar lymph. They mediate the detection of chemical cues. Further processing of olfactory signals occurs in the central nervous system resulting in odour perception and thence adapted behaviour. Other protein components of the peripheral olfactory system include odorant binding proteins (OBPs), odorant degrading enzymes (ODEs) involved in the inactivation of excess odorants following OR activation, sensory neuron membrane proteins (SNMPs) functioning in tandem with ORs [3], and ionotropic receptors (IRs), which are thought to be an evolutionarily different mechanism of chemoreception [4]. OBPs, which are the subject of this review, are implicated in the solubilisation, binding and transport of semiochemicals to ORs.
The first insect OBP was isolated from the giant moth Antheraea polyphemus in the early 1980s [5]. Since then, the number of insect proteins classified as OBPs has grown enormously. The PFAM database [6] lists over 3000 sequences from 104 species classified as pheromone-binding and general odorant-binding proteins (Pfam ID: PF01395). The number of OBPs per insect species varies widely. For example, 69 OBPs have been associated with A. gambiae, whereas 52 have been associated with D. melanogaster, 21 with A. mellifera and 13 with B. mori [7,8].
Insects dedicate an enormous part of their metabolic resources to synthesise OBPs in their antennae, where their concentration may be as high as 10 mM [9]. RNAseq analysis has shown that the levels of RNA for the most abundantly produced OBPs in the olfactory segment of the Drosophila antenna are three orders of magnitude higher than those of ORs [7]. For many years, the prevailing hypothesis regarding the role of OBPs in insect olfaction associated them with the solubilisation, binding and transport of semiochemicals to ORs. Strong support for this hypothesis came from studies involving pheromones (PBPs) and general odorant binding proteins (GOBPs) in Lepidopteran species demonstrating high specificity of these proteins for volatile compounds [10,11]. Added support for this hypothesis came from studies suggesting a possible release mechanism of bound odorants brought about by conformational changes in OBPs induced by the negatively charged environment at the dendritic membranes.
However, the above hypothesis has been challenged by research showing that OBPs: 1. display broad specificity of binding, and ability to bind more than one ligand simultaneously [12,13].

2.
are expressed not only in the insect olfactory organs but also in other body parts [14]. For example, it was shown in proteomic studies that of the 66 genes encoding OBPs in Anopheles gambiae only 18 are detectable in olfactory tissues [15]. These results raise questions regarding the actual number of OBPs that may have a purely olfactory function.
In the light of the above, it is now thought that OBPs may play diverse, as yet uncharacterized, roles in insect physiology. Thus, in addition to the odorant transport hypothesis, OBPs may be involved in the protection of odorants from degradative enzymes, odorant clearance from the sensillum lymph after OR activation, or buffering against sudden changes in odour levels after the termination of an odorant pulse [7]. For OBPs expressed in non-olfactory tissues, it has been proposed that they may play a role in the modulation of mating behaviour, haematopoiesis, humidity detection and attraction or aversion of gustatory cues [8].
The rapidly increasing numbers of OBP structures, solved by X-ray crystallography and NMR, serve as a solid foundation to draw functional inferences. The number of structures deposited at the Protein Data Bank (PDB) has grown from 4 in 2004 [16] to 25 derived from 17 species belonging to 7 insect orders. These OBPs, known as "classical" OBPs, are globular proteins consisting of alpha-helices of different topologies bearing a characteristic signature of three conserved disulfide bonds. Four of the cysteines are equispaced with three amino acids between the second and the third cysteines (C 2 , C 3 ), and eight amino acids between the fifth and the sixth cysteines (C 5 , C 6 ). The disulfide bonds form the pattern C 1 -C 3 , C 2 -C 5 , C 4 -C 6 . "Classical" OBPs vary in the size and topology of their respective helices, and belong in the same PFAM family (PF01395) and have same fold in SCOPe (Fold a.39: EF Hand-like) [17]. In addition to the "classical" OBPs, other subgroups differ in the number of disulfide bonds: the so-called C-plus OBPs with >3 disulphide bonds, and the so-called C-minus OBPs with <3 disulfide bonds. OBP structures have been described in detail by numerous papers as well as in several reviews [2,[18][19][20]. Yet, X-ray structural models fail to capture structural fluctuations due to natural thermal motions, which are induced by equilibrium fluctuations and nonequilibrium effects [21]. Equilibrium fluctuations determine protein action, and as such, the dynamics of protein motion are critical in facilitating protein interactions, or even driving them. The 3D architecture of a protein, or fold, characterise its conformational space under physiological conditions, and this, in turn, circumscribes the spectrum of motions that are intrinsically accessible to the interacting partners such as odorants or proteins. By implication, evolutionary pressure may have selected specific motions that enable proteins to perform their biological functions. This process is encapsulated in the evolutionary paradigm "structure-encodes-dynamics-encodes-function" [22].
The intrinsic motions of biomolecules are usually described in terms of the spectrum of the available vibrational frequencies or normal modes (see Appendix D). The low-frequency motions are highly collective, meaning that they involve cooperative movements of large portions of the structure, such as domain motions or structural rearrangements. These motions, frequently referred to as global or intrinsic motions, provide a useful measure of the relative rigidity in proteins. A small set of the lowest frequency vibrational modes suffices to quantify the large-scale protein motions [23]. The higher frequency modes are less collective and correspond to local changes in conformation and dynamics restricted to a few atoms. They are localised in the interior or surface of the protein and are strongly implicated in signal transmission or other internal processes [24]. Normal mode analysis (NMA) provides a powerful tool for the study of protein structure and dynamics. However, NMA is very expensive computationally and for this reason simplified coarse-grained models, such as the elastic network models (ENMs) have been developed for efficient normal mode computations [25] (see SI NMA). ENMs have been used extensively to study protein dynamics [23,26].
The aim of this review is to provide a comparative analysis of the similarities and dissimilarities of OBPs in terms of their structural and dynamic properties. The focus is not on individual OBPs but rather on this group of proteins as a whole.

Materials and Methods
The OBPs used in this study comprise exclusively the "classical" OBPs (Table 1). In order to ensure the quality of structural comparisons, classical OBPs with truncated structure (PDB IDs: 3l4a) or of low resolution (PDB IDs: 3l4a, 6vq5) or solved by NMR (PDB IDs: 2jpo, 1tuj, 6um9) were excluded. For the same reason, C-plus and C-minus OBPs were likewise excluded. The 20 OBPs in the dataset (Table 1) were compared in terms of sequence and structure similarity. They were then subjected to principal component analysis (PC) to identify patterns of structural variance, and finally to normal mode (NM) analysis using elastic network models (ENM) to determine their dynamic properties [27]. The Bio3d v2.4-2 suite of programs was used for sequence, structural, principal component and normal mode analyses [28,29]. The nma function of Bio3d was used to for normal mode calculations employing an elastic network model (ENM) using the 'calpha' force field. This force field employs a spring force constant differentiating nearest-neighbour pairs of atoms along the backbone from pairs in spatial proximity. The resulting mode vectors were scaled by the thermal fluctuation amplitudes.
UCSF Chimera (Structure Measurements panel, Axes/Planes/Centroids section) was used for structure measurements described in Section 3.2 [30]. The distance and angles measurements were made based on sets of selected helical residues. The protein structure visualisation was performed with PyMOL v2.5.2 (Schrodinger pymol-open-source).
The EvoCouplings method was used for to find the residues in each OBP with the highest number of coupled interactions to their neighbours. The method is based on the direct information (DI) to compute a set of direct residue couplings that best explains all pair correlations observed in the multiple sequence alignment [31]. The EvoCouplings implementation of the methods is achieved through an open-source Python package for coevolutionary analysis [32,33].
As a prerequisite for the structural comparison of OBPs, we constructed a multiple structural alignment (MSA). Structural alignments have been found to perform better for the deduction of the structural and dynamic relatedness of the OBPs than the corresponding sequence-based ones, particularly in cases of low sequence identity [34]. Furthermore, the choice of the structural alignment method is critical because different alignment algorithms make different assumptions regarding the length of the consensus sequence, and the pairwise root-mean-square deviations (RMSD) within the set of aligned structures [35]. A local installation of Mustang v.3.2.3, interoperable with Bio3d, was used to construct the MSA [36]. The length of the sequences in the dataset ranges between 115 to 142 amino acids (average length 124 amino acids). The resulting MSA consists of 198 columns, 93 of which have 100% occupancy, i.e., columns containing no gaps (Appendix A Figure A1). An MSA in which columns with gaps have been removed is shown in Appendix A Figure A2.

Sequence Analysis
A percent identity value, in the form of a numeric score, was determined for each pair of aligned sequences. The identity score, which was normalised to values between 0 and 1, measures the number of identical residues in relation to the length of the alignment ( Figure 1). The summary statistics of the calculated pairwise sequence identities are given in Table 2. Typically, the average sequence identity over all pairs in the dataset is around~24%. For most OBPs, the pairwise sequence identities are low, placing them in so-called "twilight-zone" where sequence identity falls between 10 and 30% ( Figure 1). The "twilightzone" is an operational term setting the boundaries of confidence for detecting evolutionary relatedness between proteins [37,38]. Despite the low sequence identities, the OBP structures have retained the same fold albeit with differences in the geometries of the packed secondary structure elements.

Structure Analysis
To compare the similarity of the OBP structures, we used the structure coordinates of the Cα-atoms and quantified it using the optimal RMSD over the sites that aligned in the MSA. RMSD is a similarity measure varying between 0 and ∞ [39]. The calculation was made from the 93 equivalent positions in the MSA. The summary statistics of the calculated differences in RMSD are given in Table 3, and the heatmap and related dendrogram derived from the pair-wise similarity matrix of RMSD values ( Figure 2) displays the structural relatedness of the OBPs.  The dendrogram shows two main clades that split at 3 Å. One of these clades consists of 11 OBPs from 4 different insect orders, namely Diptera [AaegOBP1(3k1e), Ag- , and Neuroptera [CpalOBP4 (6jpm)]. Three mosquito proteins (Ag-amOBP1, AaegOBP1, CquiOBP1) are homologous (sequence identity > 80%), displaying the greatest structural similarity (RMSD 0.3-0.36 Å). The range of RMSD differences amongst the rest of the OBPs in this clade is between 1.1 Å and 3.0 Å.
The second clade comprises OBPs that exhibit the greatest structural differences. They form the leaves of 4 distinct branches. One of the branches includes the aphid OBPs [MvicOBP3 (4z39), NribOBP3 (4z45)]. A second branch, the moth OBPs [BmorPBP1 (1dqe), BmorGOBP2 (2wc5), AtraPBP1 (4inw)]. The third branch, LmadPBP1 (1org) and DmelOBP28a (6qq4), and the fourth branch, AaegOBP22 (6oii), PregOBP56a (5dic). All of these proteins are expressed in olfactory tissues with the exception of the last two. PregOBP56a is specifically expressed in a cluster of cells on the oral disk of the insect. PregOBP56a solubilises fatty acids from meat meal during feeding and delivers them to the midgut where it may help the process of reproduction [40]. AaegOBP22 is expressed in multiple tissues including the antenna, and the male reproductive glands. Expressed in different antennal tissues, AaegOBP22 may modulate chemosensory responses, but as a secreted protein in the salivary gland is likely to be injected into the host during a blood meal [41]. The structural alignment of OBPs in the dataset involves a procedure in which (i) a score was calculated from pairwise residue-residue correspondences, followed by pairwise structural alignments; (ii) recalculation of scores in the context of multiple structures, and (iii) merging the series of pairwise alignments along a guide tree [36]. Visualisation of the superimposed structures shows helical regions where the RMSD difference of the backbone carbon atoms is the least (Figure 3, Panel a). To identify residues with the least RMSD difference, the aligned structures were subjected to an iterated superposition procedure, each round of which identified and eliminated residues displaying the largest positional differences. The subset of retained residues implies an evolutionarily conserved structural "core" consisting of 17 residues. In this case, 14 of these including OBP conserved cysteines C5 and C6, are found in the carboxy-terminal helix; the remaining 3 are found in the helix which contains C 2 . The pair-wise RMSD difference of the "core" residues is <1 Å. The angle between the two helices is in the range of ∼78-88 • and the distance between them is in the range of 6.8-8.1 Å (Figure 3, Panel b). The aphid OBPs (4z39, 4z45), and PregOBP56a (5dic) show the greatest deviations from this geometry (Appendix B, Table A1).  Panel a). Geometry of the residues comprising the conserved structural core (residues of the conserved core are depicted in the ribbon coloured in magenta). The two helices of the "conserved" geometry are shown in blue. AgamOBP1 (3n7h) was used in the example (Panel b).

Principal Component Analysis (PCA)
PCA was used to gain insights that best explain the underlying patterns of variance in the ensemble of aligned structures. When performed on an ensemble of interpolated X-ray structures, PCA captures concerted atomic displacements highlighting major structural differences between the structures. The basic premise of PCA is that it is possible to eliminate highly correlated variables in the data set without losing essential information (Appendix E).
The PCA calculations were based on equivalent residues in the MSA. Approximately 82% of the total mean-square displacement (variance) of Cα atom positional fluctuations is captured by six principal components, namely the axes of maximal variance of the distribution of structures (Figure 4). The first three principal components account for 71% of the structural variance. The eigenvalues capture the percentage of the total variance (i.e., the total mean square displacement) of atom positional fluctuations. The influence of the individual OBPs on the structural variation and correlations within the ensemble of structures can be captured by means of score plots (see Appendix E). The score plot of the first two PCs, accounting to~60% of the structural variance, maps the interrelation of the OBP structures (Figure 5a). In any given cluster depicted in the plot, the deviation of the equivalent Cα atom positions is the least. Comparing the PCA clustering to the corresponding one obtained from the RMSD similarity matrix (Figure 5b), it can be safely concluded that the score plot reflects the relatedness of OBPs as observed from the pairwise structural comparisons. The contribution of the individual residues, also known as loadings, to the structural variance covered by the first three PCs is presented in Figure 6. The figure shows that the least of the structural variance occurs in the 3rd and 6th helical regions where the residues of the structural "core" are located.

Normal Mode Analysis (NMA)
The intrinsic dynamics and flexibility of the OBPs were determined by the use of normal mode analysis based on the Elastic Network Models described in Section 2. The structural flexibility of proteins can be decomposed to high-frequency localised fluctuations of the individual amino acids, and to low-frequency motions of large rigid domains.

Covariance Similarity Analysis
To compare the similarity of the low-frequency motions of the OBPs, the 10 lowest modes for all pairs of structures were calculated. Two points are worth noting. First, there is no single well-accepted measure to describe dynamics similarity, and consequently linking the dynamics space to the structure space remains a challenge [39]. Second, dynamical similarity scores are extremely sensitive to MSA, particularly in cases where the sequence identities are very low. The root-mean square inner product (RMSIP was used as a similarity metric [39]. RMSIP is sensitive to the number of modes used. Finer separations are discernible with smaller number of modes. When the number of modes included in the calculation approaches the full number, the dynamic distance order approaches the RMSD order. The RMSIP distance matrix was used to determine the relatedness of OBPs in terms of low-frequency modes. The dynamics-based hierarchically-clustered heatmap or clustermap shows all pairwise comparisons for the aligned proteins (Figure 7). Comparison of a protein to itself along the diagonal have an RMSIP value equal to 1. Lower values indicate overlapping low-energy fluctuations of the aligned backbone carbon atoms. Fluctuations covering completely non-overlapping space have an RMSIP value equal to 0. The similarity clustering shown plot is in agreement with the structural classification of the OBPs as obtained from the pairwise RMSD measurements of the OBPs (cf. Figure 2).

Comparative Fluctuation and Deformation Analysis
The amplitude and directionality of residue fluctuations provide a measure of the local flexibility in any given protein. Fluctuations are defined as the sum of each atom's displacement along each mode, weighted by the reciprocal of the eigenvalues [42]. However, residue fluctuations are less informative regarding the amount of local flexibility of a given protein structure. Local deformation involves energy exchange arising from short-range atomic interactions. Atomic motions relative to neighbouring atoms can be calculated from the atomic energy contributions to deformation as a function of position. Thus, lower deformation energies may correspond to relatively rigid regions [24]. It is noted that deformation energies have no quantitative physical meaning, and therefore no units [43].
The fluctuation and deformation energy profiles of the conserved residues in the MSA of OBPs, averaged over all modes and normalised, are given in Figure 8. The plot in Figure 8 shows that the largest amplitude fluctuations occur almost exclusively in non-helical regions. Residues with the highest deformation energies lie in the helical regions that contain C 2 , C 3 , C 5 and C 6 and are coincident with the conserved structural core. The high deformation energy of these residues is indicative of interactions with spatially proximal residues [24,42]. Favourable interactions between spatially proximal residues are a major constraint in the evolutionary variation of proteins belonging to the same fold. Evolutionary coupling analysis of the individual OBPs showed C 1, C 2, C 5 , and residues adjacent to them in the sequence, form 'dense' coupling interactions with residues that are in close 3D proximity (data available in Supplementary Materials, evolutionary_couplings.zip).
Moving beyond the structural and dynamic classification of OBPs, single structure analysis using GNM can provide additional inferences. Examples are given below.

Single Analysis on Selected Examples
While the covariance matrix calculated from the low-frequency modes (see Figure 7) allows the classification of OBPs in terms of similarity of motions, it provides no information regarding their relative flexibilities. Protein motions entail the movement of relatively rigid structural domains, especially regions of secondary structure, with intervening regions of local flexibility [44]. During motion, these "dynamic" domains remain internally rigid in all conformations but move relative to each other. In OBPs such motions have been observed in studies of the A. mellifera antennal-specific protein, which at pH 7.0 is a domainswapped dimer [45]. The relative flexibility of the OBPs in the dataset was determined by means of the Bio3D implementation of the Geometrically Stable Substructures (GeoStaS) algorithm [46]. Although the evaluation of residue cross-correlation coefficients is the standard method to detect atomic motions, GeoStaS has the advantage of detecting both rotations and translations, in contrast with residue cross-correlation analysis which detects translations only. The GeoStaS calculations based on the 5 lowest-frequency non-trivial modes showed considerable variation amongst the OBPs both in terms of the sizes of the "dynamic" domains, as well as the correlation of domain movements. This variation is exemplified in Figure 9 and additional material can be found in Supplementary Materials, GeoStaS and cross-correlation analysis.zip; Videos S1 and S2: global motions. The colour bars at the x and y axes depict the residues assigned to the different domains that were found. The colour scale is the same in both matrices: red indicates strong correlations, and white strong anticorrelations.
Two additional examples are presented below. These involve AaegOBP22 (6oii) and DmelOBP28a (6qq4). The choice of these proteins was based on the availability of crystal structures both in free and liganded forms. It is noted that, in addition to these structures, few other structures in the dataset have been crystallised in ligand-free form, namely, BmorPBP1, BmorGOBP2, AgamOBP20 and AmelASP1. AaegOBP22 has been crystalised in open liganded form (PDB ID: 6oii) and closed ligand-free form (PDB ID: 6og0). Ligand binding has been associated with a conformational change at the carboxy-terminal end leading to the formation of an enlarged binding cavity [41]. The ligand-free structure is truncated missing 4 residues at the carboxy-terminus. The RMSD difference between the two structures is 1.35 Å. Subjecting the ligand-free structure to cross-correlation analysis shows significant anticorrelated motions involving residues of the 1st, 2nd and 6th helices ( Figure 10, panel a). These can be visualised in the corresponding three-dimensional model of the protein (Figure 10, panel b). These motions may result in conformational changes that could accommodate ligand binding and the concomitant widening of the opening to the internal cavity of the protein. In the case of DmelOBP28a, the protein in liganded and free (apo) forms revealed a large conformational reorganization induced by ligand binding. This conformational change involved the 1st, 4th and 5th helices [47]. The residue-residue cross-correlation matrix shown in Figure 11, panel a is consistent with the structural observations as it identifies residues helices 1, 4 and 5 involved in anti-correlated movements The normalised fluctuations and deformation energies of Cα atoms of AaegOBP22 (PDB ID: 6oii) and DmelOBP28a (PDB ID: 6qq4, chain A), are included in Figure 7. Comparison of the fluctuation and deformation energy profiles against the corresponding conformers in ligand-free form, showed an increase in the magnitude of Cα atom mobilities in some of the interhelical regions (Appendix C, Figures 1 and 2). Analysis of a greater number of models would be required to draw firm conclusions regarding the difference in fluctuations between ligand-free and ligand-bound forms.

Discussion
From the preceding analysis it is obvious that, in most cases, the sequence identities between pairs of OBPs fall below the twilight zone and, therefore, it is not possible to draw conclusions regarding the possibility of divergent evolutionary relatedness of these proteins [37]. In terms of structure, the pairwise RMSD calculations and PC analysis result in a clustering of OBPs into subgroups that reflects, in part, phylogenetic classifications. The foregoing analysis identifies a conserved structural 'core' that represents 15-20% of the total sequence length of the OBPs. The 'core' residues, which include C 2 , C 3 , C 5 and C 6 , are located in the helical regions with the least structural variance (Figures 5 and 6). In particular 14 of the 17 'core' residues are found in the carboxy-terminal helix. The high deformation energies of the residues of these helical regions (Figure 8) are indicative of interactions with spatially proximal residues [24]. Further evidence of such interactions was derived from evolutionary coupling analysis. Evolutionary coupling interactions have a strong bearing on the overall architecture of the fold [31,48]. Generally, the similarity of 3D structure amongst the members of a particular fold is the result the constrained evolutionary boundaries ensuring conservation of function. This, in turn, implies the co-evolution of residues in spatial proximity in order to maintain energetically favourable interactions. Consequently, the high density of coupling interactions is indicative of 'hotspots' in the evolution of proteins and, as such, may be of functional importance [31]. From the preceding, it is not unreasonable to suggest that the residues adjacent to C 2 , C 3 , C 5 and C 6 may play an important role in ligand binding or recognition.
The covariance similarity analysis of the low-frequency motions of the OBPs (Figure 7) shows that their relatedness in terms of intrinsic dynamics mirrors that observed from the RMSD calculations and PC analysis (Figures 2 and 5). Further insights were obtained from the analysis of the motions of the individual proteins. The analysis showed that in some OBPs the motions are highly collective and thus may be significant in proteinligand interactions (see examples in Section 3.4.3) [44,49]. Structural flexibility ensures the predisposition of proteins to attain alternative conformations. The pool of accessible conformations allows proteins to achieve their biological function in the presence of ligand binding, which, in turn, depends on changes in external conditions shifting the equilibrium between conformational states. Thus, although the examples described in Section 3.4.3 point to the importance of the intrinsic motions in determining the functional dynamics of the OBPs, it would be unwise to make broad generalisations in the absence of knowledge of the physiological milieu in which these proteins function.

Conclusions
In this review, the "classical" insect OBPs were compared in terms of their structural features, and intrinsic dynamics. It was found that the structural differentiation of OBPs into subgroups is reflected in their low-frequency motions. The small dataset of available crystal structures was a limiting factor in the analysis. Without a sufficiently large dataset, it is not possible to make an in-depth evaluation of the relation between the conservation of structural dynamics to the evolution of sequence and structure in insect OBPs. Nevertheless, the study of OBPs, individually or collectively, by means of NMA using EN models can provide useful insights into the interplay between structure and dynamics at little computational cost. NMA using EN models has been applied to other heterogenous protein families of which the globin family is a case in point. The globin fold is an all-α fold made up of 6-7 helices that define the architecture of a well-defined haem pocket [50]. Although the divergent sequence evolution of the family induced changes in the relative disposition of helices, these had small effect on their packing. This is because the position of the haem group is essential for protein function [51]. In the case of OBPs, it is tempting to speculate that natural selection operated in the opposite direction by allowing promiscuous changes in architecture of the fold in order to permit fast adaptation to the diverse chemical ecologies of insect.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/biom12020282/s1. Archived files (1) evolutionary_couplings.zip; (2) cross-correlation analysis.zip; Video S1.zip; Video S2.zip.   constantly. Consequently, the shape of binding site(s) may also change giving rise to more than one stable structures. The conformational space of a naturally occurring polypeptide is represented by a multidimensional potential energy surface. Briefly, the potential energy is a function of the positions R, where represents the coordinates of N atoms, each with coordinates (x,y,z). Figure A5. Representation of potential energy surface. A deep minimum in the potential energy surface indicates a stable structure (Panel a). One-dimensional cross-section through the multidimensional energy landscape of a protein. A conformational state is defined as a minimum in the energy surface. The difference in the Gibbs free energy between two states determines the rate of interconversion. Any change in the system, such as ligand binding, external conditions or mutation, will shift the equilibrium between states (Panel b).
The dynamics of proteins can be modelled means of molecular dynamics (MD). The application of MD assumes (i) a starting structure (existing X-ray or NMR model), (ii) assignment of a set of vectors for the coordinates of each atom in the structure, (iii) assignment of initial forces and velocities, and (iv) a timestep. Newton's equations are used to acquire a new position at each time interval, after which a new force is calculated for the determination of the next position, etc. The thermal energy allows the protein model to cross energy barriers thus sampling its entire conformational space. Given enough time, MD can sample the entire conformational space ( Figure A5). MD provides detailed and realistic simulation of atomic movements but requires significant computational effort, particularly for nano-scale motions where demanding statistical analyses are required over a number of simulation trials.
In Normal Modes Analysis (NMA), protein motion is restricted to a single minimum by applying a harmonic approximation. At equilibrium, the thermal movement becomes harmonic and the structure is trapped in an energy minimum. When the harmonic approximation is applied at an energy minimum point, the potential surface is represented as a parabolic multidimensional surface ( Figure A6). Based on this approximation, NMA is formulated and performed to study the dynamic properties of proteins. For a polyatomic system, such as protein, the Newton equations of motions can be expressed as in terms of the time evolution of a set of so-called massweighted Cartesian coordinates comprising of potential energy and kinetic energy terms. Assuming the harmonic approximation condition, normal mode analysis yields analytical solutions to the equations of motion. These solutions are given in terms of normal mode vectors (aka eigenvectors) that describe the direction and the displacement of each particle in the system with respect to the others, and eigenvalues that are the squares of the so-called normal mode vibrational frequencies. Assuming that the system undergoes some form of sinusoidal time evolution, and dividing the vibrational energy equally across all modes, the average amplitude of motion of any given mode scales with the inverse of the oscillation frequency. Thus, high frequency modes result in energetically more expensive localised displacements involving only a few directly bonded atoms. At low frequencies, the modes represent "collective" movements, that is they involve cooperative movements of large portions of the structure, such as domain motions or structural rearrangements.
NMA is computationally very expensive. For this reason, several coarse-grain methods have emerged in the form of elastic network models (ENMs), in which proteins are represented as assemblies of point-like particles joined by linear springs ( Figure A7). In the simplest form, the backbone carbon atoms constitute the particles and the behaviour of the springs is linear-elastic, that is it obeys Hooke's law. The connectivity of the particles is determined by calculating the distance between each pair of particles in the reference structure and using only particles falling within a cut-off distance, usually around 10 Å. NMA is then used to determine fluctuations or structural transitions in ENMs. The vibrational modes are orthogonal or normal and thus independent of each other. This means that the general motion of the system is described by the superposition of all modes.
The advantages and disadvantages of these models, as well as their application in structural biology have been reviewed extensively in references [34,52,53]. Figure A7. Representation of an Elastic Network Model of AgamOBP1 (a). Simplified diagram of the elastic network model (ENM). Each spring (i,j) obeys Hooke's law. Ri(0), Rj(0) are the cartesian coordinates of particles i, j at equilibrium. Ri, Rj are the particle coordinates following perturbation of the system. The elastic force Fi on particle i is the sum of forces Fij exerted by all springs connected to the particle (not shown) and is proportional to its perturbation from the equilibrium length (b).

Principal Components Analysis
PCA is an unsupervised statistical technique. PCA forms the basis of multivariate data analysis [54]. The basic premise of the technique is that if some variable in the dataset are highly correlated, it is possible to replace them by a smaller number of variables referred to as Principal Components. PCA takes data points from an original variable space (the dataset) and maps these onto a PC space. This allows a drastic dimensionality reduction of the original variable space with little or no loss of information.
PCA performed on protein structure datasets takes as an input the set of coordinates in each structure, usually the coordinates of C a atoms, and computes the average position of each point in the structure as <r i >. The covariance for all pairs of points i, j is computed according to: ci, j = (r i − r i ) r j − r j The brackets denote averages over the entire dataset of structures. Often a correlation matrix is used instead of the covariance one. This ensures normalisation of the variable to prevent large atomic displacements from skewing the results. Assuming a data model of n members (e.g., OBPs) and each member consisting of k variables (e.g., coordinates of Cα atoms), the data model can be represented by the correlation matrix (C) in the l.h.s of the equation below. Each of the observed variables x nk can be decomposed to a complete set of orthogonal vectors (eigenvectors, t n ), each with a corresponding eigenvalue p n (variance) that characterises the displacement.
Notice that for a cartesian coordinate system, the C matrix of the equation above is three dimensional (3n × 3n) symmetrical matrix. Matrix decomposition leads to a dimensionality reduction with 3n eigenvectors. Component loading vectors are the result of multiplication of each eigenvector with the square root of the associated eigenvalue.
If each observation in the data model (row of the C matrix) is projected onto the direction of, say PC1, the score value of that observation is the distance from the origin up to the projection point. Considering a simplified example of four PCs, the number of resultant score plots would be six, one for each pairwise combination of PCs. Score plots are useful in that they allow separation of the data into clusters.