Three-Dimensional Interaction Homology: Deconstructing Residue–Residue and Residue–Lipid Interactions in Membrane Proteins

A method is described to deconstruct the network of hydropathic interactions within and between a protein’s sidechain and its environment into residue-based three-dimensional maps. These maps encode favorable and unfavorable hydrophobic and polar interactions, in terms of spatial positions for optimal interactions, relative interaction strength, as well as character. In addition, these maps are backbone angle-dependent. After map calculation and clustering, a finite number of unique residue sidechain interaction maps exist for each backbone conformation, with the number related to the residue’s size and interaction complexity. Structures for soluble proteins (~749,000 residues) and membrane proteins (~387,000 residues) were analyzed, with the latter group being subdivided into three subsets related to the residue’s position in the membrane protein: soluble domain, core-facing transmembrane domain, and lipid-facing transmembrane domain. This work suggests that maps representing residue types and their backbone conformation can be reassembled to optimize the medium-to-high resolution details of a protein structure. In particular, the information encoded in maps constructed from the lipid-facing transmembrane residues appears to paint a clear picture of the protein–lipid interactions that are difficult to obtain experimentally.


Introduction
Understanding and exploiting protein structure has been a major goal of a dedicated and large group of biologists for nearly three-quarters of a century.In 1951, Linus Pauling suggested both the α-helix and β-strand motifs [1] almost a decade before they were experimentally observed by John Kendrew [2,3] and Max Perutz [4,5] in the crystal structures of myoglobin and hemoglobin.These secondary structure motifs are certainly crucial to protein structure, but many other features, both short and long ranges, significantly contribute to the diversity of observed protein structures.In other words, there is far more to protein structure than hydrogen bonding.In particular, the close examination of residue-residue interactions, such as that by Juswinder Singh and Janet Thornton [6], reveals a compelling collection of interaction types.The interactions designated as being hydrophobic are perhaps most interesting: while, at first, they are "obvious", this characterization belies their complexity as they are actually emergent properties involving enthalpy, entropy, and solvation [7].It should be noted that Irving Klotz [8] highlighted the importance of hydrophobic phenomena in proteins before the first crystal structures were reported.
Nearly all of the solved crystal structures for the first fifty or so years of protein structural biology were for soluble proteins as they can be (relatively) easily crystallized with conventional methods.Membrane protein structural biology has a number of additional challenges, not least of which is that their structure (and function) is likely to be highly affected by the integrity of the lipid bilayer in which they are embedded.Underscoring this importance, Qin et al. [9] showed nearly two decades ago that the lipids surrounding a membrane-bound binding site have conserved "binding sites".A second revolution in structural biology was ignited as various therapeutically important G protein-coupled receptor (GPCR) and ion channel structures were solved by a number of brilliant strategies to circumvent Nature.Notably, many of these techniques involved the detergent-based extraction of the protein from its native membrane and reinsertion into a substitute simulated bilayer [10][11][12][13].The current excitement for Cryo-EM structural biology is based in large part on the promise of more native-like membrane structures.However, even Cryo-EM has similar issues as protein extraction is still a fraught process.Thus, native or even reasonably similar lipids performing the same structural roles are rarely present in crystals or cryo-EM particles, and misinterpretations of reported structures have been published [14][15][16][17].The emerging ability to examine single particles, potentially encapsulated in a native-like "bundle" of lipids, is exciting [18].A recent review by Levental and Lyman [19] describes the current state of understanding of protein-lipid interactions, while another two reviews from 2019 by Corradi et al. [20] and 2020 by Duncan, Song, and Sansom [21] focus on what has been learned from experimental and simulated structures.
A further exciting development is the maturation of computational de novo protein structure prediction tools, such as AlphaFold [22,23], Rosetta [24,25], I-TASSER [26], and ESMFold [27].These tools incorporate substantive information from protein sequence homology wrapped around sophisticated interaction scoring algorithms.Very impressive predictive structure models have been reported [28], although smaller-scale predictions, such as rotamer conformation, etc., are not yet handled well enough to reliably apply such predicted structures for drug discovery.Also significant is that these programs have been developed mostly based on soluble proteins, and when membrane proteins were included in the training, those structures did not contain lipid bilayers, etc., and can conceivably even be flawed due to the aforementioned issues with detergent extraction [14].Very recent computational methods have hybridized AlphaFold2 and other deep learning/modeling methods, such as Rosetta, I-TASSER, etc., with Cryo-EM density maps to build better protein structure models, e.g., DeepMainmast [29], DEMO-EM2 [30], and DeepTracer [31].None of these methods, however, have any facility for modeling membrane lipids.
Our contribution to modeling and understanding protein structure, which we have been developing over the past decade, has as its hypothesis that it is the residue character, especially their three-dimensional interaction networks, that is the determinant of protein structure.While self-evident, our hypothesis de-emphasizes sequence homology in favor of a residue-level hydropathic valence, which is the set of interactions each residue ideally would make: involving interaction type (both favorable-like hydrophobic and hydrogen bonds-and unfavorable-like repulsive Coulombic and desolvation), interaction strength, and their 3D spatial arrangement.We encoded these interaction sets in 3D contourable hydropathic interaction maps calculated using a feature of the HINT (Hydropathic INTeractions) modeling system [32,33].Crucially, after clustering, there are a limited number of unique maps and they are dependent on the residue type and backbone angles.Our development of this paradigm was evolutionary, and we reported our progress with a series of articles targeting single residue types or families, starting with tyrosine [34], followed by alanine [35]; phenylalanine, tyrosine, and tryptophan [36]; aspartic acid, glutamic acid, and histidine [37]; serine and cysteine [38]; and alanine, isoleucine, leucine, proline, and valine [39].We extended the scope of this paradigm in the latter pair of studies by profiling a second dataset of membrane proteins that were exhaustively optimized, studied, and made available in the MemProtMD database [40] of intrinsic membrane protein structures from the Protein Data Bank, followed by molecular dynamics optimization within simulated lipid bilayers (dipalmytoylphospha-tidylcholine, DPPC).We applied several filters from the MemProtMD database to populate three residue subsets: first, residues, especially in intracellular and extracellular loops, helices, etc., do not interact with the bilayer and are quite similar to their analogues in soluble proteins; second, residues at the core of the trans-membrane region, in channels or GPCR binding sites, also have minimal direct interactions with lipids; and third, a set constituting the residues that do interact with the lipids molecules in the bilayer.We termed these three residue subsets (or environments) as mS (soluble), mC (core), and mL (lipid-facing), respectively.
In the 2023 report [39], we only considered the aliphatic hydrophobic residues and characterized their interactions in soluble proteins and in the three structural environments described above.We also characterized the roles of these residues by applying the GETAREA [41] algorithm to obtain the sidechain solvent-accessible surface area (SASA), supplemented with a relatively simple adaptation to calculate a new parameter: the lipidaccessible surface area (LASA) for the mL residue subset [39].Lastly, we calculated the hydropathic interaction maps for the mL subset in two different ways: (1) including the interactions with the lipid molecules (mL), which was the basis for map clustering, and (2) by turning off those interactions (mN).The difference between the maps created using those two protocols would, then, represent only residue-lipid interactions.We hypothesized that such information would be invaluable in building molecular models for membrane proteins that include bilayer molecules, perhaps to the extent that they could supplement current protein structure prediction methods in membrane protein cases.
In this paper, we report that the membrane protein calculations are complete for all residue types.We describe the scope of the data, including metrics illuminating the SASA, LASA, and hydropathic interaction character and preferences of the transmembrane residues by type.For this work, we also completed residue characterization by calculating a metric that quantitates the interaction character for each residue map cluster (and residue type) in the various datasets.We illustrate the information content of these maps, which are a deconstruction of sets of interactions and forces leading to protein structure in a quantitative, visual, and intuitive manner for both soluble and membrane-bound proteins.The volume of data involved in this work allows us only to present sample results, but significantly more information is available in Supporting Information.

Characteristics of the Datasets
Our goal was to explore a variety of environments within proteins that represent the structural motifs found in soluble and membrane-bound proteins.In earlier work, we suggested that dividing the 2π × 2π Ramachandran plot for protein backbone angles into sixty-four π/4 × π/4 "chess squares" was an appropriate methodology for determining and cataloguing backbone dependence for sidechain conformational differences, but more importantly sidechain interaction preferences [34].We later implemented [36,37] a chess square parsing scheme for larger residues to filter sidechain conformations into 60 ± 30 • , 180 ± 30 • , and 300 ± 30 • χ 1 bins for ASN, ASP, CYS, HIS, ILE, LEU, MET, PHE, SER, THR, TRP, and TYR, and these were further and similarly subdivided into (9) χ 1 /χ 2 bins for ARG, GLN, GLU, and LYS.PRO residues were parsed into two bins centered around χ 1 = +30 • and -30 • , while ALA, GLY, and VAL were not parsed.Our nomenclature is that chess squares are named a1-h8, from lower left to upper right on the Ramachandran plot, and are always written in bold italic font.The parses are named (for all residues except PRO) .60,.180,and .300,and are appended to the chess square to identify a bin (for ARG, GLN, GLU, and LYS, these names are .60.60, .60.180, etc.).PRO parses are named as .30mand .30p[36].The -S-S-bridged cysteine (CYX) was treated as described earlier [38].
Later, as mentioned above, we added a second collection of proteins to our studies that focused on those membrane-bound by abstracting a portion of the MemProtMD database [40] (http://memprotmd.bioch.ox.ac.uk (accessed on 22 May 2023)) and used the filters thus provided to build three subsets of residues: (1) extracellular, i.e., in the soluble domain, that we termed the mS dataset; (2) intracellular and lipid-facing, termed the mL dataset; and (3) intracellular and core-facing, termed the mC dataset.An alternative interaction calculation method, wherein interactions between the residues and lipids are ignored, was termed the mN dataset.
Table 1 summarizes the data in terms of the numbers of residues in each dataset.The soluble dataset is the largest, encompassing nearly 750,000 residues with occurrence frequencies generally in line with the accepted norm.Altogether, there are about half as many residues in the membrane-bound protein datasets, but these are split unevenly: the mS dataset contains 54% of the residues, the mC dataset contains only 7% of the residues and the mL dataset contains the remaining 39%.In agreement with the expectation that the mS dataset comprises residues much like those found in soluble proteins, the mS occurrence frequencies are in reasonable agreement with those found in the soluble dataset.However, the mC dataset shows some significant and interesting differences (compared to mS): (1) It is much less likely to find an ARG, ASP, GLU, LYS, or PRO as a core-facing residue.This is interesting because these, other than PRO, are most of the residues associated with some sort of regulatory function (PRO is probably just ill-suited for the structural constraints of the core); (2) The most prominent increase in occurrence in the mC dataset is exhibited by PHE, and to a lesser extent by TYR, suggesting a necessary π-stacking interaction role here; and (3) There are no CYX residues and only a small number of CYS residues.Even more dramatic, but largely expected, are the differences between relative populations in the mL dataset compared to mS.The hydrophobic residues, ALA, ILE, LEU, PRO, and VAL, along with PHE, are now 57% of the residues found as lipid-facing (vs.40% in the mS dataset).In the mL dataset, the four formally-charged residues ARG, ASP, GLU, and LYS now comprise just 7% (vs.25% in the mS dataset).a Fraction of membrane protein residues found in the particular subset, i.e., in soluble domain (mS), within membrane and core-facing (mC), or within membrane and lipid-facing (mL).

Three-Dimensional Interaction Maps and Map Clustering
To catalogue the array of interactions for each residue's sidechain-that shape its conformation and role in protein structure at all scales-we calculated three-dimensional maps that enumerated the interactions between that residue and its environment, i.e., the rest of the protein and cofactors, such as water.Hydrophobic, hydrophobic-polar, favorable polar such as hydrogen bonding and acid-base, and unfavorable polar such as acid-acid and base-base interactions were scored such that their strengths are mapped in space [34,35].These maps were binned by the residue type, chess square, and, as appropriate (vide supra), χ 1 and χ 2 parses so that their sidechains were aligned [37].
Next, our pairwise similarity metric [34] was applied to each map pair in its bin, and these metrics were set into a two-dimensional matrix in preparation for clustering.We used the k-means unified gap method [42] based on the testing of a numerous other methods [34], and manually set the maximum number of clusters based on our early experiences to maximize the likelihood that the residue types and/or chess squares we wished to compare would have corresponding cluster counts [39].Table 1 indicates the number of clusters found for each residue type in each dataset.These vary over a wide range and are dependent on the number of residues present (weakly populated datasets will have fewer), the number of parses (ALA has one, while LYS has nine), and the structural complexity that in turn dictates the maximum allowed (ALA-4, ARG-18, ASN-12, ASP-12, CYS-6, CYX-12, GLN-12, GLU-12, GLY-9, HIS-12, ILE-9, LEU-9, LYS-18, MET-18, PHE-12, PRO-6, SER-6, THR-6, TRP-12, TYR-12, and VAL-9).It was our hypothesis that these clusterderived and weighted-average map sets represent a significant reduction in data compared to the original structural data (at least 50-fold in the soluble dataset), and that we have sampled enough that it is unlikely that many new motifs would appear with larger protein datasets.For classification, each cluster is identified by the most representative residue member (in an ordinal list), which is called the exemplar, and is the residue (map) closest to the cluster's centroid.Our nomenclature is that cluster names are written in bold font.Thus, in our approach, one member within the set of average maps in the specified backbone conformation very likely captures the constellation of interactions between any residue and its environment.In other words, this is the "hydropathic valence" of the residue type in its secondary structure (chess square).To summarize, these maps are conserved motifs that are information-rich backbone-dependent rotamer and interaction libraries.

Solvent-Accessible and Lipid-Accessible Surface Areas
To understand further the role each residue plays in structure, we calculated the solvent-accessible surface area (SASAs) for all residue sidechains using the GETAREA algorithm and server [41].The SASA is a metric representing the degree of exposure/buriedness of a residue's sidechain and at its basic level reveals the residue's position within the protein, i.e., surface or not.It becomes a more powerful structure-understanding tool when the identity of the residue is considered: high solvent exposure in a hydrophobic residue has an entirely different meaning than high solvent exposure in a polar residue.This is particularly relevant in considering membrane-bound proteins because lipid-facing residues are generally hydrophobic rather than polar but have large calculated SASAs because the membrane bilayer is not per se protein and thus not recognized by GETAREA.For this reason, we implemented an adjustment, which we call lipid-accessible surface-area (LASA), where accessible surface area is categorized as SASA or LASA based on the ratios of score sums involving atoms in the DPPC molecules to the total score.Table 2 sets out the average solvent-accessible and lipid-accessible surface areas by residue type for the sidechains of the four datasets.Also listed are the reference random coil SASAs for Gly-X-Gly tripeptides [41].Notes: (1) As GLY does not have a sidechain, its actual SASA is 0.0.We did all of our calculations on its full structure and the random coil value listed is that value; (2) Bridging cysteine (CYX) is not recognized by GETAREA, and our calculations actually consider each half of CYX to be -CB-SG-SG'-CB'.The random coil number presented in Table 2 is arbitrarily 150% of the CYS random coil SASA.These two values will only be used as normalization factors in the calculations below.
While most indications are that the structural characteristics and the residues themselves are very similar between the soluble dataset and the mS dataset, there are notable differences in the average SASAs between the two.While we cannot be definitive, it seems likely that this is due to the different ways the structure models within the two datasets were collected and processed.In general, for smaller residues (ALA, CYS, ILE, LEU, SER, and VAL), the SASAs are largely the same, but for most of the larger residues (ARG, GLN, GLU, HIS LYS, TRP, and TYR), the mS dataset residues are less solvent-exposed.Since the latter set was subjected to extensive molecular dynamics, it may be that the larger sidechains were optimized into more compact conformations that may be less native-like.Fraczkiewicz and Braun [41] as surface areas for fully exposed residue sidechains; d Accessible surface area is the sum of SASA and LASA; e Accessible surface area is the same as SASA for the mN datasets; f GETAREA does not recognize S-S bridged cystines coded as CYX: this value is 150% of CYS random coil SASA; g GLY has no sidechain: this value is the random coil SASA for the full glycine residue.
For the mC and mL datasets, we calculated both residue-average SASAs and LASAs.The mC LASA data account for 4% to 17% (average 9%) of the total accessible surface areas for these residues.This must represent residues whose backbones are in the core region but whose sidechains make contact with lipid molecules.The three residue types with the largest LASAs are somewhat unexpected: HIS, MET, and PHE.Perhaps, the sidechains of aromatic planar residues (TRP but not TYR also has a high LASA) have affinity and access to the lipid.Interestingly, only the MET of the long chain residues reaches the lipids: ARG, GLN, GLU, and LYS have below-average LASAs, likely because they are less hydrophobic.
In the mL dataset, the LASA captures 72% of the accessible surface area, with largely expected trends.The larger hydrophobic residues (including PHE and TRP) show between 84% and 89% LASA, while in contrast, ASN, ASP, GLN, and GLU show between 49% and 60% LASA.ALA and PRO likely have less prominent LASAs, 76% and 77%, respectively, because they are making a larger fraction of their interactions with neighboring residues.In the following section, we will examine the interaction character for each residue type and dataset.
Examining the SASA and LASA metrics on this residue-by-residue basis is a gross oversimplification of the data we have generated.Supplementary Tables S1-S5 list these data on a cluster-by-cluster basis for: the soluble residue dataset (Table S1), soluble domain of the membrane dataset (mS, Table S2), the core-facing residues of the membrane dataset (mC, Table S3), the lipid-facing residues (lipid interactions on) of the membrane dataset (mL, Table S4), and the lipid-facing residues (lipid interactions off) of the membrane dataset (mN, Table S5).For discussion, a very small subset of these data (~0.3%) is provided in Tables 3 and 4, for the soluble proteins, the soluble domain of the membrane proteins (mS), and the core-facing residues of the membrane proteins (mC) (Table 3) and the mL and mN residue sets (Table 4).Data for one bin of residue data for five somewhat diverse residues, ALA (c5), ASP (c5.300),ILE (c5.300),PHE (c5.300), and THR (c5.300), are listed.(The c5 chess square is in the α-helix region of the Ramachandran plot.)Note that the ASP maps were calculated at a pH that evenly divides the ASP residues into charged, deprotonated aspartates and neutral protonated aspartic acids using concepts and procedures described earlier [37].It is clear from Tables 3 and 4 that the SASA/LASA values are not monotonic from residue to residue, or even within the same residue and chess square.They are, in fact, a distinguishing feature of each cluster.However, we showed earlier that the clustered maps can be numerically compared with similarity metrics and that there can be very similar 3D map motifs generated in different chess squares for the same residue, e.g., for ALA [35], or between different datasets, e.g., soluble proteins and soluble domains of membrane proteins [39].We are not going to repeat that exercise here, but do highlight that ALA c5 3020 and ALAmS c5 905 were shown to have similar profiles, and as expected, their SASA values are very nearly the same.Also described earlier is that there is a degree of similarity between the core-facing membrane residue cluster maps (mC) and those of the soluble set [39], but the significantly smaller sample size of the former (~30-fold) is a confounding factor.The clusters of core-facing residue interaction maps show a similar diversity of SASAs as other cluster sets, but are generally more buried probably due to the confined space available in that environment.
Table 4 focuses on the mL vs. mN results, where the former has interactions with both other protein residues and the lipids, and thus has both LASA and SASA data, while the latter has only interactions with the protein (i.e., ignoring the lipid) and has only SASA data.Not surprisingly, the residues in this environment generally have much more significant propensity for interaction with lipids (LASA) than they do with H 2 O (SASA).However, this varies by residue type, chess square and cluster.For example, in ALA, the cluster 518 shows about 56% SASA, while 18 shows about 13%.Of the five residue types highlighted here, ASP has three clusters with higher SASA than LASA (11, 72 and 98), while ILE and PHE have none.While these results are fairly intuitive based on the general expectations of residue properties, the next section enumerates the actual average interaction character of each cluster.

Hydropathic Interaction Characters
The HINT formalism considers four interaction types: favorable polar, Pol(+), such as acid-base and hydrogen bonding; unfavorable polar, Pol(−), such as acid-acid and base-base; favorable hydrophobic, Hyd(+); and unfavorable hydrophobic, Hyd(−), i.e., hydrophobic-polar representing desolvation and related effects [32,43].Each individual residue's interaction map set was analyzed by summing its grid point values by interaction type; these are recorded as the interaction map character [44].Table 5 reports these on a residue type basis for all five datasets.For the purposes of clarity and to aid possible comparisons between the residues, the data in Table 5 were normalized by the random coil SASA dataset out in Table 2.These results are largely as expected, with hydrophobic residue sidechains being dominated by hydrophobic interactions, with the tendency for larger sidechains to have a larger fraction of their interactions to be favorable.The most productive residues for favorable polar interactions are the formally charged acids (ASP and GLU), amides (ASN and GLN), ARG (again, formally charged), and SER.GLY is an outlier because its data throughout this study are for the entire residue structure, not just the sidechain.As above for the SASA and LASA data, Supplementary Tables S1-S5 contain the interaction character data at the chess square and cluster levels (but are not normalized).Table 6 lists the (normalized) interaction character for the same set of residues in the soluble and mS datasets described in Table 3. Table 7 lists the (normalized) interaction character for the same set of residues in the mL and mN datasets described in Table 4.
Comparisons between the SASA/LASA data and interaction character data (e.g., Table 3 vs.Tables 4 and 6 vs. Table 7) should be particularly informative.First, residue clusters with a high SASA + LASA make fewer measurable interactions (|Hyd(−)| + |Hyd(+)| + |Pol(−)| + |Pol(+)|), although interactions with water are incompletely measured: not all crystal structures in the soluble dataset have confirmed explicit waters.Second, even polar residues with a high LASA can have significant, often favorable, interactions with the DPPC lipid; see, for example, the ASP clusters 44 and 101 that have ~25-50% higher favorable polar interactions in the mL calculation than in the mN calculation.The THR clusters 6, 33 and 55 have similar trends.Such residues must be interacting with the lipid's polar head groups.Third, as expected, hydrophobic residues, especially ILE in Tables 4 and 7, make significant hydrophobic interactions with the lipid in clusters with a high LASA, but the relationship is multifactorial.
The key point is that the 3D interaction maps that we generated in our study, 14,158 for the soluble proteins and a total of 19,802 for the three subsets of the membrane proteins, are packed with information about the interaction preferences of the residues by type, backbone angle, and location.For those that are membrane-bound, we should have a handle on residue-lipid interactions as well as residue-residue interactions.Probing these maps by SASA/LASA or interaction character is inadequate to assess them.

Visualization of 3D Residue Interaction Maps
We selected a few maps, from the mL and mN datasets, for each of the residue sets listed in Tables 4 and 7 for display in isovalued contour surfaces.Also shown are difference maps calculated between the cluster maps in the mL and mN datasets.Each map is presented in two side-by-side orientations.The orientation on the left has the z-axis (generally the CA-CB) bond directed out of the plane of the page, while the orientation on the right has the z-axis in the plane of the page directed to the right.The contour levels shown are as follows: red (opaque, -24, translucent, -6), blue (opaque, +24, translucent, +6), purple (opaque, -24, translucent, -6), and green (opaque, +12, translucent, +3,-except for PHE, where this contour level is +1) for unfavorable polar, favorable polar, unfavorable hydrophobic, and favorable hydrophobic interactions, respectively.

Alanine
In Figure 1 are plotted the clusters 18 (a) and 393 (b) in the c5 chess square for ALAmL, ALAmN, and their ALAmL-ALAmN difference maps.To review, mL maps include the inter-residue and residue-lipid interactions and mN maps include only the inter-residue interactions, and thus the difference maps illustrate the role of lipid interactions.As it can be inferred from the close numerical match in the interaction characters (Table 7) of all ALAmL and ALAmN residues, alanine does not appear to make much of an impact in terms of protein-lipid interactions (at least in this conformation), which can be visualized in Figure 1 as the similarity of the analogous mL and mN maps (green and purple contours, favorable and unfavorable hydrophobic, respectively) and the very weak, but favorable, hydrophobic interactions seen in the difference maps.Alanine is simply too small to reach and associate with the membrane lipid molecules.It is worth noting, however, that the small patches of hydrophobic interaction density in the difference maps are at different locations (~10 o'clock in 18 and ~5 o'clock in 393), thus showing diversity even in alanine's small interaction with lipids.
Molecules 2024, 29, x FOR PEER REVIEW 13 of 32 a handle on residue-lipid interactions as well as residue-residue interactions.Probing these maps by SASA/LASA or interaction character is inadequate to assess them.

Visualization of 3D Residue Interaction Maps
We selected a few maps, from the mL and mN datasets, for each of the residue sets listed in Tables 4 and 7 for display in isovalued contour surfaces.Also shown are difference maps calculated between the cluster maps in the mL and mN datasets.Each map is presented in two side-by-side orientations.The orientation on the left has the z-axis (generally the CA-CB) bond directed out of the plane of the page, while the orientation on the right has the z-axis in the plane of the page directed to the right.The contour levels shown are as follows: red (opaque, -24, translucent, -6), blue (opaque, +24, translucent, +6), purple (opaque, -24, translucent, -6), and green (opaque, +12, translucent, +3,-except for PHE, where this contour level is +1) for unfavorable polar, favorable polar, unfavorable hydrophobic, and favorable hydrophobic interactions, respectively.

Alanine
In Figure 1 are plotted the clusters 18 (a) and 393 (b) in the c5 chess square for ALAmL, ALAmN, and their ALAmL-ALAmN difference maps.To review, mL maps include the inter-residue and residue-lipid interactions and mN maps include only the inter-residue interactions, and thus the difference maps illustrate the role of lipid interactions.As it can be inferred from the close numerical match in the interaction characters (Table 7) of all ALAmL and ALAmN residues, alanine does not appear to make much of an impact in terms of protein-lipid interactions (at least in this conformation), which can be visualized in Figure 1 as the similarity of the analogous mL and mN maps (green and purple contours, favorable and unfavorable hydrophobic, respectively) and the very weak, but favorable, hydrophobic interactions seen in the difference maps.Alanine is simply too small to reach and associate with the membrane lipid molecules.It is worth noting, however, that the small patches of hydrophobic interaction density in the difference maps are at different locations (~10 o'clock in 18 and ~5 o'clock in 393), thus showing diversity even in alanine's small interaction with lipids.From the left, the first pair (mL) are the maps for interactions between alanine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between alanine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: leftz-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are Figure 1.Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing alanines in the c5 chess square for (a) cluster 18 and (b) cluster 393.From the left, the first pair (mL) are the maps for interactions between alanine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between alanine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment; purple contours represent unfavorable hydrophobic interactions.

Aspartic Acid
Figure 2 illustrates the residue clusters 44 (a), 82 (b), and 101 (c) for ASPmL, ASPmN, and ASPmL-ASPmN in the 300 • parse of the c5 chess square, i.e., c5.300.Unsurprisingly, aspartate/aspartic acid residues are dominated by polar interactions (red and blue) and unfavorable hydrophobic (purple), although favorable interactions with the CB methylene are strongly evident in cluster 82, weakly in 101, and barely appear in 44.All three ASPmL clusters were calculated by our algorithm [37] to be protonated, and thus neutral (less polar), in keeping with their expected interactions with the hydrophobic membrane lipids.All of the mN maps, where interactions with lipids are turned off, show the expected interactions of an aspartic acid sidechain, making a few hydrogen bonds (blue) and detecting a few polar (red) and hydrophobic (purple) clashes [37].The mL maps, and by inference the difference maps, are much more feature rich, reflecting the interesting diversities of interactions.Clearly, unfavorable hydrophobic (purple) interactions will be important here as this very polar sidechain interacts with the nonpolar lipids, but there are a large number of new red and blue interaction contours, especially in cluster 101.This likely is due to aspartic acids/aspartates, which are not exceedingly rare in the lipid-facing region, fulfilling a structural role by anchoring the protein to the lipids through interactions with the DPPC head groups (or analogous molecules in Nature).From the left, the first pair (mL) are the maps for interactions between aspartic acid/aspartate and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between aspartic acid/aspartate and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment; purple contours represent unfavorable hydrophobic interactions; blue contours represent favorable polar interactions; and red contours represent unfavorable polar interactions.

Isoleucine
In Figure 3, the residues displayed are for the 6 (a) and 17 (b) map clusters of ILEmL, ILEmN, and ILEmL-ILEmN (c5.300).It is clear that isoleucine's major impact on structure is by providing favorable hydrophobic interactions.The ILEmN clusters illustrate the interactions between this residue and its neighboring residues, i.e., it is playing a fairly similar role as alanine within the protein.The ILEmL maps are, however, completely dominated by interactions with the lipids, and the ILEm-ILEmN difference maps emphasize how important this residue is to structurally maintaining protein/lipid ensembles.From the left, the first pair (mL) are the maps for interactions between isoleucine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between isoleucine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment; purple contours represent unfavorable hydrophobic interactions.

Phenylalanine
Figure 4 sets out residue sidechain cluster maps 110 (a) and 202 (b) for the phenylalanine datasets PHEmL, PHEmN and PHEmL-PHEmN (c5.300).Note that, although these two clusters show significantly different orientations of the aromatic ring (χ 2 = 97 • for residue 110 and χ 2 = 151 • for residue 202), the calculation and clustering protocol correctly binned these as distinct clusters possessing different conformations as well as unique interaction sets.Also, while there are significant favorable hydrophobic interactions produced by phenylalanine, they are spread out over a larger volume; so, as noted above, the favorable hydrophobic maps (green) are contoured at a lower value for just this figure.The HINT formalism considers π-π stacking as a favorable hydrophobic interaction [36,43]; so, the only interaction types that are significant are favorable hydrophobic.There is no indication from Table 5 or Table 7 that any different result would be expected in other clusters or chess squares in lipid-facing phenylalanine datasets.Of particular interest are the (PHEmL-PHEmN) difference maps: the two results shown here expose different edges to the lipids.Cluster 110 is exposing only one of its CE atoms, while cluster 202 is exposing a crescent of interaction potential spanning CE1-CZ-CE2.Both, however, are showing edge-on rather than stacking interactions.

Threonine
Two examples of threonine interaction cluster maps, 33 (a) and 52 (b), are displayed in Figure 5.In this case, the two clusters are quite different from each other.Cluster 33 makes significant favorable polar interactions with the lipid, associated with the OG1 atom, while its methyl (CG2) appears to be embedded within the hydrophobic portions of the lipid.In contrast, cluster 52's interactions with the lipid molecules are wholly hydrophobic, albeit both favorable and unfavorable.Profiles for the other four threonine cluster maps, 6, 55, 60, and 71, can be largely surmised from Tables 4 and 7. Map 6 shows the largest difference in character (Table 7) for unfavorable hydrophobic, with more modest differences in favorable hydrophobic and polar, suggesting that the OG1 atom is interacting favorably with a polar atom in a DPPC and likely unfavorably with other atoms in it.These differences for cluster 55 are similar, except that of the favorable hydrophobic is somewhat larger than the unfavorable hydrophobic.Clusters 60 and 71 are different in that their overall interactions are weaker for all four types, as should be expected by the their substantially larger accessible surface areas-implying more empty space surrounding these clusters.

Composite 3D Interaction Maps for Proteins
With the tens of thousands of backbone-, residue-, and environment-dependent distinct map clusters generated for this study, we can assemble an overall picture of noncovalent interactions within a protein, between proteins, between proteins and ligands, and of special interest here, between proteins in transmembrane regions and their lipid support.For this discussion, we arbitrarily chose one protein/lipid structure from the training set, pdbid: 4o6y, which is the X-ray diffraction structure (resolution of 1.70 Å) of cytochrome b561 from Arabidopsis thaliana, a eukaryotic transmembrane ascorbate-dependent oxidoreductase [45].
We traced each residue in the protein through our calculational procedure to identify its membrane protein subset, chess square, parse, and cluster membership.Next, we constructed three-dimensional grids large enough to encage the entire protein, 99 × 95 × 111 for mC residues, 153 × 127 × 113 for mL residues, and 135 × 125 × 145 for mS residues, all with 0.5 Å grid spacing.Then, the 3D clustered interaction maps for each residue in the protein were frame-shifted to that residue's frame, and the interpolated values for each grid point in the residue-level maps (for favorable and unfavorable, hydrophobic and polar interactions) were accumulated in the protein level maps.Figure 6 sets out these results.From the left, the first pair (mL) are the maps for interactions between phenylalanine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between phenylalanine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment.
the largest difference in character (Table 7) for unfavorable hydrophobic, with more modest differences in favorable hydrophobic and polar, suggesting that the OG1 atom is interacting favorably with a polar atom in a DPPC and likely unfavorably with other atoms in it.These differences for cluster 55 are similar, except that of the favorable hydrophobic is somewhat larger than the unfavorable hydrophobic.Clusters 60 and 71 are different in that their overall interactions are weaker for all four types, as should be expected by the their substantially larger accessible surface areas-implying more empty space surrounding these clusters.From the left, the first pair (mL) are the maps for interactions between threonine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between threonine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: From the left, the first pair (mL) are the maps for interactions between threonine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between threonine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment; purple contours represent unfavorable hydrophobic interactions; and blue contours represent favorable polar interactions.
First, in Figure 6a, the structure for the cytochrome b561 protein is shown; it has two chains, each possessing six transmembrane helices.There are a relatively small number of extramembrane residues, largely on the extracellular side (top, Figure 6a).Also, the gap between the two chains is noticeably smaller on the extracellular side than on the intracellular side.In Figure 6b, the composite interaction maps for the mS residues outside the membrane surfaces (extracellular on top and intracellular on the bottom) are shown (see caption for display details).As suggested by the relative number of extramembrane residues (top and bottom), the extracellular region has a much richer set of interactions than the intracellular.Each displayed contour corresponds to one or more interactions between the residue sidechains that impact the structure: the blue isovalue surfaces represent favorable polar interactions like hydrogen bonds, just as described above for the individual residue-level maps; red = unfavorable polar; green = favorable hydrophobic; and purple = unfavorable hydrophobic.In the transmembrane (helix) region, we first display the interactions between the core-facing mC residues and their neighbors in Figure 6c.It is important to point out that all of these map figures (Figures 1-6) were manually contoured, i.e., the isovalues were chosen to maximize reader viewability and information accessibility.The digital information within the clustered map grids themselves is the key result.With that in mind, Figure 6c could be contoured such that every atom pair is associated with an interaction or that no interactions are shown at all.Thus, this figure shows that there are numerous favorable and relatively significant interactions at or near the junction between the two six-helical bundles.Of the 17 residues identified as mC in this protein, there are 2 ARGmC, 4 HISmC, 4 LYSmC, and 1 TRPmC, all very polar and large.More research is called for, beyond the scope of this work, but these results are suggestive that the mapping strategy may provide mechanistic insights into the structure/function in membrane proteins.For example, our pH-dependent map data may be particularly informative.
Figure 6d focuses on the "internal" residue-residue interactions between the lipidfacing residues, i.e., the mL dataset, excluding interactions with lipids, which is designated mN.It highlights the hydrogen bonds and favorable/unfavorable hydrophobic interactions and very few unfavorable polar interactions, which characterize this protein's structure.Clearly, there are a large number of hydrogen bonds, and as should be expected since over half of the residues in mL are hydrophobic, a substantial volume of hydrophobic interactions is also seen.The prominent unfavorable hydrophobic, although technically an energetic cost, always tracks with the favorable hydrophobic due to interactions between hydrophobic sidechains and nearby polar residues or ubiquitous backbone atoms within the protein.
Shown in Figure 6e is the map for residue-lipid interactions, i.e., mL-mN.This figure also includes the molecular model of the associated DPPC lipids (with green-rendered carbons) extracted from the simulation reported in the MemProtMD database [40].Unsurprisingly, the contours shown are dominated by those that are favorable hydrophobic, especially in the middle where the lipid tails are found.Clearly, however, the lipid headgroups are also recognized by the protein's transmembrane helixes, and their sequence is quite obviously designed by Nature to anchor within the cellular membrane with both polar and hydrophobic interactions.There are literally thousands of specific protein-lipid interactions comprising this map.To illustrate the information content (and complexity) of this composite map, we isolated just one favorable hydrophobic contour in Figure 6f.This shows that the lipid tails of DPPC 448 (nomenclature from MemProtMD) interact with the hydrophobic sidechains of Phe 211, Val 216, Val 219, and Phe 414.
While this protein's analysis is of an existing structure, there clearly is very significant scope for using this information in structure prediction applications.Strategies for this are described in the next section.

Reconstruction of Protein Structure Using 3D Interaction Maps
The approach to reconstruct the maps into a structure prediction is simply to match the maps' patterns such that as many structural interaction features within sidechains as possible are reinforced.Thus, as each individual residue interaction map encodes a specific set of interactions representing a plausible environment, the "ideal" interacting residue or residues would have map features with the same character at the same points in space when all residue maps are aligned in the "protein frame".As was seen in many of the individual residue interaction maps (Figures 1-5), as well as in the 4o6y composite map (Figure 6), this does not mean that all interactions in a stable protein structure are favorable.
Most maps include several to a dozen distinct interaction features that are, in principle, at the midpoint between the two (or more) interacting functional groups or atoms.This suggests a surfeit of information for self-consistently reassembling the protein's sidechain conformations.The advantage here over conventional technology, such as backbonedependent rotamer libraries [46][47][48][49], and codes, such as SCWRL [50][51][52], is that this optimization is not just geometric, or including just hydrogen bonds, but inventories the entire hydropathic interaction network in its optimization.
However, reconstruction remains a combinatorial problem, albeit a smaller one: (1) instead of a near infinite number of backbone conformations, our chessboard schema defines only sixty-four, of which, except for GLY, less than half are significantly populated; (2) instead of a near infinite number of sidechain conformations for most residues, we systematized each to an accessible number-between one (no χ 1 ) and nine (χ 1 and χ 2 ), for each of which there are between four and eighteen maps; and (3) as the interaction map clusters are quite unevenly distributed amongst the residue types, backbone, and sidechain conformations, dictated by the populations of residues in the experimental structural data, consequently predicting structure for new cases will follow the same rules, i.e., the more common residues and conformations will have more extensive and validated training data than rare or potentially Ramachandran-disallowed backbone conformations.
To proceed, 3D grids enclosing the protein or its region of interest can be constructed, much as described above for the creation of Figure 6.In this case, the grids will contain "scores" assessing the quality of interaction feature matches at the positions in space associated with each grid point.Encoding match scores as well as the molecular interaction features in 3D maps has operational computing advantages.They are easily parallelizable, perhaps even more than "embarrassingly" so, and are similarly completely amenable to GPU computing.Many popular scoring functions for docking use grid-based algorithms [53][54][55].

Building a Structural Model from Interaction Maps
Applying our strategy for building a sidechain-optimized structural model by matching interaction features is illustrated by the cartoon of Figure 7.The score grid, similar to the map composite grid, holds at each grid point (xyz) the sum of all interaction maps from residues (i, j, k l, . ..) that have been frame-shifted to align with the protein's backbone.The total score (S model ) would be the sum of all values (s xyz ) in the score grid.The difference between this score grid and the composite display maps is that S model is a fitness function that must be optimized to define the best set of sidechain backbone conformations.Since SER has six maps per parse, PHE has twelve, each ASP has twelve, and ALA has four, the cartoon example suggests that there are 41,472 unique models to be scored, which is a problem amenable to various machine learning or evolutionary algorithm solutions, especially for the more biologically relevant cases of the tens-to-hundreds of residues to be modeled.We are working on a genetic algorithm solution to this problem [56].

Underlying a Residue's Molecular Structure
Each 3D interaction cluster map is associated with two molecular structures or coordinate sets: one being that of the specific residue responsible for the cluster exemplar and the other an average of all members of the cluster, weighted for the Euclidian distance from the cluster's centroid [34].For each cluster's average structure, we have access to RMSDs for each of its atoms as well as average sidechain dihedral angles with standard deviations.Thus, depending on the scope of the problem and the quality of experimental data, the number of residue maps to be explored can be controlled.For example, if the sidechains are well resolved, the structural filters taking into account the now lesser uncertainties of atomic positions and dihedral angles can reduce the pool size of residue maps to be auditioned.In contrast, a poorly resolved structure or one missing a few or all sidechains might require auditioning larger sets of residue maps, e.g., the trial map set for serine may include all three χ 1 parses.

Fitting Lipids to a Membrane Protein Structure
A relatively small number of experimental membrane protein structures deposited in the protein data bank or other databases include any or more than a small handful of structural lipids in the transmembrane region, let alone their native set [19,20,[57][58][59][60][61].This experimental deficit has led to a number of proposed methodologies to simulate the lipid structures [62][63][64] or to predict or simulate protein structure within implicit [65][66][67][68] or explicit [69][70][71][72] membrane environments.

Underlying a Residue's Molecular Structure
Each 3D interaction cluster map is associated with two molecular structures or coordinate sets: one being that of the specific residue responsible for the cluster exemplar and the other an average of all members of the cluster, weighted for the Euclidian distance from the cluster's centroid [34].For each cluster's average structure, we have access to RMSDs for each of its atoms as well as average sidechain dihedral angles with standard deviations.Thus, depending on the scope of the problem and the quality of experimental data, the number of residue maps to be explored can be controlled.For example, if the sidechains are well resolved, the structural filters taking into account the now lesser uncertainties of atomic positions and dihedral angles can reduce the pool size of residue maps to be auditioned.In contrast, a poorly resolved structure or one missing a few or all sidechains might require auditioning larger sets of residue maps, e.g., the trial map set for serine may include all three χ 1 parses.

Fitting Lipids to a Membrane Protein Structure
A relatively small number of experimental membrane protein structures deposited in the protein data bank or other databases include any or more than a small handful of structural lipids in the transmembrane region, let alone their native set [19,20,[57][58][59][60][61].This experimental deficit has led to a number of proposed methodologies to simulate the lipid structures [62][63][64] or to predict or simulate protein structure within implicit [65][66][67][68] or explicit [69][70][71][72] membrane environments.
The observation by Qin et al. [9] that there were conserved lipid binding sites on transmembrane proteins led us to believe that, just as the three-dimensional hydropathic interaction maps that we have been developing, and illustrated here, represent conserved interaction motifs possessing conserved character, loci, strength of interactions, etc., they may also include lipid interaction information if constructed from an appropriate training set.Furthermore, they would represent a means to develop an alternative, structure-based, approach for building reliable protein-lipid assemblies at a modest cost compared to many molecular dynamics-based approaches.
Simply, referring back to Figure 6 and Section 2.7.1, if the lipid-facing amino acid residues in the transmembrane region are modeled and optimized with the mN dataset (similar, visually, to Figure 6d), their mL (or mL-mN, as in Figure 6e) analogues spotlight where hydrophobic, acid, base, etc. features on lipid molecules need to be placed to produce the indicated interactions indicated by the maps.

Soluble Protein Dataset
As previously described, our dataset was comprised of 2703 randomly selected proteins from the RCSB Protein Data Bank, excluding those structures that contained ligand and/or cofactors [34].Each residue type was extracted and studied independently from all structures, except N-and C-terminal residues.Selection criteria have been outlined previously [34], where we provided extensive documentation on our selection criteria for this protein library, but briefly: (1) We did not filter for redundancy, believing that more common motifs (and residue environments) should be sampled more in less in line with their population in Nature.This, of course, assumes that the PDB structures are themselves representative, which is probably only partly true; and (2) We did not filter for resolution, believing that many low-resolution structures have unique features and interaction motifs not present in the highest resolution PDB set.Hydrogen atoms were added as needed based on their hybridization, which was followed by conjugate the gradient minimizations of their positions using Sybyl X.2.1.No further structure optimizations were performed, as retaining the integrity of the experimental data was paramount.
The presence or absence of water molecules in PDB structures is, nevertheless, a concern: we have commented on this issue extensively in the past [7,[73][74][75], but there are few-only poor-options to rectify this after a structure is solved and deposited.However, our methods will usually highlight where water molecules (or other molecules playing a similar interaction role) should be.Unfulfilled polar interaction contours are clear indicators of a missing water molecule.The map clustering we performed does recognize hydrated and water-free residues as different, with distinct interaction profiles.It is important to note that our methods are silent regarding waters associated with the protein's surface unless they make interactions.
Molecular models for four residue types-aspartic acid, cysteine, glutamic acid, and histidine-were built such that their ionization states fit pH levels previously determined (pH 50 ) [37] to evenly divide each of those residues sets into protonated and deprotonated sets.The detailed procedure for this operation was reported previously [37,38].Simply, the HINT interaction score between the residue of interest and its neighbors was calculated for both the protonated and deprotonated cases for ASP, CYS, and GLU (protonated cases were optimized for the highest scoring -XH angle) residues and for cases representing all eight [37] possible versions of HIS.Each of these scores were energetically corrected for pH-pK a nominal and the case with the highest resulting HINT score defined the most likely ionization state for that residue.These pH 50 values are 3.3449 for soluble ASP, 9.5814 for soluble CYS, 4.2240 for soluble GLU, and 5.1743 for soluble HIS.For all the membrane protein datasets, analogous pH 50 s were used: 3.9331 for ASP, 9.4506 for CYS, 4.2021 for GLU, and 5.4206 for HIS.Residues whose actual environment at the specified pH 50 is basic were modeled as deprotonated and those in an acidic environment were modeled as protonated.

Membrane Protein Dataset
Similarly, we randomly extracted 362 membrane protein structures from the Grazhdankin et al. (2020) [76] dataset that is a subset of the MemProtMD database [40].These structures were available as pre-oriented, lipid-"solvated" and optimized with ~1 µs of coarse-grain molecular dynamics [58].In previous work [38,44], our dataset was slightly larger, but the Supplementary Files (vide infra) for seven proteins (pdbids: 3fb5, 3wxv, 4xwk, 5f1c, 5jsz, 5llu, and 5m94) we used are not currently available.No effort was made to filter the membrane protein structures for redundancy or resolution.We removed lipids more than 6 Å away from the protein and added missing hydrogen atoms that were energy minimized as above.The MemProtMD dataset structures do not include water molecules, ions, or other cofactors.
As described earlier, we further binned the residues into sets that described their locations within the membrane proteins [39].Briefly, we adapted two of the Supplementary Files in MemProtMD found with each protein-lipid model.Snapshots of the average lipid phosphate surfaces (over the final 800 ns of simulation) are the "distortions" coordinates (named pdbid_default_dppc-distortions.pdb)[40] that represent the membrane region's extents.We thus defined the average z-coordinates as upper and lower planes and assigned all residues with all three of their backbone atoms, N, CA, and C, between those planes to be in the membrane region.Residues outside these planes were binned being in the (mS) soluble domain.Those in the membrane region were binned as lipid-facing (mL) or core-facing (mC) with the MemProtMD "residue-wise analysis" file (pdbid_default_dppcby-resid.csv) that defines each residue to be a constituent (or not) of the pore inner surface.The mN datasets were identical to the mL datasets but the downstream map calculations were different (vide infra).
All individual residues (and their protein) were aligned to a stub residue at the center of each chess square, such that the Cartesian origin is at the CA atom, the CA-CB bond corresponds to the z-axis, and the CA-HA bond is in the yz-plane [34].All calculated maps for that residue result from its interaction with its environment result from and can be aligned with and compared to all other residues of that type binned into that chess square.
To simplify nomenclature for this work, each residue was assigned a number in a list of residues within each chess square or, as needed, χ 1 parse and χ 2 parse, e.g., the first alanine in the a1 chess square is 1, the third isoleucine in the c5.300 parse is 3, etc.These codes can be unpacked using data freely available upon request from the author.

HINT Scoring Function
Interatomic interactions were scored with the HINT forcefield [32,33].The fundamental parameter, the hydrophobic atom constant, is an atom-level logP o/w that uses the defined fragments and factors of the Hansch and Leo methodology [77,78].These interactions were calculated for the residue of interest with respect to all other residue types and water, but for the mL lipid datasets, atoms from the DPPC lipid molecules were also considered in the calculations.

HINT Basis Interaction Maps
Three-dimensional boxes large enough to fit the structure of each residue type in all expected conformations, with an added 5 Å on each dimension, were calculated, all with a grid point spacing of 0.5 Å on each axis.Interaction maps illustrating in 3D all interactions of that residue-more or less a recasting of pairwise HINT scores into 3D objects-were calculated.These maps encode the position, intensity, and type of atomatom interactions between the residue and other residues in contact or near-contact to it.Mathematical expressions have been set out in previous works [30,33,35].Map data were calculated for sidechain atoms of all residues in this study with individual maps for the four interaction classes: favorable polar, unfavorable polar, favorable hydrophobic, and unfavorable hydrophobic.

Calculation of Map-Map Correlation Metrics and Clustering
The calculations of map-map correlations, i.e., comparisons of map pairs, was performed using an algorithm that defines the similarity between them, as previously described in detail [30].To cluster the pairwise map similarity matrices, we utilized k-means clustering available as a supplement to the freely available R programing language and environment [79].We opted to set a uniform maximum number of clusters for each chess square or chess square/parse to improve consistency and facilitate comparisons, as described earlier [73].One issue with k-means is that it does not form singleton clusters; we backfilled as necessary such missing (singleton) clusters.Only chess square/parses with five or more maps were clustered, and those with fewer were instead averaged to create a pseudo 1-cluster case.Each cluster is named for the cluster member closest to its centroid and cluster names are presented in bold, e.g., 123.
To represent each cluster, average maps were calculated by Gaussian weighting (w) each map's contribution based on its Euclidean distance from the cluster centroid, as described earlier [34].This weighting was performed so that maps closer to the centroid contribute more to the average map.The "exemplar" is the residue datum closest to the centroid of each cluster.

Solvent-Accessible and Lipid-Accessible Surface Area Calculations
The GETAREA algorithm [41] was used to calculate the solvent-accessible surface areas (SASAs) for all residue sidechains in our datasets.The program on its on-line server was run with default settings.For residues in the mC and mL datasets, the calculated SASAs were due to contact with water (both explicit and presumed) and with lipids.To quantify this latter effect, we calculated the ratio of the score sums involving lipid atoms to all atoms.If that ratio was greater than 0.1, we defined the resulting surface areas for such residues as LASAs or lipid-accessible surface areas.

Creation of Composite and Score Maps
The methods to build both of these maps are similar.First, the region of interest is defined by calculating the minimum and maximum x, y, and z coordinates of the set of residues and lipids (if included); the maxima are increased by 5 Å, and the minima are decreased by 5 Å.A grid box is superimposed over that volume with spacing between grid points of 0.5 Å.For creating these final maps, the following temporary grids are created using this grid box: four floating point grid maps for each of the interaction types (favorable and unfavorable, hydrophobic, and polar) to hold the composite map data; one Boolean (integer) grid map for each of the residues of interest in the dataset recording whether each point of the overall map is relevant for its associated residue; eight integer maps for each of the residues of interest in the dataset that hold map point indices relating data points in the residue's map to the overall map; and eight floating point grid maps for each of the residues of interest in the dataset holding the interpolation rules (values) for the alignment between the overall protein coordinate frame and each residue of interest's coordinate frame.
Each of these residues in the protein are aligned to their centroids, as presented above in Section 3.3 and previously [34].The alignment matrix thus calculated is "unapplied" to the standard defined grid box associated with that residue type, which has now been expanded to explicitly enumerate all grid point coordinates in the box, rather than only the default eight corners.This positions the residue's cluster maps in frame with the protein.
For each point in the overall map that is encaged by eight grid points in the rotated and expanded residue map, interpolation rules are defined for that point by calculating the associated eight distances to the residue map's grid point, normalizing them to the volume of the 8-point grid cube-in this case, 0.5 3 = 0.125-and then recording those values in the eight interpolation rules grids for that specific protein residue.This pre-calculation, which is CPU-and memory-intensive, enables the very efficient population of the composite or numerous score maps downstream, without further rotations or interpolations.

Summary and Conclusions
In this study, we reported the residue interaction environments of all amino acid residues previously undocumented in earlier articles [36][37][38][39].One objective was to characterize residues performing unique structural roles in membrane proteins based on a classification scheme inspired by concepts and parameters available in the MemProtMD database [40].We defined three membrane protein residue sets: (1) in soluble domains, (2) transmembrane and lipid-facing, and (3) transmembrane and core-facing.Generally, the residue interaction profiles as seen in the three-dimension hydropathic interaction maps from soluble domains in membrane proteins are quite similar to those of the same residues in soluble proteins.
Treating the three membrane protein datasets as unique provides scope for more nuanced analyses of specific structural features as well as broader interpretations of structure.An in toto protein structure is, in our approach, the composite sum of nodes in a three-dimensional network where each hydropathic interaction residue map is a puzzle piece.The puzzle pieces for reassembling lipid-facing residue structure in the helices are particularly easy to interpret in this paradigm: the favorable hydrophobic and favorable polar interactions dominate the transmembrane protein-lipid contact surface volume.On the other hand, the transmembrane core region is more complex and suffers from sparser data-there are closer to thirty-fold fewer residues sampled in this dataset than in the soluble protein set.More raw structural data may have to be sampled here.However, as mentioned above, and one important point included in two of our earlier manuscripts [37,38], our methods include the ability to vary pH, which could be particularly interesting in building an understanding of structure-function relationships for membrane proteins.
The three-dimensional hydropathic interaction maps provide a method to deconstruct the protein + water + lipid structure into residue-based units that carry enough information to reassemble such structures.However, these maps add value over simple molecular mechanics or homology methods of building proteins because they: incorporate indirect structural effects like the pi-pi stacking and pi-cation interactions of aromatic residues [39], model the role of ionization states in structure for ionizable residues [37,38], and even have strong potential to discover, define, and characterize binding sites on protein surfaces, e.g., for lipids as discussed here, but likely for ligands or cofactors.We term our methods "3D interaction homology" because the maps are focused on the three-dimensional arrangement of interactions and their types, and not the specific neighboring residue types.Further work on this paradigm will likely involve algorithms for the intelligent reassembly of the residue-level puzzle pieces into the full structure.

Figure 1 .
Figure 1.Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing alanines in the c5 chess square for (a) cluster 18 and (b) cluster 393.From the left, the first pair (mL) are the maps for interactions between alanine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between alanine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: leftz-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are

Figure 2 .
Figure 2. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing aspartic acids/aspartates in the c5 chess square for (a) cluster 44, (b) cluster 82, and (c) cluster 101.From the left, the first pair (mL) are the maps for interactions between aspartic acid/aspartate and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between aspartic acid/aspartate and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue

Figure 2 .
Figure 2. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing aspartic acids/aspartates in the c5 chess square for (a) cluster 44, (b) cluster 82, and (c) cluster

Molecules 2024 ,
29,  x FOR PEER REVIEW 16 of 32 similar role as alanine within the protein.The ILEmL maps are, however, completely dominated by interactions with the lipids, and the ILEm-ILEmN difference maps emphasize how important this residue is to structurally maintaining protein/lipid ensembles.

Figure 3 .
Figure 3. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing isoleucines in the c5 chess square for (a) cluster 6 and (b) cluster 17.From the left, the first pair (mL) are the maps for interactions between isoleucine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between isoleucine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment; purple contours represent unfavorable hydrophobic interactions.

Figure 3 .
Figure 3. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing isoleucines in the c5 chess square for (a) cluster 6 and (b) cluster 17.From the left, the first pair (mL) are the maps for interactions between isoleucine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between isoleucine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored:

Figure 4 .
Figure 4. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing phenylalanines in the c5 chess square for (a) cluster 110 and (b) cluster 202.From the left, the first pair (mL) are the maps for interactions between phenylalanine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between phenylalanine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment.

Figure 4 .
Figure 4. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing phenylalanines in the c5 chess square for (a) cluster 110 and (b) cluster 202.From the left, the first pair (mL) are the maps for interactions between phenylalanine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between phenylalanine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment.

Figure 5 .
Figure 5. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing threonines in the c5 chess square for (a) cluster 33 and (b) cluster 52.From the left, the first pair (mL) are the maps for interactions between threonine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between threonine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations:

Figure 5 .
Figure 5. Three-dimensional clustered hydropathic interaction maps for the sidechains of the lipidfacing threonines in the c5 chess square for (a) cluster 33 and (b) cluster 52.From the left, the first pair (mL) are the maps for interactions between threonine and all neighboring species, including the DPPC artificial lipids; the second pair (mN) are the maps for interactions between threonine and all neighboring species, excluding lipids; the third pair are the difference (mL-mN) maps, which thus characterize residue-lipid interactions.Each residue/map in a pair is displayed in two orientations: left-z-axis (CA-CB bond) directed out of the page, and right-z-axis directed upwards.Atoms are colored: C-white, H-cyan, N-blue, O-red.Green contours represent favorable hydrophobic interactions between the residue sidechain and its environment; purple contours represent unfavorable hydrophobic interactions; and blue contours represent favorable polar interactions.

Figure 6 .
Figure 6.Structures and hydropathic interactions of cytochrome b561 from Arabidopsis thaliana (pdbid: 4o6y).(a) The hydrogen-suppressed molecular structure with superimposed tube (orange) illustrating the backbone trace.Atoms are colored by type (white = carbon).(b) Hydropathic interactions between each residue's sidechain and environment for atoms in the mS soluble domains of the protein.Green contours represent favorable hydrophobic interactions between the residue

Figure 6 .
Figure 6.Structures and hydropathic interactions of cytochrome b561 from Arabidopsis thaliana (pdbid: 4o6y).(a) The hydrogen-suppressed molecular structure with superimposed tube (orange) illustrating the backbone trace.Atoms are colored by type (white = carbon).(b) Hydropathic interactions between each residue's sidechain and environment for atoms in the mS soluble domains of the protein.Green

Figure 7 .
Figure 7. Cartoon depicting a method to optimize sidechain interactions and conformations.An overall score grid (black regular-spaced dots) is superimposed over the region of interest in the coordinate frame of the protein.Members (i, j, k, l) of the sidechain interaction map set for each residue are individually auditioned by frame-shifting onto the protein, followed by interpolation onto the score grid.Each point (xyz) in the score grid (*) is the sum (sxyz) of the values (vxyz) from each residue's auditioned map.The total score for this model (Smodel) is then the sum over all grid points in the grid.

Figure 7 .
Figure 7. Cartoon depicting a method to optimize sidechain interactions and conformations.An overall score grid (black regular-spaced dots) is superimposed over the region of interest in the coordinate frame of the protein.Members (i, j, k, l) of the sidechain interaction map set for each residue are individually auditioned by frame-shifting onto the protein, followed by interpolation onto the score grid.Each point (xyz) in the score grid (*) is the sum (s xyz ) of the values (v xyz ) from each residue's auditioned map.The total score for this model (S model ) is then the sum over all grid points in the grid.

Table 1 .
Enumeration of the datasets.

Table 2 .
Solvent-accessible a and lipid-accessible b surface areas (SASA and LASA, respectively) of the datasets [all surface areas (Å 2 )].
[41]ess square.parsenotationdescribes the residue's backbone angle and χ 1 bin (see text); b Clusters are named for the exemplar residue, i.e., closest to cluster centroid; c Number of residues of that type member of a specified cluster; d From GETAREA[41]; e Adapted from GETAREA results as described in text.
[41]ess square.parsenotationdescribes the residue's backbone angle and χ 1 bin (see text); b Clusters are named for exemplar residue, i.e., closest to cluster centroid; c Number of residues of that type member of a specified cluster; d From GETAREA[41]; e Adapted from GETAREA results as described in the text.

Table 5 .
Average interaction character a for the datasets by the residue type.

Table 6 .
Interaction character a by residue and cluster for the soluble and mS datasets.
a Interaction character of a cluster is calculated as described in the text: Hyd(−) is unfavorable hydrophobic, Hyd(+) is favorable hydrophobic, Pol(−) is unfavorable polar, and Pol(+) is favorable polar; b Chess square.parsenotation describes the residue's backbone angle and χ 1 bin (see text); c Clusters are named for the exemplar residue, i.e., closest to cluster centroid.

Table 7 .
Interaction character a by residue and cluster for the mL and mN datasets.