A Comparative Study of the Second-Order Hydrophobic Moments for Globular Proteins: The Consensus Scale of Hydrophobicity and the CHARMM Partial Atomic Charges

In this paper, the second-order hydrophobic moment for fifteen globular proteins in 150 nonhomologous protein chains was performed in a comparative study involving two sets of hydrophobicity: one selected from the consensus scale and the other derived from the CHARMM partial atomic charges. These proteins were divided into three groups, based on their number of residues (N) and the asphericity (δ). Proteins in Group I were spherical and those in Groups II and III were prolate. The size of the proteins is represented by the mean radius of gyration (Rg ), which follows the Flory scaling law, Rg ∝ Nν. The mean value of v was 0.35, which is similar to a polymer chain in a poor solvent. The spatial distributions of the second-order moment for each of the proteins, obtained from the two sets of hydrophobicity, were compared using the Pearson correlation coefficient; the results reveal that there is a strong correlation between the two data sets for each protein structure when the CHARMM partial atomic charges, |qi| ≥ 0.3, assigned for polar atoms, are used. The locations at which these distributions vanish and approach a negative value are at approximately 50% of the percentage of solvent accessibility, indicating that there is a transition point from hydrophobic interior to hydrophilic exterior in the proteins. This may suggest that there is a position for the proteins to determine the residues at exposed sites beyond this range.


Introduction
Protein structures are stabilized by several non-covalent interactions, such as hydrophobic, van der Waals, electrostatic, and hydrogen bonding interactions [1,2]. Of all the interactions that take place in proteins, hydrophobic interaction is a dominant force that drives protein folding. The interaction between non-polar amino-acid residues and the aqueous environment provides a strong hydrophobic force for protein folding [1,3], forming a hydrophobic core in the protein interior. Thus, the variation in hydrophobicity from the hydrophobic core to hydrophilic exterior is the spatial variation for most globular proteins. To quantify the spatial transition of hydrophobicity between the interior and the surface of globular proteins of known structure, Silverman [4] introduced the second-order hydrophobic moment, using the hydrophobicity consensus scale of Eisenberg [5]. The second-order hydrophobic moment, or quadrupole moment, is similar to the electrostatic quadrupole, and it was first proposed by Eisenberg [5] to describe the hydrophobic distributions of proteins. A second-order hydrophobic moment has also been applied to the systems [6] including native and decoy protein structures, and multi-domain proteins [7]. In contrast, the first-order hydrophobic moment, or hydrophobic dipole moment, is similar to the dipole moment of a molecule. The first-order hydrophobic moment has been used to measure the amphiphilicity of primary and secondary structures either in globular proteins [5,[8][9][10] or in transmembrane proteins [11,12]. Moreover, the zeroth-order hydrophobic moment is the total hydrophobicity of the amino-acid residues in a protein, and it is similar to a net molecular charge.
Residues classified as buried or exposed are conventionally described by a geometric parameter calculated using the solvent-accessible surface area (ASA) [13], which is generated by rolling a spherical probe with a radius of 1.4 Å over the surface of a protein. The ASA values obtained are in absolute values, and these can be changed to relative values, which are also known as the percentage of solvent accessibility p (%) [14,15]. These values have been calculated using the ASA value of each amino-acid residue in the native state normalized with respect to the ASA value of the corresponding residue X in the extended state either of Gly-X-Gly [16,17] or of Ala-X-Ala [14,15]. Recently, the prediction of solvent accessibility, based on protein sequences, has also been developed using support vector regression (SVR) [18][19][20][21]. Gromiha et al. [14,15] have classified these relative values to the locations of protein residues as follows: 0-2% as completely buried, 2-20% as a location between buried and partially buried, 20-50% as partially buried, and greater than 50% as completely exposed. This classification is used to study the stability changes for protein mutants.
Silverman [4] used an ellipsoidal representation for the shapes of proteins. The extent of an ellipsoid was defined by a distance, d, which was calculated from the molecular moments of geometry [4]. The second-order hydrophobic moment per residue in a protein was calculated from a small value of d to a larger value by increasing the value of d until all residues in the protein have been collected to obtain its spatial distribution of the second-order hydrophobic moment. In contrast, the percentage of solvent accessibility, p (%), is adopted in this work, and we employ successive increases of solvent accessibility; that is, increases from 0 to 100% are used to study the spatial distributions of the secondorder hydrophobic moment of the protein structures. The second-order hydrophobic moment per residue is calculated from a space defined by a small value of p (%), which collected the residues at the hydrophobic core of a protein to the larger value of p (%) at which all residues will be collected. To investigate spatial characteristics, the distances from the origin of a protein to the centroids of the protein residues in a space defined by p (%) are also considered. Here, we use the center of mass of each protein as the origin. In this work, two sets of hydrophobic parameters for each of the amino-acid residues in the protein structures are used in a comparative study of the second-order hydrophobic moments: one is based on the hydrophobicity consensus scale of Eisenberg [5], and the other is derived from the partial atomic charges assigned to atoms of a protein in the CHARMM program [22].
In this work, a different approach from the works of Silverman [4] is used to define the spaces in a protein; i.e., we employ p (%) to define the spaces to calculate the spatial distributions of the secondorder hydrophobic moment of a protein. The purpose of this study is to examine whether the hydrophobicity of each residue in a protein can be obtained using molecular modeling, and whether the values of hydrophobicity derived from the partial atomic charges assigned to atoms of a protein in the CHARMM program [22] are comparable with those of the consensus scale of Eisenberg [5] by using the spatial distributions of the second-order hydrophobic moment in the new definition of spaces in the protein. This work may provide an alternative way to calculate the hydrophobicity of each residue in a protein. Since the hydrophobicity of an amino-acid residue cannot be defined and measured easily, it is usually obtained from the free energy changes calculated by transferring amino-acid side chains from aqueous to non-aqueous media [23,24] or from non-aqueous to aqueous media [25].

Materials and Methods
A set of 150 nonhomologous protein chains was randomly selected from PDBSELECT [26], which included more than 4,000 protein chains. The sequence identity in this set of proteins is lower than 25%, and their single-chain protein crystal structures have been determined at a resolution of less than 1.8Å and at an R-factor of less than 0.18. All the structures of the proteins were obtained from the PDB [27]. A figure of asphericity (δ) and the index of 150 proteins are shown in Figure 1. In order to select protein structures with diverse sizes and shapes, fifteen structures, as shown in Table 1, were selected from these 150 proteins and divided into three groups, depending on the number of residues (N) and the extent of asphericity (δ).  The shape of a protein can be characterized in terms of eigenvalues of the radius of gyration tensor. The radius of gyration tensor T is defined as follows [28,29]: r r (2) where N r is the number of residues in the protein, and r i is the position vector for the centroids of the amino-acid residues. The matrix T can be diagonalized to obtain the three eigenvalues, α i (i = 1, 2, 3), of this tensor. The asphericity (δ) [28][29][30][31] has been used to characterize protein shapes, and  is computed as If δ = 0, the shape is a perfect sphere, the extent of asphericity can be referred as 0 < δ < 1, and if δ = 1, the shape is a rod. The shape parameter (S) [30,31] has been used to quantify the whole shape of a protein, and S is computed as If S = 0, the shape is a perfect sphere, if S < 0, the shape is oblate ellipsoid, and if S > 0 the shape is prolate ellipsoid. The shape of proteins can also be represented by semiaxes [32]; i.e., a = [5(α 1 +α 2 )/2] 1/2 and b = (5α 3 ) 1/2 , where α i was sorted according to increasing magnitude. If a ≈ b, the shape is close to a sphere, and if a < b the shape is a prolate ellipsoid. The molecular size of a protein can be probed using its mean square radius of gyration, 2 g R , defined as follows [33]: where N r , r i , and r CM are defined in Equation (2).
The solvent accessible surface area (ASA) for each atom or for each residue in the proteins of interest was obtained using the computer program ASC [34] with default parameters. The percentage of solvent accessibility, p (%), was calculated as follows: where ASA X,folded is the ASA of X residue in the folded state of a given protein and ASA X,G-X-G is the ASA of the corresponding residue X in tripeptide (Gly-X-Gly) with the extended state. The extended state (φ = Ψ = 180°) coordinates were generated using InsightII program [35]. The Gly residues to either side of interest residue X can provide steric screening effects of neighboring residues in the simulated models, and such steric screening effects can reduce the values of ASAs calculated for the X residues in the extended state. The results of the ASA for each residue type X are shown in Table 2  158.0 0.54 a Solvent-accessible surface area (ASA) of each residue type X in the tripeptide (Gly-X-Gly) at the extended state, calculated using ASC [34]; b The hydrophobicity consensus scale of Eisenburg [5]. Polar atoms were defined on the basis of the partial charges, which were assigned to atoms in a protein by the CHARMM program [22]. In this work, three sets of partial charges, namely, i q ≥ 0.25, 0.27, or 0.3 were assigned as polar atoms, and the remaining atoms were considered apolar. These three sets of partial charges were used to examine which one has the best correlation with the hydrophobicity consensus scale of Eisenberg [5]. Polar and apolar ASAs of each residue were calculated by summing the ASA of the respective polar and apolar atoms in the residue. The hydrophobicity, h i , for each amino-acid residue, i, in a protein was calculated as follows: where ASA apo and ASA po are the apolar and polar ASAs, respectively, in the residue. For a given protein, the h i value of each residue, either selected from the hydrophobicity consensus scale of Eisenberg [5], as shown in Table 2, or calculated according to Equation (8) was normalized by the following equation: where h is the mean of the h i for all residues in the protein, and σ(h) is their standard deviation. In this work, the second-order hydrophobic moment, H 2 (d i ), per residue is similar to the works of Silverman [4], given by where n p is the number of residues collected in the space defined by the percentage of solvent accessibility p (%), and d i is the distance from the protein centroid to the centroid of residue i, in the space. Here, the protein centroid is the coordinates of its center of mass (CM). In this work, the space surface is dependent on p (%). When one selects a particular value of P (%), the number of residues n p , is collected within the space. When a value of p (%) increases, the volume of the space is also increased and the number of residues n p , residing in the space increases as well. The increment of p (%) is from 0 to 100%. Therefore, the spatial distribution of the second-order hydrophobic moment per residue, H 2 (d i ), is represented as a function of p (%). The Pearson correlation coefficient [36], R, was used to measure the correlation between the second-order moments as follows: where x denotes the second-order hydrophobic moment calculated with the h i selected from the consensus scale (Table 2), and y denotes the second-order hydrophobic moment calculated with the h i derived from the CHARMM partial atomic charges using Equation (8). These x and y were normalized by Equation (9). x and y denote the mean of x and y, respectively. In this work, this equation is also used to test whether the correlation coefficient between two data sets is statistically significant to determine the equation of the regression line. The correlation coefficient calculated with Equation (11) is a linear correlation between the two data sets. If the correlation coefficient is close to 1, there is a very strong correlation between the two data sets.

Molecular Shape and Size
Fifteen protein structures (Table 1) in 150 nonhomologous protein chains were divided into three groups, depending on their asphericity (δ) as shown in Tables 3: Group I ranged from 1.49 × 10 −2 to 2.48 × 10 −2 , Group II from 1.05 × 10 −1 to 1.81 × 10 −1 , and Group III from 2.23 × 10 −1 to 3.54 × 10 −1 . This division was also dependent on the number of residues (N), as shown in Table 1, to ensure a diverse range in molecular size. A suitable representation of molecular shape may be the gyration tensor, as shown in Equation (1). Diagonalization [36] of the gyration tensor gives three eigenvalues (α 1 , α 2 , and α 3 ) as shown in columns 2, 3, and 4 of Table 3, respectively. Columns 5 and 6 of Table 3 show asphericity (δ) and the shape parameter (S), respectively. The values of δ were calculated with Equation (3), and those of S were calculated with Equation (4). The data show that the values of δ and of S for proteins in Group I are close to zero, suggesting that the shape of these proteins is close to a sphere. The shape of proteins can also be represented by semiaxes [32]. As shown in columns 7 and 8 of Table 3, the values of a and b in Group I are very close to each other, indicating that the molecular shape can also be identified as a sphere. As shown in Table 3, α 3 is the largest eigenvalue for each protein structure in Groups II and III, compared with 1  and 2  , so that the shape of proteins in Groups II and III can be concluded as a prolate ellipsoid [32]. Based on δ > 0 and S > 0, and a < b, one can also infer that the shape of the proteins in Groups II and III is a prolate ellipsoid. The last column in Table 3 lists the square roots of the mean square radius of gyration calculated according to Equation (6). The size of folded protein structures can be represented as the mean radius of gyration (R g ), which may follow the Flory scaling law [37][38][39], namely, , where N is the number of amino-acid residues in a protein and v is a scaling exponent. As shown in Figure 2, the values of the mean radius of gyration (R g ) and the number of residues (N) were linearized by applying the Flory scaling law on a double logarithmic scale and were well fitted to linear lines (y = mx + b) by using the least-squares method. The Pearson correlation coefficient between each pair of parameters (logN and logR g ) was calculated using Equation (11). The values for the correlation coefficient are 0.99, 0.98, and 0.98 for proteins in Group I, Group II, and Group III, respectively, indicating that the correlation between N and R g is highly significant. The values of v, obtained using the least-squares method, are 0.34, 0.39, and 0.32 for protein structures in Groups I, II, and III, respectively. These values are found to have a mean value of 0.35, which is close to 0.34 as predicted by the Flory scaling law for a polymer chain in a poor solvent [40][41][42]. Thus, these polypeptide chains may form a collapsed globule to minimize contact with the solvent molecules [42].  (1). The values were sorted according to increasing magnitude; b The asphericity (δ), calculated according to Equation

Distance Distribution Functions
The distances d (Å) from the protein centroid to the centroids of amino-acid residues in a protein, were grouped into 10 bins. The proportion of distances in each bin was plotted against d (Å) to obtain the distance distribution W (d), as shown in (Figures 3a, 3b, and 3c). Depicted in the figures are the results for protein structures in Groups I, II, and III, respectively. This scheme has been used to investigate the end-to-end distance distributions for poly(oxyethylene) chains [43]. The distance distributions may associate with the persistence length of a polymer chain described by the wormlike chain model [31]. The molecular size can also be characterized by the distance distributions. As shown in (Figures 3a, 3b, and 3c), these distributions all have a similar appearance, showing a single Gaussian-like distribution. These distributions, fluctuated about their most probable values, can be denoted as d d   (Å). A mean value, d (Å), which is associated with the mean radius of gyration, denotes the precise location of the most probable appearance of a distance distribution, and a value of d  (Å) represents the breadth of its fluctuation. An increased extent of protein asphericity is correlated with increased fluctuations. For example, as shown in Table 1 Table 1.

The Second-Order Hydrophobic Moment
The detailed results for the second-order moment per residue, calculated according to Equation (10) with hydrophobicity i h selected from the consensus scale of Eisenberg (Table 2), are represented in Table 4 and Figure 4. Table 2 was normalized by Equation (9). Table 4 shows the details of the spatial distribution of the second-order hydrophobic moment per residue for protein 3PYP (see Table1) at a 2% resolution. The spatial distribution was calculated from the space collecting the residues that are completely buried (0%) to that collecting the residues that are completely exposed (100%) to the solvent. Figure 4 shows the distribution of the second-order hydrophobic moment per residue, H 2 (d i ), as shown in Table 4, as a function of p (%); it is clear that the distribution transfers from the hydrophobic core with positive values of H 2 (d i ) to the hydrophilic sites with negative values of H 2 (d i ).  . The spatial distribution of the second-order hydrophobic moment per residue, H 2 (d i ), for the protein 3PYP, plotted as a function of the percentage of solvent accessibility, p (%). The second-order hydrophobic moment was calculated using i h selected from the consensus scales (Table 2).   Table 5 shows the results of the Pearson correlation coefficient calculated using Eq. (11) for the correlation between two data sets of the second-order hydrophobic moments, calculated according to Equation (10) using two different sets of hydrophobic parameters: one is the hydrophobicity, i h , selected from the consensus scale (Table 2) and the other is the i h calculated according to Equation (8) using the CHARMM partial atomic charges assigned to polar atoms with i q ≥ 0.25, 0.27, or 0.3. The both data sets were normalized by Equation (9). Comparing the values in each row of Table 5 clearly indicates that the values in the last column of Table 5 give the highest correlations, which were calculated with i q ≥ 0.3, for individual proteins. The mean values of correlation coefficient shown in the last column of Table 5 are 0.91 ± 0.02, 0.85 ± 0.03, and 0.86 ± 0.07 for proteins in Groups I, II, and III, respectively, showing that Group I with the lowest values of asphericity gives the strongest correlation.  (Table 2) and the i h calculated according to Equation (8) with the CHARMM partial atomic charges, i q ≥ 0.3, assigned for polar atoms. The Pearson correlation coefficients between the spatial distributions of the second-order hydrophobic moments (solid and dashed lines) are 0.91, 0.91, and 0.90 for proteins IE70, 1JAE, and 1LAM, respectively. Figures 5a, 5b, and 5c show the comparative plots for the spatial distributions of the second-order hydrophobic moment for protein structures IE70, 1JAE, and 1LAM, respectively, using Equations (10) and (11). Two data sets of second-order hydrophobic moments depicted in each of these figures were calculated using the hydrophobicity, i h , selected from the consensus scale (Table 2) and the i h calculated according to Equation (8) using the CHARMM partial atomic charges assigned to polar atoms with i q ≥ 0.3, respectively. As shown in (Figures 5a, 5b, and 5c), the values of i h calculated using the CHARMM partial atomic charges overestimate the values of the second-order hydrophobic moments when p (%) is around zero. This is due to the fact that the denominator (ASA apo + ASA po ) of Equation (8) yields small values at the hydrophobic core. In contrast, the other areas of p (%) give rise to a very strong Pearson correlation, having the values are 0.91, 0.91, and 0.90. Thus, i h calculated using the CHARMM partial atomic charges may provide an alternative way to assign the hydrophobicity for each amino-acid residue in a protein. Moreover, as shown in (Figures 5a, 5b, and 5c), the second-order hydrophobic moments give an overall shape of spatial distributions from the hydrophobic interior to the hydrophilic exterior. These results are in agreement with those of Silverman [4]. Table 6 shows the locations at which the spatial distribution of the second-order hydrophobic moments for the fifteen proteins vanishes and approaches a negative value. These locations are denoted as  p (%) or  d (Å) for the space defined by the percentage of solvent accessibility, p (%), or the average distance from the protein centroid to the centroids of the amino-acid residues in the same space, respectively. These locations show the transition from the hydrophobic core of the proteins to their hydrophilic exterior. Data listed in columns 2 and 3 of Table 6 are the values of the locations,  p (%), calculated from the CHARMM partial atomic charges, i q ≥ 0.3, assigned for polar atoms, and from the consensus scale (Table 2), respectively, using the spaces defined by the percentage of solvent accessibility, p (%). The two data sets were normalized by Equation (9). Values in columns 4 and 5 of Table 6 are the average distances,  d (Å), calculated using the same parameters of i h as those shown in columns 2 and 3 of Table 6, in the space corresponding to the percentage of solvent accessibility,  p (%), shown for reference. As shown in Table 6, the values of  p (%) in column 2 have an average value of 46.5 ± 7.9 %, whereas those of  p (%) in column 3 have an average value of 57.9 ± 8.3 %. Thus, the hydrophobicity, i h , calculated with the consensus scale yields a higher average value than that calculated with i q ≥ 0.3. The result of 46.5 ± 7.9 % may provide a criterion to determine whether residues in a protein are buried or exposed to a solvent. Gromiha et al. [14,15] have studied mutations in different solvent accessibilities and have considered the mutations that are exposed when the percentage of solvent accessibility, p (%), is greater than 50%.

Conclusions
A set of 150 nonhomologous protein chains has been randomly selected from PDBSELECT [26], which consisted of more than 4,000 protein chains. Fifteen protein structures have been selected from a set of 150 nonhomologous protein chains, based on the number of residues (N) and the extent of asphericity (δ), to ensure diversity of the molecular sizes and shapes. These fifteen protein structures have been the subject of a comparative study using second-order hydrophobic moments. This may provide an alternative approach to the assignment of hydrophobicity to each amino-acid residue in the proteins by using the CHARMM partial atomic charges. The Pearson correlation coefficient for each protein structure has been used to compare the spatial distributions of second-order hydrophobic moments calculated using the hydrophobicity , h i , obtained from the consensus scale of Eisenberg [5] with that calculated using Equation (8) for the h i derived from CHARMM partial atomic charges assigned to atoms in a protein. These two data sets were normalized by Equation (9). The Pearson correlation coefficient between the two data sets shows a strong correlation for each of these fifteen protein structures when the absolute values of the CHARMM partial atomic charges are greater than 0.3, assigned for polar atoms. These comparative results suggest that the hydrophobicity of each type of amino-acid residue in the proteins can probably be obtained using the CHARMM partial charges assigned to atoms in amino-acid residues. The spatial distributions of the second-order hydrophobic moment have the overall shape of the transition from hydrophobic interior to hydrophilic exterior. The transition at a position where the value of the second-order hydrophobic moment vanishes and approaches a negative value was denoted as  p (%), and the average value of  p (%) is 46.5 ± 7.9%, which has been calculated with CHARMM partial atomic charges, i q ≥ 0.3, assigned for polar atoms.
This may give the value of the transition point from the buried sites to exposed sites for amino-acid residues in the protein structures.