Statistical Estimation of the Protein-Ligand Binding Free Energy Based On Direct Protein-Ligand Interaction Obtained by Molecular Dynamics Simulation

We have developed a method for estimating protein-ligand binding free energy (ΔG) based on the direct protein-ligand interaction obtained by a molecular dynamics simulation. Using this method, we estimated the ΔG value statistically by the average values of the van der Waals and electrostatic interactions between each amino acid of the target protein and the ligand molecule. In addition, we introduced fluctuations in the accessible surface area (ASA) and dihedral angles of the protein-ligand complex system as the entropy terms of the ΔG estimation. The present method included the fluctuation term of structural change of the protein and the effective dielectric constant. We applied this method to 34 protein-ligand complex structures. As a result, the correlation coefficient between the experimental and calculated ΔG values was 0.81, and the average error of ΔG was 1.2 kcal/mol with the use of the fixed parameters. These results were obtained from a 2 nsec molecular dynamics simulation.

There have been several reports on protein-compound docking and free energy calculation by molecular dynamics (MD) simulation. Even if a protein-ligand complex structure is unknown, ab initio MD docking simulations show protein-ligand complex structures and free energy landscapes [11][12][13][14]. Generalized ensemble methods have been adopted for wide conformational searches [15][16][17][18].
In an explicit water model, if a protein-ligand complex structure is known, the binding free energy and the potential of mean force (PMF) along the dissociation path can be obtained by using the filling potential (FP) method [18], the meta-dynamics method [19,20], the smooth reaction path generation (SRPG) method [21], or Jarzynski's method [22]. We previously proposed the FP and SRPG methods [18,21], each of which generates a reaction path (dissociation path) of the ligand and calculates the free energy surface along the path based on ab initio MD simulation. The other trend is the application of Jarzynski's equation [22]. In this method, a harmonic potential that restrains the ligand at a particular position moves slowly and leads the ligand from the binding state to the dissociation state, and the free energy profile is calculated. Among these methods, MP-CAFFE has been applied to various species, and the G estimation error was almost 1 kcal/mol [23].
The molecular-mechanics Poisson-Boltzmann surface-area (MMPBSA) method [24] and the linear interaction energy (LIE) method [25] have successfully been used to reproduce the trend of Gs for a single target protein. These methods are much faster than the ab initio MD methods described above. In the LIE method, G is evaluated based on the average van der Waals (vdW) energy and the average electrostatic energy. The weight parameters of the vdW and electrostatic terms are optimized for each target. To apply the LIE method, multiple active compounds and their docking poses are necessary in order to optimize the parameters for each target protein.
The COMBINE method is based on the assumption that biological activities can be correlated with a linear combination of a subset of the van der Waals and electrostatic terms of the interaction energies between a ligand and its surrounding protein residues (such as the target receptor) [26,27]. The protein-ligand binding free energy G is given by: where E i vdw and E i ele are the van der Waals and electrostatic terms of the interaction energies, respectively, between the ligand and the i-th residue of a protein (the target protein) and c is a constant. w i vdw and w i ele are parameters to be determined to reproduce the experimental data.
The coefficients of Equation (1) (w i vdw and w i ele ) could be determined by partial least squares (PLS) analysis. As is the case with the LIE, to apply the COMBINE method, multiple active compounds and their docking poses are necessary in order to optimize the parameters for each target protein. It has been shown that the COMBINE analysis predicts binding free energies with good accuracy and also identifies important amino acid residues for the improvement of affinity [26,[28][29][30][31][32].
In the present study, we propose a G estimation method based on the direct protein-ligand interaction obtained by molecular dynamics simulation. We introduced the entropy term and the local effective dielectric constant, and modified the van der Waals potential to improve the accuracy of the present method so that it does not require multiple active compounds to predict the G value.

Theoretical Background
G is calculated by Zwanzig equation as follows [33]: where U b , U u , and < > b represent the potential of the protein-ligand bound and unbound states and the average over the bound-state trajectory, respectively.
Kubo's cumulant expansion gives the following equation excluding the log and exp functions as [34]: The first term (linear term) corresponds to the enthalpy, and the higher-order term corresponds to the entropy. The second term becomes: When we assume: Then, the second term of Equation (3) becomes: And Equation (3) becomes: The linear term (the first and second terms; <U b > b − <U u > u ) of Equation (7) corresponds to the LIE approximation, when the energy difference is due to the receptor-ligand and solvent-ligand interaction energies. The LIE approximation calculates G by: where E vdW X and E ele X are the protein-ligand van der Waals interaction energy and the electrostatic interaction energy between the ligand and the surrounding molecules, R-L/S-L represents the interaction of the protein-ligand complex system/solute-ligand system, and the brackets (< > X ) represent the average over simulation of the protein-ligand complex system (RL) or the solute-ligand system (SL). The LIE equation includes two parameters:  and . These parameters are known to reproduce the experimental data for each target protein.
The COMBINE approximation calculates the G value by: where the E vdW (i) and E ele (i) are the protein-ligand van der Waals interaction energy and the electrostatic interaction energy between the i-th residue of the protein and the ligand, and w is the parameter. The simulation of the ligand-solution system is not necessary. Both the LIE approximation without solvent and the COMBINE approximation with the residue-independent w parameter gave the same equation:

Entropy Term
In the present study, we introduced the entropy term in Equation (10) as follows. We call this method the direct interaction approximation without solvent (DIAV) method: where E vdW (i) and E ele (i) are the vdW and electrostatic interactions between the i-th residue of the protein and the ligand, respectively. Svdw(i) and Sele(i) are fluctuations of E vdW (i) and E ele (i) during the molecular dynamics simulation, respectively. The *S x term represents the energy fluctuation of the system corresponding to the second-order term of Equation In Equation (11), the *S x term is the fluctuation of energy, but we found that the energy fluctuation itself is not suitable for evaluating G. Instead of the energy, S x is the fluctuation of a property x that is related to the energy. The properties x in the current study are the accessible surface area (x = ASA), the dihedral angles (x = DIH), the vdW potential (x = vdW), and the electrostatic potential (x = ELE) of the protein-ligand complex structure. In the present study, we determine which property is best for estimating G. There are five parameters: , 2, , 2, and .

Modification of van der Waals Potential Term
To represent the van der Waals (vdW) interaction, a Lennard-Jones (LJ) 12-6-type function is used. In the docking score, the vdW interaction (lipophilic atom contact) term represents both the vdW interaction and the cavity formation energy in solvent; in water, the latter is 10 times greater than the vdW interaction. This function gives very large values when atomic conflicts occur. To reduce these conflicts, an LJ 9-6-type function has been used in a protein-ligand docking study [3]. In general, the absolute value of the vdW interaction is much smaller than the G value. The LJ 12-6 value represents the atomic contact and its hydrophobic interaction. Thus, in the present study, we apply LJ 12-6, LJ 9-6, LJ 8-4, and LJ 6-3-type functions as follows: ] where R e is the equilibrium distance. The R e and the well depth values are set to the same values obtained from AMBER param99 [35] and the general AMBER force field (GAFF) [36]. The data-sampling MD simulation is performed with the conventional AMBER force field (LJ 12-6 potential), and the analysis is performed using Equations (12)-(15).

Effective Dielectric Constant
In the ligand-binding pocket, the effective dielectric constant ( eff ) should be different at each point, since the  eff values of proteins are 2-4 and the  eff of water is 78.5. The E ele (i) should be scaled by the  eff . We introduced the modification of the electrostatic interaction as follows (we call this method the direct interaction approximation with solvent (DIAS) method): where E mod ele (i) is the E ele (i) value scaled by the  eff . The  eff value could be calculated from the ratio between the electrostatic force calculated in the explicit water model and that in vacuum, as follows: where E j ele (i) is the electrostatic interaction between the i-th residue and the j-th atom of the ligand in vacuum. The following scale factor might be a candidate: where F i real and F i vac are the electrostatic force acting on the i-th atom of the protein considering the solvent and not considering the solvent in the explicit water model, respectively. The F real and F vac were calculated by the molecular dynamics simulation in the explicit water model and in vacuum, respectively. The scale factor ' eff i by Equation (18) or (19)could be unrealistically large when the denominators of Equations (12) and (13) are nearly zero. Thus, we introduce a parameter x and the scale function as follows: The value of  eff i in Equation (20) is 1 <  eff i , while the actual  eff value could be less than 1. But we introduced parameter  in Equation (16), thus the actual  eff parameter is  eff i /. In the following analysis, the factor  eff i in Equation (20)is used as the actual scale factor.

Examination of Entropy Term
We applied the DIAV method (Equation (11)) to the protein-ligand complex structures to examine the entropy property term Sx and performed the leave-one-out cross-validation test, as summarized in Table 1, which also summarizes the optimized parameters. The vdW potential is the LJ 12-6 type. a: property (x) of Equation (11). Here 2 = 2 = 0. The energies are presented in kcal/mol, and R represents the correlation coefficient.
In the leave-one-out cross-validation test, one data is selected as the test data that is to be predicted and the other data are used as the teaching data to generate the prediction model equation. The test data is selected one after another in the given data set until all data are selected as the test data. The vdW energy term was set to an LJ 12-6 function, and the dielectric constant was set to 1. The values of the parameters 2 and 2 were set to zero. The ASA parameters (atomic solvation parameter and radius of each atom) were obtained from a previous study [37]. In the present study, the parameters of Equation (11) were optimized by the least-squares deviation error of the G values. Compared to the results obtained by the simplified version of the COMBINE method (Equation (10)), the DIAV method (Equation (11)) slightly improved the accuracy of G.

Examination of vdW Term
We examined the DIAV method with the vdW term using Equations (12)- (15). The results and the optimized parameters are summarized in Table 2. The  value was set to 1. The entropy property x was set to the ASA. We also examined the case of x = DIH, and the result was quite similar to that obtained in the case of x = ASA. The LJ 8-4-type function gave the best result among the four functions (LJ 12-6, LJ 9-6, LJ 8-4, and LJ 6-3) in the leave-one-out cross-validation test, while the accuracy obtained was similar among the functions. Thus, the LJ6-3 and LJ4-2−type functions were not used in the following study; instead we focused on the LJ8-4 function. The energies are presented in kcal/mol, and R represents the correlation coefficient.
The vdW parameters represent both the protein-ligand vdW interaction and the hydrophobic interaction. In the present study, however, the number of data were limited to the optimization of the parameters, and then we used just the original vdW parameters.

Examination of 2 and 2 Parameters
We examined the parameters 2 and 2 of the DIAV method (Equation (11)). The vdW potential was set to the LJ 8-4-type function, and the dielectric constant was set to 1. The entropy property x was set to the ASA. We also examined the case of x = DIH; the result was quite similar to that obtained in the case of x = ASA. The leave-one-out cross-validation results and the optimized parameters are summarized in Table 3. The optimized 2 and 2 were about 0.01 and −0.0013, respectively, and the modulated vdW and electrostatic energy values were close to the original (intact) values. Actually, the parameters 2 and 2 improved the G estimation accuracy, and the equation includes five parameters (, , , 2, and 2). The two additional parameters (2 and 2) slightly improved the average accuracy.

Examination of Effective Dielectric Constant Term
We applied the idea of the effective dielectric constant. We applied the DIAS method (Equation (16)) to the estimation of G using the  eff defined by Equations (18) and (19). The leave-one-out crossvalidation results and the optimized parameters are summarized in Table 4. The vdW potential is the LJ 8-4 type. The energies are presented in kcal/mol, and R represents the correlation coefficient. Table 4. Cross-validation results obtained by Equation (10), the DIAV (Equation (11)), and the DIAS (Equation (16)) methods.

G DIAV (Equation (11)) (kcal/mol)
G DIAS (Equation (16) The vdW potential is the LJ 8-4 type. The property x of Sx is the ASA. The energies are presented in kcal/mol, and R represents the correlation coefficient. As with the results described above, the best property x among the four properties (ASA, DIH, vdW, and ELE) was the ASA. The DIAS results obtained by Equation (19) were better than those obtained by Equation (18). The DIAS results in Table 4 were obtained by using Equation (19). The consideration of  eff slightly improved the G estimation. As a result, the correlation coefficient between the experimental and the calculated G values was 0.81, and the average error of G was 1.2 kcal/mol. This result greatly improved the results obtained by Equation (10). Figure 1 shows the correlation between experimental and calculated G values obtained by the DIAS method (Equation (16)). We examined the time-dependency of the G obtained by the DIAS method. After 1 nsec MD simulation for equilibration, the sampling runs of 0.5 nsec, 1 nsec, 1.5 nsec and 2 nsec were performed.
The G values did not depend on the sampling-time length so much. Namely, the average error over the 34 target protein-ligand complexes were 1.43 kcal/mol, 1.22 kcal/mol, 1.23kcal/mol and 1.23 kcal/mol for the 0.5 nsec, 1 nsec, 1.5 nsec and 2nsec sampling times, respectively. The initial structures of these simulations were the experimentally obtained protein-compound complex structures. Thus, the protein-compound interaction did not depend on the sampling-time length so much.
The results showed that the current method worked well for various target proteins. This method could be extended as: where ,  2 , ,  2 , and  x are the parameters. This extension is one of the generalized forms of Equation (16). We examined the combination of two properties out of five. The averaged error was increased by the combination of two entropy terms. Thus, Equation (16) is simple and accurate compared to Equation (21). We applied the generalized Born surface area (GBSA) method [37][38][39] for the G calculation to the same protein-compound set used in the current study. The average error and the correlation coefficient between the experimental and calculated G values were 51.7 kcal/mol and 0.03 that showed very weak correlation, respectively. The GBSA method is good to reproduce the trend of G values of many ligands for one target protein. In the current study, each target protein has only one or a few ligands. The error of the G obtained by the GBSA method is large, thus, the GBSA method could not reproduce the trend of the G value in the current study. In this examination, the DIAS/DIAV methods showed the better results than the GBSA method.

Application to Docking-Pose Prediction
To evaluate the present method, we applied the DIAS method to the protein-ligand docking pose prediction. Usually, only 20%-30% of the docking poses generated by the protein-ligand docking program are correct (RMSD < 2 Å) in the cross-docking test, whereas 50%-70% of the docking poses generated by the protein-ligand docking program are correct (RMSD < 2 Å) in the self-docking test [7]. Of course, the cross-docking test is necessary for practical evaluation of the protein-ligand docking. In this section we mimicked the cross-docking test. We selected the docking poses by both the DIAS method (Equation (16)) and the docking program (Sievgene/myPresto [7]), then compared the results. The docking score of Sievgene was determined as: (22) where N rot , E ASA , E vdW , E ele , E hyd , and E intra-vdW represent the number of rotatable bonds of the ligand molecule, the hydrophobic energy due to the accessible surface area, the vdW energy, the protein-ligand Coulombic potential, the hydrogen bond energy, and the intramolecular vdW energy of the ligand for Sievgene [7]. Also, c rot , c AV , c ele , c hyd , and c intra-vdW are the optimized coefficient for each energy term. For each atom type, the sum of E ASA and E vdW gives a grid potential, and both energy terms are always simultaneously calculated. Thus, these two terms share the same coefficient, c AV . Sievgene utilizes the grid potential to calculate each energy term except for the intramolecular interactions. In this study, a mesh size of 60 × 60 × 60 was adopted.
In this test, we prepared three types of protein structures: (model 1) the intact protein structure prepared in Section 2, (model 2) the energy-minimized structure of apo protein in water, and (model 3) the final structure of 2-nsec MD simulation of apo protein in water. The Sievgene docking program generated five docking poses for each target protein of the three prepared structures (models 1-3). Then each protein-ligand complex structure was evaluated by the DIAS (Equation (16)) with the fixed parameter described in Table 5 in the same manner described in the previous section (the vdW function was the LJ 8-4 type function, and the property x of Sx was the ASA). The best score poses were selected by Sievgene based on docking score, and the best G poses were selected by DIAS. The results are summarized in Table 5. When the energy-minimized structures (model 2) were used, the results obtained by the DIAS method were much better than the Sievgene results. The DIAS method selected the correct poses at a rate of 73% (RMSD < 2 Å). Even if the DIAS method selected the best docking poses among the five poses generated by Sievgene, 93% of the five generated poses satisfy the RMSD < 2 Å. Thus, the DIAS method selected 78% (73% out of 93%) of the correct poses. This shows that the DIAS method is useful for practical pose prediction in drug design.
When the initial structures (model 1) were used, the Sievgene results were better than the results obtained by the DIAS method. This is a trivial self-docking test, and the MD simulations for energy calculation should slightly change the ligand coordinates from the crystal structures by thermal fluctuation. When the final structures of the MD simulation (model 3) were used, only 33.3% of the docking poses were correct (RMSD < 2 Å) by Sievgene and the DIAS method. Still, the results obtained by the DIAS method were better than those obtained by Sievgene. The shapes of the ligand-binding pockets should be changed from their suitable structures after the MD simulations. This model structure is not suitable for docking studies.

Data Preparation
To determine the coefficients for the G score, we performed a protein-ligand docking simulation based on the known complex structures registered in the Protein Data Bank. Here, 34 complexes accompanied by the experimental binding free-energy values were selected from the database that was used to determine the G scores of the PRO_LEADS [6]. The PDB identifiers and the names are summarized in Table 6. In the test dataset, the metalloproteins were removed from the present analysis. Metal atoms (Zn and Fe atoms) formed covalent bonds with O and S atoms of the ligands, and the classical force field that we applied could not represent the covalent bond. Thus, the present method cannot calculate G values for metalloproteins with high precision. The structural ensembles generated from the PDB structure given by MD in explicit water were prepared as follows. All target proteins were prepared with ligands (protein-ligand complex structure). The force fields and the charges of the protein atoms originated from AMBER parm99 [35]. The atomic charge of each ligand was determined by the restricted electrostatic point charge (RESP) procedure using HF/6-31G*-level quantum chemical calculations [40]. We used Gaussian98 to perform the quantum chemical calculations [41]. The whole structure of each protein was embedded in a sphere of TIP3P [42] water (CAP water), including ion particles of 0.1% Na + and Cl − , in order to neutralize the total charge of the systems. The center of the sphere was set at the mass center of the protein. The shortest distance between the protein atom and the CAP sphere wall was set to 10 Å. Before an MD calculation was performed for the entire system, an MD calculation for only the solvent parts (solvent water and counter ions) was performed with the protein, ligand, and metal ion coordinates fixed, so as to bring the solvent parts sufficiently close to an equilibrium state. The SHAKE method was used to constrain covalent bonds between heavy and hydrogen atoms in any molecule in the system [43]. MD simulations of the entire system were performed using 2.0 fsec time steps with the temperature set at 310 K; the fast multipole method [44] was used to calculate the Coulombic interaction. The cutoff distance of the van der Waals interaction was 12.0 Å. The MD simulations were performed by using cosgene/myPresto [18]. After equilibration steps of 1,000 psec, the protein coordinates were sampled every 1 psec. Finally, we obtained 1,000 structures for each target protein in the 1,000 psec production run. The software program myPresto version 4 [45] was used for the simulation.

Conclusions
We have developed the direct interaction approximation (DIA) method and examined both the direct interaction approximation without solvent (DIAV) and with solvent (DIAS) methods. The results showed that the inclusion of the fluctuation of the ASA/dihedral angle terms drastically improved the accuracy of G. The DIAV method (Equation (16)) was the final form for the simple and accurate estimation of G. The effective dielectric constant should be calculated by Equations (19) and (20), and the vdW potential should be the LJ 8-4-type function. This equation included six parameters: ,  ,  and x. The six optimized parameters could be applied to all of the target proteins.
In the explicit water model, the DIA (DIAV and DIAS) methods required only the MD simulation of the protein-ligand complex. The DIA method with the LJ 8-4-type function improved the accuracy of the calculated G value drastically: the correlation coefficient between the experimental and the calculated G values was improved to 0.8 as obtained by the DIAV method, from 0.7 as obtained by the simplified COMBINE method without the entropy term (Equation (10)), and the average error of G was improved to 1.2 kcal/mol as obtained by the DIAS method, from 1.9 kcal/mol as obtained by Equation (10).