A Physics-Guided Neural Network for Predicting Protein–Ligand Binding Free Energy: From Host–Guest Systems to the PDBbind Database

Calculation of protein–ligand binding affinity is a cornerstone of drug discovery. Classic implicit solvent models, which have been widely used to accomplish this task, lack accuracy compared to experimental references. Emerging data-driven models, on the other hand, are often accurate yet not fully interpretable and also likely to be overfitted. In this research, we explore the application of Theory-Guided Data Science in studying protein–ligand binding. A hybrid model is introduced by integrating Graph Convolutional Network (data-driven model) with the GBNSR6 implicit solvent (physics-based model). The proposed physics-data model is tested on a dataset of 368 complexes from the PDBbind refined set and 72 host–guest systems. Results demonstrate that the proposed Physics-Guided Neural Network can successfully improve the “accuracy” of the pure data-driven model. In addition, the “interpretability” and “transferability” of our model have boosted compared to the purely data-driven model. Further analyses include evaluating model robustness and understanding relationships between the physical features.


Introduction
Proteins perform biological functions through interaction with other biomolecules, including ligands that are (small) molecules capable of binding to a target protein, often with high affinity and specificity [1]. Protein-ligand interaction is central to several biological processes, e.g., cellular signal transduction, viral invasion, DNA replication, and cellular energy production [2]. It also has vast applications in the early stages of drug discovery [3]. One key component of protein-ligand interactions is the binding free energy change, ∆∆G, occurring between the protein and the ligand upon the ligand's attachment. This physiochemical feature heavily dictates how strongly a protein and ligand interact and is particularly useful to understand for drug design [4]. While wet-lab experiments accurately estimate ∆∆G, they are significantly slow, costly, and laborious. On the other hand, computational simulations enable significantly faster estimation of ∆∆G and shed light on the binding mechanism of various structures [5].
A wide range of computational methods trades off between physical rigor and computational time for ∆∆G calculations. On one side of the spectrum, there is molecular docking [6,7] that provides on-the-fly determination of the best poses of a ligand based on different measurements, including ∆∆G. Such methods have vast usage in high-throughput virtual screening and lead optimization, where a quick ranking of candidate drugs is central. However, molecular docking fails to calculate ∆∆G accurately due to many rough approximations, e.g., receptor flexibility, strain energies, and various entropies. On the Results of the proposed hybrid model have been compared with reference experiments as well as pure data-driven and physics-based models.

Physics-Based Model: GBNSR6
Accurate calculation of free energy is highly dependent on precise measurements of electrostatic interactions in an aqueous environment. To accomplish this, biomolecules are often placed in a solvent that is usually water accompanied by counter-ions to ensure a neutral environment. This model is called the explicit solvent model, where the water molecules are explicitly measured in calculating the electrostatic energy of the system. Although this method proved to be highly accurate, it suffers from substantial computational costs, especially for large structures. One alternative solution is implicit solvent modeling, where explicit water molecules are replaced with an infinite continuum medium that has equivalent dielectric properties as water [19].
Binding free energy (∆∆G) of a molecular system is calculated as where ∆H is the enthalpy change, ∆S is the entropy change of the system, and T is the temperature in Kelvin. Enthalpy is a property of a thermodynamic system and is defined as the sum of the system's internal energy and the product of its pressure and volume. In the implicit solvent framework, ∆H is decomposed into the gas-phase molecular mechanics energy, ∆E MM , and solvation free energy, ∆G solv . ∆E MM consists of the changes in internal energy, van der Waals energy, and the change in electrostatic energy. ∆G solv consists of polar and nonpolar components. ∆G pol (the largest component of the total energy for biomolecules) is calculated with either a PB or GB model. GB is chosen in this research since it has shown to be an efficient approximation of PB with quite accurate results [40]. A GB model with ALPB correction [46] has the following formula: where in and out are the dielectric constants of the solute and the solvent, respectively, β = in / out , α = 0.571412, and A is the electrostatic size of the molecule, which is essentially the overall size of the structure which can be computed analytically [46]. We employ the most widely used functional form [47] where r ij is the distance between atomic charges q i and q j , and R i , R j are the so-called effective Born radii of atoms i and j, which represent each atom's degree of burial within the solute. In GBNSR6, effective born radii are calculated with "R6" equation [40]. The gridbased implementation of GBNSR6 is freely available in the AMBER suite of biomolecular simulation programs [48]. Entropy is a measure of the molecular disorder or randomness of a system. It is associated with conformational energy loss when a free-state ligand binds to the corresponding unbound-state receptor. Standard methods for estimating entropic component are normal mode analysis (NMA) [49] and quasi-harmonic approximation [50]. Due to its computational complexity, though, the entropy calculation in many studies of free energy is ignored. In this study, entropy is estimated as a new feature by subtracting the enthalpy calculated by the physics-based component from the experimental reference values. See Materials and Methods for more information.

Data-Driven Model: GCN
Graph convolutional network (GCN) is a specialized convolutional neural network that accepts graphs as input and applies several layers of filters to learn patterns [51,52]. Particularly, GCN takes as input a feature description x i for every node i summarized in an N × D feature matrix X, where N is the number of nodes and D is the input features.
In addition, a representative description of the graph structure in matrix form, which is typically in the form of an adjacency matrix A. The output is a node-level matrix Z that is an N × F feature matrix, where F is the number of output features per node. Graph-level output, z, can be modeled by introducing some form of pooling operation. With that, every neural network layer is written as a nonlinear propagation function H (l+1) = f (H (l) , A) with H (0) = X and H (L) = Z (or z for graph-level outputs), L being the number of layers. The specific models then differ only in how f (., .) is chosen and parameterized.
In this work, two GCNs available in Deepchem [31,53] are employed: GraphConv model and AtomicConv model. GraphConv model implements the graph convolutional model presented in [54]. These graph convolutions start with a per-atom set of descriptors for each atom in a molecule, then combine and recombine the descriptors over convolutional layers. AtomicConv model functions as a variant of graph convolution [31]. The difference is that the "graph" in this model is the nearest neighbors graph in 3D space. The Atomic-Conv model leverages these connections in 3D space to train models that learn to predict energetic states starting from the spatial geometry of the model. These two models have been utilized as the reference data-driven models in comparison with our hybrid model. Due to its flexibility, the GraphConv model has been chosen as the GCN component in the proposed PGNN model.

Featurization and Parameterization
Model training is performed with a combination of structure-based and physics-based features. Structure-based features are directly extracted from PDB files. ConvMolFeaturizer [31] from Deepchem is employed to represent atom features in the form of a graph which is an implementation of Duvenaud graph convolutions [54]. Every protein-ligand complex is featurized based on each atom's neighborhood and is transformed to a 2D matrix, X N×D , where N is the number of atoms in a complex and D is the number of features for each atom. ConvMolFeaturizer extracts 75 features for each atom that is a binary representation of atom type, atom hybridization type, implicit valence, aromaticity, atom degree, number of hydrogens, number of radical electrons, and formal charges of each atom.
Physics-based features, P, of each molecule are calculated using GBNSR6 available in Ambertools 2020 [55]. In short, the GBNSR6 model is executed on the complex, protein, and ligand structures. Electrostatic energy (EELEC), electrostatic energy for 1-4 bonded atoms (1-4-eel), non-polar solvation energy (ESURF), polar solvation energy (EGB), and Van der Waals (VDWAALS) energy are extracted accordingly. (PBSA [56,57] model is employed to calculate Van der Waals energy since this calculation is not implemented in GBNSR6). Total enthalpy is subtracted from the experimental ∆∆G values to account for entropy estimation. This number is added to the model as the last physics-based feature. See Table 1 for more details.

Hybrid Model: PGNN
The proposed Physics-Guided Neural Networks (PGNN) is a GCN with integrated physics-based features. The architecture of this model is shown in Figure 1. The model employs a GCN [31] to capture spatial features of the structures in the 3D space. The PGNN model consists of a couple of GraphConv layers of fixed channel size (training epoch: 100, learning rate: 0.001). The activation function used for GraphConv layers is the Tanh function to provide output in the continuous range of [−1, 1]. This layer combines the features of each atom with ten nearest neighbor atoms and creates a new feature vector for each atom. The output of GraphConv is a matrix of order N × 75 × 10. A batch normalization layer is applied to improve the learning process, followed by a single-dimensional max pooling to minimize the feature space. Finally, the GraphGather layer is used to combine the data from all different nodes. The output of this layer is the model variable M that later concatenates with vector P containing physics-based features. The new vector, (M, P), is fed to the final dense layer to estimate the ∆∆G. The activation function used in the last two dense layers is Relu. The loss function is designed to minimize the empirical error or, in other words, to minimize the RMSE in calculating ∆∆G compared to the experiments.
The following thermodynamic equation is integrated into the learning process of the proposed PGNN model by initializing the weights of ∆H to 1, −1, −1 for the complex, protein and ligand, and −1 for entropy: Figure 1. Workflow of the proposed PGNN model. The sample input structure, the complex of the BIR domain of MLIAP and GDC0152, is selected from the PDBbind refined set [58]. After featurization, the structure-based features along with the adjacency matrix A enter the network. The results of the first network iteration are the model variable M. The output of the physics-based model is vector P, which is concatenated with M. The resulting vector goes through several iterations before entering the final dense layer.

Datasets
PDBbind Refined Set v.2015. The primary dataset used in this work is acquired from PDBbind refined set v.2015 [58]. This dataset is a comprehensive collection of experimentally measured ∆∆G values for 3706 protein-ligand complexes. However, due to the limited scalability of the DeepChem featurizer, a subset of 368 complexes of various sizes were picked for training and testing the proposed model. As illustrated in Figure 2, a subset has been selected with reference to the ∆∆G values of the original PDBbind dataset to ensure a comprehensive sampling. The size of the protein-ligand complexes in the sample dataset varies between 1200 to 43,000 atoms, which replicates the wide range of structure sizes in the original dataset. Extensive structure optimization and force field assignments are carried out using the protocol explained in [59]. Water molecules are eliminated from the structural (PQR) files using the CPPTRAJ library in Ambertools 2020. ANTECHAMBER program is used to generate custom residue topology files (prep files) using the general Amber force field (GAFF) [60] for ligands and FF14SB [61,62] for protein structures. The TLEAP program is employed to create the coordinate and topology files, and these files are then used to run GBNSR6. The dataset was split into training and test sets with a ratio of 3:1. Model training and testing were performed on San Diego Supercomputing Center (SDSC) clusters with 20 CPU cores and 242 GB of memory which took approximately 40 min.
Host-Guest Systems. Another dataset used in this work is a collection of small structures acquired from the host-guest benchmark [4] and SAMPL challenges [63]. This dataset consist of seven distinct hosts named Octa Acid (OA), tetra-endomethyl octa acid (TEMOA), Alpha-Cyclodextrin (aCD) and Beta-Cyclodextin (bCD) which are from the hostguest benchmark system [4], OctaAcidH (OAH) [64] and OctaAcidMe (OAMe) [65] from SAMPL5 challenge and Cucurbit [8]uril (CB8) from the SAMPL6 challenge [66]. The hosts are small molecules containing less than 100 atoms. These hosts bind their guests the same way proteins bind their ligands, so they can be considered as simple test cases for computational models of noncovalent binding. In addition, their small size and, in many cases, their rigidity makes it feasible to efficiently estimate ∆∆G values. Several guests are provided for each host, comprising 72 rigid complexes in total. The raw PDB files and pre-processed topology and coordinate files of the host-guest benchmark are freely available at https://github.com/mobleylab/benchmarksets (accessed on 7 April 2017). SAMPL5 and SAMPL6 structure files are taken from SAMPL Challenges GitHub repository. The enthalpic and entropic components of ∆∆G for these molecules are experimentally measured. Complex topology and coordinate files were further processed to strip water molecules and counterions and then split into host and guest using CPPTRAJ library on Ambertools 2020. The topology and coordinate files of hosts, guests, and complexes were then used to run GBNSR6. The datasets were split into training and test sets with a ratio of 3:1.

Accuracy of the PGNN Model
This section examines the accuracy of the proposed PGNN model compared to the two data-driven models: GraphConv model and AtomicConv model. It should be noted that calculating total ∆∆G through the selected implicit solvent is not possible since GBSNR6 merely accounts for the enthalpic component. Since enthalpic and entropic values have not been measured separately in the PDBbind refined set, the accuracy of the PGNN model is just compared with the data-driven models. Both of these models have been trained for 100 epochs with 4-fold cross-validation, which leads to 276 data points for training and 92 data points for validation and testing. The mean squared error (MSE) is defined as the regression loss function for the learning process of the two models. RMSE is used to measure the accuracy of each model; see Table 2. The following conclusions can be drawn from Table 2: First, the GraphConv model outperforms the PGNN model on the training set but fails to do so on the test set. This observation implies overfitting of GraphConv on the training set. It also specifies that the physics-based features in PGNN played an essential role in guiding the neural network on unseen data. The AtomicConv model performs significantly better on the test set than on the training set. The inconsistency in estimating ∆∆G values on the two sets, though, raises concern about the transferability of this model. PGNN, on the other hand, shows more accurate results consistently on the training and test sets.
The average loss per epoch of the 4-fold cross-validation is illustrated in Figure 3: Figure 3a shows that the GraphConv model converges more quickly than the PGNN model, with a significant gap between the training and validation results. This observation, again, implies the overfitting of this data-driven model during the training process. Despite applying the Early Stopping rule in a different scenario (not demonstrated here), this gap never closed. Figure 3b demonstrates the poor performance of the AtomicConv model on the training set and a large gap between the training loss and the validation loss, which does not close after 100 epochs. In Figure 3c, the PGNN model shows more fluctuations throughout the learning process, indicating that the model can explore the solution space more effectively and learn the relation between the features more accurately. The model is more successful than the data-driven models in closing the gap between training and validation loss and minimizing the error. This observation demonstrates that the PGNN model is less likely to be overfitted on the training set and can predict the ∆∆G more accurately on unseen datasets.

Transferability of the PGNN Model
To evaluate the transferability of the proposed model, in addition to the PDBbind dataset, PGNN was trained and tested on the host-guest systems. According to Table 3, compared to GraphConv, PGNN is slightly less accurate on the training set but more accurate on the test set. Aligned with results in Table 2, this observation implies overfitting of the GraphConv model. GBNSR6 was utilized for calculating enthalpic values of binding free energy. Entropic values were borrowed from the experiment to account for total ∆∆G values. Since host-guest systems are small and rigid, this strategy does not significantly affect the accuracy of calculations. It can be seen that the physics-based model has poor performance in comparison to the other two models. This inaccuracy is the main motivation for proposing the PGNN model, which estimates ∆∆G values of the complexes about 6 kcal/mol more accurately. It should be noted that the AtomicConv model was not tested on this dataset since the input PDBbind dataset is hard coded in this model and could not be replaced with the host-guest systems.  Figure 4. Similar to Figure 3, it is observed that GraphConv model (Figure 4a) converges more quickly than the PGNN model ( Figure 4b) with a significantly larger gap between the training and validation results, which implies overfitting.

Interpretability of the PGNN Model
Calculating the binding free energy of biomolecules is a physics-based problem by nature. Therefore, it is crucial to interpret the relevant ML models by analyzing the model parameters compared to the physical laws. In the GraphConv model, the sum of the binary (atomic) features is directly related to ∆∆G values. However, the exact relation between these features and binding free energy is not straightforward due to the complicated architecture of the neural network. The proposed PGNN model utilizes physics-based features to learn the relationship between structural features. These features are used to analyze the interpretability of the proposed model.
According to Equation (3), physics-based features of the complex have positive coefficients, physics-based features of the protein and the ligand have negative coefficients, and entropy has a negative coefficient. Accordingly, we initialized the weights of complex parameters to +1, protein, ligand, and entropy parameters to −1, and the model variable to 0.5. It is shown in Table 4 that the model has retained the thermodynamic equation (i.e., Equation (3)) since the coefficients of the physics-based features are almost the same as before and after training. In addition, the coefficient of the model variable, which has been derived from binary features, has decreased from 0.5 to 0.45. In other words, the presented PGNN model has decreased the weight of the model variable in order to converge to a smaller gap between the predicted and actual values of ∆∆G.

Robustness Analysis
Noise injection [67] is a standard practice in machine learning to study the robustness of a model. This technique can also be used in the training phase of a neural network when adding noise prevents memorizing the training samples and leads to a more robust model with lower generalization error. In this study, we tested the robustness of the PGNN model when it was trained and tested on the original dataset and the one with noise. Entropy was selected for robustness analysis since GBNSR6 does not calculate it directly. Instead, entropy values are given to the model as a difference between the experimental ∆∆G values and the enthalpic component calculated by GBSNR6. Gaussian noise, N (µ, σ 2 ), was added to the entropy feature such that µ represents the mean of entropic values over the entire dataset, and σ shows the standard deviation. According to Figure 5, it is observed that the accuracy of the model changes only 0.1 kcal/mol in the (noisy) training set and 0.5 kcal/mol in the (noisy) test set. These small changes in the RMSE of ∆∆G values confirm that the PGNN model is robust against small noises in entropy. Therefore, it is concluded that accurate entropy values, i.e., those calculated with NMA or Quasi-harmonic approximation, will not significantly affect the final results.

Feature Analysis
The correlation heatmap between the physics-based features of protein-ligand complexes in the PDBbind dataset is shown in Figure 6. This figure provides insight into the relationship between the selected features, mainly whether they could be grouped or eliminated to avoid redundancy. It is demonstrated that van der Waals energy (vdwaals) has no linear relationship with other features. Therefore, keeping this feature as an independent indicator of short-range non-covalent energy is essential. The other features, though, show strong correlations: polar (egb) and non-polar (esurf) free energy components are negatively correlated. For future studies, it is worth merging the two and considering the total solvation free energy as a new feature. In addition, electrostatic energy for 1-4 bonded atoms (1-4-eel) strongly correlates with all the features, except for vdwaals. It is recommended to eliminate this feature and focus on other atomic interactions.

Conclusions
In this study, we have proposed a hybrid data-physics model to estimate the binding free energy of protein-ligand complexes with a wide range of structure sizes. This novel model inherits "interpretability" and "transferability" from the underlying physics-based model and "accuracy" from the data-driven model. As the first step, we have trained and tested the model on 368 protein-ligand structures selected from the PDBbind refined set and 72 host-guest systems. Results show that the combination of structural features extracted by the GraphConv model and physics-based features calculated by GBSNR6 enables the model to predict the binding free energy more accurately than the two datadriven models, i.e., the GraphConv model and the AtomicConv model. Compared to these models, the PGNN model consistently performs better on both training and test sets, reflecting its transferability between different datasets. Furthermore, analyzing the coefficients of physics-based features before and after training makes it possible to compare the results with physical laws and interpret them accordingly.
The robustness of the PGNN model has been evaluated by adding a noise function to a selected feature. Similar results before and after noise injection ascertain the robustness of the model. A closer look at the physical features and their relations demonstrates that van der Waals energy is the only one that does not correlate with other features. While it is recommended to keep this feature, eliminating or grouping other features should be tested in future works. The main objective of this paper was to introduce physics-data hybrid models and the corresponding neural network architecture. Extensive testing of this model on larger datasets, including new versions of PDBbind, is our immediate next step. Further extension of this work will be conducted by running short molecular dynamics (MD) simulations embedded in MM/GBSA to bring the dynamics of protein-ligand interactions into account.