A New Hybrid Neural Network Deep Learning Method for Protein–Ligand Binding Affinity Prediction and De Novo Drug Design

Accurately predicting ligand binding affinity in a virtual screening campaign is still challenging. Here, we developed hybrid neural network (HNN) machine deep learning methods, HNN-denovo and HNN-affinity, by combining the 3D-CNN (convolutional neural network) and the FFNN (fast forward neural network) hybrid neural network framework. The HNN-denovo uses protein pocket structure and protein–ligand interactions as input features. The HNN-affinity uses protein sequences and ligand features as input features. The HNN method combines the CNN and FCNN machine architecture for the protein structure or protein sequence and ligand descriptors. To train the model, the HNN methods used thousands of known protein–ligand binding affinity data retrieved from the PDBBind database. We also developed the Random Forest (RF), Gradient Boosting (GB), Decision Tree with AdaBoost (DT), and a consensus model. We compared the HNN results with models developed based on the RF, GB, and DT methods. We also independently compared the HNN method results with the literature reported deep learning protein–ligand binding affinity predictions made by the DLSCORE, KDEEP, and DeepAtom. The predictive performance of the HNN methods (max Pearson’s R achieved was 0.86) was consistently better than or comparable to the DLSCORE, KDEEP, and DeepAtom deep learning learning methods for both balanced and unbalanced data sets. The HNN-affinity can be applied for the protein–ligand affinity prediction even in the absence of protein structure information, as it considers the protein sequence as standalone feature in addition to the ligand descriptors. The HNN-denovo method can be efficiently implemented to the structure-based de novo drug design campaign. The HNN-affinity method can be used in conjunction with the deep learning molecular docking protocols as a standalone. Further, it can be combined with the conventional molecular docking methods as a multistep approach to rapidly screen billions of diverse compounds. The HNN method are highly scalable in the cloud ML platform.


Introduction
Accurate methods of computing ligand affinity with a biological target are strongly desirable in drug discovery. Millions of compounds can be classified as 'active' and 'inactive' for a specific target using different virtual screening approaches. A ligand binds to the target protein and forms a protein-ligand complex that produces the effects such as inhibition and activation of a protein. Predicting the binding affinity and binding interaction strength between the ligand and the target protein is a key step in drug discovery. Determining the binding affinity helps identify the best ligand as the potential drug candidate. Computational approaches to predicting binding affinity instead of in vitro experiments significantly reduce drug discovery time and cost. The early phase of the small molecule drug discovery deals with the rational drug design for a protein target. Potential small molecules are selected as hits based on their binding affinity. Several computational

Results and Discussion
In this study, we developed two hybrid neural network deep learning methods, called 'HNN-denovo' and 'HNN-affinity' for the de novo design and to predict binding affinity. We applied the HNN methods to rapidly screen a precompiled known library of proteinligand complexes and predicted their binding affinity. The overall schema workflow of the HNN method is displayed in Figure 1. The HNN method was developed as a hybrid deep neural network model in Python using the Keras API with Tensorflow in the backend ( Figure 2). The HNN model consists of a convolutional neural network (CNN) for deep learning based on structure attributes (SMILES) and a multilayer perceptron (MLP)-type feed-forward neural network (FFNN) for learning based on the protein binding pocket, protein sequence, and protein-ligand interaction descriptors (Figures 1 and 2). HNN-denovo uses both amino acid sequence and structure information of the protein and SMILES of the ligand for designing novel small molecules. The HNN-denovo method uses target protein binding pocket information such as ligand-binding site interactions, ligand-specific descriptors, and fingerprints. The protein and ligand structural interaction descriptors are calculated by BINANA [31], which examines the docked ligand poses inside the protein pocket to identify non-covalent and covalent molecular interactions such as hydrogen bonds, salt bridges, etc., and pi interactions that contribute to the ligand binding. The HNN-affinity method uses proteins and their binding pocket sequences for the CNN and ligand structure for the CNN. In both cases, SMILES of the ligands were used for the ligand descriptor calculations.
site interactions, ligand-specific descriptors, and fingerprints. The protein and ligand structural interaction descriptors are calculated by BINANA [31], which examines the docked ligand poses inside the protein pocket to identify non-covalent and covalent molecular interactions such as hydrogen bonds, salt bridges, etc., and pi interactions that contribute to the ligand binding. The HNN-affinity method uses proteins and their binding pocket sequences for the CNN and ligand structure for the CNN. In both cases, SMILES of the ligands were used for the ligand descriptor calculations.
We used HNN learning than stacking learning for the following reasons. Stacking learning considers several models based on the different algorithms as the weak learner. Further, several models based on different algorithms are developed on the same training set in stacking, and the predictions are made for the same test set. Then the predictions are stacked and taken as the meta model's input features to make a better final prediction. Whereas the HNN learning first makes models based on the CNN that considers ligand features and the FFNN that considers the protein feature descriptors. Then we concatenate the CNN and FFNN models but not their individual predictions to make the final prediction. Figure 1. The overall workflow schema of experimental design, the dataset, training, and testing process of the deep learning-based hybrid neural network methods HNN-affinity and HNNdenovo and other machine learning models that we developed in this study.  We included a new SMILES feature representation method in the HNN framework by modifying our previous 3D array representation of 1D SMILES simulated by the convolutional neural network (CNN) [32,33]. We assembled protein-ligand binding data such as Kd, Ki, and the combined Kd and Ki values from a general set and refined set obtained from the PDBBind database. To train the machine learning models, we computed a total of 653 protein-and ligand-specific molecular descriptors that were modeled by FFNN and the SMILES as ligand chemical features that are modeled by CNN. We also developed models based on RF, GB, and DT. We then predicted ligand binding affinity for the precompiled known library of protein-ligand complexes obtained from the PDBBind database. Since the HNN model suffers from higher variance in comparison to other machine learning methods, such as RF, we decided to make predictions multiple times and evaluated the performance based on the average of ten simulations. Though other machine learning methods, such as the RF method, has low variance with multiple simulations, to be consistent, we ran ten simulations for the RF, GB, and DT methods as well. The average of the RF, GB, and DT model predicted values were used for the consensus prediction. Next, we compared the HNN method-predicted protein-ligand affinity results with the RF, GB, and DT methods. We also compared the HNN-denovo and HNN-affinity performance metrics with the literature-reported deep learning methods such as DLScore, K-DEEP, and DeepAtom. The results are discussed in the following sections.

Model Evaluation
The performance of the models was evaluated as the average of 10 simulation results based on root mean squared error (RMSE), mean absolute error (MAE), and Pearson correlation coefficient (PCC). The MSE is the average of the squares of the errors, which are the differences between the predicted and actual values. The MAE is the average absolute difference between the predicted and the actual values. PCC measures the strength of the relationship between the two datasets of predicted and actual values. We used HNN learning than stacking learning for the following reasons. Stacking learning considers several models based on the different algorithms as the weak learner. Further, several models based on different algorithms are developed on the same training set in stacking, and the predictions are made for the same test set. Then the predictions are stacked and taken as the meta model's input features to make a better final prediction. Whereas the HNN learning first makes models based on the CNN that considers ligand features and the FFNN that considers the protein feature descriptors. Then we concatenate the CNN and FFNN models but not their individual predictions to make the final prediction.
We included a new SMILES feature representation method in the HNN framework by modifying our previous 3D array representation of 1D SMILES simulated by the convolutional neural network (CNN) [32,33]. We assembled protein-ligand binding data such as K d , K i , and the combined K d and K i values from a general set and refined set obtained from the PDBBind database. To train the machine learning models, we computed a total of 653 protein-and ligand-specific molecular descriptors that were modeled by FFNN and the SMILES as ligand chemical features that are modeled by CNN. We also developed models based on RF, GB, and DT. We then predicted ligand binding affinity for the precompiled known library of protein-ligand complexes obtained from the PDBBind database. Since the HNN model suffers from higher variance in comparison to other machine learning methods, such as RF, we decided to make predictions multiple times and evaluated the performance based on the average of ten simulations. Though other machine learning methods, such as the RF method, has low variance with multiple simulations, to be consistent, we ran ten simulations for the RF, GB, and DT methods as well. The average of the RF, GB, and DT model predicted values were used for the consensus prediction. Next, we compared the HNN method-predicted protein-ligand affinity results with the RF, GB, and DT methods. We also compared the HNN-denovo and HNN-affinity performance metrics with the literature-reported deep learning methods such as DLScore, K-DEEP, and DeepAtom. The results are discussed in the following sections.

Model Evaluation
The performance of the models was evaluated as the average of 10 simulation results based on root mean squared error (RMSE), mean absolute error (MAE), and Pearson correlation coefficient (PCC). The MSE is the average of the squares of the errors, which are the differences between the predicted and actual values. The MAE is the average absolute difference between the predicted and the actual values. PCC measures the strength of the relationship between the two datasets of predicted and actual values.
whereŷ i is the predicted value of the ith dependent variable, y i is the ith observed dependent variable, cov Ŷ , Y is the covariance of the predicted and the observed data, and σŶσ Y is the product of their standard deviation.

Validation with PDBBind Protein-Ligand Complex Refined Set
Models based on combined K d and K i binding data: Case 1 data set.
The case 1 binding data (see methods section) is the combination of the K d and K i binding affinity data compiled with 4850 protein-ligand complexes. The models were developed with 3925 complexes in a training set and 925 complexes in the test set. The HNNdenovo model input features are the descriptors that represent the protein binding pocket structure and protein-ligand interactions computed by the BINANA algorithms [31], and the SMILES of the ligand. We implemented only BINANA descriptors as the input features for the RF, GB, and DTBoost, as they cannot handle SMILES input features. The PDB code of the complexes used in the training and test sets is provided in the Supplementary Material. The statistical performance metrics reported were averages of the 10 neural network simulations. The Pearson correlation coefficients (PCCs) for the HNN, RF, GB, DT, and consensus methods were 0.82, 0.80, 0.80, 0.81, and 0.81, respectively ( Figure 2). The RMSEs were 1.142, 1.149, 1.144, 1.104, and 1.108, and the MAEs were 0.924, 0.932, 0.908, 0.879, and 0.89 for the HNN, RF, GB, DT, and consensus methods, respectively. The HNN-denovo model consistently performed equally or better than the RF, GB, and DT methods ( Figure 3). The root mean square errors (RMSEs), i.e., binding affinity value prediction errors, and mean absolute errors (MAEs), i.e., the average magnitudes of the errors in the binding affinity value predictions, among the models were similar. Overall, for the combined K d and K i binding affinity data, the HNN-denovo consistently performed equally or better than the other machine learning methods ( Figure 3). Int. J. Mol. Sci. 2022, 23, x FOR PEER REVIEW 6 of 17

Models based on Kd binding data: Case 2 data set.
Next, we sought to test the binding affinity prediction for a smaller data set. We considered Kd and Ki binding inhibition data in this case as separate data sets, case 2 and case 3, respectively. The case 2 binding data set (see methods section) comprised of Kd binding affinity data compiled with 2464 protein-ligand complexes. The models were developed on the 2004 complexes with Kd values in the training set, and the test set contained 460 complexes for which binding affinity was predicted. The HNN-denovo model used 348 BINANA protein-ligand interactions descriptors, protein pocket structure, and the SMILES of the ligands as input features, whereas RF, GB, and GT models used BINANA descriptors as the input features. The PDB codes of the complexes in the training and test sets for case 2 are provided in the Supplementary File. The statistical performance metrics reported were the averages of the 10 neural network simulations. The HNN, RF, GB, DT, and consensus methods predicted the binding affinity with an average PCC of 0.80, 0.77, 0.77, 0.79, and 0.79 ( Figure 4). Notably, the proteinligand binding affinity prediction correlation of the HNN-denovo for Kd binding data is optimal or higher than the RF, GB, and DT ( Figure 4). The RMSE of the predictions was 1.075, 1.101, 1.09, 1.051, and 1.054, and the MAE was 0.857, 0.891, 0.859, 0.836, and 0.841 for the HNN, RF, GB, DT, and consensus methods, respectively. Taken together, for Kd data, the HNN-denovo model outperformed the other machine learning methods RF, GB, and DT ( Figure 4). Evidently, the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of the binding affinity predictions among the models are similar.

Models based on K d binding data: Case 2 data set.
Next, we sought to test the binding affinity prediction for a smaller data set. We considered K d and K i binding inhibition data in this case as separate data sets, case 2 and case 3, respectively. The case 2 binding data set (see methods section) comprised of K d binding affinity data compiled with 2464 protein-ligand complexes. The models were developed on the 2004 complexes with K d values in the training set, and the test set contained 460 complexes for which binding affinity was predicted. The HNN-denovo model used 348 BINANA protein-ligand interactions descriptors, protein pocket structure, and the SMILES of the ligands as input features, whereas RF, GB, and GT models used BINANA descriptors as the input features. The PDB codes of the complexes in the training and test sets for case 2 are provided in the Supplementary File. The statistical performance metrics reported were the averages of the 10 neural network simulations. The HNN, RF, GB, DT, and consensus methods predicted the binding affinity with an average PCC of 0.80, 0.77, 0.77, 0.79, and 0.79 ( Figure 4). Notably, the protein-ligand binding affinity prediction correlation of the HNN-denovo for K d binding data is optimal or higher than the RF, GB, and DT ( Figure 4). The RMSE of the predictions was 1.075, 1.101, 1.09, 1.051, and 1.054, and the MAE was 0.857, 0.891, 0.859, 0.836, and 0.841 for the HNN, RF, GB, DT, and consensus methods, respectively. Taken together, for K d data, the HNN-denovo model outperformed the other machine learning methods RF, GB, and DT ( Figure 4). Evidently, the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of the binding affinity predictions among the models are similar.

Models based on K i binding data: Case 3 data set.
Next, we sought to test the binding affinity prediction with K i binding data. The case 3 binding data set (see methods section) comprised the K i binding affinity data compiled with 2386 protein-ligand complexes. The models were developed for the 1936 complexes with K i values in the training set, and the binding affinity was predicted for the 450 complexes in the test set. The HNN-denovo model used 348 BINANA descriptors, protein pocket structure, and the SMILES of the ligands as input features, whereas RF, GB, and DT models used only BINANA descriptors as the input features. The PDB codes of the complexes in the training and test sets for case 3 are provided in the Supplementary File. The statistical performance metrics reported were the averages of the 10 neural network simulations. The predicted ligand affinity correlation metric PCCs were 0.83, 0.80, 0.80, 0.82, and 0.81 for the HNN, RF, GB, DT, and consensus methods, respectively ( Figure 5). The RMSEs were 1.156, 1.192, 1.182, 1.146, and 1.15, and the MAEs were 0.938, 0.95, 0.928, 0.905, and 0.911 for the HNN, RF, GB, DT, and consensus methods, respectively. Again, for the K i data, the HNN-denovo model performed optimally in predicting the ligand binding affinity, while the RMSE and MAE, among the models, are similar ( Figure 5). The root mean square errors (RMSEs) and mean absolute errors (MAEs) in the binding affinity value predictions, among the models, were similar.
3, x FOR PEER REVIEW 8

Validation with PDBBind Protein-Ligand Complex General Set
The next validation set we used is the PDBbind v2019 general set consistin binding affinity data for 12,800 protein-ligand complexes. We compiled and combine and Ki data in this general set and obtained a total of 6081 complexes. We processed computed 358 protein-ligand complexes and ligand SMILES descriptors. The models developed with the 5031 complexes in the training set, and the binding affinity predicted for the 1050 test set complexes. Here, the HNN-denovo model used BINANA protein-ligand interactions descriptors, protein pocket structure, and SMILES of the ligands as input features, whereas RF, GB, and DT models used onl BINANA descriptors as the input features. The predicted ligand affinity correlation m PCCs were 0.77, 0.77, 0.74, 0.78, and 0.78 for the HNN, RF, GB, DT, and conse methods, respectively ( Figure 6). The RMSEs were 1.134, 1.098, 1.154, 1.066, and 1.077 the MAEs were 0.908, 0.873, 0.915, 0.831, and 0.848 for the HNN, RF, GB, DT, consensus methods, respectively. The HNN-denovo model performed equally comp to the RF, and GB, with DTBoost performing slightly higher prediction accuracy (Fi 6). The root mean square errors (RMSEs) and mean absolute errors (MAEs) amon models were similar.

Validation with PDBBind Protein-Ligand Complex General Set
The next validation set we used is the PDBbind v2019 general set consisting of binding affinity data for 12,800 protein-ligand complexes. We compiled and combined K d and K i data in this general set and obtained a total of 6081 complexes. We processed and computed 358 protein-ligand complexes and ligand SMILES descriptors. The models were developed with the 5031 complexes in the training set, and the binding affinity was predicted for the 1050 test set complexes. Here, the HNN-denovo model used 348 BINANA protein-ligand interactions descriptors, protein pocket structure, and the SMILES of the ligands as input features, whereas RF, GB, and DT models used only the BINANA descriptors as the input features. The predicted ligand affinity correlation metric PCCs were 0.77, 0.77, 0.74, 0.78, and 0.78 for the HNN, RF, GB, DT, and consensus methods, respectively ( Figure 6). The RMSEs were 1.134, 1.098, 1.154, 1.066, and 1.077, and the MAEs were 0.908, 0.873, 0.915, 0.831, and 0.848 for the HNN, RF, GB, DT, and consensus methods, respectively. The HNNdenovo model performed equally compared to the RF, and GB, with DTBoost performing slightly higher prediction accuracy ( Figure 6). The root mean square errors (RMSEs) and mean absolute errors (MAEs) among the models were similar.

Validation with PDBBind Protein-Ligand Complex Refined General Set
We next sought to test the PDBBind refined general data set. It contains complexes, with 4085 complexes obtained from the refined set and 6081 obtained fr general set. We used 8931 complexes in the training set and randomly selecte complexes were used as the test set. The binding affinity was predicted for thes complexes in the test set. The HNN-denovo model used 348 BINANA descriptors, p pocket structure, and the SMILES of the ligands as input features, whereas RF, G DT models used only the BINANA descriptors as the input features. The PDB co the complexes used in the training and test sets are provided in the Supplem Material. The statistical performance metrics were reported as the average of t neural network simulations. The computed PCCs were 0.79, 0.79, 0.78, 0.81, and 0 the HNN, RF, GB, DT, and consensus methods, respectively (Figure 7). The RMSE 1.144, 1.137, 1.152, 1.086, and 1.098, and the MAEs were 0.932, 0.915, 0.921, 0.861, an for the HNN, RF, GB, DT, and consensus methods, respectively. The HNN-denovo had optimal performance among the other machine learning methods, except DT which performed slightly better in this data set (Figure 7). The root mean square (RMSEs), i.e., binding affinity value prediction errors, and mean absolute errors (M i.e., the average magnitudes of the errors in the binding affinity value predictions, the models were similar.

Validation with PDBBind Protein-Ligand Complex Refined General Set
We next sought to test the PDBBind refined general data set. It contains 10,931 complexes, with 4085 complexes obtained from the refined set and 6081 obtained from the general set. We used 8931 complexes in the training set and randomly selected 2000 complexes were used as the test set. The binding affinity was predicted for these 2000 complexes in the test set. The HNN-denovo model used 348 BINANA descriptors, protein pocket structure, and the SMILES of the ligands as input features, whereas RF, GB, and DT models used only the BINANA descriptors as the input features. The PDB codes of the complexes used in the training and test sets are provided in the Supplementary Material. The statistical performance metrics were reported as the average of the ten neural network simulations. The computed PCCs were 0.79, 0.79, 0.78, 0.81, and 0.81 for the HNN, RF, GB, DT, and consensus methods, respectively (Figure 7). The RMSEs were 1.144, 1.137, 1.152, 1.086, and 1.098, and the MAEs were 0.932, 0.915, 0.921, 0.861, and 0.88 for the HNN, RF, GB, DT, and consensus methods, respectively. The HNN-denovo model had optimal performance among the other machine learning methods, except DTBoost, which performed slightly better in this data set (Figure 7). The root mean square errors (RMSEs), i.e., binding affinity value prediction errors, and mean absolute errors (MAEs), i.e., the average magnitudes of the errors in the binding affinity value predictions, among the models were similar.

HNN-Affinity Predictions Based on the Protein Sequence and the Ligand SMILES as Features
Next, we developed the HNN-affinity to use only protein binding pocket amin sequences instead of protein structural features and Ligand SMILES as input featur protein sequence-based ligand affinity prediction is useful when the protein str complex is unavailable. The binding pocket amino acid sequences that are n encoded (see methods section) were used as input for the CNN, and Ligand SMILE used as input for the FFNN. We used the BINANA descriptors as the input featu the RF, GB, and DTBoost models, as they cannot handle SMILES input features. W the case 1 binding data of the PDBbind v2019 refined set (see methods section), w the combination of the Kd and Ki binding affinity data compiled with a total o protein-ligand complexes. The models were developed with 3860 complexes training set, and the ligand binding affinity was predicted for the 990 complexes test set. The PDB codes of the complexes used in the training and test sets are prov the Supplementary Material. The statistical performance metrics reported we averages of the 10 neural network simulations. We compared the HNN-affinity with HNN-denovo and the RF, GB, and DTBoost models ( Table 1). The HNN-affin

HNN-Affinity Predictions Based on the Protein Sequence and the Ligand SMILES as Input Features
Next, we developed the HNN-affinity to use only protein binding pocket amino acid sequences instead of protein structural features and Ligand SMILES as input features. The protein sequence-based ligand affinity prediction is useful when the protein structural complex is unavailable. The binding pocket amino acid sequences that are number encoded (see methods section) were used as input for the CNN, and Ligand SMILES were used as input for the FFNN. We used the BINANA descriptors as the input features for the RF, GB, and DTBoost models, as they cannot handle SMILES input features. We used the case 1 binding data of the PDBbind v2019 refined set (see methods section), which is the combination of the K d and K i binding affinity data compiled with a total of 4850 protein-ligand complexes. The models were developed with 3860 complexes in the training set, and the ligand binding affinity was predicted for the 990 complexes in the test set. The PDB codes of the complexes used in the training and test sets are provided in the Supplementary Material. The statistical performance metrics reported were the averages of the 10 neural network simulations. We compared the HNN-affinity results with HNN-denovo and the RF, GB, and DTBoost models ( Table 1). The HNN-affinity and HNN-denovo predictive performances are close to each other. The HNN-denovo predicted with~2% higher accuracy with the inclusion of BINANA descriptors. The binding affinity value prediction errors (RMSE) and the average magnitude of the errors (MAE) in the binding affinity value predictions among the models were similar. Evidently, the HNN-denovo performed well compared to the HNN-affinity, RF, GB, and DTBoost. The inclusion of protein and ligand structural, and protein-ligand interaction features contributed to the increased prediction accuracy of the HNN-denovo. Next, we sought to evaluate the performance of our HNN methods, with respect to the literature-reported ligand affinity prediction deep learning methods [13][14][15][16]. We compared the HNN-affinity and HNN-denovo results with the protein-ligand binding affinity prediction deep machine learning methods reported in the literature, such as DeepAtom [13,14], KDEEP [15], and DLScore [16]. The results of the HNN-denovo and HNN-Affinity performance metrics PCC and RMSE, with respect to the other deep learning methods DeepAtom [13,14], KDEEP [15], and DLScore [16], for the PDBbind 2019 refined set are presented in Table 2. Hassan et al. [16] reported a PCC 0.82 for the DLSCORE method, 0.21 for NNScore 2.0, and 0.15 for AutoDock Vina for the PDBbind v2016 data with 3199 complexes in the training set and 300 complexes in the test set. The RMSE reported was 1.15, 2.78, and 3.17, and the MAE was 0.86, 2.03, and 2.5 for the DLSCORE, NNScore 2.0, and Vina, respectively. Jiménez et al. [15] reported the PCC 0.82 with RMSE 1.27 for the KDEEP method, for the PDBbind v2016 data with 3767 complexes in the training set and 290 complexes in the test set. Li et al. [14] reported that the PCC 0.81 with RMSE was 1.31 for the DeepAtom method. Li et al. used the training set containing 3390 complexes, and 377 complexes in the test set, obtained from the PDBbind v2016 data. Further, the same author, Li et al. [14], reported the PCC 0.83 with RMSE 1.23 for the DeepAtom method, with a large training set consisting of 9363 complexes and the test set with 1000 complexes, a combined data obtained from the PDBbind v2016 plus MOAD database. In contrast, with only 3860 complexes in the training set, the HNN-affinity achieved a higher PCC of 0.83 and 0.82 and the RMSE 1.04 and 1.06 for the unbalanced (i.e., a smaller number in the test data set) data of 300 complexes and balanced test data set of 797 complexes, respectively (Table 2). Similarly, for the 4357 complexes in the training set, with the same test set size of 300 and 797 unbalanced and balanced test data sets, our HNN-denovo achieved a higher PCC of 0.86 with RMSE 1.11 and PCC 0.84 with RMSE 0.98, respectively (Table 2). Notably, even with the higher training to test set ratios (15.4 and 17.1), our HNN-denovo method outperformed other deep learning methods in the binding affinity prediction rate (PCC 0.82 and 0.84), i.e., correctly predicting the native ligands for a given test protein ( Table 2). The unbalanced data set usually boosts the prediction performance more than the balanced data set, which is reflected in our method predictions. This is also true for the other deep learning methods, such as the KDEEP and DeepAtom methods that used a small number of data sets,~10%, in their test set (Table 2). Further, the HNN-denovo achieved with the highest prediction rate with a PCC of 0.86, whereas it was 0.83 for the HNN-affinity. Taken together, our HNN methods consistently performed better than or equally well compared to other deep learning methods for the balanced and unbalanced data sets.

Data Sets
The 'refined set' and the 'general set' consisting of experimentally measured binding affinity data for biomolecular protein-ligand complexes reported in the Protein Data Bank (PDB) were downloaded from the PDBbind database [34].
Refined Set: The refined set from PDBbind v2019 consists of 4852 protein-ligand complexes annotated with the ligand affinity data. The ligand affinity data were provided as K d or K i inhibition binding constant values. From a total of 4850 protein-ligand complexes, 2464 complexes with K d data and 2386 complexes with K i data were processed to calculate the protein and ligand-centric descriptors of the complexes. The binding affinity K d and K i are converted to pK d and pK i (logarithm of the inverse of K d and K i ) to predict-ligand affinity (pK d or pK i ) using machine learning methods. Case 1. The case 1 data set consists of a combined K d and K i binding affinity data with a total of 4850 protein-ligand complexes. Models were developed with a training set 3925 protein-ligand complexes containing both K d , and K i data, and 925 complexes were used as a Test set. Case 2. The case 2 data set consists of 2464 protein-ligand complexes with K d data. Models were developed with a training set of 2004 protein-ligand complexes with K d binding affinity data, and the remaining 460 complexes were used as a test set. Case 3. The case 3 data set consists of 2386 protein-ligand complexes with K i data. Models were developed with a training set of 1936 protein-ligand complexes with K i binding affinity data, and 450 complexes were used as a test set.
General Set: The PDBbind v2019 general set consists of binding affinity data for 12,800 protein-ligand complexes. The K d and K i data were processed and combined and the K d and K i data and obtained a total of 6081 protein-ligand complexes. Models were developed with 5031 complexes in the training set and 1050 complexes in the test set.
Refined-General Set: Refined-General Set contains a total of 10,931 complexes. The refined-general set was obtained by combining 4085 complexes from the refined set and 6081 complexes from the general set. We used 8931 complexes for the training set, and 2000 complexes were randomly selected for the test set.

Computation of BINANA Descriptors as Protein-Ligand Interaction Structural Features
We used the protein pocket structure, and protein-ligand interactions modeled by BINANA descriptors as input features for the HNN-denovo method. Additionally, for the HNN-denovo method, protein sequences (described in the section below) were used as additional features. The protein and ligand files in .pdb and .mol2 format were first converted to .pdbqt format by prepare_receptor4.py and prepare_ligand4.py, respectively, MGLTools [35]. The BINANA (BINding ANAlyzer) algorithm [31] was used to calculate the descriptors of the protein-ligand complexes for the proteins and ligands in .pdbqt format. A total of 348 BINANA descriptors for each protein and its ligand were calculated, including covalent and noncovalent interactions such as electrostatic interactions, ligand atom types, rotatable bonds, salt bridges, hydrogen bonds, 'π' interactions, and bindingpocket flexibility.

SMILES as Ligand Features
For the deep learning method, SMILES were used as one of the features of the proteinligand complexes. The SMILES of the ligands were calculated from their .mol2 files using Schrodinger's structconvert utility. SMILES are in text format and were encoded as numbers to be used as input for the CNN. Unique indexes were created for 94 characters from '!' to '~' in the ASCII table to represent all possible SMILES characters in any format. The mapping of each character to a unique index was stored in python's dictionary Dsmi = {'!':1, '"':2, '#':3, '$':4, . . . , 'C':35, . . . , '~':94} and made the vocabulary of possible characters in the SMILES. The SMILES of each ligand were encoded with unique numbers and either truncated or padded with 0s to the maximum allowed length L. The SMILES for the input compounds were converted to a 2D matrix of size K × L, where K was the number of input SMILES, and L = 325 was the allowed maximum length of the SMILES string used in the model.

Protein Pocket Sequence as Input Features
We also used protein pocket sequences as input features for the HNN-affinity method. The sequences of the amino acids were extracted from the files in the .pdb format provided by PDBbind for the binding pockets of the protein-ligand complexes. The pocket sequences, also in text format, were number-encoded to be used as input for the CNN. Each of the 20 alphabetical characters representing the 20 amino acids was mapped to a unique index by a dictionary, Dseq = {'A':1, 'R':2, . . . , 'V':20}. The sequence of each pocket was encoded with unique numbers and either truncated or padded with 0s to the maximum allowed length M. The sequences were converted to a 2D matrix of size K × M, where K was the number of input sequences, and M = 150 was the allowed maximum length of the sequence used in the model.

Hybrid Neural Network Model (HNN)
The deep learning-based hybrid neural network model (Figure 2) was developed using Keras API in python with Tensorflow in the backend. The HNN model consists of two convolutional neural networks (CNNs) and one multilayer perceptron (MLP) type fully connected neural network (FCNN). One CNN (CNN-1) was used for training the model based on the structure of the ligand in each complex using the number-encoded Ligand SMILES of length L = 325. The second CNN (CNN-2) was used for training the model based on the protein pocket structural sequence in each complex using the number-encoded pocket sequence of length M = 150. The MLP-type FCNN was used for training the model based on 348 BINANA descriptors that characterized the protein-ligand interactions. The L length number-encoded SMILES was converted to a dense vector of size L × 100. The M length number-encoded pocket sequence was converted to a dense vector of size M × 20 by the embedding layer of Keras. The CNNs used Conv1D convolution with a filter of size 3 to extract features from the inputs. The model used the ReLU activation function and L2 regularization for both types of networks. Dropout and L2 regularization were used to prevent the model from overfitting. GlobalMaxPooling1D was used in the CNNs to downsample the feature maps. The output of the pooling layer of CNNs and the output of the FCNN were merged and processed by the final fully connected layer to predict the outcome using the linear activation function.

Parameter Optimization
Hyperparameter optimization was implemented using the tree-structured Parzen estimator (TPE) method within the hyperopt package of python to search for optimum parameters to improve the model's performance (Table 3). This approach requires defining the objective function that the fmin() function minimizes, the parameter space over which the search is performed, and the maximum number of evaluations to run.

Other Machine Learning Algorithms
To compare the HNN-affinity prediction results, we also developed regression models based on machine learning methods based on the random forest (RF), gradient boosting (GB), and decision tree with AdaBoost (DT). A consensus prediction was made based on the average prediction rate of the three methods, excluding the HNN.

Conclusions
In this study, we developed, implemented, and validated the HNN-denovo and HNNaffinity for the ligand binding affinity prediction in the de novo design campaign. The HNN-denovo and HNN-affinity confer a higher prediction rate and most of the time successfully retrieve the cognate ligands for the proteins. The present versions of the HNN-denovo method reached the highest prediction accuracy, with a Pearson correlation coefficient (PCC) of 0.86, i.e., 86% of the time, it correctly predicts the native ligands based on the~9000 protein-ligand complexes obtained from the PDB-Bind database. The HNN-denovo can generate small molecule hits specific to a protein binding site under the guidance of a drug-target binding affinity prediction structure model. The HNN-affinity can generate small molecule hits specific to a protein binding site under the guidance of protein binding site amino acid sequence patterns. The known limitation of the structurebased machine learning methods in predicting ligand binding affinity is that they require a protein-ligand experimental structural complex or an accurate docked ligand structural complex. The absence of such a protein-ligand structural complex or inaccurate structural complex will lead to an inaccurate ligand affinity prediction. Consequently, a machine learning prediction method that includes the protein sequence-based prediction framework is needed. Our HNN-affinity method can predict ligand affinity based on the protein amino acid sequence as input features in the absence of the protein structural information. Notably, HNN-affinity is particularly useful when the protein-ligand complex structure is unavailable; when dealing with undruggable targets; and when targeting intrinsically disordered proteins. Further, the HNN-affinity can predict different binding affinities such as K D , K I , and IC 50 as it is trained on thousands of known protein-ligand complexes with known K D , K I , and IC 50 values.
In summary, we have demonstrated that the HNN method shows promising performance for diverse data sets and outperforms the other deep learning methods for balanced and unbalanced data sets. The promising results on various datasets demonstrated that our HNN method could be potentially implemented in combination with molecular docking-based virtual screening campaigns, particularly when screening billions of compounds. We are adapting the HNN method with the molecular docking-based large-scale virtual screening pipeline. The HNN method is highly scalable and uses Keras and Tensorflow to generate and train the neural networks. The HNN virtual screening simulations can be performed with high-performance GPUs on the cloud to screen billions of diverse compounds rapidly. We ran it in our powerful GPU Linux server, which is highly scalable to run on the cloud ML platform to reduce computational time. Thus, our HNN method is scalable with high-performance GPU computing on Cloud to screen billions of diverse compounds rapidly.

Limitations
The HNN is complex when applied with a very large number of parameters for CNN and FFNN. As a result, the HNN method suffers from higher variance in comparison to other machine learning methods such as RF, which are visible in the box plots in Figures 3-7 as they have a wider interquartile range (IQR). Further, in this study, we did not apply the statistical test of Chi-square conformity, because the binding affinity predicting models give continuous outcomes. In contrast, Chi-square tests are applied for data with the categorical outcome.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.