Machine Learning Models for the Classiﬁcation of CK2 Natural Products Inhibitors with Molecular Fingerprint Descriptors

: Casein kinase 2 (CK2) is considered an important target for anti-cancer drugs. Given the structural diversity and broad spectrum of pharmaceutical activities of natural products, numerous studies have been performed to prove them as valuable sources of drugs. However, there has been little study relevant to identifying structural factors responsible for their inhibitory activity against CK2 with machine learning methods. In this study, classiﬁcation studies were conducted on 115 natural products as CK2 inhibitors. Seven machine learning methods along with six molecular ﬁngerprints were employed to develop qualitative classiﬁcation models. The performances of all models were evaluated by cross-validation and test set. By taking predictive accuracy(CA), the area under receiver operating characteristic (AUC), and (MCC)as three performance indicators, the optimal models with high reliability and predictive ability were obtained, including the Extended Fingerprint-Logistic Regression model (CA = 0.859, AUC = 0.826, MCC = 0.520) for training test andPubChem ﬁngerprint along with the artiﬁcial neural model (CA = 0.826, AUC = 0.933, MCC = 0.628) for test set. Meanwhile, the privileged substructures responsible for their inhibitory activity against CK2 were also identiﬁed through a combination of frequency analysis and information gain. The results are expected to provide useful information for the further utilization of natural products and the discovery of novel CK2 inhibitors.


Introduction
Casein kinase 2 (CK2) is involved in multiple cellular processes through phosphorylation of various substrates [1,2]. Deregulated CK2 activity is related to a variety of solid tumors, including lung, liver, and gastric cancers. A growing amount of evidence suggests that CK2 is involved in the infection of SARS-CoV2 and vaccinia and promotes rapid cell-to-cell spread [3,4]. Thus, the pharmacological intervention of CK2 has been considered a potential strategy for anti-cancer and anti-viral therapy.
The heterotetrameric CK2 holoenzyme is composed of two catalytic subunits (CK2α or CK2α') and two regulatory subunits (CK2β). Like other kinases, the ATP-binding pocket of CK2α has been considered as the orthosteric site to design ATP-competitive inhibitors with diverse scaffolds including benzimidazole, anthraquinone, tricyclic quinolone, and natural products [5,6]. However, except for CX-4945, which has advanced through clinical trials [7], most inhibitors are precluded from being drug candidates because of cytotoxicity, genotoxicity, and other pharmaceutics deficiencies [8,9]. Attempts to optimize inhibitors presenting polycyclic scaffolds have faced challenges overcoming these drawbacks. A strategy was proposed to discover more potent CK2 inhibitors with novel non-polycyclic scaffolds that permit the exploration of diverse pharmacophoric fragments. Therefore, it would be meaningful to identify key groups that make dominant contributions to CK2 inhibitory activity.
Natural products are regarded as valuable sources of drug leads [10][11][12]. Different classes of natural products, namely coumarins, flavones, anthraquinones, etc., have been identified as CK2 inhibitors. Recently, the method of computer-aided drug design (CADD) has accelerated the discovery of CK2 natural product inhibitors [13,14]. Crystal structures of CK2α-inhibitors complexes (available from Protein Data Bank) and molecular modeling studies elucidated the binding modes of compounds with CK2α. The ATP-binding pocket of CK2 is composed of a hydrophobic pocket, a positive area, and a hinge region [15,16]. More specifically, heterocyclic scaffolds of inhibitors (2-3 aromatic rings) aresandwiched in the hydrophobic site consisting of Leu45, Val53, Val66, Ile95, Phe113, and Ile174. Meanwhile, it can be found that amino, hydroxyl, and nitrogen heterocycles tend to establish hydrogen bonds (H-bonds) with Glu114 or Val116, as well as halogen bonds between halogen atoms and carbonyl oxygen atoms of Glu114 and Val116. In contrast, only a few groups, such as hydroxyl and carboxylate groups, make electrostatic interactions with Lys68 of the positive area. As the most used ligand-based drug design methods, Quantitative Structure-Activity Relationship (QSAR/3D-QSAR) and pharmacophore screening are powerful tools to identify structural features and properties of inhibitors that are strictly connected with their biological activities. Zhang et al. built the ligand-based and receptor-based 3D-QSAR models of coumarin [17], which provided structural clues for the optimization of CF 3 substituted coumarinderivates [18]. However, 3D-QSAR studies were based on compounds with one certain scaffold, and thus the generated models only gave exclusive hints for the specific scaffold. As a complementary method to overcome the applicability domain of QSAR, pharmacophore hypothesis-based virtual screening can be used to identify novel inhibitors by mapping the pharmacophoric features generated by the training set [19,20], but it cannot take into account inhibitors with different mechanisms of action at the same time. Since machine learning methods are considered as useful tools of drug discovery [21,22], classification studies of machine learning methods along with molecular features have the advantage of taking compounds with diverse chemical scaffolds as data samples, as well as considering multiple inhibition mechanisms at the same time, and have been widely used in drugsvirtual screening [23] and the prediction of kinase inhibitors binding modes [24].
At present, most studies are focused on the QSAR studies and optimization of one type of natural product. Since natural products possess diverse chemical structures and broad-spectrum bioactivities, it is appropriate to perform classification studies to identify structural groups related to their inhibitory activities and provide comprehensive structural clues for the discovery of novel CK2 inhibitors. In this study, classification models of 115 natural products were established by 7 machine learning methods combined with 6 molecular fingerprints, and privileged substructures responsible for CK2 inhibitory activity were also identified. It is expected that this study will offer an initiative for the development of novel CK2 inhibitors as anti-cancer and anti-viral drugs.

Data Collection and Chemical Space Distributions
With the consideration of molecular scaffolds as diverse as possible, 115 CK2 natural products inhibitors were collected from published literature [25][26][27][28]. Based on the inhibitory activity distribution of these inhibitors, a cut-off value (IC 50 = 10 µM) was used as a threshold to define the compounds as "P" (≤10 µM, active inhibitors) and "N" (>10 µM, inactive inhibitors). Then all these compounds were randomly divided into a training set and a test set at a ratio of 4:1 as listed in Table S1 (Supplementary Materials).
In order to evaluate chemical space distributions of the entire data set, 22 2D molecular descriptor groups (e.g., constitutional indices, charge descriptors, ring descriptors, topological indices, connectivity indices, etc.) were calculated by DRAGON 7.0 [29], and 3822 molecular descriptors were further evaluated by principal components analysis to identify the featured molecular descriptors. The five descriptors of Lipinski rules were also plotted into a radar chart to observe the chemical space distribution oftheentire dataset. In addition, the complexity of a molecule (FMF), sum of the atomic polarizabilities (apol), topological polar surface area based on fragment contributions (TopoPSA), kappa shape indices (Kier), topological charge (JGT), van der Waals volume (VABC), relative molecular mass (MW), and lipid water partition coefficient (ALogP) wasalso explored to plot scatter plots in data with different labels in Figure S1 (Supplementary Materials). As Euclidian distances could be taken as an evaluator to reflect the molecular similarity of compounds, a heat map of Euclidian distance metrics based on PubChem fingerprints wasalso constructed.

Molecular Fingerprints and Machine Learning Methods
Molecular fingerprints are generally presented as a fixed-length string of numbers 1 and 0 to characterize a molecule which could be characterized by a binary string of structural information. The number 1 indicates that a substructure exists in the given molecule, while the number 0 shows the exact opposite. In this study, six fingerprints [30], namely Extended fingerprint (Ext 1024 bits), Estate fingerprint (Est 79 bits), Molecular Access Systemfingerprint (MACCS 166 bits), PubChem fingerprint (PubChem 881 bits), CDK graph only fingerprint (Graph 1024 bits) and Substructure fingerprint (Sub 307 bits) were used. All six fingerprints were calculated using the PaDEL-Descriptor software [31].
KNN: Based on the idea of maximum likelihood estimation [32], this method is used to find the distance between new input samples and training samples in feature spaces. In this work, the nearness was determined by Euclidean distance and distance weighted parameters with a value of five.
LR: This method is commonly used for classifying a binary response to minimize the loss function of the regression to obtain the unknown parameters [33]. Two possible response values were labeled with the symbols "0" and "1", and the predicted value range was mapped to [0, 1], and finally, the classification was realized.
NB: NB algorithm is based on the principle of probability [34]. According to the known prior probability, the Bayes formula is aimed at finding the posterior probability that a sample belongs to a certain class and then selecting the class with the highest posterior probability as the class to which the sample belongs.
ANN: ANN is used to identify the complicated non-linear relationship between the dependent and the independent variable for classification and regression [35]. The network includes an input layer, an output layer, and a hidden layer. In this study, the hidden layer was set to 200, and other parameters were default.
SVM:SVMis a kernel-based algorithm toper form binary classification with the principle of structural risk minimization [36]. The input variables in the low-dimensional space are mapped to the high-dimensional space by changing the kernel function, and then a support vector (a binary string of each sample) is used to find the maximum interval in the new space. Finally, an optimal classification hyperplane is generated to discriminate samples from different categories. In our study, the Gaussian Radial basis function (RBF) kernel was selected, and the cost was set to 1.00.
RF: RF is an ensemble learning method for classification and regression [37]. The forest is assembled by trees, and each tree is formed from a bootstrapped sample of the training set. The process of RF establishment is to construct multiple decision trees by randomly extracting different features and different samples. The classification results depend on the majority of the individual tree's output. In this study, the number of trees was set as 20 in the forest.
Tree:The basic idea of a Tree is the recursive division of independent variable spaces [38]. A decision tree includes decision nodes, branches, and leaves, for a categorical dependent variable described by one or more predictor variables. The key to constructing a decision tree model is the branching criteria of the tree, the conditions under which the tree stops growing, and the pruning of the tree. In this study, the minimal number of instances in leaves was set to three, and stop splitting nodes with instances less than five.

Model Performance Evaluation
A 10-fold cross-validation and test set validation were performed to evaluate the performance of models. The parameters, including true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), were considered to evaluate the quality of each model. Furthermore, the preferred indicators, predictive accuracy of "P" class (sensitivity, SE), "N" class (specificity, SP), and overall predictive accuracy (CA) were calculated by the following formulas [39]. Taking into account all of the confusion matrix elements, positive predictive value (PP), negative predictive value (NP), and the Matthews correlation coefficient (MCC) were also explored to measure the correlation between the true class labels and the predicted labels.
In addition, TP and FP rates were also used to plot the receiver operating characteristic (ROC) curve, the area under which (AUC) was another parameter to evaluate the quality of the model ranging from 0.5 (no discriminative ability) to 1 (perfect classifier) [40].

Identification of Privileged Substructures
The privileged substructures referred to pharmacophoric groups that were responsible for their CK2 inhibitory activity. Information gain (IG) along with substructure frequency [41] was analyzed to identify the privileged groups, which offered the potential determinant fragments for novel CK2 inhibitors discovery. In case a substructure frequently appeared in the "P" category, this fragment was considered a privileged substructure of CK2 inhibitors. The frequency of fragments is defined below [42].
where N p f ragment , N is the number of compounds in "P" class containing the fragment, N is the total number of compounds in the data set, N f ragment , N is the number of compounds having the fragment in the data set, and N P , N represents the total number of "P" compounds in the data sets.
In order to elucidate the roles of privileged substructures binding to CK2, molecular docking was performed to identify the potential position and orientation of compounds 2 and 73 in the active pocket using Genetic Optimization for Ligand Docking (GOLD) version 4.0 [43]. The active site of CK2 was defined as the collection of amino acids enclosed within a 6.5 Å radius sphere centered on Ellagic [44]. TheGOLD score and default genetic algorithm (GA) parameters were applied to predict the binding modes between CK2 and inhibitors.

Dataset Analysis
The reported 115 natural product inhibitors were randomly divided into a training set and test set with a ratio of 4:1. According to the classification criteria node IC 50 = 10 µM, all molecules were split into "P"(88) and "N" (27), among which 92 training set compounds (73 inhibitors and 19 non-inhibitors)and 23 test set compounds (15 inhibitors and 8 non-inhibitors), respectively. The experimental pIC 50 values for the entire dataset were mainly distributed between −6.5 and -4.5 (shown in Figure 1A). Therefore, each group presented the roughly balanced distribution of "P" inhibitors (training group = 79%, testing group = 65.2%), which was suitable to evaluate the predictive performance of the models. Chemical diversity is a determinative factor for the development of robust and predictable models. Here, chemical diversity was characterized by the chemical space (defined by PCA analysis of featured molecular descriptors) and a heat map of molecular similarity constructed by Euclidian distance metrics (calculated by PubChem fingerprint). As shown in Figure 1B, the top 3 most important distributions were used to generate a three-dimensional distribution plot for all 115 compounds. Since these 3 components explained 57% of the featured descriptor variance in this dataset, Figure 1B can be viewed as the representation of chemical space covered by all compounds, which indicates the training set and test set are basically covered in the same spatial distribution. Additionally, the radar chart displayed the five descriptors of Lipinski rules of molecules used. It can be seen from Figure 1C that the molecular property of the entire dataset was similar and did not exhibit preference. Therefore, it is relatively reliable to evaluate the performance of models constructed by the training set using the prediction results of the test set. Furthermore, molecular diversity was analyzed by the heat map constructed by Euclidian distance metrics. Red (1) and blue (0) indicate the highest and lowest diversity of molecules, respectively. As illustrated in Figure 1D, most plots were distributed in the green area (around 0.4), which meansthe data set presented highdiversity. To a certain extent, the dataset possessed similar chemical spaces and high structural diversity, which meets the basic requirement of a reliable classification model.

Performance of 10-Fold Cross-Validation
The performances of 42 models were first evaluated with a 10-fold cross-validation method, which was performed 10 times based on the split training set with the ratio of 1:9. The results of cross-validation (CA, AUC, SE, and SP values) are shown in Figure 2. Considering the CA and AUC values as two key indicators of a reliable model, 10 models were roughly defined as the first 10 rankings, with the CA ranging from 0.783 to 0.859 and AUC ranging from 0.760 to 0.875. However, their SE values and SP values were in the range of 0.88 to 0.99 and 0 to 0.53, respectively, which means that these models had a good predictive ability for the active inhibitors other than the inactive inhibitors. It may be speculated that the higher ratio of active inhibitors to inactive inhibitors in the data set was the underlying reason.
In order to better understand the top 10 models, Table 1 also lists their detailed performance for the training set. From the results of the 10-fold cross-validation, we can see that the same molecular fingerprint generated models with the prediction capabilities to a different extent when combined with different machine learning algorithms. No matter which machine learning method was used, both Ext and PubChem fingerprints generally yielded the optimal results. Among these models, Ext-LR and PubChem-RF gave the top two predictive results by considering AUC and MCC as indicators.
PubChem fingerprint (881 bits) is based on the well-defined structural fragments dictionary, which is full of structural information. It is a public, freely accessible platform for mining biological information resources of small molecules [45]. Ext fingerprint (1024 bits) is an extension of the CDK fingerprint, which takes into account the nature of the ring, including rich structural information [46]. In this study, the data set processed the polycyclic structure features and aromatic rings. Consequently, PubChem and Ext fingerprints may well characterize their structural information.

Performances of Test Set
The test set was used for testing the predictive ability of the 10 best models. As shown in Table 2, their CA values and AUC values ranged from 0.652 to 0.826 and 0.654 to 0.933. Interestingly, the higher SE values relative to SP, similar to those of the 10-fold cross-validation results, indicated that all models had good predictive performance for "P" inhibitors, especially the highest accuracy of 100% for "P" Inhibitors of the Ext and PubChem fingerprint-based models. However, the Sub-LR model presented the best accuracy for non-inhibitors compounds (SP = 0.63). The higher prediction accuracy of inhibitor compounds could be explained by the uneven distribution of inhibitors and non-inhibitors in the test set. Among these models, the PubChem-ANN model (CA = 0.826, AUC = 0.933, MCC = 0.628) yielded the best performance, followed byExt-LR (CA = 0.826, AUC = 0.800, MCC = 0.628), and Ext-SVM (CA = 0.652, AUC = 0.850, MCC = 0.652) models for the test set.     (1) FP and FN represent frequencies of substructure presenting in "P" and "N" class, respectively.
Based on the performances of models for the training set and test set, both PubChem and Ext were involved in the optimal models for the entire dataset. Compared with other machine learning algorithms, LR and ANN produced the preferred classification models for CK2 natural products inhibitors with robustness and good predictive ability

Identification of Privileged Substructures
Information gain (IG) and substructure frequency analysis were performed to identify the privileged pharmacophoric fragments based on PubChem fingerprint. The higher IG values, the greater roles of these substructures responsible for their inhibitory activity. As listed in Table 2, 9 privileged substructures were found to appear in active inhibitors ("P") more frequently than inactive inhibitors ("N"). Specifically, the former 7 fragments were present in active inhibitors, and the latter two substructures were identified in 10 (corresp. 16) active inhibitors and one inactive inhibitor. Consequently, these fragments played the determinant roles for their inhibitory activity based on their presence in each class and also could be used as structural signs to discover and screen novel CK2 inhibitors. Meanwhile, reprehensive compounds were docked into the CK2 binding pocket to elucidate the binding modes of privileged substructures with key residues.
For example, the >= 3 saturated or aromatic benzene rings (PubChemFP193) appeared in most active natural products as the planar aromatic scaffolds were sandwiched in the hydrophobic area of the CK2α binding site to anchor its scaffold. This is consistent with the aromatic scaffolds found in most known CK2α inhibitors. The halogenated benzene substituted with a hydroxyl (PubChemFP505, PubChemFP551, and PubChemFP827) or an alkoxyl (PubChemFP806 and PubChemFP807) were found in compounds 4 (pIC 50 = 7.74) and 5 (pIC 50 = 7.70), in which the electronic attractive substituent-halogen atoms at R 4 promoted the neighbored hydroxyl to be an ionizable negative oxygen atom which formed polar interactions with Lys68 of the positive area of CK2α ( Figure 3A). This conclusioncould explain the activity variations of different structures. By comparing the structures and activities of compounds 73 and 93, a chlorine atom at R 4 generated a 30-fold increased inhibitory activityin contrast to a hydroxyl at the corresponding position. This is consistent with molecular modeling studies of Non-R2 carboxylate-substituted tricyclic Quinoline compounds, which elucidated inappropriate electrostatic interactions between the Non-R2 carboxylate group and the positive region followed by the reorientation of tricyclic skeletons [47]. Another privileged substructure is the halogen-substituted benzene fragment which is referred to as PubChemFP38, PubChemFP815, and PubChemFP785. As indicated from compound 2, this substructure pointed to the hinge region and formed the halogen bond with Glu114 or Val116 ( Figure 3B). Therefore, it is not strange that compound 2 (pIC 50 = 8.00) substituted with 2 halogen atoms at R 7 displayed a 100-fold higher activity than compound 34 (pIC 50 = 6.10) with 2 hydrogen atoms at the same position. Halogen bonds have been considered as a significant interaction involved between ligands and receptors [48]. Additionally, amide groups (PubChemFP439) were identified as pharmacophoric groups and were also included in the design of 2-propenone derivatives as CK2α inhibitors(IC 50 = 0.6 µM) [49]. By introducing the mentioned privileged substructures on the non-polycyclic2-propenone scaffold, more potent CK2 inhibitors are expected to be identified (Undergoing work).

Conclusions
In this study, seven machine learning methods, along with six molecular fingerprints, were employed to establish classification models of 115 CK2α natural product inhibitors. After 10-fold cross-validation and external test set evaluation, the accuracy of the training set and test set ranged from 0.783 to 0.859 and from 0.652 to 0.826, respectively, and the optimal model was obtained using Ext fingerprint combined with the LR algorithm (for the training set) and PubChem fingerprint along with the ANN method(for the test set). Furthermore, information gain and substructure frequency analysis were performed to identify pharmacophoric substructures related to their inhibitory activity. In summary, our research provided optimization clues for the further discovery of novel CK2 inhibitors.