Identification of Pharmacophoric Fragments of DYRK1A Inhibitors Using Machine Learning Classification Models

Dual-specific tyrosine phosphorylation regulated kinase 1 (DYRK1A) has been regarded as a potential therapeutic target of neurodegenerative diseases, and considerable progress has been made in the discovery of DYRK1A inhibitors. Identification of pharmacophoric fragments provides valuable information for structure- and fragment-based design of potent and selective DYRK1A inhibitors. In this study, seven machine learning methods along with five molecular fingerprints were employed to develop qualitative classification models of DYRK1A inhibitors, which were evaluated by cross-validation, test set, and external validation set with four performance indicators of predictive classification accuracy (CA), the area under receiver operating characteristic (AUC), Matthews correlation coefficient (MCC), and balanced accuracy (BA). The PubChem fingerprint-support vector machine model (CA = 0.909, AUC = 0.933, MCC = 0.717, BA = 0.855) and PubChem fingerprint along with the artificial neural model (CA = 0.862, AUC = 0.911, MCC = 0.705, BA = 0.870) were considered as the optimal modes for training set and test set, respectively. A hybrid data balancing method SMOTETL, a combination of synthetic minority over-sampling technique (SMOTE) and Tomek link (TL) algorithms, was applied to explore the impact of balanced learning on the performance of models. Based on the frequency analysis and information gain, pharmacophoric fragments related to DYRK1A inhibition were also identified. All the results will provide theoretical supports and clues for the screening and design of novel DYRK1A inhibitors.


Introduction
Protein kinases are implicated in cellular functions by transferring a chemical addition of phosphate group to proteins [1]. Dual-specific tyrosine phosphorylation regulated kinases (DYRKs), which include class I (DYRK1A and DYRK1B) and class II (DYRK2, DYRK3 and DYRK4) possess a dual specificity capability to phosphorylate tyrosine residue Y321 in its own activation loop as well as to phosphorylate its substrates at serine or threonine residues [2]. Among the mammalian DYRKs, DYRK1A is involved in the proliferation and differentiation of the central nervous system (CNS) ranging from early embryogenesis to late aging. Consequently, DYRK1A has been considered as a potential therapeutic target of neurodegenerative diseases such as Alzheimer's disease (AD) and Down's syndrome (DS) [3]. Inhibition of DYRK1A attenuates cognitive dysfunctions in animal models for both AD and DS disease [4]. among which 88 training set compounds (69 potent inhibitors and 19 non-potent inhibitors with of the ratio of 3.5:1) and 29 test set compounds (20 potent inhibitors and nine non-potent inhibitors with the ratio of 2:1), respectively. Given the roughly balanced distribution of "P" inhibitors (training set = 78.4%, test set = 68.9%), each group was suitable to evaluate the predictive performance of the models.
Chemical diversity is an important index of similarity among molecules to build a robust and predictable model. A heat map constructed by Euclidian distance metrics (calculated by PubChem fingerprint) was used to characterize the chemical diversity of molecules. As indicated from Figure 1B, red (1) and blue (0) illustrated the highest and lowest diversity of molecules, respectively. Most plots were distributed in the green area (around 0.4), which indicated that the dataset presented high diversity, and the models trained based on such data can have strong generalization ability. Chemical spaces of the whole dataset were investigated based on PCA analysis of featured molecular descriptors, four descriptors of Lipinski rules, and the number of rotatable bonds. Since 51% featured descriptor variance of this dataset was explained by the top three most principal components, the training set and test could be regarded as in the similar chemical spaces ( Figure  1C). Additionally, there is no remarkable preference of four descriptors of Lipinski's rules of five and the number of rotatable bonds of the entire dataset ( Figure 1D). Consequently, in light of the chemical spaces and structural diversity of the whole dataset, the basic requirements of a reliable classification model were qualified. Chemical diversity is an important index of similarity among molecules to build a robust and predictable model. A heat map constructed by Euclidian distance metrics (calculated by PubChem fingerprint) was used to characterize the chemical diversity of molecules. As indicated from Figure 1B, red (1) and blue (0) illustrated the highest and lowest diversity of molecules, respectively. Most plots were distributed in the green area (around 0.4), which indicated that the dataset presented high diversity, and the models trained based on such data can have strong generalization ability. Chemical spaces of the whole dataset were investigated based on PCA analysis of featured molecular descriptors, four descriptors of Lipinski rules, and the number of rotatable bonds. Since 51% featured descriptor variance of this dataset was explained by the top three most principal components, the training set and test could be regarded as in the similar chemical spaces ( Figure 1C). Additionally, there is no remarkable preference of four descriptors of Lipinski's rules of five and the number of rotatable bonds of the entire dataset ( Figure 1D). Consequently, in light of the chemical spaces and structural diversity of the whole dataset, the basic requirements of a reliable classification model were qualified.

Five-Fold Cross Validation Results
A 5-fold cross-validation for the training set was performed to evaluate the performance of all developed models. The top ten ranking modes were screened by taking overall predictive classification accuracy (CA) and the area under the ROC curve (AUC) as indicators. As shown in Figure 2A, all the models yielded CA and AUC values higher than 0.6. Interestingly, regardless of which ML method was employed, Ext fingerprint yielded the lowest CA and AUC values under 0.8, which were smaller than those of the other models with CA and AUC values (ranging from 0.8 to 0.9). Meanwhile, in light of the higher predictive accuracy of "P" class (sensitivity, SE) than "N" class (specificity, SP) values shown in Figure 2B, it can be speculated that all modes exhibited a better predictive ability for "P" inhibitors than "N" inhibitors. This may be due to the unbalancing "P" inhibitors in the training set with a ratio of 0.78.

Five-Fold Cross Validation Results
A 5-fold cross-validation for the training set was performed to evaluate the performance of all developed models. The top ten ranking modes were screened by taking overall predictive classification accuracy (CA) and the area under the ROC curve (AUC) as indicators. As shown in Figure 2A, all the models yielded CA and AUC values higher than 0.6. Interestingly, regardless of which ML method was employed, Ext fingerprint yielded the lowest CA and AUC values under 0.8, which were smaller than those of the other models with CA and AUC values (ranging from 0.8 to 0.9). Meanwhile, in light of the higher predictive accuracy of "P" class (sensitivity, SE) than "N" class (specificity, SP) values shown in Figure 2B, it can be speculated that all modes exhibited a better predictive ability for "P" inhibitors than "N" inhibitors. This may be due to the unbalancing "P" inhibitors in the training set with a ratio of 0.78.  Table 1 lists the detailed performances of ten models. Both PubChem and Sub fingerprints generated the optimal results in general, especially five modes derived from Pub-Chem fingerprint combined with any ML algorithm. Taking the CA, AUC, BA (average of SE and SP), and Matthews's correlation coefficient (MCC) as indicators, PubChem-SVM and PubChemFP-ANN were the top two models with high predictive ability.   Table 1 lists the detailed performances of ten models. Both PubChem and Sub fingerprints generated the optimal results in general, especially five modes derived from PubChem fingerprint combined with any ML algorithm. Taking the CA, AUC, BA (average of SE and SP), and Matthews's correlation coefficient (MCC) as indicators, PubChem-SVM and PubChemFP-ANN were the top two models with high predictive ability.

Performance of the Test Set
The predictive ability of ten models was further evaluated by a test set. Similar to those of the five-fold cross-validation results, AUC and CA values were in the range of 0.8 to 0.9 in ten models for the test set, whereas the SE and SP values exhibited different behaviors such as the higher SE than SP in Sub-, MACCS-, and Estate-based and PubChem-RF models, and similar values of SE and SP (0.800 or 0.850 and 0.899, respectively) in four PubChem-based models. Consequently, in light of the higher CA, AUC, MCC, and BA values, PubChem-SVM and PubChem-ANN yielded good performances for the test set.

Predicted Results of External Validation Set
Although several models based on the PubChem and Sub fingerprint generated good performance for the training and test set, these models were still needed to be further evaluated by an external validation set. The external validation set contained different types of compounds collected from the relevant literature including benzofuranyl, indole, pyrrolidinyl, and carbazolyl derivatives, which were not included in the training and test set. The predicted results of the external validation set are shown in Table 2. In light of the AUC, CA, and MCC values as the performance indicators, the Sub-ANN and PubChem-ANN models gave the top three predictive results for the external validation set in general. It could also be found that all models obtained higher SE values than SP values, which reflected that these models generated better predictive ability for "P" compounds of the external validation set.

Improved Performance of Balanced Models
Given the high degree of imbalance of the datasets, a hybrid balancing method SMOETL was applied to explore the impact of balanced learning on the performance of models. Here, the comparisons between balanced and imbalanced modes are discussed. As indicated in Table 2, the performances of balanced models were remarkably improved in contrast to those of the imbalanced models for the training set, especially the similar values of SE and SP and the approximate BA values of 0.95, which implied that these models have predictive capacity for both "P" and "N" samples. For the test set, except for the higher SE than SP occurring in all balanced modes (vs. the similar SE and SP of four imbalanced models), there were no observable changes (AUC, CA, MCC, and BA values) between the imbalanced and balanced models. All the balanced models were sensitive to predict potent compounds for the test set. Regarding the external validation set, balanced models such as MACCSFP-LR and Sub-LR were the most predictive models for the external validation set. Compared to the statistical parameters of the imbalanced models, the reduced gap between SE and SP values occurred in the balanced models, which meant that the SMOTETL balancing method indeed enhanced the performances of models for the external validation test. However, since four compounds (123, 125, 126, and 131) in the external validation set possessed a unique chemical structure with four heterocyclic rings connected by fusing or double bonds, which were different to the structural features of the training set composed of three heterocyclic rings with a single bond topological connection, the classifier developed based on the training set generated a weaker performance on the external set. Meanwhile, due to the smaller sample in the external validation set (10 "P" and 5 "N"), balanced learning only generated a slight improvement in the prediction capacity. Interestingly, in light of the high ratio of positive compounds in the external validation set, which shared the similar chemical scaffolds with training set, both imbalanced and balanced models were applicable to predict the "P" class.
Based on the performances of models for the training set, test set, and external test set, thee PubChem fingerprint was involved in the best model for the entire dataset. ANN and SVM were considered as optimal methods for establishing the classification prediction model of DYRK1A heterocyclic inhibitors. As PubChem fingerprint (881 bits) is a welldefined structural fragment dictionary with a variety of different substructures and features, the chemical fragments responsible for DYRK1A inhibition will be analyzed based on the PubChem fingerprint.

Identification and Analysis of Feature Substructures
In order to identify chemical fragments responsible for DYRK1A inhibition, IG and frequency analysis were performed to screen the feature substructures based on the PubChem fingerprint. The higher the IG values, the greater contribution of the feature substructures to DYRK1A inhibition. Sixteen positive and 10 negative fingerprints that occur more frequently in "P" inhibitors and "N" inhibitors responsible for DYR1KA modula-tion/inhibition are presented in Table S4. Since positive fingerprints could be used as the structural signs to discover and screen novel potent DYRK1A inhibitors, representative substructures (Table 3) presented in "P" class with high ratios were focused on and discussed in detail.      (2) PubchemFP703      (2) PubchemFP703 Molecules 2022, 27, x FOR PEER REVIEW 8 of 15                     Fragment-based drug design including fragment-based growing and/or linking strategies and fragment hybridization strategies has been widely used in drug discovery by assembling fragments into novel molecules. Here, the privileged substructures abovementioned provide structural elements for the discovery of DYRK1A inhibitors. AlNajjar et al. [21] reported that 6-hydroxybenzothiazole urea derivatives b27 displayed the highest potency against DYRK1A with an IC50 of 20 nM, which could be considered as an acetamide group-based linking compound by connecting 6-hydroxybenzothiazole and a phenyl acetamide together. Docking results indicated that hydrogen bonds were formed between b27 and key residues Leu241, Glu239, and Lys188 and/or Asp307 in the active site of DYRK1A including acetamide groups with the hinge region backbone of Leu241 and Glu239 as well as between the OH of 6-hydroxybenzothiazole and the conserved Lys188 and/or Asp307. Meanwhile, the above-mentioned substructures such as pyridine and hydroxyl-phenyl ring were also presented in 1H-pyrazolo [3,4-b]pyridine derivatives 8 h (IC50 = 5 nM) [22]. Among them, the 1H-pyrazolo [3,4-b]pyridine scaffold were involved in the interactions with the Glu239 backbone CO and Leu241 backbone NH in the hinge region and the phenol ring formed electrostatic interaction with Lys188.
Besides the validation of classification modes by the reported inhibitors, a hybrid virtual screening of natural products including pharmacophore hypothesis, classification models, and molecular docking was also performed to identity DYRK1A inhibitors with new scaffolds [undergoing work]. As shown in Figure 4, five compounds (CP1, CP4, CP11, CP12, and CP23) were screened as the top five theoretical hits, which will be further evaluated and tested by the kinase assay. Interestingly, five hits possessed the privileged fragments derived from classification models, which were predicted to produce polar interactions with the hinge region and/or positive area as the above-mentioned inhibitors form ( Figure 5). For instance, 2-benzyloxy of CP1 and CP12 may have polar interactions with a side chain NH2 group with a Lys188 and Leu241 backbone NH, respectively. It is possible for the acetamide group of CP4 to interact with Lys188. Imidazole (CP111) and isoxazole (CP23) are also expected to form hydrogen bonds with Leu241 and/or Ser242 of the hinge  Fragment-based drug design including fragment-based growing and/or linking strategies and fragment hybridization strategies has been widely used in drug discovery by assembling fragments into novel molecules. Here, the privileged substructures abovementioned provide structural elements for the discovery of DYRK1A inhibitors. AlNajjar et al. [21] reported that 6-hydroxybenzothiazole urea derivatives b27 displayed the highest potency against DYRK1A with an IC50 of 20 nM, which could be considered as an acetamide group-based linking compound by connecting 6-hydroxybenzothiazole and a phenyl acetamide together. Docking results indicated that hydrogen bonds were formed between b27 and key residues Leu241, Glu239, and Lys188 and/or Asp307 in the active site of DYRK1A including acetamide groups with the hinge region backbone of Leu241 and Glu239 as well as between the OH of 6-hydroxybenzothiazole and the conserved Lys188 and/or Asp307. Meanwhile, the above-mentioned substructures such as pyridine and hydroxyl-phenyl ring were also presented in 1H-pyrazolo [3,4-b]pyridine derivatives 8 h (IC50 = 5 nM) [22]. Among them, the 1H-pyrazolo [3,4-b]pyridine scaffold were involved in the interactions with the Glu239 backbone CO and Leu241 backbone NH in the hinge region and the phenol ring formed electrostatic interaction with Lys188.
Besides the validation of classification modes by the reported inhibitors, a hybrid virtual screening of natural products including pharmacophore hypothesis, classification models, and molecular docking was also performed to identity DYRK1A inhibitors with new scaffolds [undergoing work]. As shown in Figure 4, five compounds (CP1, CP4, CP11, CP12, and CP23) were screened as the top five theoretical hits, which will be further evaluated and tested by the kinase assay. Interestingly, five hits possessed the privileged fragments derived from classification models, which were predicted to produce polar interactions with the hinge region and/or positive area as the above-mentioned inhibitors form ( Figure 5). For instance, 2-benzyloxy of CP1 and CP12 may have polar interactions with a side chain NH2 group with a Lys188 and Leu241 backbone NH, respectively. It is possible for the acetamide group of CP4 to interact with Lys188. Imidazole (CP111) and isoxazole (CP23) are also expected to form hydrogen bonds with Leu241 and/or Ser242 of the hinge region. Most DYRK1A inhibitors possessed a heterocyclic scaffold corresponding to the privileged fragment of ≥3 hetero-aromatic rings (PubChemFP260), which was strongly associated with the protein by hydrophobic interactions with Ile165, Val173, Ala186, Leu241, Leu294, and Val306 side chains. Furthermore, the hetero-aromatic pyridine (PubchemFP187 and PubchemFP188) was also involved to produce a hydrogen bond with a backbone NH of Leu241, and a pyridyl nitrogen of 6-azaindole (PubchemFP499, PubchemFP547, Pub-chemFP569, and PubchemFP611) forms electrostatic interactions with the side chain NH 2 group of Lys188 in the positive area. Therefore, it is reasonable to understand the potency of 6-azaindole derivatives against DYRK1A with IC 50 values ranging from 0.0062 to 0.315. Another remarkable privileged substructure, 2-hydroxy or 2-alkoxyl benzothiazole (Pub-chemFP691, PubchemFP702 and PubchemFP703, or PubchemFP720 and PubchemFP783) was found in compounds 1 (IC 50 = 0.056 µM) and 24 (IC 50 = 0.0938 µM), in which the hydoxyl (alkoxyl) group formed two hydrogen bonds (single H-bond) with Leu241 in the hinge region. Additionally, an alternative binding mode was also found for 2-alkoxyl compounds that flipped and switched the binding interaction from the hinge region (Leu241) to the positive area (Lys188). By comparing the structures and activities of compounds 46 (IC 50 = 28.1 µM) and 47 (IC 50 = 0.8 µM), an alkoxyl at R2 generated a 35-fold increased inhibitory activity in contrast to an alkoxyl at the R3 position. Acetamide groups (Pub-ChemFP645 and PubChem646) were also identified as a pharmacophoric group since it produced polar interactions either with the hinge region or the positive area, depending on the chemical scaffolds substituted on. For example, compound 73 (IC 50 = 0.301 µM) with the acetamide amide made an additional hydrogen bond with backbone carbonyl oxygen atoms of Leu241 ( Figure 3A). However, the acetamide carbonyl substituted on pyridine of compound 23 formed polar interactions with the side chain NH 2 group of Lys188 or the backbone NH of Leu241 in the positive area alternatively ( Figure 3B). Furthermore, the additional pyridine linked to benzothiazole might be helpful to form the π-interaction with the phenyl of Phe238, which provides a structural basis for the improved potency of compound 23 (IC 50 = 0.15 µM) in contrast to compound 53 (IC 50 = 3.9 µM) with acetamide.
Fragment-based drug design including fragment-based growing and/or linking strategies and fragment hybridization strategies has been widely used in drug discovery by assembling fragments into novel molecules. Here, the privileged substructures abovementioned provide structural elements for the discovery of DYRK1A inhibitors. AlNajjar et al. [21] reported that 6-hydroxybenzothiazole urea derivatives b27 displayed the highest potency against DYRK1A with an IC 50 of 20 nM, which could be considered as an acetamide group-based linking compound by connecting 6-hydroxybenzothiazole and a phenyl acetamide together. Docking results indicated that hydrogen bonds were formed between b27 and key residues Leu241, Glu239, and Lys188 and/or Asp307 in the active site of DYRK1A including acetamide groups with the hinge region backbone of Leu241 and Glu239 as well as between the OH of 6-hydroxybenzothiazole and the conserved Lys188 and/or Asp307. Meanwhile, the above-mentioned substructures such as pyridine and hydroxyl-phenyl ring were also presented in 1H-pyrazolo [3,4-b]pyridine derivatives 8 h (IC 50 = 5 nM) [22]. Among them, the 1H-pyrazolo [3,4-b]pyridine scaffold were involved in the interactions with the Glu239 backbone CO and Leu241 backbone NH in the hinge region and the phenol ring formed electrostatic interaction with Lys188. Fragment-based drug design including fragment-based growing and/or linking strategies and fragment hybridization strategies has been widely used in drug discovery by assembling fragments into novel molecules. Here, the privileged substructures abovementioned provide structural elements for the discovery of DYRK1A inhibitors. AlNajjar et al. [21] reported that 6-hydroxybenzothiazole urea derivatives b27 displayed the highest potency against DYRK1A with an IC50 of 20 nM, which could be considered as an acetamide group-based linking compound by connecting 6-hydroxybenzothiazole and a phenyl acetamide together. Docking results indicated that hydrogen bonds were formed between b27 and key residues Leu241, Glu239, and Lys188 and/or Asp307 in the active site of DYRK1A including acetamide groups with the hinge region backbone of Leu241 and Glu239 as well as between the OH of 6-hydroxybenzothiazole and the conserved Lys188 and/or Asp307. Meanwhile, the above-mentioned substructures such as pyridine and hydroxyl-phenyl ring were also presented in 1H-pyrazolo [3,4-b]pyridine derivatives 8 h (IC50 = 5 nM) [22]. Among them, the 1H-pyrazolo [3,4-b]pyridine scaffold were involved in the interactions with the Glu239 backbone CO and Leu241 backbone NH in the hinge region and the phenol ring formed electrostatic interaction with Lys188.
Besides the validation of classification modes by the reported inhibitors, a hybrid virtual screening of natural products including pharmacophore hypothesis, classification models, and molecular docking was also performed to identity DYRK1A inhibitors with new scaffolds [undergoing work]. As shown in Figure 4, five compounds (CP1, CP4, CP11, CP12, and CP23) were screened as the top five theoretical hits, which will be further evaluated and tested by the kinase assay. Interestingly, five hits possessed the privileged fragments derived from classification models, which were predicted to produce polar interactions with the hinge region and/or positive area as the above-mentioned inhibitors form ( Figure 5). For instance, 2-benzyloxy of CP1 and CP12 may have polar interactions with a side chain NH2 group with a Lys188 and Leu241 backbone NH, respectively. It is possible for the acetamide group of CP4 to interact with Lys188. Imidazole (CP111) and isoxazole (CP23) are also expected to form hydrogen bonds with Leu241 and/or Ser242 of the hinge region. Besides the validation of classification modes by the reported inhibitors, a hybrid virtual screening of natural products including pharmacophore hypothesis, classification models, and molecular docking was also performed to identity DYRK1A inhibitors with new scaffolds [undergoing work]. As shown in Figure 4, five compounds (CP1, CP4, CP11, CP12, and CP23) were screened as the top five theoretical hits, which will be further evaluated and tested by the kinase assay. Interestingly, five hits possessed the privileged fragments derived from classification models, which were predicted to produce polar interactions with the hinge region and/or positive area as the above-mentioned inhibitors form ( Figure 5). For instance, 2-benzyloxy of CP1 and CP12 may have polar interactions with a side chain NH 2 group with a Lys188 and Leu241 backbone NH, respectively. It is possible for the acetamide group of CP4 to interact with Lys188. Imidazole (CP111) and isoxazole (CP23) are also expected to form hydrogen bonds with Leu241 and/or Ser242 of the hinge region.

Data Collection and Chemical Diversity
By considering the diversity of the chemical scaffold and the distribution of inhibitory activity, 117 DYRK1A heterocyclic inhibitors were used as the whole dataset [23][24][25][26][27][28][29], which were divided into a training set and a test set with a ratio of 3:1 (Table S1). In addition, an external validation set including 15 compounds (10 "P" and 5 "N", Table S2) was also used to further evaluate the robustness and reliability of classification models. The detailed information of whole dataset was listed in Table 4.
A heat map of the Euclidian distance metrics based on PubChem fingerprints was employed as an indicator of molecular similarity of compounds. In order to evaluate chemical space covered by the entire dataset, 825 2D molecular descriptor groups (e.g., constitutional indices, charge descriptors, ring descriptors, topological indices, connectivity indices, etc.) were calculated by DRAGON 7.0 [30]. In order to avoid the over-fitting possibility of these descriptors, a preliminary screening was applied to exclude the constant and nearly constant variance (>80% compounds sharing the same values with a descriptor) and descriptors with high inter-correlation (pair-wise correlations among all pairs of descriptors >95%). Then, 634 descriptors were further evaluated by principal component analysis to identify the featured molecular descriptors. The descriptors of Lipinski's rules of five [31] were also plotted into a radar chart to observe the chemical space distribution of the entire dataset.

Molecular Fingerprints and Machine Learning Methods
Molecular fingerprint belongs to a kind of chemical structure feature, which is widely used in similarity search, clustering, or recursive partition [32]. The two-dimensional or three-dimensional characteristics of molecules are compiled into binary values (0 for none, 1 for have) or counts, and the chemical structure is converted into a data format that can be understood by a computer. In this study, five molecular fingerprints namely Molecular

Data Collection and Chemical Diversity
By considering the diversity of the chemical scaffold and the distribution of inhibitory activity, 117 DYRK1A heterocyclic inhibitors were used as the whole dataset [23][24][25][26][27][28][29], which were divided into a training set and a test set with a ratio of 3:1 (Table S1). In addition, an external validation set including 15 compounds (10 "P" and 5 "N", Table S2) was also used to further evaluate the robustness and reliability of classification models. The detailed information of whole dataset was listed in Table 4.

Non-Potent Inhibitors (N) Total
Train set  69  19  88  Test set  20  9  29  Validation set  10  5  15  Total  99  33  132 A heat map of the Euclidian distance metrics based on PubChem fingerprints was employed as an indicator of molecular similarity of compounds. In order to evaluate chemical space covered by the entire dataset, 825 2D molecular descriptor groups (e.g., constitutional indices, charge descriptors, ring descriptors, topological indices, connectivity indices, etc.) were calculated by DRAGON 7.0 [30]. In order to avoid the over-fitting possibility of these descriptors, a preliminary screening was applied to exclude the constant and nearly constant variance (>80% compounds sharing the same values with a descriptor) and descriptors with high inter-correlation (pair-wise correlations among all pairs of descriptors >95%). Then, 634 descriptors were further evaluated by principal component analysis to identify the featured molecular descriptors. The descriptors of Lipinski's rules of five [31] were also plotted into a radar chart to observe the chemical space distribution of the entire dataset.

Molecular Fingerprints and Machine Learning Methods
Molecular fingerprint belongs to a kind of chemical structure feature, which is widely used in similarity search, clustering, or recursive partition [32]. The two-dimensional or three-dimensional characteristics of molecules are compiled into binary values (0 for none, 1 for have) or counts, and the chemical structure is converted into a data format that can be understood by a computer. In this study, five molecular fingerprints namely Molecular ACCess System (MACCS, 166 bits), Extended (Ext, 79 bits), Estate (Est, 1024 bits), Pub-Chem (881 bits), and Substructure (Sub, 307 bits) were calculated using PaDEL-Descriptors software [31].
In this study, the classification models were built using the support vector machine (SVM), logistic regression (LR), k-nearest neighbor (KNN), artificial neural network (ANN), naïve Bayes (NB), random forest (RF) (with a tree number of 20 and the maximum tree depth of 15), and decision tree (DT). An in-depth description of the application of these methods in drug discovery can be obtained from some excellent studies and research papers [33,34]. All of these calculations were integrated with Orange Canvas 3.11 software (freely available at https://orange.biolab.si/, accessed on 8 March 2018).
In order to explore the impact of balanced learning on the model's performance, SMOTETL was applied on the original training set to balance the number of potent and non-potent samples. The test set and external validation set were kept unbalanced. This hybrid technique combines oversampling and under sampling techniques. The Tomek link consists of two samples that are the nearest neighbor but do not belong to the same class. This under sampling technique eliminates the observations of the majority class. The SMOTE technique oversamples the original dataset and then detects and removes those samples that compose the Tomek link [35].

Model Performance Evaluation
In order to obtain a comprehensive evaluation of models, the five-fold cross validation method, a test set and an external test set were employed to evaluate the developed classification models based on statistical parameters including TP, TN, FP, FN, SE, SP, CA, and balanced accuracy (average of SE and SP) [36]. MCC was also explored to measure the correlation between the true class labels and the predicted labels.
In addition, the receiver operating characteristic (ROC) curve was also plotted based on the TP and FP rates. AUC ranging from 0.5 to 1.0 was also used to evaluate the accuracy performance of the classification models. If the AUC value is 1.0, it is considered as a perfect classifier. If the AUC is 0.5, it is considered as a classifier without discriminative ability [37].

Identification of Privileged Substructures
Information gain value (IG) and substructure frequency contribution were explored to identify privileged groups of DYRK1A inhibitors. If a fragment appears more frequently in the "P" class, this fragment is regarded as a privileged substructure of the DYRK1A inhibitor. The formula is defined as follows: Frequency of a substructure = N I f ragment × N total N f ragment−total × N I (6) where N I f ragment is the number of compounds in the "P" class containing a fragment; N total is the number of inhibitors in the entire dataset; N f ragment−total is the number of inhibitors in the entire dataset containing a certain fragment; and N I is the total number of "P" inhibitors in the whole dataset.

Molecular Docking
In order to predict the binding modes of theoretical hits with DYRK1A, molecular docking was performed by using AutoDock Vina v1.2.0 [38]. The active site of the receptor was defined as a three-dimensional grid of (50 × 50 × 50) points with a grid spacing of 0.375Å at the center of mass of the ligand (PDB ID: 3ANR) [39]. The Lamarckian genetic algorithm (LGA) was employed as the conformational search method to explore the binding modes between DYRK1A and the inhibitors.

Conclusions
In this study, classification studies of 117 DYRK1A inhibitors were explored using machine learning methods along with molecular fingerprints. Based on the performances of models evaluated by 5-fold cross validation and the test set, the PubChem fingerprint was involved in the best model for the training set and test set with an accuracy of 0.933 and 0.911 when combined with the SVM and ANN algorithm, respectively. Furthermore, pharmacophoric substructures related to their inhibitory activity were also identified using information gain and substructure frequency analysis. All these results provide the theoretical basis to understand key groups responsible for DYRK1A inhibition as well as the valuable hints for the discovery of novel DYRK1A inhibitors.
Recently, ML algorithms have widely been used in fragment-based drug discovery. The statistical ML models are able to develop the categorical or continuous correlations between molecular features and compound activity/property and make predictions for new chemical entities. In particular, the categorical correlation derived from classification models offers an attractive approach for exploring the pharmacophoric fragments to bind a target protein. Continuous correlations of QSAR models can be used to filter the fragment-based optimization compounds with the desired activities and properties in silicon. In light of the limited learning capability of ML algorithms, the developed models may be insufficient to generalize well across different structures and identify the exclusive functional groups with over-dependence on the training set. However, the developed models exhibited excellent prediction ability to the compounds that shared similar molecular information to the training set, and also used to identify the novel DYRK1A inhibitors from the natural product dataset. In the future, with the increase in storage capacity and the size of the dataset available, a subfield of ML called deep learning (DL)-based models will be applicable in drug discovery by means of not only learning from a dataset but also generating new data in a multidimensional way. Therefore, DL-derived models generalize well across compounds with more diverse chemical scaffolds and produce an entire prediction in more broad application domains.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27061753/s1, Table S1: Structure and inhibitory activity of DYRK1A inhibitors of the training and test set; Table S2: Structure and inhibitory activity of DYRK1A inhibitors of the external validation set; Table S3: Performance of 35 classification models for the training set and test set; Table S4 PubChem fingerprints of inhibitors (16) and non-inhibitors (10) responsible for DYR1KA modulation/inhibition.