Next Article in Journal
One-Pot APTES Grafted Silica Synthesis and Modification with AgNPs
Previous Article in Journal
The Volatile Compound Profile of “Lumblija”, the Croatian Protected Geographical Indication Sweet Bread
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modeling of New Agents with Potential Antidiabetic Activity Based on Machine Learning Algorithms

1
Department of Pharmacy, Bukovinian State Medical University, 58002 Chernivtsi, Ukraine
2
College of Pharmacy, University of Manitoba, Winnipeg, MB R3E 0T5, Canada
*
Author to whom correspondence should be addressed.
AppliedChem 2025, 5(4), 30; https://doi.org/10.3390/appliedchem5040030
Submission received: 5 August 2025 / Revised: 23 September 2025 / Accepted: 16 October 2025 / Published: 27 October 2025

Abstract

Type 2 diabetes mellitus (T2DM) is a growing global health challenge, expected to affect over 600 million people by 2045. The discovery of new antidiabetic agents remains resource-intensive, motivating the use of machine learning (ML) for virtual screening based on molecular structure. In this study, we developed a predictive pipeline integrating two distinct descriptor types: high-dimensional numerical features from the Mordred library (>1800 2D/3D descriptors) and categorical ontological annotations from the ClassyFire and ChEBI systems. These encode hierarchical chemical classifications and functional group labels. The dataset included 45 active compounds and thousands of inactive molecules, depending on the descriptor system. To address class imbalance, we applied SMOTE and created balanced training and test sets while preserving independent validation sets. Thirteen ML models—including regression, SVM, naive Bayes, decision trees, ensemble methods, and others—were trained using stratified 12-fold cross-validation and evaluated across training, test, and validation. Ridge Regression showed the best generalization (MCC = 0.814), with Gradient Boosting following (MCC = 0.570). Feature importance analysis highlighted the complementary nature of the descriptors: Ridge Regression emphasized ClassyFire taxonomies such as CHEMONTID:0000229 and CHEBI:35622, while Mordred-based models (e.g., Random Forest) prioritized structural and electronic features like MAXsssCH and ETA_dEpsilon_D. This study is the first to systematically integrate and compare structural and ontological descriptors for antidiabetic compound prediction. The framework offers a scalable and interpretable approach to virtual screening and can be extended to other therapeutic domains to accelerate early-stage drug discovery.

1. Introduction

Diabetes mellitus is a chronic metabolic disorder characterized by persistent hyperglycemia due to various pathophysiological mechanisms. It includes type 1 diabetes (T1DM), caused by autoimmune β-cell destruction, and type 2 diabetes (T2DM), linked to insulin resistance and deficiency, with T2DM accounting for over 90% of cases [1,2,3]. The global prevalence reached 463 million adults in 2019 and is projected to rise to 700 million by 2045, driven by aging populations, sedentary lifestyles, poor diets, and obesity [1,2]. Most cases occur in low- and middle-income countries [1].
Diabetes leads to acute metabolic complications and long-term microvascular and macrovascular damage [1,2,3], yet nearly half of cases remain undiagnosed [4]. Addressing this global burden requires better screening, prevention, and treatment strategies. Machine learning (ML) offers promising tools for early detection and risk stratification.
The investigation of chemical properties related to antidiabetic activity is gaining traction [5,6,7,8,9,10]. Predicting such activity using ML can accelerate drug discovery and reduce costs [11,12,13,14,15]. Computational methods like ligand-based virtual screening and QSAR modeling are increasingly used to identify effective and safer agents, such as novel DPP-4 inhibitors [16].
Recent work by Bustamam et al. introduced a QSAR framework combining Rotation Forest, Deep Neural Networks, and CatBoost for feature selection, applied to 1020 DPP-4 inhibitors selected using K-modes clustering and Levenshtein distance [16]. Their models achieved over 70% accuracy, with innovations like SPCA and fingerprint analysis improving performance and interpretability.
Kavakiotis et al. further reviewed ML applications in diabetes, covering diagnosis, complications, genetic interactions, and healthcare management, highlighting the dominance of supervised methods like SVMs and the role of clinical datasets [17].
Together, these studies illustrate how cheminformatics, ML, and domain knowledge are converging to improve the discovery of antidiabetic agents and optimize care in T2DM.

1.1. ML for DPP-4 Inhibitor Discovery

Dipeptidyl peptidase-4 (DPP-4) inhibition is a well-established therapeutic strategy for type 2 diabetes mellitus (T2DM). In recent years, machine learning (ML) techniques have been widely used to identify novel or repurposed compounds with DPP-4 inhibitory activity. Several notable studies illustrate the state of the art:
Hermansyah et al. [18] developed classification models (XGBoost 2.1.4, Random Forest, SVM) trained on 5098 ChEMBL DPP-4 assay data points. The best model, XGBoost, achieved 81.64% accuracy and was applied to screen 2096 FDA-approved drugs. Filtering with DUDE Decoys and Lipinski’s Rule of Five yielded 29 candidate compounds.
Hermansyah et al. [19] expanded this work using a QSAR-based workflow with five regression and four classification models. Support Vector Regression (SVR) achieved R2 = 0.78, while Random Forest classification reached 92.2% accuracy. Screening over 10 million molecules identified CH0002 as a selective DPP-4 inhibitor.
De La Torre et al. [20] curated a high-quality dataset to build predictive models, with Model M36 achieving Q2CV = 0.813 and Q2EXT = 0.803. Screening of DrugBank and DiaNat libraries produced hits including DB07272, Bergenin, and Skimmin, validated by molecular docking and molecular dynamics (MD) simulations.
Devaraji and Sivaraman [21], although targeting a different enzyme (α-amylase), demonstrated a high-performing ML workflow for inhibitor design, achieving an average model score of 0.8216, Pearson r = 0.827, and Q2EXT = 0.835, with less than 1.3% deviation between training and external validation—highlighting the importance of benchmarking internal and external performance for assessing generalization ability.

1.2. Ensemble Learning and Optimization

Several studies have demonstrated the utility of regression-based modeling approaches. For instance, multiple linear regression (MLR), combined with feature selection and genetic algorithms (GA), has been employed to model inhibition data for potent tyrosinase inhibitors in the treatment of hyperpigmentation [22]. Similarly, multiple logistic regression (MLogR) has been applied to construct mathematical models describing the inhibitory activity of compounds against cyclooxygenase-2 (COX-2) [23]. To enhance prediction accuracy and mitigate overfitting in modeling anticancer drug responses, various regularized linear regression techniques have been integrated into ensemble learning frameworks, including a low-rank matrix completion model combined with ridge regression [24].
In the domain of cancer research, algorithms such as correlation adaptive LASSO (CorrLASSO), TG-LASSO [25,26], and Elastic-Net [27,28] have been utilized. Comparative analyses of standard machine learning methods—including decision trees (XGBoost, LightGBM) and neural network architectures (MLP, CNN)—have confirmed their efficacy in drug screening applications [29]. The reliability of identifying significant molecular fragments using Random Forest, XGBoost, and LightGBM has also been demonstrated [30]. Additionally, classification of AXL kinase inhibitors for cancer therapy has been performed using the XGBoost algorithm coupled with Bayesian optimization [31].
Support Vector Machines (SVMs) are consistently highlighted as effective tools for compound classification, property prediction, and virtual screening [32,33,34]. Further studies have employed Naive Bayes classifiers (NB) [35] and the k-Nearest Neighbors (k-NN) method [31,35,36,37].
The integration of these machine learning approaches with large-scale datasets enhances the accuracy of biological activity prediction, rendering them highly promising for drug development. Notably, the combination of multiple algorithms and feature selection strategies has yielded the most effective outcomes, expediting drug discovery and reducing associated experimental burdens.
Collectively, these methodological advancements hold considerable promise for improving the efficiency and cost-effectiveness of drug development, particularly for widespread diseases such as diabetes. The comprehensive application of machine learning techniques in pharmaceutical research is therefore of both theoretical and practical importance.

2. Materials and Methods

The dataset was constructed using DrugBank [38] and ClassyFire [39]. A custom Python 3.11 parser employing the BeautifulSoup [40] and requests [41] libraries was developed to extract chemical structures (SMILES) and Anatomical Therapeutic Chemical (ATC) classification data from DrugBank. The extracted records were stored in an SQLite database.
In total, 17,287 DrugBank entries were retrieved, of which 3403 compounds had ATC annotations. From these, 46 substances labeled under ATC code A10 (antidiabetic agents) were selected as the positive class (label = 1, Table 1), while 2883 compounds from other ATC categories with valid annotations formed the negative class (label = −1).
To generate structural descriptors, the ClassyFire tool was employed. It assigned hierarchical taxonomic classes to each compound based on molecular structure, linking DrugBank IDs with ChemOnt and ChEBI classifications [39]. This yielded categorical descriptors reflecting structural and chemical properties.
In parallel, numerical descriptors were computed using the Mordred library [42], generating over 1800 features per compound, including topological, physicochemical, and geometric parameters. Three-dimensional structures were prepared using Open Babel for SMILES conversion and optimized via the PM6 method in Gaussian 16 [43].
To address class imbalance, SMOTE [44] was applied to the training and test sets, excluding the validation set. For Mordred descriptors, the dataset (45 active, 2461 inactive; active class defined in Table 1) was partitioned into: validation (14 active, 738 inactive), test (9 active, 518 inactive) and SMOTE-balanced training (1205 active/inactive). For ClassyFire descriptors (46 active, 2883 inactive), the splits were: validation (14 active, 865 inactive), test (9 active, 606 inactive) and SMOTE-balanced training (1412 active/inactive)
Thirteen classification algorithms from scikit-learn [45,46] were employed, covering a range of approaches: linear models (Logistic Regression, Ridge Regression, Lasso Regression, Elastic Net), probabilistic models (Bernoulli Naive Bayes, Multinomial Naive Bayes), tree-based methods (Decision Tree, Random Forest, Gradient Boosting), and other classifiers (k-Nearest Neighbors, Linear Discriminant Analysis, Linear SVM, SGDClassifier). All models were trained using 12-fold stratified cross-validation and evaluated by Accuracy, Precision, Recall, F1 Score, Balanced Accuracy, ROC AUC, and Matthews Correlation Coefficient (MCC), ensuring robustness for imbalanced classification tasks. The models were implemented with the following parameter settings: Logistic Regression with solver = ‘liblinear’, BernoulliNB with default settings, Multinomial Naive Bayes wrapped in a MinMaxScaler(), Linear SVM, k-NN, Random Forest, LDA, and SGD Classifier with default configurations, Lasso Regression using penalty = ‘l1’ and solver = ‘saga’, Ridge Regression with alpha = 1.0 and solver = ‘auto’, Elastic Net with penalty = ‘elasticnet’, l1_ratio = 0.5, and solver = ‘saga’, Decision Tree using criterion = ‘gini’ and max_depth = None, and Gradient Boosting with n_estimators = 100 and learning_rate = 0.1. At this stage of the study, hyperparameter tuning was not conducted; default or representative configurations were applied to allow for consistent baseline comparison across all models.
To compare group differences in the distribution of molecular descriptors between compound classes, Pearson’s chi-square (χ2) test was applied. For each descriptor, a 2 × 2 contingency table was constructed to compare the frequency of presence/absence of the descriptor (non-zero values) between the two study groups (y = 1 and y = -1). Differences were considered statistically significant when the corrected p-value (q-value) was less than 0.05.
All data used in this study are publicly available and can be accessed, reproduced, and reused from the GitHub repository https://github.com/pruhlo/appliedchem-3624702 (accessed on 9 July 2025). The repository contains raw and processed data for both antidiabetic and non-antidiabetic compounds, including molecular structures in SMILES and 3D formats, descriptor matrices in CSV and Pickle formats (Mordred and ClassyFire), annotated ontological classifications, trained machine learning models (in .pkl and .pt formats), and evaluation metrics such as feature importance and confusion matrices. Additionally, executable Jupyter notebooks are provided to reproduce descriptor generation, 3D structure conversion, classification modeling, and SMILES generation using RNNs. Visualization files and system requirements are also included, ensuring full reproducibility of all computational steps reported in this work.

3. Results and Discussion

Upon analyzing the frequency of descriptors for compounds with antidiabetic activity, it was found that most of these compounds contained descriptors such as CHEBI:24431, CHEBI:36963, CHEBI:51143, and CHEBI:35352. Moreover, 40 out of 46 compounds contained CHEBI:33836 (benzenoid aromatic compound), and 27 and 25 compounds, respectively, were classified as organic heterocyclic compounds and organosulfur compounds (CHEBI:24532, CHEBI:33261). It should be noted that the entire input dataset consisted exclusively of approved pharmaceutical agents, each of which is assigned to a specific class within the Anatomical Therapeutic Chemical (ATC) classification system.
Following the principles of phenotypic classification (ATC), the study’s results highlight several key advantages of this approach. Specifically, the PDATC-NCPMKL [47] model demonstrated its suitability for broad compound screening by enabling the prediction of ATC codes without requiring prior knowledge of a drug’s specific mechanism of action. This is particularly valuable in scenarios where multiple or unknown pathways mediate pharmacological effects. Such capability is consistent with the core concept of phenotypic models, which focus on observable effects rather than target-specific interactions [47].
Moreover, by predicting ATC code associations, the model facilitates the identification of novel therapeutic applications for existing drugs. This drug repositioning capability operates independently of predefined molecular targets, making it particularly effective for uncovering previously unrecognized drug effects and broadening the therapeutic landscape [47].
Despite its strong performance, the phenotypic classification approach as implemented in the PDATC-NCPMKL model exhibits inherent limitations. One notable constraint is the lack of target-specific biological interpretability. Since ATC codes represent phenotypic-level drug classifications rather than molecular mechanisms, the associations predicted by the model do not elucidate the specific biochemical targets or pathways involved. Consequently, while the model excels at identifying potential therapeutic applications, it provides limited guidance for rational drug optimization or mechanistic validation. Furthermore, although multiple drug and ATC kernels were integrated to enhance prediction, the model’s reliance on predefined similarity measures and known associations may restrict its applicability to novel or undercharacterized compounds with sparse annotation data.
To further characterize the ontological features associated with antidiabetic activity, the presence frequencies of categorical descriptors derived from the ChEBI and ChemOnt ontologies were compared between active and inactive compounds. Several descriptors exhibited markedly higher prevalence in antidiabetic agents, indicating potential structural or functional relevance.
The strongest differentiators included CHEBI:76983 and its corresponding ChemOnt class CHEMONTID:0000490, both present in 36.96% of active compounds but nearly absent in inactives (0.07%, Δ = 0.369). This was followed by CHEBI:33261 (organosulfur compounds, Δ = 0.367), and several ChemOnt categories such as CHEMONTID:0000270, 0000031, and 0004233, all with presence differences exceeding 0.34. These classes correspond to biologically relevant compound types such as heterocyclic organosulfur derivatives, which may underlie mechanisms of glucose regulation or enzyme modulation.
Among compound-level entities, CHEBI:22712 (organic acid) was found in over 80% of antidiabetic agents compared to 48% of non-antidiabetics (Δ = 0.325), while CHEBI:35358, CHEBI:33552, and CHEBI:35850 also exhibited over threefold enrichment in the active class. Notably, CHEBI:33836 (benzenoid aromatic compound) remained highly prevalent across both classes but still showed a significant difference (84.8% vs. 63.6%, Δ = 0.211), suggesting its relevance in maintaining core aromatic scaffolds commonly found in antidiabetic drugs.
ChemOnt superclasses, including CHEMONTID:0000364 (aromatic heteropolycyclic compounds) and CHEMONTID:0001831 (organonitrogen compounds), also showed moderate discriminative power (Δ > 0.22), further reinforcing the role of nitrogen-containing heterocycles. Similarly, descriptors such as CHEMONTID:0001925, 0003964, and 0000284 indicated notable enrichment among actives, capturing less frequent but structurally distinct motifs.
In general, this analysis highlights that antidiabetic agents share common ontological features related to aromaticity, heteroatom composition, and sulfur/nitrogen functionality, as defined by structured chemical taxonomies. These findings provide additional evidence that ontological classification can complement numerical descriptors in identifying pharmacologically relevant molecular patterns.
Additionally, frequency-based analysis of molecular features calculated using the Mordred descriptor library revealed a set of structural and physicochemical characteristics enriched in antidiabetic compounds. Among the top differentiating descriptors were nS (sulfur atom count in the molecule, frequency: 53.3% vs. 21.3%, Δ = 0.32), GGI10 and JGI10 (topological charge and information indices of order 10, both 86.7% vs. 56.7%, Δ = 0.30), and NssNH/SssNH (frequency: 62.2% vs. 35.8%, Δ = 0.26), reflecting the presence of nitrogen and sulfur atoms in specific hybridization states. The descriptor n6ARing (aromatic 6-membered ring count) also showed notable enrichment (64.4% vs. 38.8%, Δ = 0.26), highlighting the role of aromatic scaffolds in antidiabetic agents.
Several hydrogen bond-related and electronic surface area descriptors were also more prevalent in active compounds, including nHBDon (hydrogen bond donor count, 97.8% vs. 74.0%, Δ = 0.24), SlogP_VSA10 and PEOE_VSA13, indicating differentiated lipophilic and electronic surface profiles. Additionally, C2SP3 (number of secondary sp3-hybridized carbon atoms), NsssCH, SsssCH, and ring descriptors such as nARing and n6aRing further contributed to discriminating the active class.
The descriptor ETA_dEpsilon_D, related to electronic topology and hydrogen bonding potential, was present in all active compounds (100%) versus 77.3% of inactives, further suggesting differences in molecular interaction capacity. Taken together, the most discriminative Mordred features point to a consistent pattern among antidiabetic drugs, characterized by the presence of aromatic rings, heteroatoms (especially sulfur and nitrogen), specific surface properties, and defined topological indices. These findings support the hypothesis that certain substructural and electronic features are characteristic of bioactive antidiabetic molecules and may be useful in predictive modeling.

3.1. Development, Evaluation, and Application of Machine Learning Models Using Categorical Ontological Descriptors

The following algorithms were used for model training: logistic regression, naive Bayes classifiers (Multinomial and Bernoulli models), support vector machine (SVM) for linearly separable classes, k-nearest neighbors (k-NN), random forest, decision tree, linear discriminant analysis (LDA), stochastic gradient descent (SGD) classifier, Lasso regression, Ridge regression, Elastic Net, and gradient boosting.
A detailed analysis of the model training results based on categorical ontological annotations from ClassyFire and ChEBI systems (Supplementary Materials, Table S2) revealed exceptionally high performance metrics for most models. The best overall performance was achieved by Random Forest with accuracy (0.9979 ± 0.0027) and ROC AUC (1.0000 ± 0.0000), followed by Linear SVM and SGD Classifier with similarly high metrics.
The most stable models (lowest standard deviations) were Random Forest with accuracy 0.9975 ± 0.0027, Gradient Boosting with accuracy 0.9968 ± 0.0035, and Linear SVM with accuracy 0.9989 ± 0.0025. These models demonstrated not only high performance but also high reproducibility of results.
Remarkably, several models achieved perfect Recall (1.0000 ± 0.0000), including Logistic Regression, Linear SVM, k-NN, LDA, Lasso Regression, Ridge Regression, and Elastic Net. This indicates a complete absence of false negatives (FN) for these models on cross-validation data.
Random Forest demonstrated perfect precision (1.0000 ± 0.0000) during cross-validation, meaning a complete absence of false positives. On test data, this model also showed excellent results with precision 1.0000 and F1 Score 0.8000.
The lowest performance among all models was shown by k-NN with accuracy 0.9487 ± 0.0101 and the lowest precision (0.9072 ± 0.0168). Nevertheless, even this model demonstrated high performance compared to results obtained with other types of descriptors.
Analysis of the confusion matrix (Table 1) on test data confirms the conclusions about high model performance. Gradient Boosting showed the lowest number of false positives (FP = 2), while Random Forest achieved an ideal result with no false positives (FP = 0).
Most models showed minimal false negatives, the score of FN = 2.
When tested on the validation dataset that did not participate in training and data balancing (Supplementary Materials, Table S3), the results differed substantially, indicating possible overfitting or high model specificity to training data. Ridge Regression showed the best results on validation data with accuracy 0.997, precision 1.000, recall 0.786, and F1 Score 0.88.
Linear SVM and SGD ClassiFier, despite excellent results on test data, showed extremely low performance on the validation set with precision and recall 0.000.
Random Forest maintained high precision (1.000) on validation data but showed low recall (0.214), resulting in an F1 Score of 0.353. Nevertheless, this model demonstrated a good balance between precision and generalization ability.
Multinomial Naive Bayes and BernoulliNB showed the most stable performance between test and validation data, although their overall metrics were lower than those of the top-performing models.
Ridge Regression analysis revealed that the most important features are predominantly ChEBI ontological terms, with CHEBI:35622 showing the highest importance (0.587717). This was followed by CHEMONTID:0000229 (0.556824) and CHEMONTID:0002286 (0.483765). The top-ranking features demonstrate a clear hierarchy of discriminative power, with importance values ranging from 0.587717 to 0.181740 for the most significant descriptors (Table 2).
Notably, ChEBI terms dominated the top features, accounting for 7 most important descriptors. This suggests that ChEBI’s chemical entity classifications provide particularly strong discriminative power for the classification task. The presence of both CHEBI and CHEMONTID features in the top rankings indicates that both classification systems contribute complementary information for molecular discrimination.
The feature importance distribution shows a gradual decline rather than sharp cutoffs, with several features showing moderate importance (0.2–0.4 range), suggesting that the model relies on a combination of multiple ontological annotations rather than a few dominant features.
Random Forest demonstrated a more distributed feature importance pattern compared to Ridge Regression, with the highest-ranking feature CHEMONTID:0000270 showing an importance of 0.030681 (Table 3). This more uniform distribution is characteristic of Random Forest algorithms, which tend to distribute importance across multiple features rather than heavily weighting a few dominant ones.
The top features in Random Forest included a more balanced representation of both ChEBI and CHEMONTID annotations, with CHEBI:35358 (0.028784), CHEBI:22712 (0.023023), and CHEMONTID:0004233 (0.021056) among the most important. Interestingly, several features that were highly ranked in Ridge Regression also appeared prominently in Random Forest, including CHEMONTID:0002286, CHEBI:76983, and CHEBI:50492.
The relatively low maximum importance value (0.030681) compared to Ridge Regression reflects Random Forest’s ensemble nature, where importance is distributed across many weak learners, and multiple features contribute to the final decision.

3.2. Development, Evaluation, and Application of Machine Learning Models Using Mordred Descriptors

In this section, the predictive performance of various machine learning classifiers trained on Mordred molecular descriptors is assessed using cross-validation, test, and independent validation datasets (Table 4, Table 5 and Table 6). Feature importance analyses for Random Forest and Gradient Boosting models are also included (Table 7, Table 8 and Table 9).
Across CV, ensemble methods achieved near-perfect scores: Random Forest, Gradient Boosting, and Decision Tree attained ≥0.989 accuracy and F1, with Gradient Boosting yielding the highest mcc (0.998 ± 0.004). However, their generalization differed on the test set: Gradient Boosting preserved high accuracy (0.989) but its recall dropped to 0.333, while Random Forest, despite perfect test precision, recalled only 11% of positives, indicating overfitting to the majority class. Classical linear baselines (Logistic Regression, Linear SVM) maintained perfect recall but suffered from severe precision deficits, reflecting the dataset’s imbalance. Regularized linear models (Ridge) and k-NN provided a compromise, achieving balanced-accuracy scores of 0.759 and 0.588, respectively, with moderate mcc values.
Naive Bayes variants delivered consistent CV performance (F1 ≈ 0.90) yet under-performed on the test set, whereas sparsity-oriented models (Lasso, Elastic Net) failed to learn discriminative features (CV F1 = 0). Overall, Gradient Boosting offered the most favorable bias–variance trade-off, combining high test accuracy with the largest test mcc (0.574) among models that did not sacrifice recall completely.
Logistic Regression and Linear SVM predict every positive correctly (TP = 9, FN = 0) but at the cost of 487 false alarms, yielding a specificity of only 6% and precision of 1.8%. Such behavior is common when the decision threshold is left at 0.5 in a severely imbalanced setting.
Random Forest and Gradient Boosting output essentially no false positives (FP = 0) yet miss 88–89% of the true positives, producing perfect or near-perfect precision but poor recall. Their high overall accuracy (>0.98) therefore reflects correct classification of the majority class rather than balanced performance.
Ridge Regression, k-NN, and LDA strike intermediate positions. Ridge, for instance, identifies five of nine positives while limiting FP to 20, giving the highest MCC (0.315) among models that do not collapse recall. LDA reduces FP further (64) while matching Ridge’s TP count (6), leading to the best test balanced-accuracy (0.772).
Lasso and Elastic Net record only one true positive apiece and two false positives, mirroring their near-zero F1 scores in cross-validation and indicating that the applied regularization parameters were overly aggressive for the available signal. Detailed results are presented in Table 5.
These observations emphasize that, under extreme class imbalance, relying on a single scalar metric (e.g., accuracy) obscures critical type-I/II error trade-offs.
The validation set (Supplementary Materials, Table S4) contains 14 positives and 738 negatives (prevalence ≈ 1.9%). Logistic Regression achieves the highest recall (92.9%) but triggers 705 false alarms, reducing precision to 1.8% and total accuracy to 6.1%. Linear SVM, Ridge, and SGD predict only the majority class (TP = 0), yielding superficially high accuracy (≥0.981) and ROC-AUC (up to 0.914) while providing zero utility for the minority class (F1 = 0, MCC = 0). Among probabilistic models, Bernoulli NB and k-NN trade limited precision (<7.1%) for moderate recall (64.3%), reaching the best balanced accuracy for non-ensemble methods (0.742).
Tree ensembles dominate overall correlation with ground truth: Gradient Boosting attains the largest Matthews correlation coefficient (MCC = 0.651) by combining perfect precision, zero false positives, and a recall of 42.9%, while Random Forest delivers a similar balanced-accuracy (0.714) at the expense of one false positive (precision = 85.7%). Decision Tree achieves a more even trade-off (precision = 44.4%, recall = 28.6%), but its MCC (0.347) trails the boosted ensemble.
The top molecular descriptors driving model decisions for Random Forest (Table 6) and Gradient Boosting (Table 7) were distinct:
Table 6 lists the seven most influential descriptors in the Random-Forest (RF) model. The RF importance profile is comparatively flat: the leading atom-type electrotopological descriptor ATS4s accounts for only 1.8% of the total split-gain, followed closely by ring-substitution feature MAXsssCH (1.4%) and the charge-weighted surface area WNSA2 (1.3%). Three additional surface/shape indices—VSA_EState5, SpMAD_Dzare, and SpMAD_Dzpe—each contribute ≈ 1.2%, suggesting that the forest leverages a broad set of weak signals rather than a few dominant cues.
Table 7 shows the corresponding ranking for the Gradient-Boosting (GB) model. In contrast to RF, GB is heavily driven by a single descriptor: the electron-state partitioned van der Waals surface EState_VSA10 explains 26.9% of the model’s cumulative gain. The next descriptor, VSA_EState5, contributes 15.5%, after which importance values drop sharply (<5%). Notably, two descriptors—VSA_EState5 and ATS4s—appear in the top list of both ensembles, indicating consistent predictive value across learning paradigms. The steep importance decay in GB implies that model interpretability can be focused on a compact subset of surface-charge and autocorrelation features without substantial loss of explanatory power.

3.3. Generation of New Molecules Based on SMILES Representation

Molecular structures were represented using the Simplified Molecular Input Line Entry System (SMILES), a linear string format that encodes molecular graphs as character sequences. This representation enables the formulation of molecular generation as a character-level language modeling task, where the objective is to predict the next character in a SMILES sequence given the preceding context.
A dataset of 46 antidiabetic SMILES sequences (Supplementary Materials, Table S1) was used. To generate syntactically distinct yet structurally equivalent representations, a stochastic non-canonical SMILES generation algorithm was applied. Each original SMILES string was first converted into a molecular graph using the Chem.MolFromSmiles() function from the RDKit library. Subsequently, multiple non-canonical SMILES strings were generated via the MolToSmiles() function with parameters canonical = False and doRandom = True, disabling canonicalization and enabling random graph traversal.
This procedure was repeated for 10,000 iterations per molecule. Duplicate sequences were removed using a set data structure, resulting in a dataset of 227,924 unique non-canonical SMILES sequences. This dataset is suitable for training machine learning models, augmenting chemical databases, and analyzing molecular representation equivalence.
Following standard procedures [48], a vocabulary of all unique characters in the dataset was constructed. Character-to-index and index-to-character mappings were defined, and SMILES strings were converted into integer sequences and subsequently one-hot encoded. A batching function divided the data into fixed-length mini-batches, with target sequences offset by one character for next-character prediction training.
A multi-layer Long Short-Term Memory (LSTM) network was adopted, consistent with the reference implementation [48]. The architecture included dropout for regularization and a fully connected output layer to project hidden states to the character space. This configuration is well-suited for modeling long-range dependencies in SMILES syntax, such as ring closures.
Training was conducted using the Adam optimizer with a cross-entropy loss function. The training process followed the original protocol, including hidden state initialization, one-hot input encoding, and periodic validation to monitor convergence. GPU acceleration was utilized when available.
Molecule generation was performed using an autoregressive sampling approach. Beginning with a user-defined prime string, the model iteratively predicted and appended characters. Top-k sampling was used to balance structural diversity and chemical plausibility in the generated sequences.
The validity of the generated SMILES was confirmed using cheminformatics tools, ensuring both syntactic and chemical correctness. Consistent with the findings of the original study, the model produced structurally valid molecules that reflected the chemical characteristics of the training dataset.
Molecular structure generation was implemented using the approach described in the reference article “Generating Molecules Using a Char-RNN in PyTorch” [48], which details a recurrent neural network (Char-RNN) method for SMILES-based molecule generation.
The neural network was trained on a dataset of 46 molecules, resulting in the generation of 1179 new SMILES sequences (Figure 1).

3.4. Selection of Generated Molecules Using the Developed Prediction Models

After generating the SMILES sequences, a dataset was prepared for predicting the likelihood of antidiabetic activity in the generated molecules, following the same procedure used to generate descriptors with mordred and ClassyFire.
Thus, the data on the predicted probabilities of antidiabetic activity for various compounds were analyzed using different machine-learning models. To select the top 10 most promising compounds, the selection criteria included high average predicted probabilities of antidiabetic activity across all models, consistency of high scores across different models, and preference was given to compounds with the highest scores in the largest number of models (Table 8).
As a result, the most promising compounds in terms of predicted antidiabetic activity were
  • SMILES sequence No. 967
  • Gradient Boosting: 0.99998
  • Random Forest: 0.95
  • Decision Tree: 1.0
  • LDA: 0.999944
  • SMILES sequence No. 540
  • Gradient Boosting: 0.99998
  • Random Forest: 0.93
  • Decision Tree: 1.0
  • LDA: 1.0
Slightly lower predicted probabilities of antidiabetic activity were observed for the following SMILES sequences (see Table 6):
  • No. 447—Gradient Boosting (0.99998), Random Forest (0.92), Decision Tree (1.0), LDA (0.999743)
  • No. 628—Gradient Boosting (0.99998), Random Forest (0.92), Decision Tree (1.0), LDA (0.999743)
  • No. 1163—Gradient Boosting (0.99998), Random Forest (0.92), Decision Tree (1.0), LDA (0.999743)
  • No. 52—Gradient Boosting (0.872455), Random Forest (0.81), Linear SVM (0.553283), Logistic Regression (0.93189)
  • No. 108—Gradient Boosting (1.0), Random Forest (0.65), Logistic Regression (0.65996)
  • No. 20—Gradient Boosting (0.99998), Random Forest (0.92), Decision Tree (1.0), LDA (0.999742)
  • No. 451—Gradient Boosting (0.872455), Random Forest (0.81), Linear SVM (0.553283), Logistic Regression (0.93189)
  • No. 843—Gradient Boosting (0.872455), Random Forest (0.81), Linear SVM (0.553283), Logistic Regression (0.93189)
As for the novelty of the obtained structures, these sequences are likely to already be known. Therefore, in the next stage of generated structure selection, the shortlisted molecules were checked against the Reaxys and Manifold databases.
Further analysis of the presence of generated molecules was carried out using the AI-Driven Chemistry ChemAIRS® platform https://www.chemical.ai/ (accessed on 6 June 2025).
During the study, it was found that Molecule No. 20 is not registered in either the Reaxys or ChemAIRS® platform.
According to the conducted analysis, Molecules 52, 108, 447, 451, and 843 were identified in the ChemAIRS® platform and are available for order from different commercial companies.
Molecule No. 540 was also not registered in either the ChemAIRS® platform or Reaxys.
Likewise, Molecule No. 967 was not found in the ChemAIRS® platform.
There are several potential mechanisms of action for the generated compounds; for instance, molecule No. 967 shares motifs with DPP-4 inhibitors. This observation offers a rational hypothesis for future molecular docking and bioassay testing.

3.5. Biochemical and Pharmacophoric Patterns Revealed by Feature Importance Analysis

An in-depth analysis of the top-ranking features identified by Ridge Regression and Random Forest models reveals several key biochemical and pharmacophoric patterns that may underlie the antidiabetic activity of compounds in the dataset.
Lipophilic Fragments and Fatty Acid Derivatives. The highest-ranked feature in the Ridge Regression model, CHEMONTID:0001729 (fatty acids and derivatives), underscores the significance of lipophilic moieties in mediating antidiabetic activity. This observation aligns with the established structure–activity relationships of PPAR agonists, such as thiazolidinediones and fibrates, which incorporate long-chain hydrophobic elements that facilitate receptor binding and metabolic regulation.
As demonstrated in this study [49], the five-drug class model utilizes phenotypic predictors derived from nine routinely collected clinical features to estimate the relative glycemic effectiveness of commonly used glucose-lowering therapies. While this approach enables broad applicability and implementation across real-world settings, it does not explicitly incorporate mechanistic or target-specific information (e.g., drug-target interaction with SGLT2 or PPARγ pathways). The observed differential long-term outcomes—including risks of glycemic failure, renal progression, and cardiovascular events—suggest underlying heterogeneity in drug response that may benefit from further mechanistic stratification. Therefore, future integration of target-specific modeling frameworks, alongside this phenotypic model, may enhance predictive precision by aligning physiological drug actions with patient-level molecular or biomarker profiles, ultimately supporting more personalized and biologically informed treatment decisions [49].
Aromatic Heterocycles and Lactams. Several highly ranked ontological classes, including CHEMONTID:0000229 (heteroaromatic compounds), CHEMONTID:0001894 (aromatic heteropolycyclic compounds), and CHEMONTID:0001819 (lactams), indicate the prevalence of these structural motifs in known antidiabetic agents. Aromatic heterocycles frequently serve as scaffolds in kinase inhibitors, DPP-4 inhibitors, and sulfonylureas, while lactam rings are often included to enhance metabolic stability and receptor binding affinity.
Convergent Features in Ridge and Forest Models. The joint identification of CHEMONTID:0000490 (organic oxygen compounds) and CHEBI:76983 (L-pipecolate) among the top features in both Ridge Regression and Random Forest models suggests the robustness of these descriptors. These compounds may contribute to modulating oxidative stress and mitochondrial function, processes critically involved in diabetes pathogenesis.
Amino Acid Derivatives and Bioactive Peptides. The inclusion of features such as CHEBI:48277 (Cysteinylglycine) emphasizes the relevance of peptide-like or amino acid–derived moieties. These fragments may mimic endogenous signaling peptides or influence redox modulation and insulin sensitization mechanisms.
Neuroactive Structures and Alkaloid Derivatives. The presence of CHEMONTID:0004343 (alkaloids and derivatives) implies that certain antidiabetic candidates may also exert neuromodulatory effects or interact with neurotransmitter systems. This may be pertinent to diabetes-associated cognitive dysfunction and the emerging involvement of the gut–brain axis in metabolic control.
Collectively, these findings highlight the multidimensional pharmacophoric landscape of antidiabetic compounds and underscore the predictive value of integrating chemical ontology (ChemOnt) and compound-specific identifiers (ChEBI) in machine learning–driven drug discovery.

3.6. Applicability Domain and Prediction Reliability

To improve the interpretability and reliability of the developed models, we assessed their applicability domain (AD)—the chemical space within which the model is expected to produce reliable predictions. AD was determined using the Mahalanobis distance method, which quantifies how far a new sample lies from the centroid of the training data in descriptor space.
For each descriptor set, we calculated the Mahalanobis distance between validation/test samples and the centroid of the training set. A cutoff threshold was derived based on the 95th percentile of the Mahalanobis distances within the training set, using the chi-squared distribution with degrees of freedom equal to the number of descriptors.
Applicability Domain (Mahalanobis Distance) by Mordred-based models:
Threshold: 43.8922
Number of validation samples within AD: 752 out of 752
Applicability Domain (Mahalanobis Distance) by ClassyFire-based models:
Threshold: 49.8229
Number of validation samples within AD: 185 out of 879
These results indicate that Mordred-based models (Figure 2) exhibit excellent descriptor coverage across the chemical space of the validation set, as all external compounds fall within the AD. In contrast, ClassyFire-based models (Figure 3) have a much narrower applicability domain, with only 21.1% of validation compounds falling within the reliable prediction space. This observation is consistent with the higher structural diversity and sparsity of categorical ontological annotations used in the ClassyFire and ChEBI systems.
To aid interpretability, we also visualized the AD boundaries using Principal Component Analysis (PCA). Compounds falling outside the AD were clearly separated from the training distribution in reduced two-dimensional space, further reinforcing the importance of AD evaluation prior to applying the model to novel chemical entities.
In summary, integrating AD analysis allows identification of regions where model predictions are reliable and highlights scenarios in which predictions should be treated with caution or further validated. This is especially critical when deploying models in real-world applications such as virtual screening or drug repurposing (Table 9).
The AD filter retained 190 structures (2 actives, 188 inactives). When restricted to this chemically homogeneous sub-space, nine of the thirteen classifiers—Logistic Regression, Random Forest, LDA, Lasso, Elastic Net, Decision Tree, Gradient Boosting, and the baseline k-NN (with two residual false positives)—achieved perfect or near-perfect recognition of both classes (ACC, F1, MCC, balanced-accuracy = 1.0). This behavior indicates that, within the multivariate region defined by the Mahalanobis metric, the actives are well separated from the bulk of the inactives and are linearly separable.
By contrast, Naive-Bayes variants suffered pronounced overprediction: BernoulliNB and Multinomial NB produced 19 and 33 false positives, respectively, lowering precision to <10% and MCC to ≤0.293, even though recall remained 1.0. Their assumption of descriptor independence appears ill-suited to the tightened feature covariance structure imposed by the AD.
Finally, margin-based linear methods (Linear SVM, SGD, Ridge) collapsed to the majority class, failing to identify either active (recall = 0, MCC = 0, balanced-accuracy = 0.5). This outcome suggests that their decision boundaries, optimized on the full training distribution, shrink inside the AD to exclude the minority region altogether.
Overall, Table 9 demonstrates that AD pruning can dramatically amplify model discrimination—turning several previously moderate performers into perfect classifiers—while simultaneously exposing methods that rely on unrealistic distributional assumptions or overly rigid margins. In the present setting, ensemble and regularized logistic models provide the most reliable in-domain predictions, whereas independence-based and hard-margin techniques require additional calibration or hybrid AD strategies to remain competitive.

3.7. Retrosynthesis and Molecule Synthesis Using AI

The retrosynthesis (Figure 4), performed using the ChemAIRS® platform https://www.chemical.ai/retrosynthesis (accessed on 6 June 2025), incorporates classical and modern organic transformations, including amide bond formation, heterocyclization, electrophilic aromatic substitution, and late-stage functionalization. The synthetic route comprises several key stages, labeled as [Step 1A] through [Step 7A]. The entire synthetic plan, carried out with the assistance of ChemAIRS®, is characterized by atom- and step-economy, avoids unnecessary protection-deprotection steps, and provides a practical framework for constructing drug-like heterocyclic scaffolds. Its design reflects both synthetic feasibility and medicinal relevance, supporting future applications in drug discovery and chemical biology.

3.8. Limitations and Applicability to Drug Discovery

A performance drop between cross-validation and independent validation metrics was analyzed, focusing on balanced accuracy and Matthew’s correlation coefficient (MCC). The analysis revealed the following key observations:
Simple models, such as Logistic Regression, Linear SVM, and SGD Classifier, showed minimal changes in performance, with balanced accuracy differences within ±4% and moderate MCC reductions. This suggests that these models generalize well to unseen data.
In contrast, complex models (e.g., Random Forest, Gradient Boosting, Ridge Regression) exhibited larger drops, with declines exceeding 30% in balanced accuracy and up to 95% in MCC. However, these differences are more plausibly attributed to distributional shifts and test set instability rather than classical overfitting.
Specifically, the training data was balanced using SMOTE (50:50), while the test set remained highly imbalanced, with only 9 positive samples among 527 total (~1.7%). Such a class distribution mismatch introduces considerable metric variability, particularly in MCC, which is highly sensitive to small changes in rare class predictions.
No signs of data leakage were detected, and data splits were performed appropriately.
These findings indicate that the observed performance changes are more consistent with experimental design constraints and class distribution effects than with model overtraining.
A major limitation of the present study lies in the composition of the dataset. The models were trained and evaluated exclusively on compounds with established pharmacological use as antidiabetic agents, specifically classified under ATC class A10 (Anatomical Therapeutic Chemical classification). The availability of such compounds is limited, and new antidiabetic drugs are introduced very infrequently. As a result, the dataset cannot be readily expanded, and the experimental design cannot be restructured without access to newly approved or discovered agents.
Despite these constraints, the developed models retain value as in silico tools for early-stage drug discovery, particularly in the following contexts:
  • Prioritization of large virtual libraries to identify candidates with predicted antidiabetic activity;
  • Hypothesis generation and compound ranking in cheminformatics pipelines;
  • Selection of compounds for further in vitro or in vivo validation.
The use of these models should be restricted to compounds that fall within the applicability domain (AD) defined by the training data. Applicability can be assessed using techniques such as Mahalanobis distance or PCA-based visualization. Predictions for compounds outside this domain should be interpreted with caution.
Given the limited number of positive samples, the high-class imbalance, and the constraints of real-world drug availability, these models should not be used as standalone decision-making tools. Instead, they are best employed as pre-screening instruments that complement experimental and domain-based evaluation strategies.

4. Conclusions

This study presents a systematic approach to modeling novel chemical compounds with potential antidiabetic activity using machine learning algorithms. We begin by analyzing current trends in the application of artificial intelligence in pharmaceutical research, highlighting the increasing role of machine learning in accelerating drug discovery and improving prediction accuracy. The review underscores that ML techniques offer clear advantages over traditional methodologies, particularly in the context of identifying drug candidates with favorable biochemical profiles.
To support model development, we constructed a well-curated dataset that includes information on 46 known antidiabetic drugs and over 2400 other pharmaceutical compounds. Data preprocessing and categorization were carried out using reproducible protocols to ensure the quality and usability of the data. Molecular feature engineering was then performed using both the ClassyFire taxonomic framework and Mordred descriptors, resulting in a comprehensive set of over 1800 molecular features suitable for classification tasks.
Recognizing the challenges posed by class imbalance, we applied appropriate balancing strategies to prepare the training data. Several machine learning classifiers were subsequently trained and evaluated, confirming their capacity to distinguish antidiabetic agents from non-antidiabetic compounds with reasonable performance. In parallel, we adapted a character-level recurrent neural network (Char-RNN) to generate novel molecular structures in the form of SMILES strings. These generated molecules were evaluated using the trained classifiers, allowing us to identify candidates with high predicted antidiabetic potential.
While the findings demonstrate the feasibility of combining ML-based prediction and molecular generation for antidiabetic drug discovery, certain limitations remain. Notably, the generalizability of the models to chemically diverse compounds needs further investigation. Moreover, the biological activity of newly generated molecules has yet to be validated experimentally. Future research should focus on expanding the chemical space through additional data sources, incorporating docking simulations and bioassays, and integrating explainable AI tools to enhance model transparency and applicability in regulatory settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/appliedchem5040030/s1, Table S1: Antidiabetic agents for building machine learning models; Table S2: Comparative characteristics of machine learning model metrics based on descriptors according to ClassyFire classification; Table S3: Comparative characteristics of machine learning model metrics based on descriptors according to the ClassiFier library (on the validation dataset); Table S4: Evaluation of the models performance on an external, unseen validation dataset.

Author Contributions

Conceptualization: Y.P. and I.I.; methodology: Y.P.; software: AT.; validation: Y.P. and A.T.; formal analysis: A.T.; investigation: Y.P.; writing—original draft preparation: Y.P.; writing—review and editing: I.I.; visualization, Y.P.; supervision, Y.P.; project administration: I.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in this study are publicly available and can be accessed, reproduced, and reused from the GitHub repository https://github.com/pruhlo/appliedchem-3624702 (accessed on 4 August 2025).

Acknowledgments

During the preparation of this manuscript, the authors used Cursor https://cursor.com (accessed on 6 July 2025), and CluadeAI (https://claude.ai/). The authors used these programs for drafting, reviewing, and refining the manuscript content. All AI-generated output was thoroughly reviewed and edited by the authors, who take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Forouhi, N.G.; Wareham, N.J. Epidemiology of diabetes. Medicine 2022, 50, 638–643. [Google Scholar] [CrossRef]
  2. Forouhi, N.G.; Wareham, N.J. Epidemiology of diabetes. Medicine 2019, 47, 22–27. [Google Scholar] [CrossRef]
  3. Mekala, K.C.; Bertoni, A.G. Chapter 4—Epidemiology of diabetes mellitus. Transplant. Bioeng. Regen. Endocr. Pancreas 2020, 1, 49–58. [Google Scholar] [CrossRef]
  4. Saeedi, P.; Petersohn, I.; Salpea, P.; Malanda, B.; Karuranga, S.; Unwin, N.; Colagiuri, S.; Guariguata, L.; Motala, A.A.; Ogurtsova, K.; et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th edition. Diabetes Res. Clin. Pract. 2019, 157, 107843. [Google Scholar] [CrossRef] [PubMed]
  5. Schetz, J.A. Structure-Activity Relationships: Theory, Uses and Limitations. In Reference Module in Biomedical Sciences; Elsevier: Amsterdam, The Netherlands, 2015; pp. 1–12. [Google Scholar] [CrossRef]
  6. Guha, R. On Exploring Structure–Activity Relationships. In In Silico Models for Drug Discovery; Kortagere, S., Ed.; Methods in Molecular Biology; Humana Press: Totowa, NJ, USA, 2013; Volume 993, pp. 81–94. [Google Scholar] [CrossRef]
  7. Rueda-Zubiaurre, A.; Yahiya, S.; Fischer, O.J.; Hu, X.; Saunders, C.N.; Sharma, S.; Straschil, U.; Shen, J.; Tate, E.W.; Delves, M.J.; et al. Structure–Activity Relationship Studies of a Novel Class of Transmission Blocking Antimalarials Targeting Male Gametes. J. Med. Chem. 2020, 63, 2240–2262. [Google Scholar] [CrossRef] [PubMed]
  8. Laiolo, J.; Lanza, P.A.; Parravicini, O.; Barbieri, C.; Insuasty, D.; Cobo, J.; Vera, D.M.A.; Enriz, R.D.; Carpinella, M.C. Structure Activity Relationships and the Binding Mode of Quinolinone-Pyrimidine Hybrids as Reversal Agents of Multidrug Resistance Mediated by P-Gp. Sci. Rep. 2021, 11, 16856. [Google Scholar] [CrossRef]
  9. Sanket, B.; Bhatshankar, A.S.T. Quantitative Structure-Activity Relationship and Group-Based Quantitative Structure-Activity Relationship: A Review. Int. J. Pharm. Sci. Res. 2023, 14, 1131–1148. [Google Scholar]
  10. Sippl, W.; Ntie-Kang, F. Editorial to Special Issue—“Structure-Activity Relationships (SAR) of Natural Products”. Molecules 2021, 26, 250. [Google Scholar] [CrossRef]
  11. Olaokun, O.O.; Zubair, M.S. Antidiabetic Activity, Molecular Docking, and ADMET Properties of Compounds Isolated from Bioactive Ethyl Acetate Fraction of Ficus Lutea Leaf Extract. Molecules 2023, 28, 7717. [Google Scholar] [CrossRef]
  12. Onikanni, S.A.; Lawal, B.; Munyembaraga, V.; Bakare, O.S.; Taher, M.; Khotib, J.; Susanti, D.; Oyinloye, B.E.; Noriega, L.; Famuti, A.; et al. Profiling the Antidiabetic Potential of Compounds Identified from Fractionated Extracts of Entada Africana toward Glucokinase Stimulation: Computational Insight. Molecules 2023, 28, 5752. [Google Scholar] [CrossRef]
  13. Adinortey, C.A.; Kwarko, G.B.; Koranteng, R.; Boison, D.; Obuaba, I.; Wilson, M.D.; Kwofie, S.K. Molecular Structure-Based Screening of the Constituents of Calotropis Procera Identifies Potential Inhibitors of Diabetes Mellitus Target Alpha Glucosidase. Curr. Issues Mol. Biol. 2022, 44, 963–987. [Google Scholar] [CrossRef]
  14. Odugbemi, A.I.; Nyirenda, C.; Christoffels, A.; Egieyeh, S.A. Artificial Intelligence in Antidiabetic Drug Discovery: The Advances in QSAR and the Prediction of α-Glucosidase Inhibitors. Comput. Struct. Biotechnol. J. 2024, 23, 2964–2977. [Google Scholar] [CrossRef] [PubMed]
  15. Adeniji, S.E.; Uba, S.; Uzairu, A. Multi-Linear Regression Model, Molecular Binding Interactions and Ligand-Based Design of Some Prominent Compounds against Mycobacterium Tuberculosis. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 8. [Google Scholar] [CrossRef]
  16. Bustamam, A.; Hamzah, H.; Husna, N.A.; Syarofina, S.; Dwimantara, N.; Yanuar, A.; Sarwinda, D. Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus. J. Big Data 2021, 8, 74. [Google Scholar] [CrossRef]
  17. Kavakiotis, I.; Tsave, O.; Salifoglou, A.; Maglaveras, N.; Vlahavas, I.; Chouvarda, I. Machine Learning and Data Mining Methods in Diabetes Research. Comput. Struct. Biotechnol. J. 2017, 15, 104–116. [Google Scholar] [CrossRef]
  18. Hermansyah, O.; Rahmawati, S.; Masrijal, C.D.P.; Permasari, R.I.; Slamet, S. Identification of DPP-4 Inhibitor Active Compounds Using Machine Learning Classification. Int. J. Chem. Biochem. Sci. 2023, 24, 674–681. [Google Scholar]
  19. Hermansyah, O.; Bustamam, A.; Yanuar, A. Virtual screening of dipeptidyl peptidase-4 inhibitors using quantitative structure–activity relationship-based artificial intelligence and molecular docking of hit compounds. Comput. Biol. Chem. 2021, 95, 107597. [Google Scholar] [CrossRef]
  20. De La Torre, S.; Cuesta, S.A.; Calle, L.; Mora, J.R.; Paz, J.L.; Espinoza-Montero, P.J.; Flores-Sumoza, M.; Márquez, E.A. Computational approaches for lead compound discovery in dipeptidyl peptidase-4 inhibition using machine learning and molecular dynamics techniques. Comput. Biol. Chem. 2024, 112, 108145. [Google Scholar] [CrossRef]
  21. Devaraji, V.; Sivaraman, J. Exploring the potential of machine learning to design antidiabetic molecules: A comprehensive study with experimental validation. J. Biomol. Struct. Dyn. 2024, 42, 13290–13311. [Google Scholar] [CrossRef]
  22. Bazl, R.; Ganjali, M.R.; Derakhshankhah, H.; Saboury, A.A.; Amanlou, M.; Norouzi, P. Prediction of Tyrosinase Inhibition for Drug Design Using the Genetic Algorithm–Multiple Linear Regressions. Med. Chem. Res. 2013, 22, 5453–5465. [Google Scholar] [CrossRef]
  23. Billones, L.T.; Gonzaga, A.C. Multiple Logistic Regression Modeling of Compound Class as Active or Inactive Against COX-2 and Prediction on Designed Coxib Derivatives and Similar Compounds. Chem-Bio Inform. J. 2022, 22, 63–87. [Google Scholar] [CrossRef]
  24. Liu, C.; Wei, D.; Xiang, J.; Ren, F.; Huang, L.; Lang, J.; Tian, G.; Li, Y.; Yang, J. An Improved Anticancer Drug-Response Prediction Based on an Ensemble Method Integrating Matrix Completion and Ridge Regression. Mol. Ther.—Nucleic Acids 2020, 21, 676–686. [Google Scholar] [CrossRef] [PubMed]
  25. Datta, S.; Dev, V.A.; Eden, M.R. Using Correlation Based Adaptive LASSO Algorithm to Develop QSPR of Antitumour Agents for DNA–Drug Binding Prediction. Comput. Chem. Eng. 2019, 122, 258–264. [Google Scholar] [CrossRef]
  26. Huang, E.W.; Bhope, A.; Lim, J.; Sinha, S.; Emad, A. Tissue-Guided LASSO for Prediction of Clinical Drug Response Using Preclinical Samples. PLoS Comput. Biol. 2020, 16, e1007607. [Google Scholar] [CrossRef]
  27. Rydzewski, N.R.; Peterson, E.; Lang, J.M.; Yu, M.; Laura Chang, S.; Sjöström, M.; Bakhtiar, H.; Song, G.; Helzer, K.T.; Bootsma, M.L.; et al. Predicting Cancer Drug TARGETS–TreAtment Response Generalized Elastic-neT Signatures. NPJ Genom. Med. 2021, 6, 76. [Google Scholar] [CrossRef]
  28. Toussi, C.A.; Haddadnia, J.; Matta, C.F. Drug Design by Machine-Trained Elastic Networks: Predicting Ser/Thr-Protein Kinase Inhibitors’ Activities. Mol. Divers. 2021, 25, 899–909. [Google Scholar] [CrossRef]
  29. Pu, Q.; Li, Y.; Zhang, H.; Yao, H.; Zhang, B.; Hou, B.; Li, L.; Zhao, Y.; Zhao, L. Screen Efficiency Comparisons of Decision Tree and Neural Network Algorithms in Machine Learning Assisted Drug Design. Sci. China Chem. 2019, 62, 506–514. [Google Scholar] [CrossRef]
  30. Li, B.; Wang, Y.; Yin, Z.; Xu, L.; Xie, L.; Xu, X. Decision Tree-based Identification of Important Molecular Fragments for Protein-ligand Binding. Chem. Biol. Drug Des. 2024, 103, e14427. [Google Scholar] [CrossRef]
  31. Noviandy, T.R.; Idroes, G.M.; Hardi, I. Machine Learning Approach to Predict AXL Kinase Inhibitor Activity for Cancer Drug Discovery Using Bayesian Optimization-XGBoost. J. Soft Comput. Data Min. 2024, 15, 46–56. [Google Scholar] [CrossRef]
  32. Janairo, J.I.B. Support Vector Machine in Drug Design. In Cheminformatics, QSAR and Machine Learning Applications for Novel Drug Development; Elsevier: Amsterdam, The Netherlands, 2023; pp. 161–179. [Google Scholar] [CrossRef]
  33. Rodríguez-Pérez, R.; Bajorath, J. Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery. J. Comput. Aided Mol. Des. 2022, 36, 355–362. [Google Scholar] [CrossRef]
  34. Maltarollo, V.G.; Kronenberger, T.; Espinoza, G.Z.; Oliveira, P.R.; Honorio, K.M. Advances with Support Vector Machines for Novel Drug Discovery. Expert Opin. Drug Discov. 2019, 14, 23–33. [Google Scholar] [CrossRef] [PubMed]
  35. Mandal, L.; Jana, N.D. A Comparative Study of Naive Bayes and K-NN Algorithm for Multi-Class Drug Molecule Classification. In Proceedings of the 2019 IEEE 16th India Council International Conference (INDICON), Rajkot, India, 13–15 December 2019; IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar] [CrossRef]
  36. Jimenes-Vargas, K.; Perez-Castillo, Y.; Tejera, E.; Munteanu, C.R. Exploring Target Identification for Drug Design with K-Nearest Neighbors’ Algorithm. In Artificial Intelligence and Soft Computing; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; Volume 14126, pp. 219–227. [Google Scholar] [CrossRef]
  37. Zhang, H.; Mao, J.; Qi, H.-Z.; Xie, H.-Z.; Shen, C.; Liu, C.-T.; Ding, L. Developing Novel Computational Prediction Models for Assessing Chemical-Induced Neurotoxicity Using Naïve Bayes Classifier Technique. Food Chem. Toxicol. 2020, 143, 111513. [Google Scholar] [CrossRef] [PubMed]
  38. Knox, C.; Wilson, M.; Klinger, C.M.; Franklin, M.; Oler, E.; Wilson, A.; Pon, A.; Cox, J.; Chin, N.E.; Strawbridge, S.A.; et al. DrugBank 6.0: The DrugBank Knowledgebase for 2024. Nucleic Acids Res. 2024, 52, D1265–D1275. [Google Scholar] [CrossRef]
  39. Feunang, Y.D.; Eisner, R.; Knox, C.; Chepelev, L.; Hastings, J.; Owen, G.; Fahy, E.; Steinbeck, C.; Subramanian, S.; Bolton, E.; et al. ClassyFire: Automated Chemical Classification with a Comprehensive, Computable Taxonomy. J. Cheminform. 2016, 8, 61. [Google Scholar] [CrossRef]
  40. Richardson, L. Beautiful Soup Documentation. Available online: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (accessed on 25 May 2025).
  41. Requests: HTTP for Humans—Requests 2.0.0 Documentation. Available online: https://docs.python-requests.org/en/v2.0.0/ (accessed on 3 January 2025).
  42. Moriwaki, H.; Tian, Y.S.; Kawashita, N.; Takagi, T. Mordred: A molecular descriptor calculator. J. Cheminform. 2018, 10, 4. [Google Scholar] [CrossRef]
  43. Armendariz, N.L.D.; Vázquez, N.A.R.; Brazon, E.M. SEMI-Empirical PM6 Method Applied in the Analysis of Thermodynamics Properties and Molecular Orbitals at Different Temperatures of Adsorption Drugs on Chitosan Hydrogels for Type 2 Diabetes. Polym. Bull. 2019, 76, 3423–3435. [Google Scholar] [CrossRef]
  44. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  45. Machine Learning in Python documentation. Scikit-Learn. Available online: https://scikit-learn.org/stable/api/sklearn.linear_model.html (accessed on 3 January 2025).
  46. Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 3rd ed.; O’Reilly Media, Inc.: Beijing, China, 2023. [Google Scholar]
  47. Chen, L.; Xu, J.; Zhou, Y. PDATC-NCPMKL: Predicting drug’s Anatomical Therapeutic Chemical (ATC) codes based on network consistency projection and multiple kernel learning. Comput. Biol. Med. 2024, 169, 107862. [Google Scholar] [CrossRef]
  48. Choudhary, S. Generating Molecules using a Char-RNN in Pytorch. Medium. Available online: https://medium.com/@sunitachoudhary103/generating-molecules-using-a-char-rnn-in-pytorch-16885fd9394b (accessed on 19 March 2025).
  49. Dennis, J.M.; Young, K.G.; Cardoso, P.; Güdemann, L.M.; McGovern, A.P.; Farmer, A.; Holman, R.R.; Sattar, N.; McKinley, T.J.; Pearson, E.R.; et al. A five-drug class model using routinely available clinical features to optimise prescribing in type 2 diabetes: A prediction model development and validation study. Lancet 2025, 405, 701–714. [Google Scholar] [CrossRef]
Figure 1. Structures of selected generated molecules based on SMILES.
Figure 1. Structures of selected generated molecules based on SMILES.
Appliedchem 05 00030 g001
Figure 2. Applicability Domain for samples validation by Mordred descriptors: (a) Mahalanobis Distance; (b) visualized out of AD.
Figure 2. Applicability Domain for samples validation by Mordred descriptors: (a) Mahalanobis Distance; (b) visualized out of AD.
Appliedchem 05 00030 g002
Figure 3. Applicability Domain for samples validation by ClassyFire descriptors: (a) Mahalanobis Distance; (b) visualized out of AD.
Figure 3. Applicability Domain for samples validation by ClassyFire descriptors: (a) Mahalanobis Distance; (b) visualized out of AD.
Appliedchem 05 00030 g003
Figure 4. Retrosynthetic analysis of the compound No. 967 using ChemAIRS®.
Figure 4. Retrosynthetic analysis of the compound No. 967 using ChemAIRS®.
Appliedchem 05 00030 g004
Table 1. Comparison of classification models based on confusion matrix data based on descriptors according to ClassyFire classification.
Table 1. Comparison of classification models based on confusion matrix data based on descriptors according to ClassyFire classification.
ModelData TypeTrue Positives (TP)False Positives (FP)True Negatives (TN)False Negatives (FN)
Logistic RegressionTest Data775992
BernoulliNBTest Data6365703
Multinomial Naive BayesTest Data7335732
Linear SVMTest Data716052
k-NNTest Data7635432
Random ForestTest Data606063
LDATest Data7265802
SGD ClassifierTest Data766002
Lasso RegressionTest Data785982
Ridge RegressionTest Data795972
Elastic NetTest Data766002
Decision TreeTest Data785982
Gradient BoostingTest Data526044
Table 2. Future importance (Top 7) Ridge Regression model by ClassyFire.
Table 2. Future importance (Top 7) Ridge Regression model by ClassyFire.
FeatureImportance
CHEBI:356220.587717
CHEMONTID:00002290.556824
CHEMONTID:00022860.483765
CHEMONTID:00004900.475671
CHEBI:769830.475671
CHEMONTID:00012110.471702
CHEMONTID:00018610.392302
Table 3. Future importance (Top 7) Random Forest model by ClassyFire.
Table 3. Future importance (Top 7) Random Forest model by ClassyFire.
FeatureImportance
CHEMONTID:00002700.030681
CHEBI:353580.028784
CHEBI:227120.023023
CHEMONTID:00042330.021056
CHEBI:332610.017623
CHEMONTID:00000310.017038
CHEBI:335520.016370
Table 4. Comparative characteristics of machine learning model metrics based on Mordred descriptors.
Table 4. Comparative characteristics of machine learning model metrics based on Mordred descriptors.
ModelData TypeAccuracyPrecisionRecallF1 ScoreROC AUCBalanced_accmcc
Logistic RegressionCross-Validation Data0.5257 ± 0.01210.5133 ± 0.00661.0000 ± 0.00000.6783 ± 0.00580.6924 ± 0.03900.5257 ± 0.01220.1581 ± 0.0394
Test Data0.07590.01811.00000.03560.55060.52990.0330
BernoulliNBCross-Validation Data0.8622 ± 0.02730.7867 ± 0.03480.9975 ± 0.00430.8792 ± 0.02070.9614 ± 0.01520.8623 ± 0.02730.7533 ± 0.0439
Test Data0.73620.04230.66670.07950.73350.70210.1180
Multinomial Naive BayesCross-Validation Data0.8992 ± 0.02590.8495 ± 0.03700.9726 ± 0.01090.9065 ± 0.02250.9263 ± 0.02720.8992 ± 0.02590.8078 ± 0.0473
Test Data0.82350.05320.55560.09710.75910.69190.1299
Linear SVMCross-Validation Data0.5270 ± 0.01340.5140 ± 0.00721.0000 ± 0.00000.6789 ± 0.00630.6804 ± 0.03640.5270 ± 0.01370.1613 ± 0.0432
Test Data0.07590.01811.00000.03560.54930.52990.0330
k-NNCross-Validation Data0.9220 ± 0.02060.8678 ± 0.03100.9975 ± 0.00600.9278 ± 0.01760.9759 ± 0.01060.9220 ± 0.02070.8544 ± 0.0360
Test Data0.83300.03530.33330.06380.67280.58750.0617
Random ForestCross-Validation Data0.9996 ± 0.00140.9992 ± 0.00271.0000 ± 0.00000.9996 ± 0.00141.0000 ± 0.00000.9996 ± 0.00140.9992 ± 0.0027
Test Data0.98481.00000.11110.20000.88970.55560.3308
LDACross-Validation Data0.9120 ± 0.01920.8532 ± 0.02650.9967 ± 0.00470.9192 ± 0.01620.9183 ± 0.02740.9120 ± 0.01930.8365 ± 0.0340
Test Data0.87290.08570.66670.15190.83420.77160.2073
SGD ClassifierCross-Validation Data0.5889 ± 0.08290.5838 ± 0.08150.6475 ± 0.06570.6131 ± 0.07030.5875 ± 0.08080.5889 ± 0.08290.1790 ± 0.1668
Test Data0.55980.02550.66670.04920.59910.61230.0585
Lasso RegressionCross-Validation Data0.4992 ± 0.00330.0000 ± 0.00000.0000 ± 0.00000.0000 ± 0.00000.3291 ± 0.03430.4992 ± 0.0019-0.0118 ± 0.0265
Test Data0.98100.33330.11110.16670.51820.55360.1847
Ridge RegressionCross-Validation Data0.9793 ± 0.01160.9630 ± 0.02270.9975 ± 0.00430.9798 ± 0.01110.9795 ± 0.01610.9793 ± 0.01160.9595 ± 0.0222
Test Data0.95450.20000.55560.29410.77520.75850.3151
Elastic NetCross-Validation Data0.4992 ± 0.00330.0000 ± 0.00000.0000 ± 0.00000.0000 ± 0.00000.3291 ± 0.03430.4992 ± 0.0019-0.0118 ± 0.0265
Test Data0.98100.33330.11110.16670.51820.55360.1847
Decision TreeCross-Validation Data0.9892 ± 0.00830.9815 ± 0.01620.9975 ± 0.00430.9894 ± 0.00810.9892 ± 0.00830.9892 ± 0.00830.9787 ± 0.0163
Test Data0.97530.30000.33330.31580.65990.65990.3037
Gradient BoostingCross-Validation Data0.9992 ± 0.00190.9984 ± 0.00371.0000 ± 0.00000.9992 ± 0.00181.0000 ± 0.00000.9992 ± 0.00190.9983 ± 0.0037
Test Data0.98861.00000.33330.50000.71540.66670.5740
Table 5. Comparison of classification models based on confusion matrix data based on Mordred descriptors.
Table 5. Comparison of classification models based on confusion matrix data based on Mordred descriptors.
ModelData TypeTrue Positives (TP)False Positives (FP)True Negatives (TN)False Negatives (FN)
Logistic RegressionTest Data9487310
BernoulliNBTest Data61363823
Multinomial Naive BayesTest Data5894294
Linear SVMTest Data9487310
k-NNTest Data3824366
Random ForestTest Data105188
LDATest Data6644543
SGD ClassifierTest Data62292893
Lasso RegressionTest Data125168
Ridge RegressionTest Data5204984
Elastic NetTest Data125168
Decision TreeTest Data375116
Gradient BoostingTest Data305186
Table 6. Future importance (Top 7) Random Forest model by Mordred.
Table 6. Future importance (Top 7) Random Forest model by Mordred.
Serial Number of the Generated CompoundFeatureImportance
763ATS4s0.018
234MAXsssCH0.014
1547WNSA20.013
1274VSA_EState50.012
683SpMAD_Dzare0.012
648C2SP30.012
750SpMAD_Dzpe0.0116
Table 7. Future importance (Top 7) Gradient Boosting model by Mordred.
Table 7. Future importance (Top 7) Gradient Boosting model by Mordred.
Serial Number of the Generated CompoundFeatureImportance
1558EState_VSA100.2694
1787VSA_EState50.1550
1547ATS4s0.05114
799ATSC8c0.03562
1102ZMIC20.03212
1288PNSA20.03128
643SdS0.02764
Table 8. Chemical Structures of SMILES Sequences with the Highest Predicted Antidiabetic Activity According to Various ML Models.
Table 8. Chemical Structures of SMILES Sequences with the Highest Predicted Antidiabetic Activity According to Various ML Models.
Serial Number of the Generated CompoundSMILES2D Structure
20N(C(=O)Cc1cc(c(OCC)cc1)C(O)=O)CC(C)CAppliedchem 05 00030 i001
52C12(CC3(CC(CC(C3)C2)C1)O)[C@H](N)C(N1[C@@H](C[C@@H]2C[C@@H]21)C#N)=OAppliedchem 05 00030 i002
108c1c(OCC)c(ccc1CC(=O)N[C@H](c1c(cccc1)N1CCCCC1)CC(C)C)C(O)=OAppliedchem 05 00030 i003
447C1CN([C@@H]2C[C@H]21)C(=O)[C@@H](N)C12CC3CC(CC(O)(C3)C1)C2Appliedchem 05 00030 i004
451C1C2(CC3CC(C2)CC1(C3)O)[C@@H](C(N1[C@H](C#N)C[C@H]2[C@@H]1C2)=O)NAppliedchem 05 00030 i005
540C1C2(CC3CC(CC1(C3)C2)NCC(=O)N1CCC[C@H]1C#N)OAppliedchem 05 00030 i006
628F[C@H](CC(N1CCn2c(nnc2C(F)(F)F)C1)=O)NAppliedchem 05 00030 i007
843C12CC3(CC(O)(CC(C3)C1)C2)[C@H](N)C(=O)N1[C@H](C#N)C[C@@H]2C[C@H]12Appliedchem 05 00030 i008
967c12c(nnc(C(F)(F)F)c2CCN(C1)C(C[C@@H](Cc1c(cc(F)c(F)c1)F)N)=O)FAppliedchem 05 00030 i009
1163N1(C(C[C@@H](Cc2cc(F)c(cc2F)F)N)=O)Cc2nnc(C(F)(F)F)n2CC1Appliedchem 05 00030 i010
Table 9. Mahalanobis-distance applicability-domain (AD) evaluation on the independent ClassyFire set.
Table 9. Mahalanobis-distance applicability-domain (AD) evaluation on the independent ClassyFire set.
ModelTrue Positives (TP)False Positives (FP)True Negatives (TN)False Negatives (FN)AccuracyPrecisionRecallF1 ScoreROC AUCBalanced_accmcc
Logistic Regression2018801.01.01.01.01.01.01.0
BernoulliNB21916900.90.0951.00.1741.00.9490.293
Multinomial Naive Bayes23315500.8260.0571.00.1081.00.9120.217
Linear SVM0018820.9890.00.00.01.00.50.0
k-NN2218600.9890.51.00.6671.00.9950.703
Random Forest2018801.01.01.01.01.01.01.0
LDA2018801.01.01.01.01.01.01.0
SGD Classifier0018820.9890.00.00.01.00.50.0
Lasso Regression2018801.01.01.01.01.01.01.0
Ridge Regression0018820.9890.00.00.01.00.50.0
Elastic Net2018801.01.01.01.01.01.01.0
Decision Tree2018801.01.01.01.01.01.01.0
Gradient Boosting2018801.01.01.01.01.01.01.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pruhlo, Y.; Iurchenko, I.; Tomenko, A. Modeling of New Agents with Potential Antidiabetic Activity Based on Machine Learning Algorithms. AppliedChem 2025, 5, 30. https://doi.org/10.3390/appliedchem5040030

AMA Style

Pruhlo Y, Iurchenko I, Tomenko A. Modeling of New Agents with Potential Antidiabetic Activity Based on Machine Learning Algorithms. AppliedChem. 2025; 5(4):30. https://doi.org/10.3390/appliedchem5040030

Chicago/Turabian Style

Pruhlo, Yevhen, Ivan Iurchenko, and Alina Tomenko. 2025. "Modeling of New Agents with Potential Antidiabetic Activity Based on Machine Learning Algorithms" AppliedChem 5, no. 4: 30. https://doi.org/10.3390/appliedchem5040030

APA Style

Pruhlo, Y., Iurchenko, I., & Tomenko, A. (2025). Modeling of New Agents with Potential Antidiabetic Activity Based on Machine Learning Algorithms. AppliedChem, 5(4), 30. https://doi.org/10.3390/appliedchem5040030

Article Metrics

Back to TopTop