XGBPred-ACSM: A Hybrid Descriptor-Driven XGBoost Framework for Anticancer Small Molecule Prediction
Abstract
1. Introduction
2. Results and Discussion
2.1. Dataset Analysis and Feature Selection
2.2. Performance of Classification Models
2.2.1. Analysis of 2D/FP Descriptors
2.2.2. Analysis of Hybrid Features
2.2.3. Performance Under Scaffold-Based Validation
2.3. Feature Analysis
3. Materials and Methods
3.1. Dataset Construction
3.2. Feature Calculation
3.3. Feature Selection
3.4. Machine Learning Algorithms and Model Construction
3.5. Cross-Validation and Performance Metrics
3.6. Scaffold-Based Validation
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sudhakar, A. History of Cancer, Ancient and Modern Treatment Methods. J. Cancer Sci. Ther. 2009, 1, 1–4. [Google Scholar] [CrossRef] [PubMed]
- Baudino, T.A. Targeted Cancer Therapy: The Next Generation of Cancer Treatment. Curr. Drug Discov. Technol. 2015, 12, 3–20. [Google Scholar] [CrossRef] [PubMed]
- Wicki, A.; Witzigmann, D.; Balasubramanian, V.; Huwyler, J. Nanomedicine in cancer therapy: Challenges, opportunities, and clinical applications. J. Control. Release 2015, 200, 138–157. [Google Scholar] [CrossRef] [PubMed]
- Kretschmer, F.; Seipp, J.; Ludwig, M.; Klau, G.W.; Böcker, S. Coverage bias in small molecule machine learning. Nat. Commun. 2025, 16, 554. [Google Scholar] [CrossRef]
- Urbina, F.; Lentzos, F.; Invernizzi, C.; Ekins, S. Dual use of artificial intelligence-powered drug discovery. Nat. Mach. Intell. 2022, 4, 189–191. [Google Scholar] [CrossRef]
- Kleandrova, V.V.; Scotti, L.; Bezerra Mendonça Junior, F.J.; Muratov, E.; Scotti, M.T.; Speck-Planche, A. QSAR Modeling for Multi-Target Drug Discovery: Designing Simultaneous Inhibitors of Proteins in Diverse Pathogenic Parasites. Front. Chem. 2021, 9, 634663. [Google Scholar] [CrossRef]
- Roskoski, R., Jr. Properties of FDA-approved small molecule protein kinase inhibitors: A 2020 update. Pharmacol. Res. 2020, 152, 104609. [Google Scholar] [CrossRef]
- Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019, 18, 463–477. [Google Scholar] [CrossRef]
- Balaji, P.D.; Selvam, S.; Sohn, H.; Madhavan, T. MLASM: Machine learning based prediction of anticancer small molecules. Mol. Divers. 2024, 28, 2153–2161. [Google Scholar] [CrossRef]
- Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef]
- Dara, S.; Dhamercherla, S.; Jadav, S.S.; Babu, C.M.; Ahsan, M.J. Machine Learning in Drug Discovery: A Review. Artif. Intell. Rev. 2022, 55, 1947–1999. [Google Scholar] [CrossRef] [PubMed]
- Rafique, R.; Islam, S.R.; Kazi, J.U. Machine learning in the prediction of cancer therapy. Comput. Struct. Biotechnol. J. 2021, 19, 4003–4017. [Google Scholar] [CrossRef] [PubMed]
- Bhinder, B.; Gilvary, C.; Madhukar, N.S.; Elemento, O. Artificial Intelligence in Cancer Research and Precision Medicine. Cancer Discov. 2021, 11, 900–915. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; Cai, Y.; Zhao, K.; Xie, H.; Chen, X. Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov. Today 2022, 27, 103356. [Google Scholar] [CrossRef]
- Mao, J.; Akhtar, J.; Zhang, X.; Sun, L.; Guan, S.; Li, X.; Chen, G.; Liu, J.; Jeon, H.-N.; Kim, M.S.; et al. Comprehensive strategies of machine-learning-based quantitative structure-activity relation-ship models. iScience 2021, 24, 103052. [Google Scholar] [CrossRef]
- Selvaraj, C.; Chandra, I.; Singh, S.K. Artificial intelligence and machine learning approaches for drug design: Challenges and opportunities for the pharmaceutical industries. Mol. Divers. 2022, 26, 1893–1913. [Google Scholar] [CrossRef]
- Choi, J.-H.; Choi, Y.; Lee, K.-S.; Ahn, K.-H.; Jang, W.Y. Explainable Model Using Shapley Additive Explanations Approach on Wound Infection after Wide Soft Tissue Sarcoma Resection: “Big Data” Analysis Based on Health Insurance Review and Assessment Service Hub. Medicina 2024, 60, 327. [Google Scholar] [CrossRef]
- Ferreira, L.L.G.; Andricopulo, A.D. ADMET modeling approaches in drug discovery. Drug Discov. Today 2019, 24, 1157–1165. [Google Scholar] [CrossRef] [PubMed]
- Trinh, C.; Tbatou, Y.; Lasala, S.; Herbinet, O.; Meimaroglou, D. On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 1—From Data Collection to Model Construction: Understanding of the Methods and Their Effects. Processes 2023, 11, 3325. [Google Scholar] [CrossRef]
- Wang, Y.; Bryant, S.H.; Cheng, T.; Wang, J.; Gindulyte, A.; Shoemaker, B.A.; Thiessen, P.A.; He, S.; Zhang, J. PubChem BioAssay: 2017 update. Nucleic Acids Res. 2017, 45, D955–D963. [Google Scholar] [CrossRef]
- Fields, F.R.; Freed, S.D.; Carothers, K.E.; Hamid, M.N.; Hammers, D.E.; Ross, J.N.; Kalwajtys, V.R.; Gonzalez, A.J.; Hildreth, A.D.; Friedberg, I.; et al. Novel antimicrobial peptide discovery using machine learning and biophysical selection of minimal bacteriocin domains. Drug Dev. Res. 2020, 81, 43–51. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 2017, 9, 513–530. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Dhanda, S.K.; Singla, D.; Mondal, A.K.; Raghava, G.P.S. DrugMint: A webserver for predicting and designing of drug-like molecules. Biol. Direct 2013, 8, 28. [Google Scholar] [CrossRef] [PubMed]
- Sharma, A.; Selvam, S.; Balaji, P.D.; Madhavan, T. ANN multi-layer perceptron for prediction of blood-brain barrier permeable compounds for central nervous system therapeutics. J. Biomol. Struct. Dyn. 2024, 43, 9011–9016. [Google Scholar] [CrossRef] [PubMed]
- Dhall, A.; Patiyal, S.; Sharma, N.; Devi, N.L.; Raghava, G.P.S. Computer-aided prediction of inhibitors against STAT3 for managing COVID-19 associated cytokine storm. Comput. Biol. Med. 2021, 137, 104780. [Google Scholar] [CrossRef]
- Yap, C.W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. [Google Scholar] [CrossRef]
- Kaneko, H. Molecular Descriptors, Structure Generation, and Inverse QSAR/QSPR Based on SELFIES. ACS Omega 2023, 8, 21781–21786. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-learn: Machine Learning in Pythonitle. J. Mach. Learn. Res. 2011, 12, 282. [Google Scholar]
- Sharma, N.; Patiyal, S.; Dhall, A.; Devi, N.L.; Raghava, G.P.S. ChAlPred: A web server for prediction of allergenicity of chemical compounds. Comput. Biol. Med. 2021, 136, 104746. [Google Scholar] [CrossRef]
- Tang, J.; Alelyani, S.; Liu, H. Feature selection for classification: A review. In Data Classification: Algorithms and Applications; CRC Press: Boca Raton, FL, USA, 2014; pp. 37–64. [Google Scholar]
- Ignatenko, V.; Surkov, A.; Koltcov, S. Random forests with parametric entropy-based information gains for classification and regression problems. PeerJ Comput. Sci. 2024, 10, e1775. [Google Scholar] [CrossRef]
- Tarwidi, D.; Pudjaprasetya, S.R.; Adytia, D.; Apri, M. An optimized XGBoost-based machine learning method for predicting wave run-up on a sloping beach. MethodsX 2023, 10, 102119. [Google Scholar] [CrossRef] [PubMed]
- Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef] [PubMed]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
- Solomatine, D.P.; Shrestha, D.L. AdaBoost.RT: A boosting algorithm for regression problems. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, Hungary, 25–29 July 2004; Volume 2, pp. 1163–1168. [Google Scholar] [CrossRef]
- Shaker, B.; Yu, M.S.; Song, J.S.; Ahn, S.; Ryu, J.Y.; Oh, K.S.; Na, D. LightBBB: Computational prediction model of blood-brain-barrier penetration based on LightGBM. Bioinformatics 2021, 37, 1135–1139. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble Methods in Machine Learning. In International Workshop on Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
- Ye, Q.; Chai, X.; Jiang, D.; Yang, L.; Shen, C.; Zhang, X.; Li, D.; Cao, D.; Hou, T. Identification of active molecules against Mycobacterium tuberculosis through machine learning. Brief Bioinform. 2021, 22, bbab068. [Google Scholar] [CrossRef]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
- Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef]
- Zhao, Y.; Wan, Q.; He, X. Construction of IRAK4 inhibitor activity prediction model based on machine learning. Mol. Divers. 2024, 28, 2289–2300. [Google Scholar] [CrossRef]






| Classifier | 10-Fold Cross-Validation | Test Set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | MCC | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | MCC | |
| RF | 74.72 | 76.88 | 72.57 | 0.82 | 0.49 | 72.50 | 82.78 | 62.22 | 0.79 | 0.45 |
| XGB | 79.90 | 82.92 | 76.88 | 0.89 | 0.59 | 75.42 | 92.78 | 58.06 | 0.85 | 0.54 |
| GB | 76.60 | 79.58 | 73.61 | 0.86 | 0.53 | 74.31 | 88.89 | 59.72 | 0.81 | 0.50 |
| ET | 74.97 | 78.82 | 71.11 | 0.83 | 0.50 | 72.36 | 87.78 | 56.94 | 0.80 | 0.47 |
| AdaBoost | 78.09 | 80.14 | 76.04 | 0.87 | 0.56 | 70.97 | 89.17 | 52.27 | 0.83 | 0.45 |
| LightGBM | 79.17 | 80.90 | 77.43 | 0.87 | 0.58 | 71.67 | 87.78 | 55.56 | 0.82 | 0.45 |
| Voting (XGB + GB) | 79.24 | 82.36 | 76.11 | 0.88 | 0.58 | 75.42 | 91.67 | 59.17 | 0.84 | 0.53 |
| Classifier | 10-Fold Cross-Validation | Test Set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | MCC | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | MCC | |
| RF | 78.30 | 77.85 | 78.75 | 0.87 | 0.56 | 72.64 | 90.00 | 55.28 | 0.85 | 0.48 |
| XGB | 75.94 | 78.12 | 73.75 | 0.86 | 0.51 | 72.22 | 87.78 | 56.67 | 0.83 | 0.46 |
| GB | 79.83 | 80.28 | 79.37 | 0.89 | 0.59 | 73.89 | 91.94 | 55.83 | 0.86 | 0.51 |
| ET | 77.08 | 76.60 | 77.57 | 0.86 | 0.54 | 75.00 | 89.17 | 60.83 | 0.84 | 0.52 |
| AdaBoost | 74.20 | 76.18 | 72.22 | 0.82 | 0.48 | 72.64 | 86.94 | 58.33 | 0.82 | 0.47 |
| LightGBM | 75.94 | 76.81 | 75.07 | 0.85 | 0.51 | 74.44 | 89.44 | 59.44 | 0.84 | 0.51 |
| Voting (GB + XGB) | 78.78 | 80.00 | 77.57 | 0.88 | 0.57 | 73.06 | 90.28 | 55.83 | 0.85 | 0.49 |
| Classifier | 10-Fold Cross-Validation | Test Set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | MCC | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC | MCC | |
| RF | 74.20 | 75.56 | 72.85 | 0.82 | 0.48 | 71.11 | 81.11 | 61.11 | 0.78 | 0.43 |
| XGB | 82.05 | 83.82 | 80.28 | 0.90 | 0.64 | 79.11 | 92.78 | 59.44 | 0.88 | 0.55 |
| GB | 76.77 | 80.07 | 73.47 | 0.85 | 0.53 | 73.75 | 87.78 | 59.72 | 0.81 | 0.49 |
| ET | 75.69 | 79.37 | 72.01 | 0.85 | 0.51 | 69.44 | 85.00 | 53.89 | 0.81 | 0.40 |
| AdaBoost | 77.57 | 80.42 | 74.72 | 0.87 | 0.55 | 71.53 | 88.33 | 54.72 | 0.83 | 0.45 |
| LightGBM | 79.55 | 81.81 | 77.29 | 0.88 | 0.59 | 69.86 | 87.78 | 51.94 | 0.81 | 0.42 |
| Voting (GB + XGB) | 81.18 | 83.54 | 78.82 | 0.90 | 0.62 | 75.69 | 92.22 | 59.17 | 0.87 | 0.54 |
| XGB Model | |
|---|---|
| Metric | Value |
| Precision | 0.67 |
| Recall | 0.92 |
| F1-score | 0.77 |
| PR-AUC | 0.88 |
| AUC (95% CI) | 0.88 [0.85, 0.90] |
| MCC (95% CI) | 0.55 [0.49, 0.61] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Balaji, P.D.; Selvam, S.; Thiagarajan, A.; Sohn, H.; Madhavan, T. XGBPred-ACSM: A Hybrid Descriptor-Driven XGBoost Framework for Anticancer Small Molecule Prediction. Pharmaceuticals 2026, 19, 635. https://doi.org/10.3390/ph19040635
Balaji PD, Selvam S, Thiagarajan A, Sohn H, Madhavan T. XGBPred-ACSM: A Hybrid Descriptor-Driven XGBoost Framework for Anticancer Small Molecule Prediction. Pharmaceuticals. 2026; 19(4):635. https://doi.org/10.3390/ph19040635
Chicago/Turabian StyleBalaji, Priya Dharshini, Subathra Selvam, Anuradha Thiagarajan, Honglae Sohn, and Thirumurthy Madhavan. 2026. "XGBPred-ACSM: A Hybrid Descriptor-Driven XGBoost Framework for Anticancer Small Molecule Prediction" Pharmaceuticals 19, no. 4: 635. https://doi.org/10.3390/ph19040635
APA StyleBalaji, P. D., Selvam, S., Thiagarajan, A., Sohn, H., & Madhavan, T. (2026). XGBPred-ACSM: A Hybrid Descriptor-Driven XGBoost Framework for Anticancer Small Molecule Prediction. Pharmaceuticals, 19(4), 635. https://doi.org/10.3390/ph19040635
