Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
Abstract
:1. Introduction
2. Materials and Methods
2.1. Study Sample and Power Analysis
2.2. Sample Collection and Storage
2.3. Metabolomics Profiling
2.4. Data Analyses
2.5. Machine Learning Pipeline
2.6. Hyperparameter Optimization
- LightGBM: Learning rate = 0.05; num_leaves = 31; max_depth = 7; n_estimators = 500.
- AdaBoost: Base estimator = Decision Stump; n_estimators = 100; learning rate = 0.8.
- Random Forest: n_estimators = 300; max_depth = 10; max_features = “sqrt”.
2.7. Class Imbalance Mitigation
2.8. SHAP Analysis for Interpretability
3. Results
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
- Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed]
- Vaida, M.; Arumalla, K.K.; Tatikonda, P.K.; Popuri, B.; Bux, R.A.; Tappia, P.S.; Huang, G.; Haince, J.-F.; Ford, W.R. Identification of a Novel Biomarker Panel for Breast Cancer Screening. Int. J. Mol. Sci. 2024, 25, 11835. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Xie, G.; Wang, X.; Fan, J.; Qiu, Y.; Zheng, X.; Qi, X.; Cao, Y.; Su, M.; Wang, X.; et al. Serum and urine metabolite profiling reveals potential biomarkers of human hepatocellular carcinoma. Mol. Cell. Proteom. 2011, 10, M110.004945. [Google Scholar] [CrossRef]
- Asiago, V.M.; Alvarado, L.Z.; Shanaiah, N.; Gowda, G.N.; Owusu-Sarfo, K.; Ballas, R.A.; Raftery, D. Early detection of recurrent breast cancer using metabolite profiling. Cancer Res. 2010, 70, 8309–8318. [Google Scholar] [CrossRef]
- Anh, N.K.; Lee, A.; Phat, N.K.; Yen, N.T.H.; Thu, N.Q.; Tien, N.T.N.; Kim, H.-S.; Kim, T.H.; Kim, D.H.; Kim, H.-Y.; et al. Combining metabolomics and machine learning to discover biomarkers for early-stage breast cancer diagnosis. PLoS ONE 2024, 19, e0311810. [Google Scholar] [CrossRef]
- Zou, Y.; Song, D.; Cai, Y.; Liang, K.; Fu, J.; Zhang, H. Comprehensive Untargeted Serum Metabolomics Identifies Biomarkers and Metabolic Pathways in Breast Cancer. 2024. Available online: https://www.researchsquare.com/article/rs-4649887/v1 (accessed on 20 March 2025).
- Cardoso, M.R.; Silva, A.A.R.; Talarico, M.C.R.; Sanches, P.H.G.; Sforça, M.L.; Rocco, S.A.; Rezende, L.M.; Quintero, M.; Costa, T.B.; Viana, L.R.; et al. Metabolomics by NMR combined with machine learning to predict neoadjuvant chemotherapy response for breast cancer. Cancers 2022, 14, 5055. [Google Scholar] [CrossRef]
- Gong, S.; Wang, Q.; Huang, J.; Huang, R.; Chen, S.; Cheng, X.; Liu, L.; Dai, X.; Zhong, Y.; Fan, C.; et al. LC-MS/MS platform-based serum untargeted screening reveals the diagnostic biomarker panel and molecular mechanism of breast cancer. Methods 2024, 222, 100–111. [Google Scholar] [CrossRef]
- Xie, G.; Zhou, B.; Zhao, A.; Qiu, Y.; Zhao, X.; Garmire, L.; Shvetsov, Y.B.; Yu, H.; Yen, Y.; Jia, W. Lowered circulating aspartate is a metabolic feature of human breast cancer. Oncotarget 2015, 6, 33369. [Google Scholar] [CrossRef]
- Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
- Tourassi, G.D.; Frederick, E.D.; Markey, M.K.; Floyd, C.E., Jr. Application of the mutual information criterion for feature selection in computer-aided diagnosis. Med. Phys. 2001, 28, 2394–2402. [Google Scholar] [CrossRef]
- Estévez, P.A.; Tesmer, M.; Perez, C.A.; Zurada, J.M. Normalized mutual information feature selection. IEEE Trans. Neural Netw. 2009, 20, 189–201. [Google Scholar] [CrossRef] [PubMed]
- Sulaiman, M.A.; Labadin, J. Feature selection based on mutual information. In Proceedings of the 2015 9th International Conference on IT in Asia (CITA), Sarawak, Malaysia, 4–5 August 2015; pp. 1–6. [Google Scholar]
- Alsouki, L.; Duval, L.; Marteau, C.; El Haddad, R.; Wahl, F. Dual-sPLS: A family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. Chemom. Intell. Lab. Syst. 2023, 237, 104813. [Google Scholar] [CrossRef]
- McWilliams, B.; Montana, G. Sparse partial least squares regression for on-line variable selection with multivariate data streams. Stat. Anal. Data Min. ASA Data Sci. J. 2010, 3, 170–193. [Google Scholar] [CrossRef]
- Olson Hunt, M.J.; Weissfeld, L.; Boudreau, R.M.; Aizenstein, H.; Newman, A.B.; Simonsick, E.M.; Van Domelen, D.R.; Thomas, F.; Yaffe, K.; Rosano, C. A variant of sparse partial least squares for variable selection and data exploration. Front. Neuroinformatics 2014, 8, 18. [Google Scholar] [CrossRef] [PubMed]
- Kursa, M.B.; Jankowski, A.; Rudnicki, W.R. Boruta–a system for feature selection. Fundam. Informaticae 2010, 101, 271–285. [Google Scholar] [CrossRef]
- Zhang, Y.; Gong, D.-w.; Gao, X.-z.; Tian, T.; Sun, X.-y. Binary differential evolution with self-learning for multi-objective feature selection. Inf. Sci. 2020, 507, 67–85. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS 2011), Granada, Spain, 12–15 December 2011. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- LemaÃŽtre, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
- Wishart, D.S. Metabolomics for investigating physiological and pathophysiological processes. Physiol. Rev. 2019, 99, 1819–1875. [Google Scholar] [CrossRef]
- Omotehinwa, T.O.; Oyewola, D.O.; Dada, E.G. A light gradient-boosting machine algorithm with tree-structured parzen estimator for breast cancer diagnosis. Healthc. Anal. 2023, 4, 100218. [Google Scholar] [CrossRef]
- Ma, B.; Pan, J.; Hou, X.; Li, C.; Xiong, T.; Gong, Y.; Song, F. The Construction of Polygenic Risk Scores for Breast Cancer Based on LightGBM and Multiple Omics Data. 2021. Available online: https://www.researchsquare.com/article/rs-438740/v1 (accessed on 29 April 2025).
- Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
- Mehmood, W.; Shahid, M.; Rashid, M.; Tariq, F.; Chaudary, M.I.; Hamza, M. Exploring the Anti-Carcinogenic Effect of Choline in Limiting the Progression of Breast Cancer in Females: Anti-Carcinogenic Effect of Choline in Breast Cancer. J. Health Rehabil. Res. 2024, 4. [Google Scholar] [CrossRef]
- Swift, A.; Heale, R.; Twycross, A. What are sensitivity and specificity? Evid.-Based Nurs. 2020, 23, 2–4. [Google Scholar] [CrossRef]
- Yadav, S.; Sehrawat, H.; Jaglan, V.; Singh, Y.; Dalal, S.; Le, D.-N. Developing Model-Agnostic Meta-Learning Enabled Lightbgm Model Asthma Level Prediction in Smart Healthcare Modeling. Scalable Comput. Pract. Exp. 2024, 25, 4872–4885. [Google Scholar] [CrossRef]
- Mittal, K.; Gill, K.S.; Upadhyay, D.; Dangi, S. From Data to Diagnosis: Employing Machine Learning with LightGBM Classification to Evaluate Autism Probability. In Proceedings of the 2024 International Conference on Innovations and Challenges in Emerging Technologies (ICICET), Nagpur, India, 7–8 June 2024; pp. 1–5. [Google Scholar]
- Sanches, P.H.G.; de Melo, N.C.; Porcari, A.M.; de Carvalho, L.M. Integrating molecular perspectives: Strategies for comprehensive multi-omics integrative data analysis and machine learning applications in transcriptomics, proteomics, and metabolomics. Biology 2024, 13, 848. [Google Scholar] [CrossRef]
- Zhou, H.; Wang, F.; Niu, T. Prediction of prognosis and immunotherapy response of amino acid metabolism genes in acute myeloid leukemia. Front. Nutr. 2022, 9, 1056648. [Google Scholar] [CrossRef] [PubMed]
- Mokhtari, R.B.; Ashayeri, N.; Baghaie, L.; Sambi, M.; Satari, K.; Baluch, N.; Bosykh, D.A.; Szewczuk, M.R.; Chakraborty, S. The hippo pathway effectors YAP/TAZ-TEAD oncoproteins as emerging therapeutic targets in the tumor microenvironment. Cancers 2023, 15, 3468. [Google Scholar] [CrossRef]
- Akrida, I.; Makrygianni, M.; Nikou, S.; Mulita, F.; Bravou, V.; Papadaki, H. Hippo pathway effectors YAP, TAZ and TEAD are associated with EMT master regulators ZEB, Snail and with aggressive phenotype in phyllodes breast tumors. Pathol.-Res. Pract. 2024, 262, 155551. [Google Scholar] [CrossRef]
- Feldker, N.; Ferrazzi, F.; Schuhwerk, H.; Widholz, S.A.; Guenther, K.; Frisch, I.; Jakob, K.; Kleemann, J.; Riegel, D.; Bönisch, U.; et al. Genome-wide cooperation of EMT transcription factor ZEB 1 with YAP and AP-1 in breast cancer. EMBO J. 2020, 39, e103209. [Google Scholar] [CrossRef]
- Sarmasti Emami, S.; Ge, A.; Zhang, D.; Hao, Y.; Ling, M.; Rubino, R.; Nicol, C.J.; Wang, W.; Yang, X. Identification of PTPN12 phosphatase as a novel negative regulator of hippo pathway effectors YAP/TAZ in breast cancer. Int. J. Mol. Sci. 2024, 25, 4064. [Google Scholar] [CrossRef]
- Thompson, B.J. YAP/TAZ: Drivers of tumor growth, metastasis, and resistance to therapy. Bioessays 2020, 42, 1900162. [Google Scholar] [CrossRef] [PubMed]
- Delcaillau, D.; Ly, A.; Papp, A.; Vermet, F. Model transparency and interpretability: Survey and application to the insurance industry. Eur. Actuar. J. 2022, 12, 443–484. [Google Scholar] [CrossRef]
- Ponce-Bobadilla, A.V.; Schmitt, V.; Maier, C.S.; Mensing, S.; Stodtmann, S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin. Transl. Sci. 2024, 17, e70056. [Google Scholar] [CrossRef]
- Sobhan, M.; Mondal, A.M. Explainable machine learning to identify patient-specific biomarkers for lung cancer. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 3152–3159. [Google Scholar]
- Saeidnia, H.R.; Firuzpour, F.; Kozak, M.; Soleymani majd, H. Advancing cancer diagnosis and treatment: Integrating image analysis and AI algorithms for enhanced clinical practice. Artif. Intell. Rev. 2025, 58, 105. [Google Scholar] [CrossRef]
- Ho, C.; Zhao, Z.; Chen, X.F.; Sauer, J.; Saraf, S.A.; Jialdasani, R.; Taghipour, K.; Sathe, A.; Khor, L.-Y.; Lim, K.-H.; et al. A promising deep learning-assistive algorithm for histopathological screening of colorectal cancer. Sci. Rep. 2022, 12, 2222. [Google Scholar] [CrossRef]
- Survarachakan, S.; Prasad, P.J.R.; Naseem, R.; de Frutos, J.P.; Kumar, R.P.; Langø, T.; Cheikh, F.A.; Elle, O.J.; Lindseth, F. Deep learning for image-based liver analysis—A comprehensive review focusing on malignant lesions. Artif. Intell. Med. 2022, 130, 102331. [Google Scholar] [CrossRef]
- Manjunath, R.; Ghanshala, A.; Kwadiki, K. Deep learning algorithm performance evaluation in detection and classification of liver disease using CT images. Multimed. Tools Appl. 2024, 83, 2773–2790. [Google Scholar] [CrossRef] [PubMed]
- Nishida, N.; Yamakawa, M.; Shiina, T.; Mekada, Y.; Nishida, M.; Sakamoto, N.; Nishimura, T.; Iijima, H.; Hirai, T.; Takahashi, K.; et al. Artificial intelligence (AI) models for the ultrasonographic diagnosis of liver tumors and comparison of diagnostic accuracies between AI and human experts. J. Gastroenterol. 2022, 57, 309–321. [Google Scholar] [CrossRef]
- Abdelsamea, M.M.; Pitiot, A.; Grineviciute, R.B.; Besusparis, J.; Laurinavicius, A.; Ilyas, M. A cascade-learning approach for automated segmentation of tumour epithelium in colorectal cancer. Expert Syst. Appl. 2019, 118, 539–552. [Google Scholar] [CrossRef]
- Nava, R.; González, G.; Kybic, J.; Escalante-Ramírez, B. Classification of tumor epithelium and stroma in colorectal cancer based on discrete Tchebichef moments. In Clinical Image-Based Procedures—Translational Research in Medical Imaging, Proceedings of the 4th International Workshop, CLIP 2015, Munich, Germany, 5 October 2015; Springer: Cham, Switzerland, 2016; pp. 79–87. [Google Scholar]
- Collins, F.S.; Varmus, H. A new initiative on precision medicine. New Engl. J. Med. 2015, 372, 793–795. [Google Scholar] [CrossRef]
- Barberis, E.; Khoso, S.; Sica, A.; Falasca, M.; Gennari, A.; Dondero, F.; Afantitis, A.; Manfredi, M. Precision medicine approaches with metabolomics and artificial intelligence. Int. J. Mol. Sci. 2022, 23, 11269. [Google Scholar] [CrossRef] [PubMed]
- Ma, F.; Zhao, L.; Ma, R.; Wang, J.; Du, L. FoxO signaling and mitochondria-related apoptosis pathways mediate tsinling lenok trout (Brachymystax lenok tsinlingensis) liver injury under high temperature stress. Int. J. Biol. Macromol. 2023, 251, 126404. [Google Scholar] [CrossRef]
- Izzo, L.T.; Trefely, S.; Demetriadou, C.; Drummond, J.M.; Mizukami, T.; Kuprasertkul, N.; Farria, A.T.; Nguyen, P.T.; Murali, N.; Reich, L.; et al. Acetylcarnitine shuttling links mitochondrial metabolism to histone acetylation and lipogenesis. Sci. Adv. 2023, 9, eadf0115. [Google Scholar] [CrossRef] [PubMed]
- Yang, F.; Xu, M.; Chen, X.; Luo, Y. Spotlight on porphyrins: Classifications, mechanisms and medical applications. Biomed. Pharmacother. 2023, 164, 114933. [Google Scholar] [CrossRef] [PubMed]
- Farahzadi, R.; Hejazi, M.S.; Molavi, O.; Pishgahzadeh, E.; Montazersaheb, S.; Jafari, S. Clinical significance of carnitine in the treatment of cancer: From traffic to the regulation. Oxidative Med. Cell. Longev. 2023, 2023, 9328344. [Google Scholar] [CrossRef]
- Irino, Y.; Toh, R.; Nagao, M.; Mori, T.; Honjo, T.; Shinohara, M.; Tsuda, S.; Nakajima, H.; Satomi-Kobayashi, S.; Shinke, T.; et al. 2-Aminobutyric acid modulates glutathione homeostasis in the myocardium. Sci. Rep. 2016, 6, 36749. [Google Scholar] [CrossRef]
- Chen, X.; Qiu, W.; Ma, X.; Ren, L.; Feng, M.; Hu, S.; Xue, C.; Chen, R. Roles and Mechanisms of Choline Metabolism in Nonalcoholic Fatty Liver Disease and Cancers. Front. Biosci. 2024, 29, 182. [Google Scholar] [CrossRef]
Methods | Accuracy | Sensitivity | Specificity | F1-Score | AUC |
---|---|---|---|---|---|
MI | 0.993240 | 0.997359 | 0.978788 | 0.995667 | 0.990043 |
sPLS | 0.992567 | 0.998225 | 0.972727 | 0.995265 | 0.986991 |
MOFS | 0.995934 | 0.999091 | 0.984848 | 0.997399 | 0.993939 |
Boruta | 0.995260 | 0.999091 | 0.981818 | 0.996976 | 0.991970 |
Metric/Model | LightGBM | AdaBoost | Random Forest |
---|---|---|---|
Accuracy | 0.866 (0.819–0.913) | 0.837 (0.786–0.888) | 0.802 (0.747–0.857) |
F1-Score | 0.87 (0.823–0.916) | 0.839 (0.788–0.89) | 0.804 (0.749–0.859) |
Sensitivity | 0.891 (0.813–0.944) | 0.851 (0.767–0.914) | 0.812 (0.722–0.883) |
Specificity | 0.842 (0.756–0.907) | 0.822 (0.733–0.891) | 0.792 (0.7–0.866) |
AUC | 0.916 (0.866–0.965) | 0.891 (0.836–0.946) | 0.861 (0.802–0.921) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the Lithuanian University of Health Sciences. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guldogan, E.; Yagin, F.H.; Ucuzal, H.; Alzakari, S.A.; Alhussan, A.A.; Ardigò, L.P. Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models. Medicina 2025, 61, 1112. https://doi.org/10.3390/medicina61061112
Guldogan E, Yagin FH, Ucuzal H, Alzakari SA, Alhussan AA, Ardigò LP. Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models. Medicina. 2025; 61(6):1112. https://doi.org/10.3390/medicina61061112
Chicago/Turabian StyleGuldogan, Emek, Fatma Hilal Yagin, Hasan Ucuzal, Sarah A. Alzakari, Amel Ali Alhussan, and Luca Paolo Ardigò. 2025. "Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models" Medicina 61, no. 6: 1112. https://doi.org/10.3390/medicina61061112
APA StyleGuldogan, E., Yagin, F. H., Ucuzal, H., Alzakari, S. A., Alhussan, A. A., & Ardigò, L. P. (2025). Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models. Medicina, 61(6), 1112. https://doi.org/10.3390/medicina61061112