A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome
Abstract
:1. Introduction
- ○
- We proposed stacking-based ensemble learning, which was applied with five-fold cross-validation utilizing the publicly available LC–MS/MS dataset of the nasopharyngeal metabolome of COVID-19;
- ○
- Top features were selected using the Random Forest algorithm, and statistical analyses such as the chi-square test, T-test, and Ranksum test were performed;
- ○
- The proposed method was applied to the following classification scenarios: (A) Control vs. RSV, (B) Control vs. Influenza A, (C) Control vs. COVID-19, (D) Control vs. All respiratory viruses, and (E) COVID-19 vs. Influenza A/RSV, to discover significant metabolites in each case;
- ○
- SHAP analysis was used to evaluate the contribution of significant features in each case to identify the most important metabolites.
2. Related Works
3. Methods
3.1. Dataset Description
3.2. Statistical Analysis
3.3. Dataset Preprocessing
3.4. Classification Model Development
3.4.1. Random Forest Classifier
3.4.2. Linear Discriminant Analysis
3.4.3. XGBoost Classifier
3.4.4. Logistic Regression
3.4.5. ExtraTreesClassifier
3.4.6. KNeighborsClassifier
3.4.7. ElasticNet
3.4.8. Stacking Ensemble Approach
3.5. Evaluation Metrics
3.6. Model Explainability
4. Results and Discussion
4.1. Feature Ranking
4.2. Classification Model Results
4.3. Model Explainability According to Shap Values
4.4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gallo, O.; Locatello, L.G.; Mazzoni, A.; Novelli, L.; Annunziato, F. The central role of the nasal microenvironment in the transmission, modulation, and clinical progression of SARS-CoV-2 infection. Mucosal Immunol. 2021, 14, 305–316. [Google Scholar] [CrossRef] [PubMed]
- Palese, P. Influenza: Old and new threats. Nat. Med. 2004, 10, S82–S87. [Google Scholar] [CrossRef] [PubMed]
- Preventation, C.f.D.C.a. Symptoms of COVID-19. Available online: https://www.cdc.gov/coronavirus/2019-ncov/index.html (accessed on 23 May 2023).
- WHO. Influenza. Available online: https://www.who.int/teams/health-product-policy-and-standards/standards-and-specifications/norms-and-standards/vaccine-standardization/influenza (accessed on 23 May 2023).
- Jha, A.; Jarvis, H.; Fraser, C.; Openshaw, P. Respiratory syncytial virus. In SARS, MERS and other Viral Lung Infections; European Respiratory Society: Lausanne, The Switzerland, 2016. [Google Scholar]
- Schreckenberger, P.C.; McAdam, A.J. Point-counterpoint: Large multiplex PCR panels should be first-line tests for detection of respiratory and intestinal pathogens. J. Clin. Microbiol. 2015, 53, 3110–3115. [Google Scholar] [CrossRef] [PubMed]
- Somerville, L.K.; Ratnamohan, V.M.; Dwyer, D.E.; Kok, J. Molecular diagnosis of respiratory viruses. Pathology 2015, 47, 243–249. [Google Scholar] [CrossRef]
- Tan, S.K.; Burgener, E.B.; Waggoner, J.J.; Gajurel, K.; Gonzalez, S.; Chen, S.F.; Pinsky, B.A. Molecular and culture-based bronchoalveolar lavage fluid testing for the diagnosis of cytomegalovirus pneumonitis. In Open Forum Infectious Diseases; Oxford University Press: New York, NY, USA, 2015; p. ofv212. [Google Scholar]
- Phan, T. Genetic diversity and evolution of SARS-CoV-2. Infect. Genet. Evol. 2020, 81, 104260. [Google Scholar] [CrossRef] [PubMed]
- Haljasmägi, L.; Salumets, A.; Rumm, A.P.; Jürgenson, M.; Krassohhina, E.; Remm, A.; Sein, H.; Kareinen, L.; Vapalahti, O.; Sironen, T. Longitudinal proteomic profiling reveals increased early inflammation and sustained apoptosis proteins in severe COVID-19. Sci. Rep. 2020, 10, 20533. [Google Scholar] [CrossRef] [PubMed]
- Valdés, A.; Moreno, L.O.; Rello, S.R.; Orduña, A.; Bernardo, D.; Cifuentes, A. Metabolomics study of COVID-19 patients in four different clinical stages. Sci. Rep. 2022, 12, 1650. [Google Scholar] [CrossRef] [PubMed]
- Antonelli, G. Emerging new technologies in clinical virology. Clin. Microbiol. Infect. 2013, 19, 8–9. [Google Scholar] [CrossRef] [PubMed]
- Mancone, C.; Ciccosanti, F.; Montaldo, C.; Perdomo, A.; Piacentini, M.; Alonzi, T.; Fimia, G.M.; Tripodi, M. Applying proteomic technology to clinical virology. Clin. Microbiol. Infect. 2013, 19, 23–28. [Google Scholar] [CrossRef]
- Burke, T.W.; Henao, R.; Soderblom, E.; Tsalik, E.L.; Thompson, J.W.; McClain, M.T.; Nichols, M.; Nicholson, B.P.; Veldman, T.; Lucas, J.E. Nasopharyngeal protein biomarkers of acute respiratory virus infection. EBioMedicine 2017, 17, 172–181. [Google Scholar] [CrossRef]
- Nalbantoglu, S. Metabolomics: Basic principles and strategies. Mol. Med. 2019, 10, 137–150. [Google Scholar]
- Bennet, S.; Kaufmann, M.; Takami, K.; Sjaarda, C.; Douchant, K.; Moslinger, E.; Wong, H.; Reed, D.E.; Ellis, A.K.; Vanner, S. Small-molecule metabolome identifies potential therapeutic targets against COVID-19. Sci. Rep. 2022, 12, 10029. [Google Scholar] [CrossRef]
- Shen, B.; Yi, X.; Sun, Y.; Bi, X.; Du, J.; Zhang, C.; Quan, S.; Zhang, F.; Sun, R.; Qian, L. Proteomic and metabolomic characterization of COVID-19 patient sera. Cell 2020, 182, 59–72. [Google Scholar] [CrossRef]
- Bardanzellu, F.; Fanos, V. Metabolomics, Microbiomics, machine learning during the COVID-19 pandemic. Pediatr. Allergy Immunol. 2022, 33, 86–88. [Google Scholar] [CrossRef]
- Sindelar, M.; Stancliffe, E.; Schwaiger-Haber, M.; Anbukumar, D.S.; Adkins-Travis, K.; Goss, C.W.; O’Halloran, J.A.; Mudd, P.A.; Liu, W.-C.; Albrecht, R.A. Longitudinal metabolomics of human plasma reveals prognostic markers of COVID-19 disease severity. Cell Rep. Med. 2021, 2, 100369. [Google Scholar] [CrossRef] [PubMed]
- de Fátima Cobre, A.; Surek, M.; Stremel, D.P.; Fachi, M.M.; Borba, H.H.L.; Tonin, F.S.; Pontarolo, R. Diagnosis and prognosis of COVID-19 employing analysis of patients’ plasma and serum via LC-MS and machine learning. Comput. Biol. Med. 2022, 146, 105659. [Google Scholar] [CrossRef]
- Liebal, U.W.; Phan, A.N.; Sudhakar, M.; Raman, K.; Blank, L.M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites 2020, 10, 243. [Google Scholar] [CrossRef]
- Galal, A.; Talal, M.; Moustafa, A. Applications of machine learning in metabolomics: Disease modeling and classification. Front. Genet. 2022, 13, 1017340. [Google Scholar] [CrossRef]
- Beirnaert, C.; Peeters, L.; Meysman, P.; Bittremieux, W.; Foubert, K.; Custers, D.; Van der Auwera, A.; Cuykx, M.; Pieters, L.; Covaci, A. Using expert driven machine learning to enhance dynamic metabolomics data analysis. Metabolites 2019, 9, 54. [Google Scholar] [CrossRef]
- Mendez, K.M.; Reinke, S.N.; Broadhurst, D.I. A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification. Metabolomics 2019, 15, 1–15. [Google Scholar] [CrossRef]
- Kantz, E.D.; Tiwari, S.; Watrous, J.D.; Cheng, S.; Jain, M. Deep neural networks for classification of LC-MS spectral peaks. Anal. Chem. 2019, 91, 12407–12413. [Google Scholar] [CrossRef] [PubMed]
- Delafiori, J.; Navarro, L.C.; Siciliano, R.F.; de Melo, G.C.; Busanello, E.N.B.; Nicolau, J.C.; Sales, G.M.; de Oliveira, A.N.; Val, F.F.A.; de Oliveira, D.N. Covid-19 automated diagnosis and risk assessment through metabolomics and machine learning. Anal. Chem. 2021, 93, 2471–2479. [Google Scholar] [CrossRef] [PubMed]
- Hogan, C.A.; Rajpurkar, P.; Sowrirajan, H.; Phillips, N.A.; Le, A.T.; Wu, M.; Garamani, N.; Sahoo, M.K.; Wood, M.L.; Huang, C. Nasopharyngeal metabolomics and machine learning approach for the diagnosis of influenza. EBioMedicine 2021, 71, 103546. [Google Scholar] [CrossRef] [PubMed]
- Hasan, M.R.; Suleiman, M.; Perez-Lopez, A. Metabolomics in the Diagnosis and Prognosis of COVID-19. Front. Genet. 2021, 12, 721556. [Google Scholar] [CrossRef]
- Oropeza-Valdez, J.J.; Padron-Manrique, C.; Vázquez-Jiménez, A.; Soberon, X.; Resendis-Antonio, O. Exploring metabolic anomalies in COVID-19 and post-COVID-19: A machine learning approach with explainable artificial intelligence. Front. Mol. Biosci. 2024, 11, 1429281. [Google Scholar] [CrossRef]
- Lepoittevin, M.; Remaury, Q.B.; Lévêque, N.; Thille, A.W.; Brunet, T.; Salaun, K.; Catroux, M.; Pellerin, L.; Hauet, T.; Thuillier, R. Advantages of Metabolomics-Based Multivariate Machine Learning to Predict Disease Severity: Example of COVID. Int. J. Mol. Sci. 2024, 25, 12199. [Google Scholar] [CrossRef] [PubMed]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Rahman, T.; Al-Ishaq, F.A.; Al-Mohannadi, F.S.; Mubarak, R.S.; Al-Hitmi, M.H.; Islam, K.R.; Khandakar, A.; Hssain, A.A.; Al-Madeed, S.; Zughaier, S.M. Mortality prediction utilizing blood biomarkers to predict the severity of COVID-19 using machine learning technique. Diagnostics 2021, 11, 1582. [Google Scholar] [CrossRef]
- Bridge, P.D.; Sawilowsky, S.S. Increasing physicians’ awareness of the impact of statistics on research outcomes: Comparative power of the t-test and Wilcoxon rank-sum test in small samples applied research. J. Clin. Epidemiol. 1999, 52, 229–235. [Google Scholar] [CrossRef] [PubMed]
- Chowdhury, M.E.; Rahman, T.; Khandakar, A.; Al-Madeed, S.; Zughaier, S.M.; Doi, S.A.; Hassen, H.; Islam, M.T. An early warning tool for predicting mortality risk of COVID-19 patients using machine learning. Cogn. Comput. 2021, 16, 1778–1793. [Google Scholar] [CrossRef] [PubMed]
- Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Ferreira, P.; Le, D.C.; Zincir-Heywood, N. Exploring feature normalization and temporal information for machine learning based insider threat detection. In Proceedings of the 2019 15th International Conference on Network and Service Management (CNSM), Halifax, NS, Canada, 21–25 October 2019; pp. 1–7. [Google Scholar]
- Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
- Tharwat, A.; Gaber, T.; Ibrahim, A.; Hassanien, A.E. Linear discriminant analysis: A detailed tutorial. AI Commun. 2017, 30, 169–190. [Google Scholar] [CrossRef]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T. Xgboost: Extreme Gradient Boosting, R package version 0.4-2. 2015. Available online: https://cran.ms.unimelb.edu.au/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 20 May 2023).
- Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.-Y. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef] [PubMed]
- Sharaff, A.; Gupta, H. Extra-tree classifier with metaheuristics approach for email classification. In Advances in Computer Communication and Computational Sciences: Proceedings of IC4S 2018; Springer: Singapore, 2019; pp. 189–197. [Google Scholar]
- Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, 3–7 November 2003; pp. 986–996. [Google Scholar]
- Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the International workshop on multiple classifier systems, Cagliari, Italy, 21–23 June 2000; pp. 1–15. [Google Scholar]
- Hossain, R.; Timmer, D. Machine learning model optimization with hyper parameter tuning approach. Glob. J. Comput. Sci. Technol. D Neural Artif. Intell 2021, 21, 31. [Google Scholar]
- Tawsifur, R.; Khandakar, A.; Abir, F.F.; Faisal, M.A.A.; Hossain, M.S.; Podder, K.K.; Abbas, T.O.; Alam, M.F.; Kashem, S.B.; Islam, M.T. QCovSML: A reliable COVID-19 detection system using CBC biomarkers by a stacking machine learning model. Comput. Biol. Med. 2022, 143, 105284. [Google Scholar]
- Kim, Y.; Kim, Y. Explainable heat-related mortality with random forest and SHapley Additive exPlanations (SHAP) models. Sustain. Cities Soc. 2022, 79, 103677. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
- Ogra, P.L. Respiratory syncytial virus: The virus, the disease and the immune response. Paediatr. Respir. Rev. 2004, 5, S119–S126. [Google Scholar] [CrossRef]
- Suarez, D.L. Influenza A virus. Anim. Influenza 2016, 1–30. [Google Scholar]
- Abu-Farha, M.; Thanaraj, T.A.; Qaddoumi, M.G.; Hashem, A.; Abubaker, J.; Al-Mulla, F. The role of lipid metabolism in COVID-19 virus infection and as a drug target. Int. J. Mol. Sci. 2020, 21, 3544. [Google Scholar] [CrossRef]
- Frank, M.; Drikakis, D.; Charissis, V. Machine-learning methods for computational science and engineering. Computation 2020, 8, 15. [Google Scholar] [CrossRef]
Control vs. All Respiratory Viruses | |||||
---|---|---|---|---|---|
Feature Name | Control | Respiratory Virus | Total | Technique | p-Value |
Sex
| 25% 75% 0% | 42.77% 47.59% 9.63% | 53.33% 39.04% 7.62% | Chi-square test | <0.05 |
LYSOC18.2
| 0.86 ± 1.05 0.8725 | 1.57 ± 0.97 1.4427 | 1.42 ± 1.03 1.2314 | Rank-sum test | <0.0001 |
Ile
| 19.57 ± 15.78 15.50 | 69.76 ± 42.48 66.90 | 59.24 ± 43.54 53.30 | Rank-sum test | <0.0001 |
Met.SO
| 1.27 ± 1.97 0.5445 | 6.74 ± 6.29 5.90 | 5.59 ± 6.08 5.02 | Rank-sum test | <0.0001 |
Asp
| 54.54 ± 25.06 49.350 | 139.60 ± 58.74 132.50 | 121.78 ± 63.70 116.00 | T-test | <0.0001 |
Phe
| 24.54 ± 16.97 21.40 | 85.80 ± 44.40 84.05 | 72.97 ± 47.33 70.40 | Rank-sum test | <0.0001 |
Tyr
| 23.24 ± 12.52 22.60 | 72.33 ± 43.25 62.95 | 62.04 ± 43.70 54.90 | T-test | <0.0001 |
Kynurenine
| 3.88 ± 2.72 6.224 | 6.85 ± 7.05 5.190 | 6.22 ± 6.50 5.3550 | Rank-sum test | 0.0067 |
Val
| 32.43 ± 29.98 26.250 | 122.04 ± 89.86 111.00 | 103.26 ± 88.86 85.85 | Rank-sum test | <0.0001 |
Citric acid
| 3.26 ± 1.68 3.840 | 1.76 ± 4.21 1.070 | 2.08 ± 3.86 1.28 | T-test | 0.02169 |
Arg
| 42.75 ± 24.27 36.150 | 134.68 ± 73.22 132.00 | v115.42 ± 75.90 92.75 | Rank-sum test | <0.0001 |
Model | Cases | Accuracy | Sensitivity | Specificity | |
---|---|---|---|---|---|
Bennet et al. [16] | Supervised machine learning | Control vs. all respiratory virus | 96% | 98% | 86% |
COVID-19 vs. influenza A/RSV | 85% | 74% | 90% | ||
Stacking-Based Ensemble Approach | RandomForest (With SMOTE) | Control vs. all respiratory virus | 98.10% | 98.10% | 94.48% |
RandomForest (Without SMOTE) | 96.67 | 96.66 | 92.44 | ||
Logistic Regression (With SOMOTE) | COVID-19 vs. influenza A/RSV | 86.14% | 86.14% | 80.3 | |
Logistic Regression (Without SMOTE) | 84.94 | 84.94 | 77.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sumon, M.S.I.; Hossain, M.S.A.; Al-Sulaiti, H.; Yassine, H.M.; Chowdhury, M.E.H. A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome. Metabolites 2025, 15, 44. https://doi.org/10.3390/metabo15010044
Sumon MSI, Hossain MSA, Al-Sulaiti H, Yassine HM, Chowdhury MEH. A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome. Metabolites. 2025; 15(1):44. https://doi.org/10.3390/metabo15010044
Chicago/Turabian StyleSumon, Md. Shaheenur Islam, Md Sakib Abrar Hossain, Haya Al-Sulaiti, Hadi M. Yassine, and Muhammad E. H. Chowdhury. 2025. "A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome" Metabolites 15, no. 1: 44. https://doi.org/10.3390/metabo15010044
APA StyleSumon, M. S. I., Hossain, M. S. A., Al-Sulaiti, H., Yassine, H. M., & Chowdhury, M. E. H. (2025). A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome. Metabolites, 15(1), 44. https://doi.org/10.3390/metabo15010044