Explainable Machine Learning for Heat-Related Illness Prediction: An XGBoost–SHAP Approach Using Korean Meteorological Data
Abstract
1. Introduction
2. Related Work
3. Materials and Methods
3.1. Dataset
3.1.1. Data Preprocessing
- Precipitation: Missing entries were assumed to indicate no rainfall and imputed as 0 mm, following standard meteorological practices [3].
- Wind speed, humidity, and solar radiation: Missing values were substituted with the overall training-period mean to prevent bias or artificial variance.
3.1.2. Definition of Variables
3.2. Exploratory Analysis and Baseline Benchmarking
3.2.1. Exploratory Data Analysis (EDA)
3.2.2. Baseline Benchmarking
3.3. Model Construction and Enhancement
3.3.1. Balancing Data
- Cost-sensitive learning using the scale_pos_weight parameter, which adjusts the loss function to penalize misclassification of minority (positive) cases. The parameter was set to 2.8, reflecting the negative-to-positive HRI sample ratio (2762:990) in the training dataset.
- Synthetic oversampling using the ROSE algorithm, which generates balanced synthetic samples via smoothed bootstrapping.
3.3.2. Algorithmic Enhancement
3.4. Feature Importance Analysis
3.4.1. Explainable Artificial Intelligence (XAI)
3.4.2. Shapley Additive exPlanations (SHAP)
3.5. Performance Evaluations
4. Results
4.1. Baseline Characteristics and Correlation Analysis
4.2. Heat-Related Illness (HRI) Classification Performance
- Cost-sensitive weighting using the scale_pos_weight parameter;
- Synthetic oversampling using the ROSE algorithm.
4.3. Calibration and Explainability Analysis
5. Discussion
5.1. Interpretation of Key Findings
5.2. Implications for Public Health and Policy
5.3. Contribution to the Field
5.4. Limitations and Future Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| HRI | Heat-related illnesses |
| HI | Heat index |
| XAI | Explainable artificial intelligence |
| ML | Machine learning |
| RFs | Random forests |
| SVM | Support vector machine |
| k-NN | k-nearest neighbors |
| XGBoost | Extreme gradient boosting |
| SHAP | Shapley additive explanations |
| ROSE | Random over-sampling examples |
| ROC | Receiver operating characteristic |
| AUC | Area under the curve |
References
- IPCC. Climate Change 2023: Synthesis Report—Summary for Policymakers; IPCC: Geneva, Switzerland, 2023; Available online: https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_SPM.pdf (accessed on 31 October 2025).
- CDC/NIOSH. Heat-Related Illnesses (Overview and Types). 2024. Available online: https://www.cdc.gov/niosh/heat-stress/about/illnesses.html (accessed on 31 October 2025).
- Guidelines for the Quality Control and Statistical Management of Meteorological Observation Data. Korea Meteorological Administration. 2025, pp. 1–47. Available online: https://data.kma.go.kr/resources/images/publication/기상관측데이터 품질 통계 관리 지침(2025.9).pdf (accessed on 31 October 2025).
- NWS/WPC. The Heat Index Equation. 2022. Available online: https://www.wpc.ncep.noaa.gov/html/heatindex_equation.shtml (accessed on 31 October 2025).
- Heaviside, C.; Macintyre, H.; Vardoulakis, S. The Urban Heat Island: Implications for Health in a Changing Environment. Curr. Environ. Health Rep. 2017, 4, 296–305. [Google Scholar] [CrossRef] [PubMed]
- Yoo, C.; Im, J.; Weng, Q.; Cho, D.; Kang, E.; Shin, Y. Diurnal Urban Heat Risk Assessment Using Extreme Air Temperatures and Real-Time Population Data in Seoul. iScience 2023, 26, 108123. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Min, J.; Lee, W.; Sun, K.; Cha, W.C.; Park, C.; Kang, C.; Yang, J.; Kwon, D.; Kwag, Y.; et al. Timely Accessibility to Healthcare Resources and Heatwave-Related Mortality in 7 Major Cities of South Korea: A Two-Stage Approach with Principal Component Analysis. Lancet Reg. Health-West. Pac. 2024, 45, 101022. [Google Scholar] [CrossRef]
- Park, J.; Kim, J. Defining Heatwave Thresholds Using an Inductive Machine Learning Approach. PLoS ONE 2018, 13, e0206872. [Google Scholar] [CrossRef]
- Chae, Y.; Park, J. Analysis on Effectiveness of Impact Based Heatwave Warning Considering Severity and Likelihood of Health Impacts in Seoul, Korea. Int. J. Environ. Res. Public Health 2021, 18, 2380. [Google Scholar] [CrossRef]
- Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
- Boudreault, J.; Ruf, A.; Campagna, C.; Chebana, F. Multi-Region Models Built with Machine and Deep Learning for Predicting Several Heat-Related Health Outcomes. Sustain. Cities Soc. 2024, 115, 105785. [Google Scholar] [CrossRef]
- Kim, Y.; Kim, Y. Explainable Heat-Related Mortality with Random Forest and SHapley Additive exPlanations (SHAP) Models. Sustain. Cities Soc. 2022, 79, 103677. [Google Scholar] [CrossRef]
- Xu, H.; Guo, S.; Shi, X.; Wu, Y.; Pan, J.; Gao, H.; Tang, Y.; Han, A. Machine Learning-Based Analysis and Prediction of Meteorological Factors and Urban Heatstroke Diseases. Front. Public Health 2024, 12, 1420608. [Google Scholar] [CrossRef]
- Kan, J.-C.; Vieira Passos, M.; Destouni, G.; Barquet, K.; Ferreira, C.S.S.; Kalantari, Z. Seasonal Heatwave Forecasting with Explainable Machine Learning and Remote Sensing Data. Stoch. Environ. Res. Risk Assess. 2025, 39, 3333–3352. [Google Scholar] [CrossRef]
- Shafiq, F.; Zafar, A.; Khan, M.U.G.; Iqbal, S.; Albesher, A.S.; Asghar, M.N. Extreme Heat Prediction through Deep Learning and Explainable AI. PLoS ONE 2025, 20, e0316367. [Google Scholar] [CrossRef] [PubMed]
- Lee, Y.; Cho, D.; Im, J.; Yoo, C.; Lee, J.; Ham, Y.-G.; Lee, M.-I. Unveiling Teleconnection Drivers for Heatwave Prediction in South Korea Using Explainable Artificial Intelligence. npj Clim. Atmos. Sci. 2024, 7, 176. [Google Scholar] [CrossRef]
- Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; Association for Computing Machinery: New York, NY, USA; pp. 785–794. [Google Scholar]
- Kuhn, M. Building Predictive Models in R Using the Caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
- Korea Disease Control and Prevention Agency. Attributable All-Cause Mortality during Heatwaves in South Korea, 2006–2018. 2019. Available online: https://www.kdca.go.kr (accessed on 31 October 2025).
- Korea Meteorological Administration. Main Site/Services. 2025. Available online: https://www.kma.go.kr (accessed on 31 October 2025).
- Yang, C.; Fridgeirsson, E.A.; Kors, J.A.; Reps, J.M.; Rijnbeek, P.R. Impact of Random Oversampling and Random Undersampling on the Performance of Prediction Models Developed Using Observational Health Data. J. Big Data 2024, 11, 7. [Google Scholar] [CrossRef]
- Vickers, A.J.; Elkin, E.B. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med. Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef]
- WHO Regional Office for Europe. Planning Heat–Health Action. 2025. Available online: https://www.who.int/europe/activities/planning-heat-health-action (accessed on 31 October 2025).
- Hood, L.; Friend, S.H. Predictive, Personalized, Preventive, Participatory (P4) Cancer Medicine. Nat. Rev. Clin. Oncol. 2011, 8, 184–187. [Google Scholar] [CrossRef]
- Vickers, A.J.; Calster, B.V.; Steyerberg, E.W. Net Benefit Approaches to the Evaluation of Prediction Models, Molecular Markers, and Diagnostic Tests. BMJ 2016, 352, i6. [Google Scholar] [CrossRef]
- Thiel, J.; Seim, A.; Stephan, B.; Sedlmayr, M.; Prochaska, E.; Henke, E. The Spectrum of Heat-Related Diseases—A Meta-Review. Int. J. Public Health 2025, 70, 1608592. [Google Scholar] [CrossRef] [PubMed]
- Tan, C.W.; Yu, P.-D.; Chen, S.; Poor, H.V. DeepTrace: Learning to Optimize Contact Tracing in Epidemic Networks with Graph Neural Networks. IEEE Trans. Signal Inf. Process. Netw. 2025, 11, 97–113. [Google Scholar] [CrossRef]
- Fei, Z.; Ryeznik, Y.; Sverdlov, O.; Tan, C.W.; Wong, W.K. An Overview of Healthcare Data Analytics With Applications to the COVID-19 Pandemic. IEEE Trans. Big Data 2022, 8, 1463–1480. [Google Scholar] [CrossRef]











| Condition | Clinical Features | Severity |
|---|---|---|
| Heat cramps | Painful muscle spasms, usually in legs or abdomen | Mild |
| Heat exhaustion | Heavy sweating, weakness, nausea, dizziness, headache | Moderate |
| Heat syncope | Sudden dizziness or fainting, usually from prolonged standing | Moderate |
| Heatstroke | High body temperature (>40 °C), confusion, unconsciousness, seizure | Severe, life-threatening |
| Variable | Unit | Definition |
|---|---|---|
| Mean daily temperature | °C | Arithmetic mean of hourly air temperatures over a day |
| Maximum temperature | °C | Highest air temperature recorded within a day |
| Minimum temperature | °C | Lowest air temperature recorded within a day |
| Temperature range | °C | Difference between daily maximum and minimum temperatures |
| Mean daily relative humidity | % | Mean of hourly relative-humidity observations over a day |
| Minimum relative humidity | % | Lowest hourly relative humidity recorded within a day |
| Precipitation | mm | Total daily accumulated rainfall |
| Mean daily wind speed | m/s | Mean wind speed measured over a day |
| Solar radiation | MJ/m2 | Total solar energy received per unit area per day |
| Dataset | Positive (HRI = 1) | Negative (HRI = 0) | Positive Rate (%) | Description |
|---|---|---|---|---|
| Training set | 990 | 2762 | 26.3% | Natural distribution |
| Test set | 223 | 323 | 40.8% | Temporal test set |
| Training set (weighted) | 990 | 2762 | 26.3% | Cost-sensitive adjustment |
| Training set (ROSE) | 1865 | 1887 | 49.7% | Synthetic oversampling |
| Variable | Mean Temperature 1 (°C) | Relative Humidity (%) | Solar Radiation (MJ/m2) |
|---|---|---|---|
| Gwangju | 3.00 | 12.36 | 7.40 |
| Daegu | 3.44 | 12.27 | 7.21 |
| Daejeon | 3.34 | 11.80 | 7.45 |
| Busan | 3.25 | 10.54 | 8.27 |
| Seoul | 3.46 | 11.19 | 7.80 |
| Ulsan | 3.37 | 11.09 | 5.87 |
| Incheon | 3.53 | 11.81 | 7.73 |
| p-value 2 | <0.001 | <0.001 | 0.0002 |
| Variable | SD | Median (IQR) | Max | Population (×106) 1 | Mean Daily Rate 2 |
|---|---|---|---|---|---|
| Gwangju | 0.92 | 0 (0–0) | 6 | 1.40 | 0.0271 |
| Daegu | 1.03 | 0 (0–0) | 7 | 2.36 | 0.0195 |
| Daejeon | 0.75 | 0 (0–0) | 7 | 1.44 | 0.0222 |
| Busan | 1.53 | 0 (0–1) | 12 | 3.25 | 0.0212 |
| Seoul | 3.24 | 0 (0–2) | 27 | 9.32 | 0.0166 |
| Ulsan | 1.29 | 0 (0–1) | 12 | 1.09 | 0.0514 |
| Incheon | 2.71 | 0 (0–1) | 43 | 3.04 | 0.0365 |
| p-value 3 | 0.423 |
| Model | AUC | Accuracy | Sensitivity | Specificity | Precision | F1-Score |
|---|---|---|---|---|---|---|
| Logistic 1 | 0.863 | 0.827 | 0.583 | 0.915 | 0.711 | 0.641 |
| RFs 2 | 0.854 | 0.823 | 0.574 | 0.912 | 0.701 | 0.631 |
| SVM 3 | 0.827 | 0.819 | 0.494 | 0.936 | 0.734 | 0.591 |
| k-NN 4 | 0.824 | 0.804 | 0.529 | 0.903 | 0.661 | 0.588 |
| XGBoost 5 | 0.860 | 0.820 | 0.572 | 0.909 | 0.692 | 0.626 |
| XGBoost (Weighted) | 0.857 | 0.807 | 0.318 | 0.9593 | 0.770 | 0.512 |
| XGBoost (ROSE 6) | 0.853 | 0.778 | 0.788 | 0.768 | 0.771 | 0.779 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Im, C.; Kim, W.; Kim, H. Explainable Machine Learning for Heat-Related Illness Prediction: An XGBoost–SHAP Approach Using Korean Meteorological Data. Bioengineering 2025, 12, 1276. https://doi.org/10.3390/bioengineering12111276
Im C, Kim W, Kim H. Explainable Machine Learning for Heat-Related Illness Prediction: An XGBoost–SHAP Approach Using Korean Meteorological Data. Bioengineering. 2025; 12(11):1276. https://doi.org/10.3390/bioengineering12111276
Chicago/Turabian StyleIm, Chaeyeong, Wonji Kim, and Heesoo Kim. 2025. "Explainable Machine Learning for Heat-Related Illness Prediction: An XGBoost–SHAP Approach Using Korean Meteorological Data" Bioengineering 12, no. 11: 1276. https://doi.org/10.3390/bioengineering12111276
APA StyleIm, C., Kim, W., & Kim, H. (2025). Explainable Machine Learning for Heat-Related Illness Prediction: An XGBoost–SHAP Approach Using Korean Meteorological Data. Bioengineering, 12(11), 1276. https://doi.org/10.3390/bioengineering12111276

