1. Introduction
Gender and gender differences play a critical role in the development and application of machine learning models for disease classification. These differences significantly impact the epidemiology, clinical manifestation, and progression of various medical conditions, including dermatological disorders, autoimmune diseases, gender-specific cancers, and other illnesses documented in hospital records based on the International Classification of Diseases, 10th Revision (ICD-10) classification system.
Machine learning (ML) has emerged as a transformative tool in the medical field, significantly advancing disease classification and predictive analytics. Various ML algorithms, including Decision Trees (DTs), Support Vector Machine (SVMs), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Artificial Neural Networks (ANNs), have been widely applied to analyze complex medical datasets and predict disease outcomes with high accuracy [
1,
2,
3]. The effectiveness of classification models is highly dependent on selecting relevant features, particularly when handling high-dimensional medical data, making feature selection crucial for improving interpretability, reducing computational complexity, and mitigating overfitting. Regularization techniques, such as the least absolute shrinkage and selection operator (LASSO), ridge regression, and elastic net, are widely employed to identify informative features while addressing multicollinearity issues [
4,
5,
6]. These methods impose penalties on model coefficients, ensuring sparsity and enhancing generalizability. Beyond regularization-based feature selection, feature importance ranking techniques, such as the mean decrease in accuracy (MDA) and the mean decrease in impurity (MDI), are used to refine classification models by evaluating the predictive contribution of each variable [
7,
8,
9,
10]. MDA is used to assess feature significance by measuring classification accuracy reduction upon feature exclusion, while MDI is used to quantify a feature’s contribution to decision tree-based models through impurity reduction across splits. Integrating these techniques with ICD-10 classification facilitates precise disease identification, enabling early detection, optimized resource allocation, and evidence-based policy-making and making ML classification easier for both analysis and further implementation. In the ML model’s classification and performance, the feature selection is important. In this study, hybrid feature selections based on L1 (LASSO) and L2 (Ridge) regularization with MDA and MDI were conducted to investigate the implementation of machine learning models based on a constrained set of features.
2. Disease Classification
The model in this study was developed based on the data collected from hospital records. Data were pre-processed to deal with missing values and standardize data for robust analysis. The RF model and its hyper-parameter tuning were adopted to predict disease categories. Model performance was evaluated, and the feature importance was analyzed by using Shapley additive explanation (SHAP).
2.1. Data Collection
Ethical approval was acquired under an exemption from the Rangsit University Ethics Committee, under the Helsinki Declaration. Secondary ICD-10 data were collected from Nong Han Hospital, Udon Thani, Thailand, from 2020 to 2022.
2.1.1. ICD-10 Classification
ICD-10 provides a standardized system for categorizing diseases, health conditions, and procedures [
11,
12,
13,
14]. Being widely used in clinical documentation and research, it features alphanumeric codes that support consistent data analysis across healthcare systems. We used codes A00–A99, covering infectious and parasitic diseases of significant public health importance.
2.1.2. Data Summary
Focusing on the response variable, the distribution of intestinal infectious diseases (A00–A09), tuberculosis (A15–A19), and other bacterial diseases (A30–A49) between males and females were compared. Females showed a higher proportion of intestinal infectious diseases (72.81%) than males (50.13%). Conversely, tuberculosis (26.13%) and other bacterial diseases (23.74%) were more prevalent in males (14.81%) than in females (12.38%), respectively, showing gender-specific differences in disease susceptibility.
2.2. Data Pre-Processing
Before analysis, data preprocessing was conducted for data cleansing, transformation, and imputation to ensure consistency and completeness. Three major disease classes were used: intestinal infectious diseases, tuberculosis, and other bacterial diseases. Intestinal infectious diseases were more prevalent among females (72.81%) than males (50.13%), whereas tuberculosis and other bacterial diseases were more frequent in males (26.13 and 23.74%) than in females (14.81 and 12.38%). For model training, the dataset was split into the training set (80%) and the testing dataset (20%). A five-fold cross-validation strategy was applied, where the training set was partitioned into five subsets to enhance model generalizability. To minimize class imbalance, class weights were assigned inversely proportional to class frequencies using
where
is the total sample size,
is the number of classes, and
represents class size [
4,
5,
6]. This weighting approach mitigates bias toward majority classes, improving predictive performance for minority classes. The class weights for the male and the female across three disease groups were evaluated as shown in
Table 1.
2.3. Feature Selection Techniques
Penalized regression techniques, such as LASSO, ridge, and elastic net, are widely used for feature selection to improve classification performance in disease prediction and biomedical applications [
15,
16,
17]. As the response variable has three classes, the multinomial logistic regression (MLR) was adjusted. MLR was used to model the probability of each class
using the following softmax function.
where
is the class label for sample
taking values in {1, 2, 3},
is the feature vector for sample
, and
is the coefficient vector corresponding to class
The standard objective function (negative log-likelihood) for multinomial logistic regression is defined by
where
denotes an indicator function that equals to 1 if
and 0 otherwise. Subsequently, LASSO applies
L1 regularization, enforcing sparsity so that Equation (3) is adjusted to
where
is the
L1 regularization parameter. The second term enforces sparsity, making some coefficients exactly zero. Ridge applies
L2 regularization to prevent large coefficients. Equation (3) is then modified to
The
L2 penalty prevents large values in
β but does not enforce sparsity. Combining
L1 and
L2 regularization, the elastic net is newly defined as
where
and
control sparsity and shrinkage, respectively. To observe the importance of each feature of males and females, the feature importance score including MDA and MDI was applied to the training set of 80% [
8,
9,
10]. The conceptual idea is described as follows.
MDA is a feature importance measure used in machine learning, particularly in RF. It quantifies the impact of each predictor by computing the decrease in model accuracy when the values of that feature are randomly permuted. A higher MDA score indicates a more influential feature so it is widely applied in classification and regression tasks.
MDI is a feature importance metric used in tree-based models, particularly RF. It measures the reduction in impurity (e.g., Gini index or entropy) contributed by each feature across all decision trees. A higher MDI score indicates a more influential feature. However, MDI is biased towards variables with more categories or higher cardinality, necessitating careful interpretation in feature selection and model evaluation.
2.4. Classification Models and Performance
Regarding the ICD-10 disease classification based on the dataset, we defined three major classes (
Table 1). An RF classifier, one of the most effective ML models, was implemented. Per-parameter tuning and evaluation metrics of RF are described as follows.
RF was the main classification model used in this study. It is an ensemble-based machine learning approach frequently employed for classifying infectious and parasitic diseases using ICD-10 codes [
2,
18,
19]. RF generates multiple decision trees during the training phase and aggregates their predictions to enhance accuracy and mitigate overfitting. By constructing each tree from randomly selected subsets of features and samples, RF improves model robustness against noise and imbalanced data.
Grid search strategy was used to systematically evaluate all possible combinations of parameters within a given parameter set, providing a comprehensive exploration of the parameter space [
20]. The optimal parameters identified for the model for the male included the maximum depth of 15, minimum samples split of 15, and 60 estimators, while the best parameters for the female model included the maximum depth of 15, minimum samples split of 5, and 100 estimators.
Feature importance provides information on the key predictors contributing to the model. Class weighting is used to mitigate data imbalance by ensuring equitable representation of all classes. To evaluate the impact of these factors on model performance, the 3 × 3 confusion matrix for ICD-10 disease classification was used to derive essential metrics, including accuracy, precision, recall (sensitivity), specificity, F1-score, balanced accuracy, and the area under the curve (AUC) [
10,
21,
22,
23]. The F1-score is used to balance precision and recall, whereas AUC is used to assess model performance across various thresholds. Given the presence of three disease classes, all metrics were averaged and subsequently reported to provide a comprehensive evaluation in this study.
2.5. SHAP Analysis
SHAP is widely used in interpreting machine learning models by assessing feature importance through Shapley values, derived from cooperative game theory [
1,
24,
25]. It is used to quantify each feature’s contribution to the model’s predictions, offering information on model behavior. For classifying infectious and parasitic diseases using ICD-10 codes, key diagnostic factors were used in SHAP to improve interpretability and healthcare decision-making in this study. Generally, SHAP analysis is applied to the best-performing model to ensure accurate feature attribution.
3. Results
Python (version 3.10.9, conda-forge distribution) was used for all analyses. Core libraries included pandas (v1.5.2) for data manipulation, SciPy (v1.9.3) and statsmodels (v0.13.2) for statistical procedures, scikit-learn (v1.6.1) for feature selection and machine learning models, and PyTorch (v2.1.2+cu118) for computational tasks. These tools supported model implementation, feature evaluation, and performance analysis. Gender-specific feature sets were evaluated to identify the most suitable features for optimal classification. The models were assessed using cumulative feature inclusion strategies. Finally, SHAP analysis was conducted to interpret the contribution of individual features and understand the influence of each feature on classification outcomes and the model’s decision-making process.
3.1. Feature Selection Results
The feature selection results from LASSO, Ridge, and Elastic Net regression highlighted variations in gender-specific disease classification. Age, temperature, respiratory rate (RR), pulse, and BMI were selected for both males and females, while occupation and status exhibited lower coefficients. Then, the MDA and MDI scores were evaluated for all significant features in both males and females (
Table 2).
In
Figure 1, feature importance scores of MDA and MDI highlighted gender-specific differences. For MDA, age and temperature have the highest scores for males, while age and BMI were important for females. For MDI, age and BMI dominated for both genders, with pulse important for females. Setting the threshold value of 0.01, MDA showed five features for both males and females, while MDI selected six features for males and seven for females (Status appears only in females). Therefore, the ordered sets of features were identified as follows.
MDA_Male = {Age, Temperature, BMI, RR, Pulse};
MDA_Female = {Age, Temperature, BMI, Pulse, RR};
MDI_Male = {Age, BMI, Temperature, Pulse, Occupation, RR};
MDI_Female = {BMI, Age, Pulse, Temperature, RR, Occupation, Status}.
To investigate the impact of feature inclusion on model performance, features were added to the RF models based on predefined importance rankings. At each step, the model was evaluated after adding one additional feature from the previously defined set, starting with the set with important features. This cumulative inclusion strategy enabled the assessment of how each successive feature contributed to classification performance.
3.2. Model Performance
By implementing RF models across male and female groups, the features sets were first selected by LASSO, Ridge, and Elastic Net. These features were then ranked using MDA and MDI scores, and the ranked sets were subsequently used to train the models. Aspresented in
Table 3 and
Table 4, which report results based on MDA, the classification performance of RF models in the test set improved as more features were included, particularly in males. Although not shown, the models using MDI-ranked features exhibited a similar pattern. The notation ** marks the best-performing models for each evaluation metric, highlighting improvements in accuracy, precision, recall, F1-score, specificity, and AUC.
Table 3 shows that the male group consistently showed higher accuracy, recall, and specificity than the female group, while females displayed slightly lower precision (
Table 4). F1-score showed a similar pattern, improving with the addition of features in both groups. The AUC values remained robust across all scenarios, indicating consistent classification performance. Specificity was notably high in all models, reflecting strong negative class prediction. The results underscored the importance of feature selection. Combinations of features enhanced model robustness and predictive reliability. For example, under MDA with five features, {Age, Temperature, BMI, RR, Pulse}, the F1-score reached 0.6422 in males and 0.5997 in females. The AUC was 0.8124 for males and 0.8075 for females, showing better performance of the male model. Similarly, under MDI with six features, {BMI, Age, Pulse, temperature, RR, Occupation}, the male F1-score was 0.6293 and the female F1-score was 0.6185, with corresponding AUC values of 0.8186 and 0.8067. The male models generally performed better with a smaller difference in AUC in MDI, reflecting improved female model performance. Interestingly, female performance patterns differed from males: whereas male models improved with the inclusion of more features, female models achieved their highest values for some metrics (e.g., accuracy and precision) using only three features. This indicates that additional features did not consistently improve female model performance and, in some cases, slightly reduced it. In other words, identifying the most crucial features in predictive modeling can be more effective than simply adding more features. Notably, BMI and Pulse appeared in top-performing MDA and MDI sets, suggesting their consistent contribution to model discrimination. The sustained high specificity in both groups proved the reliability of these models in identifying non-disease cases. However, differences in F1-score and AUC revealed gender-based variations in classification effectiveness.
3.3. SHAP for Feature Importance
To interpret the experimental results obtained from the best-performing classifier, SHAP analysis results identified the most influential features driving the model’s predictions across the training and test datasets, providing explainable insights into class-level predictions. The SHAP analysis results for the male group highlighted age, RR, and temperature as the most influential features for the model’s predictions (
Figure 2). Age showed the highest mean SHAP value, significantly impacting classification, followed by RR and temperature. The class-based breakdown suggested that feature importance varied by category, with pulse and BMI having lesser but notable contributions. Higher feature values were correlated with greater SHAP values, while RR, age, and BMI showed strong influence. Pulse contributed the least. For the female group, the SHAP analysis result showed that BMI, RR, and temperature were influential features (
Figure 3). BMI showed the highest mean SHAP value, followed by RR and temperature, indicating their significant impact on classification. The feature set of pulse and age also contributed notably, while status and occupation had lower but observable effects across different classes.
4. Discussion
The capability of the developed model in the classification of the three major groups of diseases based on ICD-10 codes was assessed by feature selection, feature importance, and SHAP. Regarding the gender-specific risk factors, seven features were identified in the dataset.
4.1. Features and Classification
Feature selection plays a crucial role in improving classification performance. The inclusion of age, temperature, BMI, RR, and pulse from MDA and BMI, age, pulse, and temperature from MDI demonstrated their importance in distinguishing disease cases. Higher F1-scores and AUC values in male models indicated more robust classification, possibly due to feature variations. The specificity was high, indicating strong negative class detection. While male models exhibited better recall, female models indicated the necessity of feature selection to optimize disease classification based on gender-specific patterns.
4.2. Model Interpretation
Understanding model interpretation is essential for assessing the influence of different features on disease classification. The SHAP analysis results presented key predictors and showed that age, BMI, RR, and body temperature were the most influential factors. Age and RR showed stronger effects in males, while BMI and pulse played critical roles in female classifications. MDA and MDI scores confirmed these findings, reinforcing their significance in predictive modeling. The results supported model interpretability, ensuring clinically relevant and explainable predictions.
4.3. Implementation in Transitional Healthcare
The integration of machine learning-based classification models in transitional healthcare settings enhances early disease detection and patient management. By leveraging ICD-10 classifications and feature selection, hospitals can implement automated screening systems to prioritize high-risk cases [
2,
3,
14]. Key features such as BMI, age, and pulse aid in predictive modeling for infectious diseases. The observed gender-specific variations highlighted the need for personalized healthcare strategies. These models enable timely interventions to reduce diagnostic delays and improve clinical decision-making. It is necessary to integrate real-time data to enhance the adaptability of ML models in diverse healthcare environments.
5. Conclusions
We developed a machine learning-based disease classification framework using ICD-10 codes, with an emphasis on gender-specific risk factors. Feature selection was conducted using LASSO, Ridge, and Elastic Net, followed by MDA and MDI scoring to determine feature importance. The RF classifier was trained with the training and testing dataset split into 80:20 and five-fold cross-validation, incorporating class weights to address data imbalance. Age, BMI, RR, and body temperature were key predictors for both male and female groups, while pulse had a stronger impact on female models. SHAP analysis results confirmed the differential importance of these features. Models trained with appropriate feature selection exhibited improved accuracy, recall, and F1-score, presenting the importance of balancing predictive power and interpretability. The inclusion of gender-specific variations enhances the model’s applicability to personalized medicine, supporting data-driven healthcare decision-making. It is still required to explore real-time implementations and model integration into clinical workflows. By refining predictive modeling for infectious disease classification, the developed model in this study improves early diagnosis, optimizes resource allocation, and contributes to better patient outcomes in clinical settings.
Author Contributions
P.B. and T.S. contributed to the conceptualization, methodology, and formal analysis. P.B. and S.S. prepared the original draft, while T.S. and P.B. reviewed and edited the manuscript. S.S. and T.S. were responsible for validation. J.Y. performed the programming, investigation, and visualization. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
This study received ethical approval under an exemption category from the Ethics Committee at Rangsit University, Thailand (DPE. No. RSUERB2024-032), in compliance with the principles outlined in the Helsinki Declaration.
Informed Consent Statement
As the research involved de-identified secondary data, informed consent was waived by the Ethics Committee at Rangsit University and Nong Han Hospital.
Data Availability Statement
Data available from the corresponding author on request.
Acknowledgments
The authors extend their sincere gratitude to Nong Han Hospital, Udon Thani, for generously providing the medical dataset utilized in this study. Additionally, we deeply appreciate the insightful and constructive feedback from the anonymous reviewers, which has significantly contributed to enhancing the overall quality and clarity of this manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput. Methods Programs Biomed. 2022, 214, 106584. [Google Scholar] [CrossRef] [PubMed]
- Zikos, D.; DeLellis, N. Comparison of the Predictive Performance of Medical Coding Diagnosis Classification Systems. Technologies 2022, 10, 122. [Google Scholar] [CrossRef]
- Akarajarasroj, T.; Wattanapermpool, O.; Sapphaphab, P.; Rinthon, O.; Pechprasarn, S.; Boonkrong, P. Feature Selection in the Classification of Erythemato-Squamous Diseases using Machine Learning Models and Principal Component Analysis. In Proceedings of the Biomedical Engineering International Conference (BMEiCON), Tokyo, Japan, 28–31 October 2023; pp. 1–5. [Google Scholar]
- Zhu, M.; Xia, J.; Jin, X.; Yan, M.; Cai, G.; Yan, J.; Ning, G. Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 2018, 6, 4641–4652. [Google Scholar] [CrossRef]
- Yang, C.; Zhou, F. Imbalanced bearing fault diagnosis based on adaptive cost-sensitive neural network. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 6514–6519. [Google Scholar]
- Baldota, S.; Aggarwal, D. A Novel Weighted Extreme Learning Machine for Highly Imbalanced Multiclass Classification. In Congress on Intelligent Systems; Springer: Singapore, 2022; pp. 817–830. [Google Scholar]
- Gwetu, M.V.; Tapamo, J.-R.; Viriri, S. Exploring the impact of purity gap gain on the efficiency and effectiveness of random forest feature selection. In Proceedings of the Computational Collective Intelligence, Hendaye, France, 4–6 September 2019; pp. 340–352. [Google Scholar]
- Chaibi, M.; Tarik, L.; Berrada, M.; El Hmaidi, A. Machine learning models based on random forest feature selection and Bayesian optimization for predicting daily global solar radiation. Int. J. Renew. Energy Dev. 2022, 11, 309. [Google Scholar] [CrossRef]
- Scornet, E. Trees, forests, and impurity-based variable importance in regression. Ann. Inst. Henri Poincaré Probab. Statist. 2023, 59, 21–52. [Google Scholar] [CrossRef]
- Simmachan, T.; Wongsai, S.; Lerdsuwansri, R.; Boonkrong, P. Impact of COVID-19 Pandemic on Road Traffic Accident Severity in Thailand: An Application of K-Nearest Neighbor Algorithm with Feature Selection Techniques. Thail. Stat. 2025, 23, 129–143. [Google Scholar]
- WHO. ICD-10 Version:2019. Available online: https://icd.who.int/browse10/2019/en (accessed on 11 January 2025).
- Fung, K.W.; Xu, J.; Bodenreider, O. The new International Classification of Diseases 11th edition: A comparative analysis with ICD-10 and ICD-10-CM. J. Am. Med. Inform. Assoc. 2020, 27, 738–746. [Google Scholar] [CrossRef] [PubMed]
- CDC. Classification of Diseases, Functioning, and Disability. Available online: https://www.cdc.gov/nchs/icd/icd-10-cm/index.html (accessed on 11 January 2025).
- Almagro, M.; Unanue, R.M.; Fresno, V.; Montalvo, S. ICD-10 coding of Spanish electronic discharge summaries: An extreme classification problem. IEEE Access 2020, 8, 100073–100083. [Google Scholar] [CrossRef]
- Bedoui, A.; Lazar, N.A. Bayesian empirical likelihood for ridge and lasso regressions. Comput. Stat. Data Anal. 2020, 145, 106917. [Google Scholar] [CrossRef]
- Suprapto, S.; Nikmah, Y.L. Ridge and Lasso Regression for Feature Selection of Overlapping Ibuprofen and Paracetamol UV Spectra. Moroc. J. Chem. 2023, 11, 221–229. [Google Scholar] [CrossRef]
- Ghasemi, A.; Najarzadeh, D.; Khazaei, M. Seemingly unrelated penalized regression models. Commun. Stat.-Simul. Comput. 2024, 1–20. [Google Scholar] [CrossRef]
- Rajendran, N.A.; Vincent, D.R. Heart disease prediction system using ensemble of machine learning algorithms. Recent Pat. Eng. 2021, 15, 130–139. [Google Scholar] [CrossRef]
- Le Lay, J.; Alfonso-Lizarazo, E.; Augusto, V.; Bongue, B.; Masmoudi, M.; Xie, X.; Gramont, B.; Clarier, T. Prediction of hospital readmission of multimorbid patients using machine learning models. PLoS ONE 2022, 17, e0279433. [Google Scholar] [CrossRef] [PubMed]
- Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar] [CrossRef]
- Tengjongdee, S.; Chaitan, C.; Hanmanop, S.; Jongsiri, T.; Khongtan, M.; Pechprasarn, S.; Boonkrong, P. Comparative Analysis of Deep Learning Networks for COVID-19 and Pneumonia Identification: Grad-CAM Visualization of Chest X-Ray Images. In Proceedings of the Biomedical Engineering International Conference (BMEiCON), Chon Buri, Thailand, 21–24 November 2024; pp. 1–5. [Google Scholar]
- Simmachan, T.; Boonkrong, P. Effect of Resampling Techniques on Machine Learning Models for Classifying Road Accident Severity in Thailand. J. Curr. Sci. Technol. 2025, 15, 99. [Google Scholar] [CrossRef]
- Boonkrong, P.; Simmachan, T.; Sittimongkol, R.; Lerdsuwansri, R. Data-Driven Approach in Provincial Clustering for Sustainable Tourism Management in Thailand. Thail. Stat. 2025, 23, 481–500. [Google Scholar]
- Li, X.; Zhou, Y.; Dvornek, N.C.; Gu, Y.; Ventola, P.; Duncan, J.S. Efficient Shapley explanation for features importance estimation under uncertainty. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2020; Springer: Cham, Switzerland, 2020; pp. 792–801. [Google Scholar]
- Ponce-Bobadilla, A.V.; Schmitt, V.; Maier, C.S.; Mensing, S.; Stodtmann, S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin. Transl. Sci. 2024, 17, e70056. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).