Explainable Ensemble Learning for Robust Severity Stratification of Carpal Tunnel Syndrome from Clinical Data
Abstract
1. Introduction
- The findings were compared to baseline values obtained by Park et al. [6] in the original work introducing the dataset to verify the efficiency of our data augmentation scheme. Specifically, with no oversampling and using only the classical ML algorithm (XGBoost), the highest accuracy in their analysis amounted only to 76.6%. In contrast, our data augmentation approach, namely ADASYN, significantly boosted the performance to 90.45%, yielding an approximate 14% improvement over baseline values.
- In contrast to existing ensemble structures based on stacking, an innovative stacking ensemble framework using XGBoost, Random Forest, LightGBM, and CatBoost, combined with calibrated probability aggregation, was developed and optimized. This model obtained 91.15% accuracy while keeping an optimal balance in performance for all three CTS severity levels (Mild/Moderate/Severe). Thus, this method directly addresses the need for multi-class stratification in the diagnosis of CTS severity, which has been overlooked by previous studies concentrating primarily on binary classification.
- The use of feature engineering, such as polynomial transformations, interaction terms, and categorical binning, which is clinically motivated, is incorporated in the stacking ensemble structure. It improves the ability of the model to distinguish between different severity levels and makes the results more interpretable for clinicians.
- To solve the problem of a lack of quantitative consistency evaluation between global and local explainable AI methods, both SHAP and LIMEs are utilized and compared. Consistent results from both models prove that clinically relevant features, such as cross-sectional area (CSA), symptom duration, and pain intensity, are the most influential factors for the decision-making process of the model.
- Besides accuracy metrics, we also analyzed macro-average metrics, ROC/PR AUC, the Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Visualization of dimensionality reduction (PCA and t-SNE) and feature importance helped in confirming the model’s stability.
Related Works
2. Materials and Methods
2.1. Dataset
2.2. Data Augmentation Methods
2.3. Machine Learning (ML)
2.4. Evaluation Metrics
3. Experimental Results
3.1. Dataset Preparation
3.2. Feature Engineering
3.3. Descriptive Statistics of the Dataset
- Age shows a near-normal distribution centered around 58 years, consistent with the clinical profile of the studied population.
- BMI is slightly right-skewed, with most values concentrated between 20 and 30, reflecting typical patient BMI ranges.
- CSA (Cross-Sectional Area) has a moderate right skew, with the majority of values clustered between 10 and 20 mm2.
- PB (Palmar Bowing) exhibits a highly skewed distribution with a concentration of values at lower ranges, indicating potential outliers.
- Duration (symptom duration) is heavily right-skewed, as most patients present within the first year, with fewer cases reporting longer symptom periods.
- NRS (pain score) is distributed across the full scale, with peaks around moderate pain levels (scores 4–6).
3.4. Statistical Association Between Features and Target Variable
- Original Features: In the original features, some features, such as BMI, CSA (Cross-Sectional Area), PB (Palmar Bowing), and symptom duration, proved to be statistically different among severity classes (p < 0.05 in both tests). However, demographic features such as Age do not prove any statistically significant association.
- Processed Features: The test outcomes on the processed dataset also prove to be statistically significant. Thus, feature engineering proved to preserve and even strengthen the importance of clinical and sonographic features in classifying severity levels of CTS.
- Stacking Meta-Features: All the stacking meta-features from meta_0 to meta_n demonstrated very strong statistical significance (p = 0.000 in both ANOVA and Kruskal–Wallis tests). This shows that the proposed meta-model is able to capture discriminatory features relevant to the classes.
3.5. Dimensionality Reduction and Visualization
4. Machine Learning Classification Results
4.1. Individually Optimized Base Model Performances
4.2. Final Stacking Model
- Meta-feature creation: Formation of the first layer of the stacking model is done based on meta-features that include the following.
- Out-of-Fold predictions: Probability predictions produced by four calibrated base models (calibrated_xgb, calibrated_rf, calibrated_lgbm, and calibrated_catboost) through 5-fold Stratified K-Fold cross-validation (stacking_cv_main) on the training set. For the test set, probabilities were calculated using models trained on the complete training set.
- One-vs-Rest (OvR) features: Probability predictions generated by a calibrated XGBoost model (ovr_base_model_for_meta) for each class using the OvR procedure.
- Passthrough original features: In the case of passthrough_original_features = True flag, preprocessed original features were passed along with the previous predictions as well. The meta_X_train and meta_X_test datasets were generated in this manner.
4.3. SHAP–LIME Consistency Analysis
4.4. Clinical Decision Impact Matrix
4.5. Ablation Study
4.6. Overfitting Analysis
4.7. Statistical Comparison of Classifiers

5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Padua, L.; Monaco, M.L.; Padua, R.; Gregori, B.; Tonali, P. Neurophysiological classification of carpal tunnel syndrome: Assessment of 600 symptomatic hands. Ital. J. Neurol. Sci. 1997, 18, 145–150. [Google Scholar] [CrossRef]
- Levine David, W.; Simmons, B.P.; Koris , M.J.; Daltroy, L.H.; Hohl, G.G.; Fossel, A.H.; Katz, J.N. A self-administered questionnaire for the assessment of severity of symptoms and functional status in carpal tunnel syndrome. JBJS 1993, 75, 1585–1592. [Google Scholar]
- Cartwright, M.S.; Hobson-Webb, L.D.; Boon, A.J.; Alter, K.E.; Hunt, C.H.; Flores, V.H.; Walker, F.O. Evidence-based guideline: Neuromuscular ultrasound for the diagnosis of carpal tunnel syndrome. Muscle Nerve 2012, 46, 287–293. [Google Scholar] [CrossRef]
- Roll, S.C.; Case-Smith, J.; Evans, K.D. Diagnostic Accuracy of Ultrasonography, V.S. Electromyography in Carpal Tunnel Syndrome: A Systematic Review of Literature. Ultrasound Med. Biol. 2011, 37, 1539–1553. [Google Scholar] [CrossRef]
- Klauser, A.S.; Halpern, E.J.; Zordo, D.; Feuchtner, G.M.; Arora, R.; Gruber, J.; Martinoli, C.; Löscher, W.N. Carpal tunnel syndrome assessment with US: Value of additional cross-sectional area measurements of the median nerve in patients versus healthy volunteers. Radiology 2009, 250, 171–177. [Google Scholar] [CrossRef]
- Park, D.; Kim, B.H.; Lee, S.E.; Kim, D.Y.; Kim, M.; Kwon, H.D.; Lee, J.W. Machine learning-based approach for disease severity classification of carpal tunnel syndrome. Sci. Rep. 2021, 11, 17464. [Google Scholar] [CrossRef]
- Elseddik, M.; Mostafa, R.R.; Elashry, A.; El-Rashidy, N.; El-Sappagh, S.; Elgamal, S.; El-Bakry, H. Predicting CTS Diagnosis and Prognosis Based on Machine Learning Techniques. Diagnostics 2023, 13, 492. [Google Scholar] [CrossRef]
- Inui, A.; Takase, F.; Lucchina, S.; Kanatani, T. Prediction of Electrophysiological Severity and Carpal Tunnel Syndrome Instrument Changes After Carpal Tunnel Release Using Machine Learning Model. Appl. Sci. 2025, 15, 1812. [Google Scholar] [CrossRef]
- Kharat, P.P.; Al Majmaie, S.; Ghajari, G.; Amsaad, F.; Ibrahem, M.I. A Secure and Robust ML Framework for Sequence Classification and Adversarial Evaluation in a Bilateral Carpal Tunnel Syndrome Crossover Dataset. Information 2026, 17, 293. [Google Scholar] [CrossRef]
- Misch, M.; Medani, K.; Rhisheekesan, A.; Manjila, S. Artificial Intelligence and Carpal Tunnel Syndrome: A systematic review and contemporary update on imaging techniques. Hand Surg. Rehabil. 2025, 44, 102264. [Google Scholar] [CrossRef] [PubMed]
- Bakalis, D.; Kontogiannis, P.; Ntais, E.; Simos, Y.V.; Tsamis, K.I.; Manis, G. Carpal Tunnel Syndrome Automated Diagnosis: A Motor vs. Sensory Nerve Conduction-Based Approach. Bioengineering 2024, 11, 175. [Google Scholar] [CrossRef] [PubMed]
- Thakur, N.V.; Yenurkar, G.K.; Aherrao, A.; Aherrao, A.; Landge, S.; Katre, S. Medical Image Fusion Using Discrete Wavelet Transform: In view of Deep Learning. In Proceedings of the 1st DMIHER International Conference on Artificial Intelligence in Education and Industry 4.0, IDICAIEI, Wardha, India; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar] [CrossRef]
- Elseddik, M.; Alnowaiser, K.; Mostafa, R.R.; Elashry, A.; El-Rashidy, N.; Elgamal, S.; El-Bakry, H. Deep Learning-Based Approaches for Enhanced Diagnosis and Comprehensive Understanding of Carpal Tunnel Syndrome. Diagnostics 2023, 13, 3211. [Google Scholar] [CrossRef]
- Yetiş, M.; Kocaman, H.; Canli, M.; Yildirim, H.; Yetiş, A.; Ceylan, I. Carpal tunnel syndrome prediction with machine learning algorithms using anthropometric and strength-based measurement. PLoS ONE 2024, 19, e0300044. [Google Scholar] [CrossRef]
- Sasaki, T.; Koyama, T.; Kuroiwa, T.; Nimura, A.; Okawa, A.; Wakabayashi, Y.; Fujita, K. Evaluation of the Existing Electrophysiological Severity Classifications in Carpal Tunnel Syndrome. J. Clin. Med. 2022, 11, 1685. [Google Scholar] [CrossRef]
- Pellicer-Valero, O.J.; Martín-Guerrero, J.D.; Fernández-de-las-Peñas, C.; De-la-Llave-Rincón, A.I.; Rodríguez-Jiménez, J.; Navarro-Pardo, E.; Cigarán-Méndez, M.I. Spectral Clustering Reveals Different Profiles of Central Sensitization in Women with Carpal Tunnel Syndrome. Symmetry 2021, 13, 1042. [Google Scholar] [CrossRef]
- Wei, Y.; Zhang, W.; Gu, F. Towards Diagnosis of Carpal Tunnel Syndrome Using Machine Learning. ACM Int. Conf. Proc. Ser. 2020, 7, 76–82. [Google Scholar] [CrossRef]
- Lyu, S.; Zhang, M.; Yu, J.; Zhu, J.; Zhang, B.; Gao, L.; Chen, Q. Application of radiomics model based on ultrasound image features in the prediction of carpal tunnel syndrome severity. Skelet. Radiol. 2024, 53, 1389–1397. [Google Scholar] [CrossRef]
- Öten, E.; Aygün Bilecik, N.; Uğur, L. Use of machine learning methods in diagnosis of carpal tunnel syndrome. Comput. Methods Biomech. Biomed. Eng. 2026, 29, 838–848. [Google Scholar] [CrossRef]
- Shinohara, I.; Inui, A.; Mifune, Y.; Nishimoto, H.; Yamaura, K.; Mukohara, S.; Kuroda, R. Using deep learning for ultrasound images to diagnose carpal tunnel syndrome with high accuracy. Ultrasound Med. Biol. 2022, 48, 2052–2059. [Google Scholar] [CrossRef] [PubMed]
- Ardakani, A.A.; Afshar, A.; Bhatt, S.; Bureau, N.J.; Tahmasebi, A.; Acharya, U.R.; Mohammadi, A. Diagnosis of carpal tunnel syndrome: A comparative study of shear wave elastography, morphometry and artificial intelligence techniques. Pattern Recognit. Lett. 2020, 133, 77–85. [Google Scholar] [CrossRef]
- Horng, M.H.; Yang, C.W.; Sun, Y.N.; Yang, T.H. DeepNerve: A New Convolutional Neural Network for the Localization and Segmentation of the Median Nerve in Ultrasound Image Sequences. Ultrasound Med. Biol. 2020, 46, 2439–2452. [Google Scholar] [CrossRef]
- Wu, C.H.; Syu, W.T.; Lin, M.T.; Yeh, C.L.; Boudier-Revéret, M.; Hsiao, M.Y.; Kuo, P.L. Automated segmentation of median nerve in dynamic sonography using deep learning: Evaluation of model performance. Diagnostics 2021, 11, 1893. [Google Scholar] [CrossRef]
- Wang, Y.W.; Chang, R.F.; Horng, Y.S.; Chen, C.J. MNT-DeepSL: Median nerve tracking from carpal tunnel ultrasound images with deep similarity learning and analysis on continuous wrist motions. Comput. Med. Imaging Graph. 2020, 80, 101687. [Google Scholar] [CrossRef]
- Wang, Y.W.; Chang, R.F.; Horng, Y.S.; Chen, C.J. A screening method using anomaly detection on a smartphone for patients with carpal tunnel syndrome: Diagnostic case-control study. JMIR Mhealth Uhealth 2021, 80, 101687. [Google Scholar] [CrossRef]
- Tsamis, K.I.; Kontogiannis, P.; Gourgiotis, I.; Ntabos, S.; Sarmas, I.; Manis, G. Automatic electrodiagnosis of carpal tunnel syndrome using machine learning. Bioengineering 2021, 8, 181. [Google Scholar] [CrossRef]
- Harrison, C.J.; Geoghegan, L.; Sidey-Gibbons, C.J.; Stirling, P.H.C.; McEachan, J.E.; Rodrigues, J.N. Developing Machine Learning Algorithms to Support Patient-centered, Value-based Carpal Tunnel Decompression Surgery. Plast. Reconstr. Surg. Glob. Open 2022, 10, e4279. [Google Scholar] [CrossRef] [PubMed]
- Zhou, H.; Bai, Q.; Hu, X.; Alhaskawi, A.; Dong, Y.; Wang, Z.; Lu, H. Deep CTS: A deep neural network for identification MRI of carpal tunnel syndrome. J. Digit. Imaging 2022, 35, 1433–1444. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Yuce, E.; Sahin, M.E.; Ulutas, H.; Erkoç, M.F. Efficient Cerebral Infarction Segmentation Using U-Net and U-Net3 + Models. J. Imaging Inform. Med. 2026, 39, 1253–1264. [Google Scholar] [CrossRef] [PubMed]
- Fernández, A.; García, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); IEEE: Piscataway, NJ, USA, 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
- Çubukçu, H.C.; Topcu, D.İ.; Yenice, S. Machine learning-based clinical decision support using laboratory data. Clin. Chem. Lab. Med. 2023, 62, 793–823. [Google Scholar] [CrossRef] [PubMed]
- Masood, A.; Naseem, U.; Rashid, J.; Kim, J.; Razzak, I. Review on enhancing clinical decision support system using machine learning. CAAI Trans. Intell. Technol. 2024, 1–14. [Google Scholar] [CrossRef]
- Sahin, M.E. Image processing machine learning-based bone fracture detection classification using X-ray, images. Int. J. Imaging Syst. Technol. 2023, 33, 853–865. [Google Scholar] [CrossRef]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2019, 21, 6. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 97–101. [Google Scholar] [CrossRef]
- Sirocchi, C.; Bogliolo, A.; Montagna, S. Medical-informed machine learning: Integrating prior knowledge into medical decision systems. BMC Med. Inform. Decis. Mak. 2024, 24, 186. [Google Scholar] [CrossRef] [PubMed]
- Bomrah, S.; Uddin, M.; Upadhyay, U.; Komorowski, M.; Priya, J.; Dhar, E.; Syed-Abdul, S. A scoping review of machine learning for sepsis prediction- feature engineering strategies and model performance: A step towards explainability. Crit. Care 2024, 28, 180. [Google Scholar] [CrossRef] [PubMed]
- Kruskal, W.H.; Wallis, W.A. Use of Ranks in One-Criterion Variance Analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
- Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. Available online: https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf (accessed on 11 May 2026).




























| Ref. | Method | Dataset | Evaluation Metrics |
|---|---|---|---|
| [6] | Machine Learning, Random Up-Sampling and SMOTE 10-fold cross-validation, | The data set contains 1037 CTS data containing 11 features. Of these, 507 represent the hand with mild, 276 moderate and 254 severe CTS. | XGB overall accuracy 76.6 (71.2–81.5)% max |
| [20] | DL | 50 healthy (22 males and 28 females) and 50 patients (19 males and 31 females). | Specificity of 1.00 |
| [21] | CNN and SVM 10-fold cross-validation | 100 CTS and 100 normal wrist images. | CNN 0.980 AUC, SVM 0.943 AUC |
| [22] | DeepNerve method ConvLSTM + U-Net + MaskTrack 4-fold cross-validation | Twenty-four image sequences were captured from six male participants. Each image sequence consists of approximately 420 frames. | F-score 0.9015 |
| [23] | DeepLabV3+, U-Net, FPN and Mask-R-CNN | The dataset included 52 participants, of whom 36 were used for training and 16 for testing based on their varying visual characteristics. | DeepLabV3+ and Mask R-CNN IoU 0.83 |
| [24] | MNT-DeepSL | Ultrasound images were collected from 100 hands (left and right) of 50 individuals, with six wrist movements recorded for each hand. | MNT-DeepSL Accuracy 0.9 |
| [25] | Machine Learning (autoencoder (AE)) | There were 36 participants, 36 hands with CTS and 27 hands without CTS. | AUC of 0.86 |
| [26] | LR, SVM, kNN, DT, NB Feature extraction, K-fold cross-validation (leave-one-out cross-validation). | Data generated by the signals of 65 hands. The set (46 CTS and 19 control) comprised 302 features. | The SVM model achieved 95% neurophysiological and 89% clinical diagnostic accuracy |
| [27] | KNN, DVM, XGB, YSA 75/25 segmentation, SMOTE Grid Search Optimization | Data were obtained from a registry comprising records of 1919 consecutive patients who underwent CTD (Carpal Tunnel Decompression) for CTS (Carpal Tunnel Syndrome). | XGB/Accuracy 0.759 |
| [28] | U-NET and Deep CTS, Image Processing, 5-fold cross-validation | The dataset consists of 415 pairs of images and labels, including both left and right hands. | 0.63 accuracy of intersection over union |
| Our study | Machine Learning ADASYN data augmentation 5-fold cross-validation | 1521 data with 3 classes | Final stacking model accuracy 0.9115 |
| Variable | Mild (n = 507) | Moderate (n = 276) | Severe (n = 254) | p-Value |
|---|---|---|---|---|
| Age (years, mean ± SD) | 57.3 ± 10.6 | 59.2 ± 10.8 | 57.8 ± 11.2 | 0.069 |
| Male (%) | 39.2 | 44.6 | 32.7 | 0.183 |
| BMI (kg/m2) | 24.2 ± 3.4 | 24.7 ± 3.0 | 25.8 ± 3.7 | <0.001 |
| Diabetes prevalence (%) | 9.3 | 16.3 | 21.6 | <0.001 |
| Symptom duration (months) | 4.3 ± 5.0 | 8.5 ± 8.2 | 15.9 ± 12.8 | <0.001 |
| NRS pain score | 3.3 ± 1.3 | 4.9 ± 1.5 | 6.1 ± 1.5 | <0.001 |
| Thenar weakness/atrophy (%) | 0.2 | 8.7 | 66.5 | <0.001 |
| Stage | Total N | Mild | Moderate | Severe | Features | Note |
|---|---|---|---|---|---|---|
| 1. Raw Dataset [6] | 1037 | 507 | 276 | 254 | 11 | Public dataset |
| 2. After Preprocessing | 1037 | 507 | 276 | 254 | 11 | No missing values |
| 3. Train/Test Split (80/20) | 732 | 358 | 195 | 179 | 11 | Test set: 305 samples (held out) |
| 4. After ADASYN (train only) | 1216 | 406 | 405 | 405 | 11 | Synthetic minority oversampling |
| 5. After Feature Engineering | 1216 | 406 | 405 | 405 | 55 | Polynomial + interaction + binning |
| Index | Count | Mean | Std | Min | %25 | %50 | %75 | Max | Skew | Kurtosis | N_Missing | Pct_Missing |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1216 | 58.0559 | 10.5921 | 19.0 | 52.0 | 58.0 | 64.0 | 87.0 | −0.0991 | 0.2468 | 0 | 0.0 |
| BMI | 1216 | 24.8325 | 3.1938 | 17.7095 | 22.6222 | 24.4076 | 26.7782 | 42.5980 | 0.7086 | 1.1311 | 0 | 0.0 |
| CSA | 1216 | 15.5201 | 4.2296 | 7.0 | 13.0 | 15.0 | 17.6907 | 37.0 | 1.2229 | 2.7762 | 0 | 0.0 |
| PB | 1216 | 2.5877 | 1.7483 | 0.5 | 1.8 | 2.3767 | 3.0080 | 30.0 | 8.9990 | 111.2309 | 0 | 0.0 |
| Duration | 1216 | 8.8133 | 9.2346 | 0.0 | 3.0 | 5.0 | 12.0 | 60.0 | 1.8296 | 3.4730 | 0 | 0.0 |
| NRS | 1216 | 4.5953 | 1.7032 | 1.0 | 3.0 | 5.0 | 6.0 | 10.0 | 0.1787 | −0.1683 | 0 | 0.0 |
| Sex | 1216 | 0.5189 | 0.4998 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | −0.0758 | −1.9975 | 0 | 0.0 |
| Side | 1216 | 0.3700 | 0.4830 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.5388 | −1.7124 | 0 | 0.0 |
| Diabetes | 1216 | 0.0970 | 0.2961 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.7259 | 5.4398 | 0 | 0.0 |
| NP | 1216 | 0.4580 | 0.4984 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.1685 | −1.9748 | 0 | 0.0 |
| Weakness | 1216 | 0.1973 | 0.3981 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.5225 | 0.3188 | 0 | 0.0 |
| Severity | 1216 | 0.9991 | 0.8170 | 0.0 | 0.0 | 1.0 | 2.0 | 2.0 | 0 | 0.0 |
| Feature Type | Feature | Test Applied | p-Value | Significance (α = 0.05) |
|---|---|---|---|---|
| Numerical | Age | Kruskal–Wallis | 0.3165 | Not Significant |
| BMI | Kruskal–Wallis | <0.0001 | Significant | |
| CSA | Kruskal–Wallis | <0.0001 | Significant | |
| PB | Kruskal–Wallis | <0.0001 | Significant | |
| Duration | Kruskal–Wallis | <0.0001 | Significant | |
| NRS | Kruskal–Wallis | <0.0001 | Significant | |
| Categorical | Sex | Chi-square | 0.0011 | Significant |
| Side | Chi-square | 0.0001 | Significant | |
| Diabetes | Chi-square | 0.0007 | Significant | |
| NP | Chi-square | <0.0001 | Significant | |
| Weakness | Chi-square | <0.0001 | Significant |
| Model | Accuracy (Mean ± SD) | F1-Macro (Mean ± SD) | Status |
|---|---|---|---|
| LightGBM ★ | 0.8890 ± 0.0329 | 0.8894 ± 0.0323 | SELECTED—top accuracy |
| XGBoost ★ | 0.8808 ± 0.0284 | 0.8812 ± 0.0280 | SELECTED—consistent performance |
| Random Forest ★ | 0.8799 ± 0.0433 | 0.8810 ± 0.0420 | SELECTED—diversity for ensemble |
| MLP | 0.8652 ± 0.0237 | 0.8649 ± 0.0238 | Not selected—lower accuracy |
| CatBoost ★ | 0.8561 ± 0.0369 | 0.8571 ± 0.0359 | SELECTED—categorical handling advantage |
| Decision Tree | 0.7541 ± 0.0320 | 0.7568 ± 0.0292 | Not selected |
| SVM (RBF) | 0.7385 ± 0.0331 | 0.7404 ± 0.0334 | Not selected |
| Logistic Regression | 0.6957 ± 0.0436 | 0.6966 ± 0.0427 | Not selected |
| K-Nearest Neighbors | 0.6891 ± 0.0421 | 0.6903 ± 0.0421 | Not selected |
| Naive Bayes | 0.5164 ± 0.0194 | 0.4803 ± 0.0271 | Not selected |
| Hyperparameter Summary for Models | ||
|---|---|---|
| Models | Parameters | Values |
| XGBoost | learning_rate | 0.0496 |
| n_estimators | 450 | |
| max_depth | 8 | |
| subsample | 0.765 | |
| colsample_bytree | 0.543 | |
| Random Forest | n_estimators | 550 |
| max_depth | 26 | |
| max_features | ‘log2’ | |
| LightGBM (base model) | learning_rate | 0.0798 |
| n_estimators | 500 | |
| num_leaves | 40 | |
| max_depth | 6 | |
| subsample | 0.6005 | |
| CatBoost (base model) | learning_rate | 0.0707 |
| iterations | 450 | |
| depth | 7 | |
| Meta-Learner (Stacking LGBM) | learning_rate | 0.00394 |
| n_estimators | 600 | |
| max_depth | 9 | |
| num_leaves | 70 | |
| subsample | 0.782 | |
| Model | Accuracy | Precision (Macro) | Recall (Macro) | F1-Score (Macro) | Balanced Accuracy | ROC AUC (Macro) | MCC | Cohen’s Kappa |
|---|---|---|---|---|---|---|---|---|
| XGBoost | 0.9049 | 0.9051 | 0.9049 | 0.9048 | 0.9049 | 0.9695 | 0.8576 | 0.8574 |
| Random Forest | 0.9016 | 0.9018 | 0.9017 | 0.9009 | 0.9017 | 0.9731 | 0.8533 | 0.8525 |
| LightGBM | 0.9049 | 0.9057 | 0.9049 | 0.9051 | 0.9049 | 0.9671 | 0.8576 | 0.8574 |
| CatBoost | 0.9049 | 0.9044 | 0.9049 | 0.9046 | 0.9049 | 0.9774 | 0.8575 | 0.8574 |
| Final Stacking | 0.9115 | 0.9112 | 0.9115 | 0.9113 | 0.9115 | 0.9708 | 0.8672 | 0.8672 |
| Model | Mild P | Mild R | Mild F1 | Mod P | Mod R | Mod F1 | Sev P | Sev R | Sev F1 |
|---|---|---|---|---|---|---|---|---|---|
| XGBoost | 0.877 | 0.921 | 0.899 | 0.879 | 0.853 | 0.866 | 0.960 | 0.941 | 0.951 |
| Random Forest | 0.857 | 0.891 | 0.874 | 0.822 | 0.814 | 0.818 | 0.960 | 0.931 | 0.945 |
| LightGBM | 0.877 | 0.921 | 0.899 | 0.880 | 0.863 | 0.871 | 0.970 | 0.941 | 0.955 |
| CatBoost | 0.811 | 0.891 | 0.849 | 0.846 | 0.755 | 0.798 | 0.952 | 0.961 | 0.956 |
| Stacking (Proposed) | 0.839 | 0.931 | 0.883 | 0.884 | 0.824 | 0.853 | 0.969 | 0.931 | 0.950 |
| Meta-Learner | Accuracy (Mean ± SD) | F1-Macro (Mean ± SD) | Selection Rationale |
|---|---|---|---|
| Logistic Regression | 0.8849 ± 0.0203 | 0.8855 ± 0.0196 | Linear—may underfit ensemble interactions |
| Random Forest | 0.8939 ± 0.0202 | 0.8944 ± 0.0194 | Highest mean accuracy; higher variance |
| XGBoost | 0.8882 ± 0.0039 | 0.8883 ± 0.0036 | Very low variance but moderate accuracy |
| LightGBM (Selected) | 0.8865 ± 0.0139 | 0.8864 ± 0.0137 | Optimal accuracy–stability balance; selected |
| Category | Metric/Value |
|---|---|
| Total Samples | 1521 |
| Features (Final) | 55 |
| Classes | 3 |
| Training Samples | 1216 |
| Test Samples | 305 |
| Final Model Accuracy | 91.15% |
| Final Model F1-Score | 91.13% |
| Final Model ROC AUC | 0.9708 |
| Best Individual Model | XGBoost (90.49% Acc) |
| Misclassified Samples | 27 |
| Metric | Value | p-Value | Top-15 Overlap | N Instances |
|---|---|---|---|---|
| Spearman ρ (SHAP vs. LIME global ranks) | 0.2808 | 0.036 * | 5/15 | 50 |
| Error Type | Count | Rate (%) | Clinical Implication |
|---|---|---|---|
| Correct classification | 273 | 89.5% | No clinical impact |
| Moderate impact (adjacent class) | 29 | 9.5% | Minor treatment adjustment needed |
| Critical error (Severe ↔ Mild) | 3 | 1.0% | Surgical patient missed or overtreated |
| Severe predicted as Mild (worst case) | 3 | 1.0% | Surgery delayed—highest clinical cost |
| Mild predicted as Severe | 0 | 0.0% | Unnecessary surgery referral—none occurred |
| Configuration | Accuracy | F1-Macro | MCC | ROC-AUC |
|---|---|---|---|---|
| C1: Baseline (No ADASYN, No FE, No Stacking) | 0.756 ± 0.030 | 0.730 ± 0.033 | 0.600 ± 0.043 | 0.890 ± 0.014 |
| C2: +ADASYN (No FE, No Stacking) | 0.755 ± 0.020 | 0.733 ± 0.022 | 0.610 ± 0.033 | 0.884 ± 0.016 |
| C3: +Feature Engineering (No Stacking) | 0.887 ± 0.027 | 0.888 ± 0.026 | 0.835 ± 0.038 | 0.973 ± 0.007 |
| C4: +Stacking (Proposed—Full Pipeline) | 0.884 ± 0.031 | 0.885 ± 0.030 | 0.827 ± 0.045 | 0.945 ± 0.013 |
| Strategy | Accuracy | F1-Macro | Severe Recall | MCC |
|---|---|---|---|---|
| No Augmentation | 0.748 ± 0.023 | 0.722 ± 0.026 | 0.776 ± 0.026 | 0.596 ± 0.039 |
| Class-Weight (Balanced) | 0.747 ± 0.035 | 0.727 ± 0.036 | 0.791 ± 0.044 | 0.598 ± 0.058 |
| SMOTE | 0.747 ± 0.023 | 0.725 ± 0.026 | 0.795 ± 0.015 | 0.597 ± 0.039 |
| ADASYN (Proposed) | 0.755 ± 0.022 | 0.734 ± 0.024 | 0.811 ± 0.035 | 0.611 ± 0.036 |
| Model | Train Acc (Mean ± SD) | CV Acc (Mean ± SD) | Gap |
|---|---|---|---|
| XGBoost | 1.0000 ± 0.0000 | 0.8873 ± 0.0266 | 0.1127 |
| RandomForest | 0.9883 ± 0.0026 | 0.8684 ± 0.0468 | 0.1199 |
| LightGBM | 1.0000 ± 0.0000 | 0.8923 ± 0.0290 | 0.1077 |
| CatBoost | 0.9852 ± 0.0026 | 0.8676 ± 0.0306 | 0.1176 |
| Stacking (meta) | 0.9992 ± 0.0004 | 0.8923 ± 0.0163 | 0.1069 |
| Model | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean | SD |
|---|---|---|---|---|---|---|---|
| LightGBM | 0.8770 | 0.8436 | 0.9095 | 0.9053 | 0.9259 | 0.8923 | 0.0290 |
| XGBoost | 0.8893 | 0.8436 | 0.9012 | 0.8930 | 0.9218 | 0.8873 | 0.0257 |
| Stacking (Proposed) | 0.8893 | 0.8313 | 0.8724 | 0.9136 | 0.9136 | 0.8840 | 0.0306 |
| RandomForest | 0.8730 | 0.7778 | 0.8930 | 0.8889 | 0.9095 | 0.8684 | 0.0468 |
| CatBoost | 0.8648 | 0.8148 | 0.8889 | 0.8642 | 0.9053 | 0.8676 | 0.0306 |
| Test | χ2 Statistic | p-Value | Decision |
|---|---|---|---|
| Friedman Test (k = 5 classifiers, n = 5 folds) | 12.939 | 0.0116 | Reject H0 (p < 0.05) |
| Model | Average Rank | Interpretation |
|---|---|---|
| LightGBM | 1.700 | Best-ranked classifier |
| XGBoost | 2.000 | Second best |
| Stacking (Proposed) | 2.700 | Third—consistently competitive |
| RandomForest | 4.000 | Fourth |
| CatBoost | 4.600 | Fifth (NaN-affected folds imputed) |
| Model | XGBoost | RandomForest | LightGBM | CatBoost | Stacking |
|---|---|---|---|---|---|
| XGBoost | — | 0.266 | 0.998 | 0.070 | 0.957 |
| RandomForest | 0.266 | — | 0.145 | 0.975 | 0.691 |
| LightGBM | 0.998 | 0.145 | — | 0.031 * | 0.855 |
| CatBoost | 0.070 | 0.975 | 0.031 * | — | 0.317 |
| Stacking | 0.957 | 0.691 | 0.855 | 0.317 | — |
| Comparison | ΔAcc | Wilcoxon W | p (Wilc.) | t-Stat | p (t-Test) | Cohen’s d |
|---|---|---|---|---|---|---|
| Stacking vs. XGBoost | −0.0058 | 3.0 | 0.465 | −0.713 | 0.516 | −0.356 |
| Stacking vs. RandomForest | +0.0156 | 3.0 | 0.313 | 1.284 | 0.268 | 0.642 |
| Stacking vs. LightGBM | −0.0082 | 3.0 | 0.313 | −0.934 | 0.403 | −0.467 |
| Stacking vs. CatBoost | +0.0164 | 2.0 | 0.188 | 1.533 | 0.200 | 0.766 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sahin, M.E.; Ulutas, H.; Korkmaz, M.; Ozbay Karakus, M.; Er, O.; Unluel, H. Explainable Ensemble Learning for Robust Severity Stratification of Carpal Tunnel Syndrome from Clinical Data. Diagnostics 2026, 16, 1604. https://doi.org/10.3390/diagnostics16111604
Sahin ME, Ulutas H, Korkmaz M, Ozbay Karakus M, Er O, Unluel H. Explainable Ensemble Learning for Robust Severity Stratification of Carpal Tunnel Syndrome from Clinical Data. Diagnostics. 2026; 16(11):1604. https://doi.org/10.3390/diagnostics16111604
Chicago/Turabian StyleSahin, Muhammet Emin, Hasan Ulutas, Murat Korkmaz, Mucella Ozbay Karakus, Orhan Er, and Huriye Unluel. 2026. "Explainable Ensemble Learning for Robust Severity Stratification of Carpal Tunnel Syndrome from Clinical Data" Diagnostics 16, no. 11: 1604. https://doi.org/10.3390/diagnostics16111604
APA StyleSahin, M. E., Ulutas, H., Korkmaz, M., Ozbay Karakus, M., Er, O., & Unluel, H. (2026). Explainable Ensemble Learning for Robust Severity Stratification of Carpal Tunnel Syndrome from Clinical Data. Diagnostics, 16(11), 1604. https://doi.org/10.3390/diagnostics16111604

