Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost
Abstract
1. Introduction
2. Related Works
2.1. Oversampling Techniques
2.2. Undersampling Techniques
2.3. Hybrid Resampling Techniques
2.4. Prediction Models with Built-In Resampling Techniques
2.5. Summary
3. Materials and Methods
3.1. Data
3.2. Experimental Design
3.2.1. Data Normalization
3.2.2. Stratified Splitting of the Dataset
3.2.3. Resampling and Hybrid Techniques
- (1)
- Synthetic Minority Oversampling Technique (SMOTE)
- (2)
- Borderline-SMOTE
- (3)
- Adaptive Synthetic Sampling (ADASYN)
- (4)
- Random Undersampling (RUS)
- (5)
- Tomek Links
- (6)
- SMOTE-Tomek
- (7)
- SMOTE-Edited Nearest Neighbors (SMOTE-ENN)
- (8)
- Bagging-SMOTE
3.2.4. XGBoost Prediction Model
- TP = True Positives
- TN = True Negatives
- FP = False Positives
- FN = False Negatives
- TPR (True Positive Rate) is also known as sensitivity or recall.
- FPR (False Positive Rate) is defined as .
4. Results
4.1. Impact of Resampling Techniques Toward the Training Set
4.2. Impact of Resampling Techniques Toward Prediction Model
5. Discussion
6. Conclusions
- Resampling is critical for improving minority class detection. Traditional oversampling methods, especially SMOTE, showed meaningful improvements in F1-score (up to 0.73) and MCC (up to 0.70). These results indicate enhanced model robustness without excessive overfitting. Hybrid techniques such as SMOTE-Tomek and Borderline-SMOTE further improved recall while almost maintaining precision. This highlights their value in refining decision boundaries and reducing class overlap. Importantly, these methods also demonstrated modest computational demands, with sampling times around 2–6 s and training times of approximately 70 s. Therefore, they are practical for real-world applications.
- Trade-offs between recall and precision, as well as computational efficiency, must be context aware. Aggressive techniques such as RUS yielded the highest recall (0.85). However, this came at the cost of poor precision (0.46) and diminished generalization performance, mainly due to the loss of informative majority-class samples. RUS was also the most computationally efficient method, with a training time of 14.9 s and a sampling time of 1.6 s. This makes it suitable for scenarios where detection sensitivity and speed are prioritized. In contrast, Tomek Links offered a strong balance between precision (0.82) and efficiency (sampling: 2.7 s, training: 41 s). This makes it attractive for applications where computational speed and minimizing false positives are critical. SMOTE-ENN achieved a high recall (0.77) but at a moderate precision level (0.63), suggesting a tendency toward increased false positives. Its computational cost was also reasonable (sampling: 6.1 s, training: 58.1 s).
- Bagging combined with stratified SMOTE (Bagging-SMOTE) achieved a strong overall balance across evaluation metrics. Applying stratified SMOTE with an appropriate sampling ratio (0.15) to each bootstrap sample and aggregating predictions through bagging led to competitive results. These included AUC of 0.96, F1-score of 0.72, PR-AUC of 0.80, and MCC of 0.68. However, this method required much longer training time (400.3 s), which may limit its practicality in time-sensitive applications. These results show that ensemble-based resampling can improve both sensitivity and generalization, which is important in financial applications.
- Resampling techniques selection should be tailored to application goals. For early warning systems that prioritize recall and timely identification of financially distressed companies, techniques such as Bagging-SMOTE, Borderline-SMOTE, SMOTE-Tomek, and SMOTE are recommended. In contrast, when minimizing false positives is important, such as in regulatory flagging or risk-adjusted decision-making, more conservative methods, such as Tomek Links, may be more appropriate.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Song, Y.; Jiang, M.; Li, S.; Zhao, S. Class-imbalanced Financial Distress Prediction with Machine Learning: Incorporating Financial, Management, Textual, and Social Responsibility Features into Index System. J. Forecast. 2024, 43, 593–614. [Google Scholar] [CrossRef]
- Engїn, U. Financial distress prediction from time series data using xgboost: Bist100 of borsa istanbul. Doğuş Üniversitesi Derg. 2023, 24, 589–604. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Wang, B.; Mei, C.; Wang, Y.; Zhou, Y.; Cheng, M.-T.; Zheng, C.-H.; Wang, L.; Zhang, J.; Chen, P.; Xiong, Y. Imbalance Data Processing Strategy for Protein Interaction Sites Prediction. IEEE/ACM Trans. Comput. Biol. Bioinf. 2021, 18, 985–994. [Google Scholar] [CrossRef]
- Cai, T. Breast Cancer Diagnosis Using Imbalanced Learning and Ensemble Method. Appl. Comput. Math. 2018, 7, 146. [Google Scholar] [CrossRef]
- Elreedy, D.; Atiya, A.F. A Novel Distribution Analysis for SMOTE Oversampling Method in Handling Class Imbalance. In Computational Science–Proceedings of the ICCS 2019: 19th International Conference, Faro, Portugal, 12–14 June 2019; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 236–248. [Google Scholar] [CrossRef]
- Alex, S.A.; Nayahi, J.J.V. Classification of Imbalanced Data Using SMOTE and AutoEncoder Based Deep Convolutional Neural Network. Int. J. Unc. Fuzz. Knowl. Based Syst. 2023, 31, 437–469. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing; Springer Berlin Heidelberg: Berlin, Heidelberg, 2005; Volume 3644, pp. 878–887. ISBN 978-3-540-28226-6. [Google Scholar] [CrossRef]
- Chen, C.; Shen, W.; Yang, C.; Fan, W.; Liu, X.; Li, Y. A New Safe-Level Enabled Borderline-SMOTE for Condition Recognition of Imbalanced Dataset. IEEE Trans. Instrum. Meas. 2023, 72, 1–10. [Google Scholar] [CrossRef]
- Glazkova, A. A Comparison of Synthetic Oversampling Methods for Multi-Class Text Classification. arXiv 2020, arXiv:2008.04636. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: Hong Kong, China, 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
- Pristyanto, Y.; Nugraha, A.F.; Dahlan, A.; Wirasakti, L.A.; Ahmad Zein, A.; Pratama, I. Multiclass Imbalanced Handling Using ADASYN Oversampling and Stacking Algorithm. In Proceedings of the 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM), Seoul, Republic of Korea, 3 January 2022; IEEE: Seoul, Republic of Korea, 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Arifin, M.A.S.; Stiawan, D.; Yudho Suprapto, B.; Susanto, S.; Salim, T.; Idris, M.Y.; Budiarto, R. Oversampling and Undersampling for Intrusion Detection System in the Supervisory Control and Data Acquisition IEC 60870-5-104. IET Cyber-Phys. Syst. Theory Appl. 2024, 9, 282–292. [Google Scholar] [CrossRef]
- Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Kaushik, M.M.; Mahmud, S.M.H.; Kabir, M.A.; Nandi, D. The Effects of Class Rebalancing Techniques on Ensemble Classifiers on Credit Card Fraud Detection: An Empirical Study. In Proceedings of the Applied Data Science and Smart Systems, Rajpura, India, 4–5 November 2022; AIP Publishing LLC: Rajpura, India, 2023; p. 030011. [Google Scholar] [CrossRef]
- Zang, J.; Li, H. Abnormal Traffic Detection Based on Data Augmentation and Hybrid Neural Network. In Proceedings of the 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), Guangzhou, China, 20 September 2024; IEEE: Guangzhou, China, 2024; pp. 249–253. [Google Scholar] [CrossRef]
- Putra, L.G.R.; Marzuki, K.; Hairani, H. Correlation-Based Feature Selection and Smote-Tomek Link to Improve the Performance of Machine Learning Methods on Cancer Disease Prediction. Eng. Appl. Sci. Res. 2023, 50, 577583. [Google Scholar] [CrossRef]
- Yang, F.; Wang, K.; Sun, L.; Zhai, M.; Song, J.; Wang, H. A Hybrid Sampling Algorithm Combining Synthetic Minority Over-Sampling Technique and Edited Nearest Neighbor for Missed Abortion Diagnosis. BMC Med. Inform. Decis. Mak. 2022, 22, 344. [Google Scholar] [CrossRef]
- Wang, W.; Liang, Z. Financial Distress Early Warning for Chinese Enterprises from a Systemic Risk Perspective: Based on the Adaptive Weighted XGBoost-Bagging Model. Systems 2024, 12, 65. [Google Scholar] [CrossRef]
- Liu, W.; Fan, H.; Xia, M.; Pang, C. Predicting and Interpreting Financial Distress Using a Weighted Boosted Tree-Based Tree. Eng. Appl. Artif. Intell. 2022, 116, 105466. [Google Scholar] [CrossRef]
- Wu, C.; Chen, X.; Jiang, Y. Financial Distress Prediction Based on Ensemble Feature Selection and Improved Stacking Algorithm. Kybernetes, 2024; ahead-of-print. [Google Scholar] [CrossRef]
- Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern. A 2010, 40, 185–197. [Google Scholar] [CrossRef]
- Díez López, C.; Montiel González, D.; Vidaki, A.; Kayser, M. Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning. Front. Microbiol. 2022, 13, 886201. [Google Scholar] [CrossRef]
- Shaikh, S.; Daudpota, S.M.; Imran, A.S.; Kastrati, Z. Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models. Appl. Sci. 2021, 11, 869. [Google Scholar] [CrossRef]
- Kotb, M.H.; Ming, R. Comparing SMOTE Family Techniques in Predicting Insurance Premium Defaulting Using Machine Learning Models. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 621–629. [Google Scholar] [CrossRef]
- Hairani, H.; Anggrawan, A.; Priyanto, D. Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV Int. J. Inform. Vis. 2023, 7, 258. [Google Scholar] [CrossRef]
Set | Category | Brief Description |
---|---|---|
Financial Indicators | Solvency | Ability to repay short- and long-term debt using enterprise assets. |
Disclosed Index | Indicators directly reported to reflect company operations. | |
Ratio Structure | Financial structure based on the proportions of key financial indicators. | |
Operating Capacity | Efficiency in utilizing assets for business operations. | |
Earning Capacity | Ability to generate profits. | |
Cash Flow | Cash-related indicators derived from financial statement ratios. | |
Risk Level | Risk of financial instability due to weak structure or poor financing practices. | |
Growth Capability | Potential for future expansion and performance improvement. | |
Index per Share | Financial condition measured on a per-share basis. | |
Relative Value Index | Derived from comparisons among related indicators. | |
Dividend Distribution | Metrics derived from comparisons among related financial indicators. | |
Bankruptcy Reorganization | Business Risk | Risk profile of bankrupt or restructured listed firms. |
Technique | Type | Description |
---|---|---|
SMOTE | Oversampling | Generates synthetic minority class samples via linear interpolation between neighbors. |
ADASYN | Oversampling | Focuses synthesis on difficult minority samples by weighting based on local imbalance. |
Borderline-SMOTE | Oversampling | Specifically, it oversamples minority instances near the decision boundary. |
Random Under | Undersampling | Randomly reduces the majority class samples to balance the distribution. |
Tomek Links | Undersampling | Removes overlapping majority class samples to improve class separation. |
SMOTE-Tomek | Hybrid | Apply SMOTE, then clean the noise with Tomek Links. |
SMOTE-ENN | Hybrid | Generates synthetic samples and then removes misclassified instances. |
Bagging-SMOTE | Hybrid | Stratify data with sampling by replacement, then apply SMOTE in each stratified set of data. |
Parameter | Value | Descriptions |
---|---|---|
random_state | 42 | To ensure the reproducibility of experimental results |
n_estimators | 300 | The maximum number of trees |
eval_metric | Logloss, AUC | To evaluate the performance of tree prediction |
early_stopping_rounds | 20 | Early stopping to prevent model overfit |
Methods | Class Distribution | ||
---|---|---|---|
Class 0 | Class 1 | Class Ratio (0:1) | |
Original data | 16,241 | 2227 | 7.29:1 |
SMOTE | 16,241 | 16,241 | 1:1 |
ADASYN | 16,241 | 16,273 | 1:1 |
Borderline-SMOTE | 16,241 | 16,241 | 1:1 |
RUS | 2227 | 2227 | 1:1 |
Tomek Links | 16,021 | 2227 | 7.19:1 |
SMOTE-Tomek | 16,229 | 16,229 | 1:1 |
SMOTE-ENN | 13,702 | 16,018 | 0.86:1 |
Bagging-SMOTE | 16,241 | 2436 | 6.67:1 |
Methods | Accuracy | Precision | Recall | F1 | AUC | PR-AUC | MCC | Sampling Time (s) | Training Time (s) |
---|---|---|---|---|---|---|---|---|---|
Original data | 0.94 | 0.85 | 0.60 | 0.71 | 0.96 | 0.82 | 0.68 | - | 42.6 |
SMOTE | 0.94 | 0.77 | 0.69 | 0.73 | 0.96 | 0.80 | 0.70 | 2.4 | 70.0 |
ADASYN | 0.94 | 0.76 | 0.68 | 0.72 | 0.96 | 0.80 | 0.68 | 2.3 | 70.4 |
Borderline-SMOTE | 0.94 | 0.77 | 0.69 | 0.73 | 0.96 | 0.80 | 0.69 | 2.4 | 71.4 |
RUS | 0.87 | 0.48 | 0.88 | 0.62 | 0.94 | 0.71 | 0.58 | 1.6 | 14.9 |
Tomek Links | 0.94 | 0.82 | 0.61 | 0.70 | 0.96 | 0.81 | 0.68 | 2.7 | 41.0 |
SMOTE-Tomek | 0.94 | 0.75 | 0.69 | 0.72 | 0.95 | 0.80 | 0.69 | 5.9 | 68.5 |
SMOTE-ENN | 0.92 | 0.63 | 0.77 | 0.69 | 0.95 | 0.76 | 0.65 | 6.1 | 58.1 |
Bagging-SMOTE | 0.93 | 0.74 | 0.69 | 0.72 | 0.96 | 0.80 | 0.68 | - | 400.3 |
Methods | Test Set Class 0: 6961, Class 1: 954 | |||
---|---|---|---|---|
TN | FP | FN | TP | |
Original data | 6859 | 102 | 379 | 575 |
SMOTE | 6763 | 198 | 293 | 661 |
ADASYN | 6751 | 210 | 301 | 653 |
Borderline-SMOTE | 6759 | 202 | 295 | 659 |
RUS | 6040 | 921 | 117 | 837 |
Tomek Links | 6833 | 128 | 370 | 584 |
SMOTE-Tomek | 6746 | 215 | 295 | 659 |
SMOTE-ENN | 6528 | 433 | 217 | 737 |
Bagging-SMOTE | 6734 | 227 | 291 | 663 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hou, G.; Tong, D.L.; Liew, S.Y.; Choo, P.Y. Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost. Mathematics 2025, 13, 2186. https://doi.org/10.3390/math13132186
Hou G, Tong DL, Liew SY, Choo PY. Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost. Mathematics. 2025; 13(13):2186. https://doi.org/10.3390/math13132186
Chicago/Turabian StyleHou, Guodong, Dong Ling Tong, Soung Yue Liew, and Peng Yin Choo. 2025. "Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost" Mathematics 13, no. 13: 2186. https://doi.org/10.3390/math13132186
APA StyleHou, G., Tong, D. L., Liew, S. Y., & Choo, P. Y. (2025). Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost. Mathematics, 13(13), 2186. https://doi.org/10.3390/math13132186