Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset
Abstract
:1. Introduction
2. Related Works
3. Preliminaries
3.1. Imbalanced Data Problem in Binary Classification
3.2. Oversampling Techniques
3.2.1. Synthetic Minority Oversampling Technique
3.2.2. Adaptive Synthetic Sampling
3.2.3. Borderline-SMOTE
3.2.4. Oversampling Followed by Data Cleaning Techniques
3.3. ROC Curve
4. Research Design
4.1. Dataset
4.2. Experiment Setup
4.2.1. The Bankruptcy Prediction Framework
4.2.2. Novel Features from Transaction Dataset
| Algorithm 1. A novel algorithm for feature extraction. |
![]() |
5. Experimental Results
5.1. Results of Bankruptcy Prediction
5.2. Result of Bankruptcy Prediction with Mixed Dataset
6. Conclusions
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Kieu, T.; Vo, B.; Le, T.; Deng, Z.H.; Le, B. Mining top-k co-occurrence items with sequential pattern. Expert Syst. Appl. 2017, 85, 123–133. [Google Scholar] [CrossRef]
- Le, T.; Vo, B.; Baik, S.W. Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept. Eng. Appl. Artif. Intell. 2018, 68, 1–9. [Google Scholar] [CrossRef]
- Le, T.; Vo, B. The lattice-based approaches for mining association rules: A review. WIREs Data Min. Knowl. Discov. 2016, 6, 140–151. [Google Scholar] [CrossRef]
- Vo, B.; Le, T.; Nguyen, G.; Hong, T.P. Efficient algorithms for mining erasable closed patterns from product datasets. IEEE Access 2017, 5, 3111–3120. [Google Scholar] [CrossRef]
- Vo, B.; Le, T.; Pedrycz, W.; Nguyen, G.; Baik, S.W. Mining erasable itemsets with subset and superset itemset constraints. Expert Syst. Appl. 2017, 69, 50–61. [Google Scholar] [CrossRef]
- Vo, B.; Pham, S.; Le, T.; Deng, Z.H. A novel approach for mining maximal frequent patterns. Expert Syst. Appl. 2017, 73, 178–186. [Google Scholar] [CrossRef]
- Pham, H.P.; Le, H.S. Linguistic Vector Similarity Measures and Applications to Linguistic Information Classification. Int. J. Intell. Syst. 2017, 32, 67–81. [Google Scholar]
- Nguyen, D.T.; Ali, M.; Le, H.S. A Novel Clustering Algorithm in a Neutrosophic Recommender System for Medical Diagnosis. Cogn. Comput. 2017, 9, 526–544. [Google Scholar]
- Le, H.S.; Pham, H.T. Some novel hybrid forecast methods based on picture fuzzy clustering for weather nowcasting from satellite image sequences. Appl. Intell. 2017, 46, 1–15. [Google Scholar]
- Dang, T.H.; Le, H.S.; Le, V.T. Novel fuzzy clustering scheme for 3D wireless sensor networks. Appl. Soft Comput. 2017, 54, 141–149. [Google Scholar]
- Abeysinghe, C.; Li, J.; He, J. A Classifier Hub for Imbalanced Financial Data. In Proceedings of the Australasian Database Conference, Sydney, Australia, 28–29 September 2016; pp. 476–479. [Google Scholar]
- Laurikkala, J. Improving Identification of Difficult Small Classes by Balancing Class Distribution. In Proceedings of the Conference on AI in Medicine in Europe: Artificial Intelligence Medicine, Cascais, Portugal, 1–4 July 2001; pp. 63–66. [Google Scholar]
- Li, Y.; Zhang, S.; Yin, Y.; Xiao, W.; Zhang, J. A Novel Online Sequential Extreme Learning Machine for Gas Utilization Ratio Prediction in Blast Furnaces. Sensors 2017, 17, 1847. [Google Scholar] [CrossRef] [PubMed]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Siegel, J.; Bhattacharyya, R.; Kumar, S.; Sarma, S.E. Air filter particulate loading detection using smartphone audio and optimized ensemble classification. Eng. Appl. Artif. Intell. 2017, 66, 104–112. [Google Scholar] [CrossRef]
- Wei, L.; Xiong, X.; Zhang, W.; He, X.Z.; Zhang, Y. The effect of genetic algorithm learning with a classifier system in limit order markets. Eng. Appl. Artif. Intell. 2017, 65, 436–448. [Google Scholar] [CrossRef]
- Bao, F.; Deng, Y.; Dai, Q. ACID: Association correction for imbalanced data in GWAS. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 15, 316–322. [Google Scholar] [CrossRef] [PubMed]
- Wilk, S.; Stefanowski, J.; Wojciechowski, S.; Farion, K.J.; Michalowski, W. Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study. Inf. Technol. Med. 2016, 471, 503–515. [Google Scholar]
- Barboza, F.; Kimura, H.; Altman, E. Machine learning models and bankruptcy prediction. Expert Syst. Appl. 2017, 83, 405–417. [Google Scholar] [CrossRef]
- Kim, M.J.; Kang, D.K.; Kim, H.B. Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst. Appl. 2015, 42, 1074–1082. [Google Scholar] [CrossRef]
- Wang, M.; Chen, H.; Li, H.; Cai, Z.N.; Zhao, X.; Tong, C.; Li, J.; Xu, X. Grey wolf optimization evolving kernel extreme learning machine: Application to bankruptcy prediction. Eng. Appl. Artif. Intell. 2017, 63, 54–68. [Google Scholar] [CrossRef]
- Zelenkov, Y.; Fedorova, E.; Chekrizov, D. Two-step classification method based on genetic algorithm for bankruptcy forecasting. Expert Syst. Appl. 2017, 88, 393–401. [Google Scholar] [CrossRef]
- Zakaryazad, A.; Duman, E. A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing. Neurocomputing 2016, 175, 121–131. [Google Scholar] [CrossRef]
- Tan, M.; Tan, L.; Dara, S.; Mayeux, C. Online defect prediction for imbalanced data. In Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), Florence, Italy, 16–24 May 2015; pp. 99–108. [Google Scholar]
- Folino, G.; Pisani, F.S.; Sabatino, P. An Incremental Ensemble Evolved by using Genetic Programming to Efficiently Detect Drifts in Cyber Security Datasets. In Proceedings of the GECCO (Companion), Denver, CO, USA, 20–24 July 2016; pp. 1103–1110. [Google Scholar]
- Li, Y.; Guo, H.; Liu, X.; Li, Y.; Li, J. Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl.-Based Syst. 2016, 94, 88–104. [Google Scholar] [CrossRef]
- Kim, H.J.; Jo, N.O.; Shin, K.S. Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Syst. Appl. 2016, 59, 226–234. [Google Scholar] [CrossRef]
- Batista, G.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2008), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Zieba, M.; Tomczak, S.K.; Tomczak, J.M. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Syst. Appl. 2016, 58, 93–101. [Google Scholar] [CrossRef]
- Pietruszkiewicz, W. Dynamical systems and nonlinear Kalman filtering applied in classification. In Proceedings of the 7th IEEE International Conference on Cybernetic Intelligent Systems (CIS 2008), London, UK, 9–10 September 2008. [Google Scholar]
- Zhao, D.; Huang, C.; Wei, Y.; Yu, F.; Wang, M.; Chen, H. An Effective Computational Model for Bankruptcy Prediction Using Kernel Extreme Learning Machine Approach. Comput. Econ. 2017, 49, 325–341. [Google Scholar] [CrossRef]
- Kang, P.; Cho, S. EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems. In Proceedings of the International Conference on Neural Information Processing (ICONIP 2006), Hong Kong, China, 3–6 October 2006; pp. 837–846. [Google Scholar]
- Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Lemaitre, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 17. [Google Scholar]





| ID | Description |
|---|---|
| F1 | Current assets |
| F2 | Non-current assets, fixed assets, or fixed capital property |
| F3 | Total assets |
| F4 | Current liabilities within one year |
| F5 | Non-current liabilities that are over one-year terms. |
| F6 | Total liabilities |
| F7 | Capital |
| F8 | Earned surplus |
| F9 | Total capital |
| F10 | Total capital after liabilities |
| F11 | Sales revenue |
| F12 | Cost of sales |
| F13 | Gross profit |
| F14 | Sales and administrative expenses |
| F15 | Operating profit that refers to the profits earned through business operations |
| F16 | Non-operating income |
| F17 | Non-operating expenses |
| F18 | Income and loss before income taxes |
| F19 | Net income |
| Oversampling Techniques | Bankruptcy Prediction Model | AUC (%) |
|---|---|---|
| None | Random Forest | 82.4 ± 0.5 |
| Decision Tree | 76.2 ± 0.6 | |
| Multi-Layer Perceptron | 51.8 ± 0.2 | |
| SVM | 52.4 ± 1.7 | |
| SMOTE | Random Forest | 84.1 ± 0.4 |
| Decision Tree | 81.9 ± 0.5 | |
| Multi-Layer Perceptron | 71.4 ± 0.8 | |
| SVM | 53.1 ± 1.5 | |
| Borderline-SMOTE | Random Forest | 83.1 ± 0.5 |
| Decision Tree | 75.6 ± 0.6 | |
| Multi-Layer Perceptron | 67.7 ± 0.6 | |
| SVM | 52.1 ± 2.5 | |
| ADASYN | Random Forest | 83.1 ± 0.4 |
| Decision Tree | 80.3 ± 0.5 | |
| Multi-Layer Perceptron | 68.9 ± 0.5 | |
| SVM | 51.2 ± 2.1 | |
| SMOTE + Tomek | Random Forest | 84.1 ± 0.4 |
| Decision Tree | 81.9 ± 0.5 | |
| Multi-Layer Perceptron | 69.8 ± 0.4 | |
| SVM | 53.5 ± 1.2 | |
| SMOTE + ENN | Random Forest | 84.2 ± 0.5 |
| Decision Tree | 81.2 ± 0.5 | |
| Multi-Layer Perceptron | 72.7 ± 0.5 | |
| SVM | 54.2 ± 1.4 |
| Times | AUC (%) | AUC (%) for Mixed Dataset |
|---|---|---|
| 1 | 83.9 | 84.3 |
| 2 | 84.4 | 84.3 |
| 3 | 84.2 | 84.4 |
| 4 | 84.2 | 84.9 |
| 5 | 84.3 | 84.3 |
| 6 | 84.5 | 84.1 |
| Average AUC | 84.2 | 84.4 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Le, T.; Lee, M.Y.; Park, J.R.; Baik, S.W. Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset. Symmetry 2018, 10, 79. https://doi.org/10.3390/sym10040079
Le T, Lee MY, Park JR, Baik SW. Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset. Symmetry. 2018; 10(4):79. https://doi.org/10.3390/sym10040079
Chicago/Turabian StyleLe, Tuong, Mi Young Lee, Jun Ryeol Park, and Sung Wook Baik. 2018. "Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset" Symmetry 10, no. 4: 79. https://doi.org/10.3390/sym10040079
APA StyleLe, T., Lee, M. Y., Park, J. R., & Baik, S. W. (2018). Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset. Symmetry, 10(4), 79. https://doi.org/10.3390/sym10040079


