# A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Preliminaries

#### 2.1. Class Imbalance Problem in Bankruptcy Prediction

#### 2.2. Undersampling Approach Using IHT

_{i}, y

_{i}), p(y

_{i}|x

_{i}, h) is the conditional probability of label y

_{i}given by the weak learner h for the input feature vector x

_{i}. The smaller the value of p(y

_{i}|x

_{i}, h), the less correct h is. The IH of the training sample (x

_{i}, y

_{i}), denoted by $I$ with respect to h, is as follows:

_{i}, y

_{i}) is determined as in Equation (3):

## 3. Materials and Methods

#### 3.1. The Experimental Dataset

#### 3.2. Cluster-Based Boosting Algorithm

_{i}) refers to the Euclidean distance between data point x

_{i}and the nearest center point. Equation (4) makes it so that the data samples in the majority class closed the center points and the data samples in the minority class will have a higher weight values compared to the further data samples in majority class. Next, CBoost in Line 6 will normalize these values using the following equation:

_{t}(x) that gives the lowest weighted classification error (${\u03f5}_{t}$) in Line 8, calculates the weight for the t-th weak classifier (${\alpha}_{t}$) in Line 10, and updates the next weight W

_{t}

_{+1}in Line 11. The final classifier H computes the sign of a weighted combination of a weak learner as in Equation (6):

_{t}(x) refers to the t-th weak learner and ${\alpha}_{t}$ is the corresponding weight. In short, CBoost is a greedy algorithm that finds and adds one weak learner at an iteration and then optimizes the weights and updates the weighted distribution for the next iteration. In the final step, the algorithm combines them as in Equation (6) to create a stronger learner as the final one.

#### 3.3. RFCI Framework

## 4. Results

#### 4.1. Experimental Setting

#### 4.2. Identifying k Value Experiment

#### 4.3. Bankruptcy Prediction Results

#### 4.4. Time Analysis

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Acknowledgments

## Conflicts of Interest

## References

- Cu, N.G.; Le, H.S.; Chiclana, F. Dynamic structural neural network. J. Intell. Fuzzy Syst.
**2018**, 34, 2479–2490. [Google Scholar] [Green Version] - Dang, L.M.; Hassan, S.I.; Im, S.; Mehmood, I.; Moon, H. Utilizing text recognition for the defects extraction in sewers CCTV inspection videos. Comput. Ind.
**2018**, 99, 96–109. [Google Scholar] [CrossRef] - Dang, L.M.; Syed, I.H.; Suhyeon, I.; Sangaiah, A.; Mehmood, I.; Rho, S.; Seo, S.; Moon, H. UAV based wilt detection system via convolutional neural networks. Sustain. Comput. Inform. Syst.
**2018**, in press. [Google Scholar] [CrossRef] - Le, T.; Nguyen, A.; Huynh, B.; Vo, B.; Pedrycz, W. Mining constrained inter-sequence patterns: A novel approach to cope with item constraints. Appl. Intell.
**2018**, 48, 1327–1343. [Google Scholar] [CrossRef] - Bui, H.; Vo, B.; Nguyen, H.; Nguyen-Hoang, T.A.; Hong, T.P. A weighted N-list-based method for mining frequent weighted itemsets. Expert Syst. Appl.
**2018**, 96, 388–405. [Google Scholar] [CrossRef] - Vo, B.; Le, T.; Coenen, F.; Hong, T.P. Mining frequent itemsets using the N-list and subsume concepts. Int. J. Mach. Learn. Cybern.
**2016**, 7, 253–265. [Google Scholar] [CrossRef] - Le, T.; Vo, B.; Baik, S.W. Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept. Eng. Appl. Artif. Intell.
**2018**, 68, 1–9. [Google Scholar] [CrossRef] - Kim, D.; Yun, U. Efficient algorithm for mining high average-utility itemsets in incremental transaction databases. Appl. Intell.
**2017**, 47, 114–131. [Google Scholar] [CrossRef] - Vo, B. An Efficient Method for Mining Frequent Weighted Closed Itemsets from Weighted Item Transaction Databases. J. Inf. Sci. Eng.
**2017**, 33, 199–216. [Google Scholar] - Mai, T.; Vo, B.; Nguyen, L. A lattice-based approach for mining high utility association rules. Inf. Sci.
**2017**, 399, 81–97. [Google Scholar] [CrossRef] - Kim, B.; Kim, J.; Yi, G. Analysis of Clustering Evaluation Considering Features of Item Response Data Using Data Mining Technique for Setting Cut-Off Scores. Symmetry
**2017**, 9, 62. [Google Scholar] [CrossRef] - Soleimani, H.; Tomasin, S.; Alizadeh, T.; Shojafar, M. Cluster-head based feedback for simplified time reversal prefiltering in ultra-wideband systems. Phys. Commun.
**2017**, 25, 100–109. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Tajiki, M.M.; Akbari, B.; Shojafar, M.; Mokari, N. Joint QoS and Congestion Control Based on Traffic Prediction in SDN. Appl. Sci.
**2017**, 7, 1265. [Google Scholar] [CrossRef] - Roan, T.N.; Ali, M.; Le, H.S. δ-equality of intuitionistic fuzzy sets: A new proximity measure and applications in medical diagnosis. Appl. Intell.
**2018**, 48, 499–525. [Google Scholar] - Singh, K.; Singh, K.; Le, H.S.; Aziz, A. Congestion control in wireless sensor networks by hybrid multi-objective optimization algorithm. Comput. Netw.
**2018**, 138, 90–107. [Google Scholar] [CrossRef] - Le, T.; Vo, B.; Duong, T.H. Personalized Facets for Semantic Search Using Linked Open Data with Social Networks. In Proceedings of the 2012 Third International Conference on Innovations in Bio-Inspired Computing and Applications, Kaohsiung, Taiwan, 26–28 September 2012; pp. 312–337. [Google Scholar]
- Nguyen, D.T.; Ali, M.; Le, H.S. A Novel Clustering Algorithm in a Neutrosophic Recommender System for Medical Diagnosis. Cogn. Comput.
**2017**, 9, 526–544. [Google Scholar] - Lu, T.C. Interpolation-based hiding scheme using the modulus function and re-encoding strategy. Signal Process.
**2018**, 142, 244–259. [Google Scholar] [CrossRef] - Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci.
**2017**, 409, 17–26. [Google Scholar] [CrossRef] - Zakaryazad, A.; Duman, E. A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing. Neurocomputing
**2016**, 175, 121–131. [Google Scholar] [CrossRef] - Herndon, N.; Caragea, D. A Study of Domain Adaptation Classifiers Derived from Logistic Regression for the Task of Splice Site Prediction. IEEE Trans. NanoBiosci.
**2016**, 15, 75–83. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Luo, J.; Xiao, Q. A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J. Biomed. Inform.
**2017**, 66, 194–203. [Google Scholar] [CrossRef] [PubMed] - Kim, M.J.; Kang, D.K.; Kim, H.B. Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst. Appl.
**2015**, 42, 1074–1082. [Google Scholar] [CrossRef] - Zieba, M.; Tomczak, S.K.; Tomczak, J.M. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Syst. Appl.
**2016**, 58, 93–101. [Google Scholar] [CrossRef] - Barboza, F.; Kimura, H.; Altman, E. Machine learning models and bankruptcy prediction. Expert Syst. Appl.
**2017**, 83, 405–417. [Google Scholar] [CrossRef] - Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S. MAHAKIL: Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction. IEEE Trans. Softw. Eng.
**2018**, 44, 534–550. [Google Scholar] [CrossRef] - Le, T.; Lee, M.Y.; Park, J.R.; Baik, S.W. Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset. Symmetry
**2018**, 10, 79. [Google Scholar] [CrossRef] - Batista, G.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl.
**2004**, 6, 20–29. [Google Scholar] [CrossRef] - Smith, M.R.; Martinez, T.R.; Giraud-Carrier, C.G. An instance level analysis of data complexity. Mach. Learn.
**2014**, 95, 225–256. [Google Scholar] [CrossRef] - Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci.
**1997**, 55, 119–139. [Google Scholar] [CrossRef] - Lemaitre, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res.
**2017**, 18, 17:1–17:5. [Google Scholar] - Thorndike, R.L. Who Belongs in the Family? Psychometrika
**1953**, 18, 267–276. [Google Scholar] [CrossRef]

**Figure 1.**An example of the undersampling approach, using Instance Hardness Threshold (IHT) with several different balancing ratios.

**Figure 4.**The variation of distortion value for each value k. (

**A**) shows variation k from 0 to 250. (

**B**) shows variation k from 0 to 140 for better display.

Feature | Description |
---|---|

F1 | The current assets of the enterprise |

F2 | The non-current assets i.e., fixed capital assets |

F3 | The total assets that sum the current and non-current assets |

F4 | Current debts that need to pay this year |

F5 | Long-term debts |

F6 | The total debts that sum current and long-term debts |

F7 | Capital |

F8 | Earned surplus |

F9 | Total capital |

F10 | Total capital after debts |

F11 | Revenue from sale activities |

F12 | Cost of sales activity |

F13 | Gross profit from sale activity |

F14 | Management costs |

F15 | Operating profit that refers to the profits earned through business operations |

F16 | Non-operating income |

F17 | Non-operating costs |

F18 | Income and loss before taxes |

F19 | Net income |

Method | Resample Approach | Classifier | AUC (%) |
---|---|---|---|

[20] | Undersampling method based on clustering technique | MLP | 46.3 ± 0.3 |

Decision Tree | 53.4 ± 0.1 | ||

Random Forest | 57.7 ± 0.2 | ||

AdaBoost | 52.7 ± 0.5 | ||

[28] | Oversampling method using SMOTEENN | MLP | 72.7 ± 0.5 |

Decision Tree | 81.2 ± 0.5 | ||

Random Forest | 84.2 ± 0.5 | ||

AdaBoost | 84.8 ± 0.4 | ||

[24] | None | GMBoost | 75.3 ± 0.6 |

RFCI | Undersampling method using IHT concept | CBoost | 86.8 ± 0.3 |

Method | Resampling Approach | Classifier | Training Time (s) | Testing Time (s) |
---|---|---|---|---|

[20] | Undersampling method based on clustering technique | MLP | 134.2 ± 9.5 | 0.03 |

Decision Tree | 133.2 ± 9.9 | 0.002 | ||

Random Forest | 134.0 ± 8.9 | 0.01 | ||

AdaBoost | 135.7 ± 9.3 | 0.15 | ||

[28] | Oversampling method using SMOTEENN | MLP | 48.3 ± 3.0 | 0.02 |

Decision Tree | 36.2 ± 1.0 | 0.003 | ||

Random Forest | 36.7 ± 0.9 | 0.02 | ||

AdaBoost | 66.4 ± 1.0 | 0.31 | ||

[24] | None | GMBoost | 13.7 ± 0.1 | 0.3 |

RFCI | Undersampling method using IHT concept | CBoost | 39.4 ± 0.7 | 0.15 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Le, T.; Hoang Son, L.; Vo, M.T.; Lee, M.Y.; Baik, S.W.
A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset. *Symmetry* **2018**, *10*, 250.
https://doi.org/10.3390/sym10070250

**AMA Style**

Le T, Hoang Son L, Vo MT, Lee MY, Baik SW.
A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset. *Symmetry*. 2018; 10(7):250.
https://doi.org/10.3390/sym10070250

**Chicago/Turabian Style**

Le, Tuong, Le Hoang Son, Minh Thanh Vo, Mi Young Lee, and Sung Wook Baik.
2018. "A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset" *Symmetry* 10, no. 7: 250.
https://doi.org/10.3390/sym10070250