Oversampling Techniques for Bankruptcy Prediction : Novel Features from a Transaction Dataset

In recent years, weakened by the fall of economic growth, many enterprises fell into the crisis caused by financial difficulties. Bankruptcy prediction, a machine learning model, is a great utility for financial institutions, fund managers, lenders, governments, and economic stakeholders. Due to the number of bankrupt companies compared to that of non-bankrupt companies, bankruptcy prediction faces the problem of imbalanced data. This study first presents the bankruptcy prediction framework. Then, five oversampling techniques are used to deal with imbalance problems on the experimental dataset which were collected from Korean companies in two years from 2016 to 2017. Experimental results show that using oversampling techniques to balance the dataset in the training stage can enhance the performance of the bankruptcy prediction. The best overall Area Under the Curve (AUC) of this framework can reach 84.2%. Next, the study extracts more features by combining the financial dataset with transaction dataset to increase the performance for bankruptcy prediction and achieves 84.4% AUC.


Introduction
Nowadays, when a very large amount of data is being generated every day, data mining to create knowledge to use in intelligent systems becomes dramatically important.Data mining includes several common tasks, such as association rule learning [1][2][3][4][5][6], classification [7], and clustering [8][9][10].Classification [11][12][13][14][15][16] has received significant attention from the research and development community.In machine learning, classification, a supervised learning, is the problem of identifying the class that a new observation belongs to, based on a training set of data containing observations.Classification has attracted a great deal of research attention with many practical applications in various domains.
In real-world datasets, class distribution is commonly imbalanced.For instance, in binary classification, the minority class contains a small number of data samples, and the majority class contains a very large number of data samples.Consider a dataset where 95% of the data samples belong to the majority class, and only 5% belongs to the minority class: a classifier may achieve the accuracy up to 95% just by predicting that all data samples belonging to the majority class.In this case, it is a not good classification model.The class imbalance problem has been encountered in various domains, such as chemical and biomedical engineering, business management, information technology, energy management, etc.In chemical and biomedical engineering, protein detection [17], and disease diagnoses [18] are the most common topics related to imbalanced data.In business management, bankruptcy prediction [19][20][21][22] and fraud detection [11,23] are two very attractive topics.Bankruptcy prediction is a model to forecast the fate of firms and has a great utility for all economic stakeholders.Fraud detection includes several sub-problems, such as e-payment fraud, credit and plastic card fraud [23], and loan default prediction [11].For details, Abeysinghe et al. [11] in 2016 presented the dataset which contains 30,000 loan records collected from an online P2P system.Each record has 225 features about the lender's personal information, network behavior information, and social network information.In this dataset, 27,802 cases repay money on time while 2198 cases cannot.In information technology, software defect detection [24] and network intrusion detection [25] are implemented under imbalanced scenarios.In [25], the author introduced the ISCX IDS dataset collected from the Information Security Centre of Excellence of the University of New Brunswick.To obtain this dataset, the authors captured seven days of network traffic.Most of the flows are normal traffic, while small malicious activities were found.In energy management, several problems, such as planning [26], and the operation of energy production and energy consumption units are related to imbalanced data.
In machine learning, bankruptcy prediction [20,27] is considered as a binary classification model, in which the numbers of bankrupt firms are usually much smaller than those of non-bankrupt firms.For imbalanced data, most learning algorithms are not able to induce meaningful classifiers.One of the reason is that learning algorithms tend to focus on majority classes to maximize classification accuracy and ignore the minority classes.Therefore, many approaches have been proposed to overcome the problem of class imbalance [12] in which the most commonly used technique is sampling methods [11,[28][29][30][31], which are utilized to achieve a balanced class distribution from imbalanced datasets.Sampling methods can be divided into two groups: undersampling and oversampling techniques.The undersampling techniques remove several data points in the majority class, while the oversampling methods generate the synthetic data points belonging to the minority class for obtaining a desirable balancing ratio.Several undersampling, followed by oversampling, approaches [28] are also introduced to achieve better performance.
This study utilizes several oversampling techniques to forecast bankruptcy for a Korean case study.The main contributions of this study are as follows: (i) the bankruptcy prediction framework to predict the company bankruptcy; (ii) the Korean dataset in the last two years was collected for use in the bankruptcy prediction framework; (iii) the first experiment was conducted to compare the performance of oversampling techniques to forecast bankruptcy on the Korean dataset; and (iv) to enhance the performance, we combined the financial dataset and transaction dataset in the second experiment.Experimental results show that the best overall AUC of this framework can reach 84.2%.Additionally, combining the financial dataset and transaction dataset can help to increase the performance of bankruptcy prediction.
The remainder of the paper is organized as follows: Section 2 presents several related works on bankruptcy prediction; then, the preliminaries, including the problem of imbalanced data, oversampling techniques, and performance measures in imbalanced data, were summarized in Section 3; in Section 4, we present the research design of two experiments; experimental results are presented in Section 5; and, finally, Section 6 gives the conclusion as well as offers some future research issues.

Related Works
In 2015, Kim et al. [20] proposed a geometric mean (GM)-based boosting algorithm, named GMBoost, to solve the problem of imbalanced data and bankruptcy prediction.GMBoost uses the GM of both classes in error rate and accurate calculations to enable learning with consideration of both majority and minority classes.This method was verified by a dataset collected from a Korean commercial bank.This dataset includes 500 bankrupt companies during the year 2002 to the year 2005 and 2500 non-bankrupt companies during 2002-2005 with 30 financial ratios, including debt coverage, leverage, profitability, capital structure, activity, liquidity, size, etc.
Later, Kim et al. [27] proposed the cluster-based evolutionary undersampling (CBEUS) method to address the problem of imbalanced data as well as bankruptcy prediction.The first step of CBEUS is to cluster the non-bankruptcy companies into several clusters using k-means clustering.Next, the Euclidean distance between an instance and its centroid was computed.The thresholds that represent the distance from the centroid of each cluster using Genetic Algorithms (GA) were determined in the second step.This approach was successfully applied to a dataset for bankruptcy prediction which has 106 financial indicators of 22,500 externally non-audited small-and medium-sized Korean manufacturing firms from 2002 to 2007.In which, 1350 firms filed for bankruptcy and the 21,150 firms filed for non-bankruptcy.
In 2016, a novel approach for bankruptcy prediction that mainly applies eXtreme Gradient Boosting (XGB) for learning an ensemble of decision trees was proposed in [32].XGB is the boosting method which is modified to optimize a Taylor expansion of the loss functions to archive a good performance for all kinds of data.In addition, the authors introduced a new concept, named synthetic features, which are generated by random selection of two existing features and random selection of the arithmetical operation.This approach was evaluated by the financial condition of Polish companies from 2007 to 2013 (bankruptcy companies) and from 2000 to 2012 for (still-operating companies) with 64 financial indicators.
Next, Zelenkov et al. [22] proposed a two-step classification method for bankruptcy prediction.In the first stage, training of individual classifiers and the selection of an adequate feature set is made for each of classifier.In the second stage, the voting ensemble with a majority voting rule is implemented from the set of the trained classifiers in the first stage.This method was the demonstration on a balanced dataset which consists of 912 observations (456 bankrupts and 456 successful companies) of Russian firms.Each firm has 55 features on liquidity, financial stability, turnover, and profitability.
Then, Wang et al. [21] proposed a new kernel extreme learning machine (KELM) model that uses a novel swarm intelligence algorithm, namely grey wolf optimization (GWO), GWO-KELM model, for tuning parameters.The authors used two balanced datasets, the Wieslaw dataset [33] and Japanese dataset (JPNBDS), to evaluate the effectiveness of GWO-KELM for bankruptcy forecasting.The Wieslaw dataset has 240 real companies including 112 bankrupt companies and 128 non-bankrupt companies with 30 financial ratios.JPNBDS collected from 1995 to 2009 has 76 non-bankrupt and 76 bankrupt firms with only 10 financial ratios for each company.
Next, a KELM (kernel extreme learning machine)-based bankruptcy prediction model was introduced by Zhao et al. in [34].In this method, a two-step grid search strategy that combines the coarse search with the fine search has been implemented to optimize the parameters of the proposed model.The Wieslaw dataset [33] was utilized in the experiments of this study to evaluate the proposed model, as well as five existing models, including Support Vector Machine (SVM), Extreme Learning Machine, Random Forest, Particle Swarm Optimization Enhanced Fuzzy k-Nearest Neighbor (PSOFKNN), and Logit models.
Latest, as an overview article, Barboza et al. [19] implemented and tested several classification models, including SVM with linear and radial basis function (RBF) kernels, artificial neural networks (ANN), logistic regression, boosting, Random Forest, as well as bagging, to predict bankruptcy.The author used a balanced data for training which has 449 bankruptcy firms and 449 non-bankruptcy firms covering 1985 to 2005.The validation is an imbalanced dataset covering 2006 to 2013 which have 133 bankruptcy firms and 13,300 non-bankruptcy firms.However, in most of studies, the authors only use the dataset in the same period for training and validation stages.It would be meaningless if we used the data for training and validation stages in the same period.In reality, data from previous and current years will be used to train and predict bankruptcy in following years.Additionally, it would be impractical when several studies use balanced datasets to predict bankruptcy.

Preliminaries
In this section, we first present the notation and imbalanced data problem in binary classification.Then, we summary several common oversampling techniques for handling the imbalanced data including the synthetic minority oversampling technique (SMOTE) [29], borderline-SMOTE [30], Adaptive Synthetic Sampling (ADASYN) [31], the integrations of SMOTE with the Edited Nearest Neighbor (SMOTE + ENN) [28] and the integrations of SMOTE with Tomek links (SMOTE + Tomek) [28].Finally, the most popular measure for the imbalanced domain, ROC curve, will be summarized.

Imbalanced Data Problem in Binary Classification
Let χ be an imbalanced dataset in the binary classification.The dataset contains the minority and majority class denoted by χ min and χ maj respectively.The balancing ratio (br χ ) of χ is determined as: where | .| denotes the number element of a set.
To clearly understand this problem, Kang and Cho [35] created six datasets, which had different balancing ratios (1:1, 1:3, 1:5, 1:10, 1:30, and 1:50), to show the effects of balancing ratios on classification performance of the SVM algorithm.The experimental results in this paper showed that the accuracy of the minority class decreased rapidly when the balancing ratios decreased.The main reason was the effect of the performance of the majority class on simple accuracy was much greater than the minority class when the balancing ratios was low.
For handling the imbalanced data problem, the most commonly used technique is sampling techniques which is to resample the dataset χ into new dataset χ res such that br χ > br χ res .Sampling techniques will be summarized in Section 3.2.To evaluate models in imbalanced data, the receiver operating characteristic (ROC) curve was proposed to overcome the above problems of accuracy, which will be surveyed in Section 3.3.

Synthetic Minority Oversampling Technique
The SMOTE algorithm generates synthetic data points based on the feature space similarities between the real minority examples.SMOTE will consider k-nearest neighbors (denoted by K x i ) based on the Euclidian distance for each example x i ∈ χ min where k is the given input.Figure 1A shows the four nearest neighbors of x i .To create a synthetic sample for x i , this algorithm randomly selects an element xi in K x i and xi in χ min (Figure 1A).The feature vector of the new data point is the sum of the feature vectors of x i and the value, which can be obtained by multiplying the vector difference between x i and xi with a random number δ (∈[0, 1]), as following formula: where xi is an element in K x i : xi ∈ χ min .According to Equation (2), the synthetic sample is a point along the line segment joining x i and the randomly-selected xi ∈ K x i .Figure 1B shows an example of the SMOTE.The new sample x new is in the line between x i and xi .

Adaptive Synthetic Sampling
SMOTE generates the same number of synthetic data instances for each x i ∈ χ min .It leads to the classification model cannot archive the good performance.To overcome this limitation, ADASYN, which uses a systematic method to adaptively create different amounts of synthetic data based on their distributions, was proposed.ADASYN firstly determines the number of generated synthetic data examples for whole dataset by the following formula: where where i is the number of samples in K x i that belong to χ maj .To ensure the total value of all element r i is 1, the normalize r i according to ri is determined as: so that ∑ i=1 ri = 1.Then, the number of generated synthetic data samples for each x i ∈ χ min was determined as: Finally, ADASYN generates g i synthetic data samples for each x i ∈ χ min using Equation (2).The main idea of ADASYN is to use a density distribution r i to determine the number of generated synthetic samples for each minority example x i .In other words, ADASYN will generate much synthetic data for the data points near the border and a small amount of synthetic data for the rest.

Borderline-SMOTE
Borderline-SMOTE is another method to overcome the limitation of SMOTE.This algorithm also found k-nearest neighbors K x i for each example x i ∈ χ min .Then it selects x i that has i satisfies: where i is the number of samples in K x i that belong to χ maj .This set of x i satisfying this condition is called by DANGER which is then passed to the SMOTE to generate synthetic data instances by using Equation (2).The main difference between SMOTE and Borderline-SMOTE is that SMOTE generates synthetic data samples for all examples x i ∈ χ min while Borderline-SMOTE only generates synthetic data samples for those examples in DANGER (the examples near the border of two classes).

Oversampling Followed by Data Cleaning Techniques
Given a pair (x i , x j ) where x i ∈ χ min and x j ∈ χ maj .Let d(., .)be the Euclidean distance between two data points.The pair (x i , x j ) is called a Tomek link [28] if and only if there is no sample x k such that d(x i , x k ) < d(x i , x j ) or d(x j , x k ) < d(x i , x j ).If two samples x i and x j are in a Tomek link, either one of these samples is noise or both samples are near a border.Using the Tomek link definition to clean unwanted overlapping between classes after the oversampling step, it can provide the well-defined classification rules for improving classification performance.The integration of SMOTE with Tomek links (SMOTE + Tomek) [28] uses SMOTE for the oversampling step to balance the dataset then uses Tomek links to remove overlapping samples to enhance the performance of classification.
Another approach to clean unwanted overlapping between classes is he neighborhood cleaning rule [12] based on the edited nearest neighbor (ENN), which removes samples that differ from two samples in the three nearest neighbors.Like SMOTE + Tomek, SMOTE + ENN [28] also uses SMOTE for the oversampling step, then uses ENN to remove the overlapping examples.

ROC Curve
Considering a binary classification, a representation of classification performance can be shown by a confusion matrix (see Figure 2).Two evaluation metrics, true positives rate (TPR) and false positives rate (FPR), are defined as: where TP, FN, FP, and TN are true positives, false negatives, false positives, and true negatives, respectively.The ROC curve [36] is created by plotting TPR over FPR at various threshold settings, and any point in the ROC curve corresponds to the performance of a single classifier on a given distribution (see Figure 3).To compare two ROC curves (R1 and R2), AUC [36], the area under the ROC curve, was proposed.If R1 provide a larger AUC value compared to that of R2 (Figure 3), the classifier associated with R1 has a better performance compared to the classifier associated with R2.

Dataset
The experimental dataset was collected from a Korean financial company for the last two years.The Korean dataset contains 19 financial features for each company.This dataset consists of 307 bankrupted companies and 120,048 normal companies with a balancing ratio of 0.0026.This ratio is quite small for the normal classifier to predict bankruptcy correctly.From the financial statements, each company has 19 financial features (ratios) which have frequently been applied in the previous corporate bankruptcy prediction studies, such as assets, liabilities, capital, profit, etc.These ratios are shown and described in Table 1.For consistency purposes with these ratios, we standardized these ratios by removing the mean and scaling to unit variance to create new vectors x' by Equation ( 9) for the whole dataset and we use this dataset to perform the experiment: where x is the original feature vector, x is the mean of that feature vector, and σ is its standard deviation.To understand datasets, we apply the principal component analysis (PCA) to visualize these datasets in three-dimensional space.Figure 4 show the three-dimensional space of the Korean dataset using PCA to reduce 19 features to three components, including PC1, PC2, and PC3 in Figure 4. We easily recognize that bankrupt companies were rushed to a spot with red color on these charts.Therefore, it may be easy to classify, as well as predict bankrupt and non-bankrupt companies.

The Bankruptcy Prediction Framework
In this first experiment, we present the bankruptcy prediction framework (Figure 5).We use the five-fold cross-validation methodology to split the training and testing set.In the first case, we pass the original training dataset to the bankruptcy prediction module directly.In other words, the system does not use any oversampling technique to balance the datasets.In the second case, five oversampling techniques, including SMOTE, Borderline-SMOTE, ADASYN, SMOTE + Tomek, and SMOTE + ENN summarized in the previous section were used in the resampling module alternately.In the bankruptcy prediction module, we use four classification models including Random Forest, Decision Tree, Multi-Layer Perceptron, and SVM to predict bankruptcy.The purpose of this experiment is to show the effectiveness of oversampling for the bankruptcy prediction task.

Novel Features from Transaction Dataset
To increase the performance of bankruptcy prediction for Korean case study, we have collected the transaction dataset which consists of all outcome transactions of each company in the original dataset in the same period with the financial statements.From the transaction dataset, we first identify the largest partner (P) for each company (C) in the original Korean dataset which has the largest amount of money in transaction with company C.Then, 19 features from company P will be merged with 19 original features of company C to create the 38 features which will be used as features of company C.
The 38 features dataset will be denoted as the Korean mixed dataset.The above steps are shown in Algorithm 1.For example, company A has transactions with companies B, C, and D in the transaction dataset.In the same period of the financial dataset, companies B, C, and D have transactions with A for a total of 100, 500, and 800 USD, respectively, which were determined by Line 3 in Algorithm 1.Then, only company D was selected in Line 4. Finally, features of company D will be appended to the features of company A at Line 5 of Algorithm 1.
In this experiment, only the best oversampling technique and classifier model for bankruptcy prediction in the first experiment (SMOTE + ENN and Random Forest classifier) was selected to use in the bankruptcy prediction framework.The five-fold cross-validation methodology was used to report the AUC values for both Korean and Korean mixed datasets.

Experimental Results
The framework was implemented in Python 2.7 and performed on a standard PC (8 GB DDR3L RAM, 40 GB of flash storage, 3.40 GHz × 2 cores Intel Core i7-2600 processor, and Ubuntu 16.04 LTS as an operating system).
To achieve the best performance for each classifier, we perform the framework in several times to choose the best parameter for each one.In the oversampling module, all the techniques, including SMOTE, Borderline-SMOTE, ADASYN, SMOTE + Tomek, as well as SMOTE + ENN are in the imbalanced-learn python package [37].All of the above sampling techniques have k_neighbors = 7 and ratio = 'auto'.In the bankruptcy prediction module, we used the rectified linear unit (ReLU) as the activation function, learning_rate = 0.01, and other parameters are default parameters for the Multi-Layer Perceptron classifier.For the SVM classifier, we use the Support Vector classification (SVC) with probability = True and other default parameters.For the Decision Tree classifier, we set max_depth to 4. Finally, we set max_depth to 4, and n_estimators to 20 for the Random Forest classifier.

Results of Bankruptcy Prediction
This section shows the performance of the bankruptcy prediction framework with various oversampling techniques and classifiers in Table 2.Note that for each value of the AUC, this model was run in five-fold cross-validation and report average value six times with standard deviation.From these results shown in Table 2, multi-layer perceptron and SVM does not have a good performance (most of AUCs are not over 70%) whether in combination with any oversampling technique.Meanwhile, without oversampling techniques, Random Forest is the best classifier for bankruptcy prediction with 82.4% in AUC while Decision Tree is only 76.2% AUC.Using various oversampling techniques, the performance of this framework will be increased with different values.For example, SMOTE can help Random Forest increase 1.7% to reach 84.1% AUC.Similarly, Borderline-SMOTE, ADASYN, SMOTE + Tomek, and SMOTE + ENN help the Random Forest increase 0.7%, 0.7%, 1.7%, and 1.8% AUC respectively.Therefore, SMOTE + ENN and Random Forest is the best oversampling technique and classifier model for the bankruptcy prediction framework, which yields at 84.2% AUC for the original Korean dataset.

Result of Bankruptcy Prediction with Mixed Dataset
This section shows the performance of the bankruptcy prediction with SMOTE + ENN for the resampling module and Random Forest for the bankruptcy prediction module for both the Korean dataset and the Korean mixed dataset in five-fold validation, six times (see Table 3).The average AUC of this model for the Korean dataset is 84.2% while that of this model for the Korean mixed dataset is 84.4%.Therefore, using the financial dataset with transaction dataset helps the bankruptcy prediction framework archives the better performance than using the original financial dataset.In order to verify the effectiveness of the bankruptcy prediction framework for the Korean mixed dataset compared with those for the Korean dataset, the paired t-test was used to compare the two population means two lists' of performances.We ran this framework for the Korean dataset and the Korean mixed dataset in 30 batches (each batch was run six times and the average AUC value was obtained) and achieved a p-value = 0.0062 for two sets of performance.Assuming the significance level usually equal 0.05, we can reject all stated null median difference hypotheses.Therefore, using the Korean mixed dataset for the above bankruptcy prediction framework is better than using the Korean dataset for this framework.In other words, features from transaction dataset can improve the performance of the bankruptcy prediction framework.
Due to the number of features of the Korean mixed dataset larger than that of the Korean dataset, the training time of the bankruptcy prediction framework for the Korean mixed dataset (Case A) is larger than that of the bankruptcy prediction framework for the Korean mixed dataset (Case B).Details, Case A requires 47.9 s while Case B requires 46.5 s in average for both oversampling and training steps of each fold.The time gap between Case A and Case B is negligible.In addition, we can save the trained model to a file and then load the trained model to predict many times.

Conclusions
This study utilizes several oversampling techniques to deal with imbalance problems on the financial dataset collected from Korean companies from 2016 to 2017.Experimental results show that oversampling techniques can improve the performance of the bankruptcy prediction framework, in which SMOTE + ENN for the resampling module and Random Forest for the bankruptcy prediction module achieved the best AUC, which yields 84.2% in AUC for this framework.Next, the study combines the financial dataset with transaction dataset in the same period to extract more features from largest partner.These features will be added to original dataset to improve the performance of bankruptcy prediction.The best overall AUC of this framework can reach 84.4% AUC for the Korean mixed dataset.
For future works, several following related issues will be studied.We first study to improve performance for this task by using other techniques for imbalanced data problem.Second, an online classifier will be studied to reuse of previous results.

Figure 1 .
Figure 1.(A) Example of the four-nearest neighbors for the x i ; and (B) x new by SMOTE based on the Euclidean distance.

Figure 2 .
Figure 2. Confusion matrix for performance evaluation.

Figure 4 .
Figure 4. Three-dimensional visualization of the Korean dataset using PCA.

Algorithm 1 .
A novel algorithm for feature extraction.
1] is a given parameter indicate the balance level after balance data by ADASYN, |χ maj | is the number of data instances belonging to the majority class and |χ min | is the number of data samples of the minority class.Next, ADASYN finds k-nearest neighbors K x i (in the same way with SMOTE algorithm) and calculates the ratio denoted by r i for each example x i ∈ χ min :

Table 1 .
The set of features extracted from financial statements.

Table 2 .
Results of the bankruptcy prediction framework for Korean dataset.

Table 3 .
Results of bankruptcy prediction for the Korean dataset and the Korean mixed dataset in five-fold validation, six times.