Bankruptcy Prediction Using Machine Learning Techniques

: In this study, we apply several advanced machine learning techniques including extreme gradient boosting (XGBoost), support vector machine (SVM), and a deep neural network to predict bankruptcy using easily obtainable ﬁnancial data of 3728 Belgian Small and Medium Enterprises (SME’s) during the period 2002–2012. Using the above-mentioned machine learning techniques, we predict bankruptcies with a global accuracy of 82–83% using only three easily obtainable ﬁnancial ratios: the return on assets, the current ratio, and the solvency ratio. While the prediction accuracy is similar to several previous models in the literature, our model is very simple to implement and represents an accurate and user-friendly tool to discriminate between bankrupt and non-bankrupt ﬁrms.


Introduction
Bankruptcy detection is a major topic in finance.Indeed, for obvious reasons, many actors such as shareholders, managers or banks are interested in the likelihood of bankruptcy of firms.Consequently, many studies have been carried out on the topic of bankruptcy prediction.In the late 1960s, Beaver (1966) introduced a univariate analysis, providing the first statistical justification for the ability of financial ratios to account for default.Then, Altman (1968) developed the Z-score model by using five financial ratios to predict the bankruptcy of U.S. firms.In his paper, Altman employed multiple discriminant analysis (MDA) techniques to determine the probability of bankruptcy on a sample of firms.Altman's Z-Score model has been a popular technique and widely used by auditors, accountants, courts, banks, and other creditors.The MDA technique assumes that variables follow a normal distribution, and this methodology was later adopted by many other researchers (Deakin 1972;Edmister 1972;Altman et al. 1977;Laitinen 1991;Grice and Ingram 2001).The hypothesis of multinormality of variables is then questioned in favor of the hypothesis according to which the explanatory variables have different distributions.Consequently, the logit (Ohlson 1980) and probit (Zmijewski 1984) models were then frequently used in the prediction of the failure.The second stage of the story began in the 1990s with artificial intelligence algorithms, specifically in the machine learning branch such as neural networks (Lennox 1999) or the genetic algorithm (Shin and Lee 2002).They produced convincing results in terms of forecasting without requiring any statistical restriction.Indeed, Barboza et al. (2017) tested five machine learning models and compared their bankruptcy prediction power against traditional statistics techniques (discriminant analysis and logistic regression) using North American firms' data from 1985 to 2013.Their study found substantial improvement in bankruptcy prediction accuracy using machine learning techniques compared to traditional methods.This conclusion is also reached by (Adnan and Dar 2006) through their extensive literature review.Ongoing, in the 21st century, new types of learning machines such as extreme gradient boosting (XGBoost), support vector machine, random forest or deep neural network were developed and often provided better accuracy than statistical techniques.Shi and Li (2019) literature review reports that logit and neural network models are the most frequently used and studied models in the area of bankruptcy prediction.Mai et al. (2018) compared traditional learning machine models with convolutional neural networks on a large database of US public companies and report that simpler models are more effective, while Hosaka's study (Hosaka 2019), based on a smaller sample of delisted Japanese companies, reports that the use of convolutional neural networks allows to reach better predictions.So far, no consensus regarding the use of convolutional neural networks exists.
In addition to the choice of the model, bankruptcy prediction accuracy depends on the financial ratios that are chosen to run the model.To this end, statisticians have developed methods such as principal components analysis (Zavgren 1985;Wang 2004;Tang and Chi 2005;Pompe and Bilderbeek 2005) or LASSO 1 technique (Meir et al. 2008;Tian et al. 2015) for selecting subsets of variables from an initial list of explanatory variables to identify the most relevant predictor variables (Fan and Li 2001).These lists may include 50 ratios (du Jardin 2015) that are built using detailed information from the balance sheet and the income statement.Out of these lists, 5 to 10 ratios are generally retained to run the model.Du Jardin's study (du Jardin 2006) reports several variable selection methods.Even though most studies use annual data one year prior to bankruptcy, Baldwin and Glezen (1992) resorted to quarterly data to feed their model and Reznakova and Karas (2014) calculate averaged ratios involving several years before bankruptcy.Some studies include nonfinancial variables (in addition to financial ratios) into bankruptcy prediction models.These variables may refer to market valuation (Campbell et al. 2008;Tian et al. 2015), corporate governance (Ciampi 2015), relational (Tobback et al. 2017) or textual data (Mai et al. 2018).Other specificities may exist.Pompe and Bilderbeek (2005) investigated the influence of the model accuracy in a period of economic decline.Bankruptcy prediction models can be related to SME's (Brédart and Cultrera 2016) or to listed firms (Sfakianakis 2012).Nevertheless, the availability of this detailed information for some firms may be questioned.
Similar to many other western countries, Belgium has accounted for many bankruptcies in the recent past.Brédart's (2014) study utilized Belgian bankruptcies data to predict bankruptcies using the neural networks method and reached good prediction accuracyabove 80%.In this paper, we replicate Brédart's study using a simple neural network with one layer, and four cutting-edge machine learning techniques: a deep feed-forward network with six hidden layers, Random Forests, Support Vector Machine with radial basis function kernel and Extreme Gradient Boosting (XGBoost) to compare their performances.Even though Brédart used only three variables in his model, the prediction accuracy was fairly good.In our study, we used the same financial ratios that were considered by Brédart as they have a good bankruptcy prediction rate.Our objective in this study is to utilize sophisticated machine learning techniques to see if these techniques have better predictability using the Brédart's data from Belgium compared to his model involving the neural networks method.
Our results report prediction accuracy rates of more than 80 percent, using only three easily obtainable financial ratios: the return on assets, the current ratio, and the solvency ratio.Even though, Barboza et al.'s (2017) study had significantly higher prediction accuracy using machine learning techniques compared to the traditional methods, our study did not show any improvement in prediction from the machine learning techniques compared to the shallow neural network method with a single hidden layer.All the four methods gave the same kind of results of about 81% prediction accuracy.The graphical plots of the data show a significant overlap between the different features of the bankrupt and solvent firms' data.This leads to a limitation of advanced models to carry out better predictions.Nevertheless, the ease of collecting the information feeding our models makes them very attractive for decision makers such as bankers.Our study contributes to the academic literature about bankruptcy prediction because it shows that simple models using easily obtainable and common information can be reliable and help to make adequate decisions.

Data
In this study, we used the same Belgian firm's dataset previously used by Brédart (Brédart 2014).This dataset consists of a sample of 3728 Belgian firms that were declared bankrupt between 2002 and 2012 to predict bankruptcy utilizing the new bankruptcy prediction techniques.We used the same three financial ratios that were considered by this author as they were simple, easily available and provided good classification rates.

Methodology
To predict bankruptcy, we utilized several of the most advanced machine learning techniques, including, a deep neural network, support vector machine (SVM) and extreme gradient boosted tree method (XGBoost).These techniques are described hereafter.

Deep Feedforward Neural Networks
Deep feedforward artificial neural networks are advanced types of supervised machine learning methods which learn patterns in the input data through compositions of mathematical functions in order to map input data to corresponding outputs (Goodfellow et al. 2016).Deep feedforward networks are capable of performing both regression and classification tasks.As shown in Figure 1, a typical feedforward network consists of three types of layers: an input layer, which receives the data, an output layer, which represents the network output, and a number of hidden layers, which perform the task of mapping the input features to the network output.Each hidden layer consists of a set of non-interacting neurons, which processes the data in a parallel manner.The width of the network is determined by the number of neurons in the hidden layers, whereas the depth of the networks is a measure of the number of its hidden layers.A shallow neural network has up to several hidden layers, whereas usually a deep network has a larger number of hidden layers.Figure 1  A feedforward network is said to be fully connected if each neuron in a specific layer receives the outputs of all the neurons in the previous layers.As illustrated in Figure 1 (lower panel), the neurons perform a two-step mathematical operation: First each performs a scalar product between its input vector and its local weight vector, and the result is shifted by the addition of a scalar bias.In the second step, the result is transformed through the activation function of the neural.
The training stage of the network starts with the data being individually propagated forward through the network to generate the corresponding error signal (1) due to the discrepancy between the network's predicted output f (x i ) and the ground truth y i .Next, a backward propagation step is initiated where the "gradient decent" optimization method is applied to adjust the weights in order to reduce the network output error.In "online learning", the two-step learning process is applied to each datum individually, whereas in "batch learning", the error due to batch of data (m input data) is propagated through the network before their collective error is computed as and the weights adjusted (Goodfellow et al. 2016).
The training data is usually divided into three batches, named appropriately as "training set", "development set", and "testing set".The training set accounts for the largest of the three sets (60-70%) and is used to optimize the network accuracy by repeatedly using it until the desired accuracy is reached.The development portion of the data is then used to test the accuracy of the network with previously unseen data.This step is crucial to diagnose over-fitting and related issues.Any significant discrepancy between the network's accuracy for the training and development will lead to repeating the training process, possibly adjusting the network hyperparameters (batch size, learning rate, etc.) or structure (number of layers, etc.) or introducing a regularization scheme.After the network reaches the desired level with the development set, the testing set is used to determine the ultimate accuracy of the network.There are several popular packages that offer implementations of deep neural networks, most prominent of which are Python-based Google's Tensflow and Facebook's PyTorch.These packages offer multiple tools for implementation and regularization of neural networks, such as drop-out and L p norm methods.In this study we used Tensorflow implementation of deep feedforward network to predict bankruptcy/solvency ratios.
and the weights adjusted (Goodfellow et al. 2016).
The training data is usually divided into three batches, named appropriately as "training set", "development set", and "testing set".The training set accounts for the largest of the three sets (60-70%) and is used to optimize the network accuracy by repeatedly using it until the desired accuracy is reached.The development portion of the data is then used to test the accuracy of the network with previously unseen data.This step is crucial to diagnose over-fitting and related issues.Any significant discrepancy between the network's accuracy for the training and development will lead to repeating the training process, possibly adjusting the network hyperparameters (batch size, learning rate, etc.) or structure (number of layers, etc.) or introducing a regularization scheme.After the network reaches the desired level with the development set, the testing set is used to determine the ultimate accuracy of the network.There are several popular packages that offer implementations of deep neural networks, most prominent of which are Python-based Google's Tensflow and Facebook's PyTorch.These packages offer multiple tools for implementation and regularization of neural networks, such as drop-out and Lp norm methods.In this study we used Tensorflow implementation of deep feedforward network to predict bankruptcy/solvency ratios.

Support Vector Machine (SVMs)
Support vector machines (SVMs) are powerful supervised machine learning methods that are used mainly for classification.They are a class of optimal margin classifiers that find a boundary hypersurface (or plane) with maximal margin between data clusters belonging to different classes (Boser et al. 1992;Cortes and Vapnik 1995).SVMs have high classification efficiency for high-dimensional data, even in cases where different classes overlap significantly.Mathematically, the method finds a d-dimensional hyperplane satisfying the equation:

Support Vector Machine (SVMs)
Support vector machines (SVMs) are powerful supervised machine learning methods that are used mainly for classification.They are a class of optimal margin classifiers that find a boundary hypersurface (or plane) with maximal margin between data clusters belonging to different classes (Boser et al. 1992;Cortes and Vapnik 1995).SVMs have high classification efficiency for high-dimensional data, even in cases where different classes overlap significantly.Mathematically, the method finds a d-dimensional hyperplane satisfying the equation: for the d-dimensional data point x i corresponding to output y i , where w j is the corresponding is jth component of the weight vector and ζ (0 ≤ ζ ≤ 1) is a soft margin slack variable used to adjust the boundary between classes in such a way as to minimize the classification error while allowing misclassification of overlapping data points.If the data is linearly separable, the method generalizes perfectly to previously unseen data (Abe 2005).If there is no linear separability, then it is mapped into a higher-dimensional scalar product space through a set of functional transformations where the dot products in the above inequality are replaced by products of functions of the data.The above inequality takes the new form: for 1 ≤ k ≤ N, where φ j are functions that cast the data in to high-dimensional feature space that eliminates or reduces the degree of overlap between the different classes in an attempt to improve the generalization accuracy of the model.
In practice, to find the optimal decision boundary between the data classes, algorithms implementing the SVM model utilize a two-step iterative optimization process.In the first step, the data is cast into a high-dimensional space to find a decision hyper-plane.In the second step, the distance between the resulting hyperplane and closest data points is tweaked in order to maximize the margin between the decision boundary and the nearest data points.The power of the of the SVM classifier lies in that it always finds a decision boundary, especially when there is significant overlap between the different data classes.After it finds such optimal boundary, the SVM model is ready to classify new data points according to which side of the decision boundary their coordinates lie.The SVM algorithm is implemented in various programming languages, including Python.The Python implementation of the SVM is included in the open-source machine learning package Scikit-Learn.The ROC curve of SVM model is shown in Figure 2.
for the d-dimensional data point xi corresponding to output yi, where wj is the corresponding is jth component of the weight vector and ζ 0 ≤  ≤ 1 is a soft margin slack variable used to adjust the boundary between classes in such a way as to minimize the classification error while allowing misclassification of overlapping data points.If the data is linearly separable, the method generalizes perfectly to previously unseen data (Abe 2005).If there is no linear separability, then it is mapped into a higher-dimensional scalar product space through a set of functional transformations where the dot products in the above inequality are replaced by products of functions of the data.The above inequality takes the new form: for 1 ≤  ≤ , where  are functions that cast the data in to high-dimensional feature space that eliminates or reduces the degree of overlap between the different classes in an attempt to improve the generalization accuracy of the model.In practice, to find the optimal decision boundary between the data classes, algorithms implementing the SVM model utilize a two-step iterative optimization process.In the first step, the data is cast into a high-dimensional space to find a decision hyper-plane.In the second step, the distance between the resulting hyperplane and closest data points is tweaked in order to maximize the margin between the decision boundary and the nearest data points.The power of the of the SVM classifier lies in that it always finds a decision boundary, especially when there is significant overlap between the different data classes.After it finds such optimal boundary, the SVM model is ready to classify new data points according to which side of the decision boundary their coordinates lie.The SVM algorithm is implemented in various programming languages, including Python.The Python implementation of the SVM is included in the open-source machine learning package Scikit-Learn.The ROC curve of SVM model is shown in Figure 2.

Extreme Gradient Boosting
Boosting is a powerful, ensemble-based learning method that combines a set of easily learnable, weak classifiers into a much powerful classifier (Schapire 1990(Schapire , 1999;;Kearns and Valiant 1989).Extreme gradient boosting (XGBoost) is a variant of gradient boosting methods with superior performance that uses a more regularized model formalization to control overfitting (Chen and Guestrin 2016).Alongside deep learning, XGBoost is one of the most successful methods for large scale data classification and is the method of choice for many winning entries in Kaggle machine learning competitions.XGBoost is implemented in many programming languages, including Python as part of the Scikit-Learn package.

Results
Table 1 shows the correlations between variables that are used in the models to predict bankruptcy of Belgian firms.All the correlation coefficients are relatively small.The accuracy of the model for predicting the categories of new inputs as bankrupt or otherwise for the different classification algorithms is limited to around 81-82% range.The limitation in the accuracy is due to a significant overlap between the two classes as shown in Figure 3.

Accuracy Comparisons of Different Models
Table 2 shows the accuracy of models used in this study.Different models provide roughly the same level of global accuracy of about 85% for correctly predicting whether a specific firm is bankrupt.While the global accuracy is only slightly better than (2 percent The Receiver Operating Characteristics (ROC) curve is plotted with true positive rate (TPR) against the false positive rate (FPR) using the feedforward neural networks model as shown in Figure 4.A higher y-axis in this plot indicates a higher number of true positives relative to false negatives.It represents goodness of the model in predicting the positive class when the actual outcome is positive.A better prediction performance can be expected if the curve is closer to the top-left corner of the plot.The area below the ROC curve is called the Area Under the Curve (AUC), which is the probability that a randomly chosen bankruptcy is higher than a randomly chosen nonbankrupt instance.As shown in Figure 4, this area is close to 0.85 for the feedforward neural networks model and that is considered as a skilful model.

Conclusions
Bankruptcy prediction models may use many different data and techniques.The latest studies regarding bankruptcy prediction used different kinds of variables (Tobback et al. 2017;Mai et al. 2018) and cutting-edge techniques (Mai et al. 2018;Hosaka 2019).In this study, by applying an optimized neural network with six hidden layers, a support vector machine and XGBoost classification algorithms on the financial data of 3728 Belgian enterprises, we achieve a global bankruptcy prediction accuracy of 82-83%.Compared to Brédart's 2014 analysis of the same dataset with a shallow neural network, we achieve a slight 2% improvement in the bankruptcy cases and a 17% improvement in solvency cases.We recognize the limitation in the prediction accuracy as arising from the significant overlap in the feature space between financial variables belonging to bankrupt and solvent companies.However, our study does not report significant differences in results in terms of prediction accuracy, regardless of the technique used.Moreover, a significant prediction accuracy rate is achieved by using only three financial ratios that are easily obtainable for most firms.In addition to its contribution to the academic literature, this study is of high interest for bankers who want to assess the probability of bankruptcy (and therefore of non-reimbursement) of firms requesting loans without having to compute many financial ratios and collect non-financial data.The ROC curve of SVM model is shown in Figure 2, and the area under this curve is also close to 0.85.

Accuracy Comparisons of Different Models
Table 2 shows the accuracy of models used in this study.Different models provide roughly the same level of global accuracy of about 85% for correctly predicting whether a specific firm is bankrupt.While the global accuracy is only slightly better than (2 percent more) that was obtained by the shallow network used by Brédart (2014), our current models show a 17% improvement in correctly classifying healthy corporations.One of the limitations of these prediction techniques is that the financial data of the two classes are inseparable as shown in Figure 3.The algorithms would have resulted in a higher prediction accuracy if the data were to be more separable.

Conclusions
Bankruptcy prediction models may use many different data and techniques.The latest studies regarding bankruptcy prediction used different kinds of variables (Tobback et al. 2017;Mai et al. 2018) and cutting-edge techniques (Mai et al. 2018;Hosaka 2019).In this study, by applying an optimized neural network with six hidden layers, a support vector machine and XGBoost classification algorithms on the financial data of 3728 Belgian enterprises, we achieve a global bankruptcy prediction accuracy of 82-83%.Compared to Brédart's 2014 analysis of the same dataset with a shallow neural network, we achieve a slight 2% improvement in the bankruptcy cases and a 17% improvement in solvency cases.We recognize the limitation in the prediction accuracy as arising from the significant overlap in the feature space between financial variables belonging to bankrupt and solvent companies.However, our study does not report significant differences in results in terms of prediction accuracy, regardless of the technique used.Moreover, a significant prediction accuracy rate is achieved by using only three financial ratios that are easily obtainable for most firms.In addition to its contribution to the academic literature, this study is of high interest for bankers who want to assess the probability of bankruptcy (and therefore of non-reimbursement) of firms requesting loans without having to compute many financial ratios and collect non-financial data.
shows a schematic diagram of a fully connected feed-forward network.The upper figure shows the general structure of the network, whereas the lower figure shows the schematic diagram of the artificial neuron.

Figure 1 .
Figure 1.Upper figure: fully connected feed-forward artificial neural network (CR = Current ratio; SR = Solvency ratio; PR = Profitability ratio); lower figure: schematic of the computational unit (artificial neuron) of the network.

Figure 1 .
Figure 1.Upper figure: fully connected feed-forward artificial neural network (CR = Current ratio; SR = Solvency ratio; PR = Profitability ratio); lower figure: schematic of the computational unit (artificial neuron) of the network.

Figure 2 .
Figure 2. ROC curve of support vector machine model.Figure 2. ROC curve of support vector machine model.

Figure 2 .
Figure 2. ROC curve of support vector machine model.Figure 2. ROC curve of support vector machine model.

Figure 3 .
Figure 3.An illustration of the overlap between bankrupt cases in the profit first solvency space.

Figure 3 .
Figure 3.An illustration of the overlap between bankrupt cases in the profit first solvency space.

Figure 4 .
Figure 4. ROC curve of feedforward neural network model.

Figure 4 .
Figure 4. ROC curve of feedforward neural network model.

Table 2 .
Class and model accuracies.