Corporate Default Predictions Using Machine Learning: Literature Review

Corporate default predictions play an essential role in each sector of the economy, as highlighted by the global financial crisis and the increase in credit risk. This study reviews the corporate default prediction literature from the perspectives of financial engineering and machine learning. We define three generations of statistical models: discriminant analyses, binary response models, and hazard models. In addition, we introduce three representative machine learning methodologies: support vector machines, decision trees, and artificial neural network algorithms. For both the statistical models and machine learning methodologies, we identify the key studies used in corporate default prediction. By comparing these methods with findings from the interdisciplinary literature, our review suggests some new tasks in the field of machine learning for predicting corporate defaults. First, a corporate default prediction model should be a multi-period model in which future outcomes are affected by past decisions. Second, the stock price and the corporate value determined by the stock market are important factors to use in default predictions. Finally, a corporate default prediction model should be able to suggest the cause of default.


Introduction
Forecasts of corporate defaults are used in various fields across the economy. Corporations can diagnose their current statuses based on prediction models and establish their strategies. Executives can run their businesses more stably by managing key indicators that affect corporate default risk. Investors can revise their strategies and improve their portfolios by examining the likelihood of corporate defaults. Additionally, governments can establish macroprudential policies and improve related financial regulations using corporate default predictions. In these ways, default prediction models help in designing and improving the financial system. Moreover, by employing machine learning algorithms and statistical models, corporate default predictions are at the cutting edge of advanced financial engineering. The recent global financial crisis and the increase in credit risk highlight the importance of this field. Because of their importance, corporate default predictions have been extensively studied since the work of Beaver [1].
Thus far, several structural models have been used to explain corporate defaults. Merton [2] develops the "distance-to-default" measure of corporate default risk using a Black-Scholes-type pricing model. In a recent study, Jessen and Lando [3] confirm that the distance-to-default measure can robustly detect corporate default risk. They also present a distance-to-default measure with an adjustment using the stochastic volatility of the value of assets. Glover [4] proposes a structural model to calculate a corporation's expected default costs. Brogaard et al. [5] use the distance-to-default approach to show that enhanced stock liquidity reduces corporate default risk. Hillegeist et al. [6] argue that a market-based measure based on the Black-Scholes-Merton model performs better than Altman's [7] Z-score and Ohlson's [8] O-score when assessing a discrete hazard model. Duffie et al. [9] propose a doubly stochastic model to estimate the term structure of corporate default risk. Recently, however, reduced-form models with statistical approaches and machine learning algorithms that predict the likelihood of a corporate default provide more satisfactory results than structural models in general.
In this study, we review the corporate default prediction literature from the perspectives of financial engineering and machine learning simultaneously. We attempt to identify new opportunities in the field of machine learning for predicting corporate defaults by comparing these kinds of literature. This study therefore examines the research on corporate default forecasts thus far and introduces representative methodologies and major studies by categorizing statistical approaches and machine learning techniques. We define three generations of the main statistical models for corporate default forecasting as discriminant analyses, binary response models, and hazard models, and then we study each generation. Among machine learning algorithms, classification methodologies are mainly used in this setting, and support vector machines (SVMs), decision trees, and artificial neural network algorithms are typical. There are already several studies reviewing the field of bankruptcy prediction [10][11][12][13]. However, we will focus on discovering new research topics and challenges in this study. The development of machine learning methodologies is expected to further accelerate innovation in the financial sector, leading to the emergence of new financial services and the rise of the data economy based on the distribution of data.
Before starting the review, we need to set the scope of the investigation. In this study, we define the financial literature as academic journals covering the following areas: accounting, finance, financial economics, economics, and econometrics. The machine learning literature includes academic papers covering computer science, statistics, and operations research. We search the databases, Google Scholar and EBSCOhost, to exhaustively collect the related literature. The keywords used for the search are "corporate default prediction," or "bankruptcy prediction." We also include the keyword "machine learning" when we find the machine learning literature. After we construct our own classification system, we use the name of each methodology as additional keywords. The citation count (e.g., Journal Citation Reports-Clarivate Analytics) is mainly used for thesis selection, but our subjective evaluations are also considered.
This study explores various techniques and algorithms used for corporate default predictions. The remainder of the paper is organized as follows. Section 2 focuses on statistical approaches for corporate default predictions, and Section 3 reviews several machine learning techniques. Section 4 concludes the study.

Corporate Default Prediction Using Statistical Approaches
Corporate default forecasts using statistical models can be largely classified into three generations. Table 1 shows these three primary methodologies and their representative studies. Various studies have been conducted as each methodology has expanded, and these methods are actively used to this day.

Discriminant Analysis
Studies in the first generation of the corporate default literature use discriminant analysis. Discriminant analysis is a default prediction methodology that has been widely used since the work of Beaver [1] and Altman [7]. These studies build reduced-form default prediction models using discriminant analysis and provide ordinal rankings of default risk by generating credit scores. The famous Altman Z-score is one example, and subsequent studies use this methodology [14,15]. Discriminant analysis selects the variables that can best determine whether a company is bankrupt and calculates the discriminant function as a linear combination of these variables, as follows: where D is the discriminant score calculated by the discriminant function, β 0 is a constant, β m is an estimated coefficient, and X m is an explanatory variable. An observation is classified as normal if the discrimination score is below a certain threshold and is classified as being in default if the score is above that threshold. However, discriminant analysis requires the assumption that the independent variables follow multivariate normal distributions, the covariance matrix between the two groups is defined as normal, and defaults are identical [16]. The discriminant analysis does have the advantage that corporations can be ranked according to their degree of default risk.

Binary Response Models
The second generation of corporate default predictions uses binary response models. A binary response model is a model that defines a corporation's state as either normal (= 0) or in default (= 1). It estimates the probability of a default using explanatory variables and applies a logistic or probit function in most cases. A representative example is the O-score with a logistic function, introduced by Ohlson [8], whereas Zmijewski [17] tests corporate default risk using a probit model. These binary models allow bankruptcy probabilities to be calculated over the next period. As Foreman [18] and Charitou et al. [16] explain, the binary response model defines the probability that corporation n = 1, . . . , N defaults, P n , as follows: where y n = 1 if corporation n defaults and 0 otherwise. P n (y n = 1) is the probability of default for corporation n, β 1 , β 2 , . . . , β m are slope coefficients, and X 1,n , X 2,n , . . . , X m,n are explanatory variables for corporation n.
A binary response model has several advantages over discriminant analysis for corporate default forecasting. First, a binary response model does not require any assumptions about the probability of default or the distributions of the predictor variables. Second, it can test the significance of individual independent variables. Lastly, it can be used to calculate the probability of default in the next period. Campbell et al. [19] extend the binary response model by investigating corporate defaults using a multiple logit model. This approach allows us to calculate corporate default probabilities and predict default risk over several periods. Aretz et al. [20] adopt Campbell et al.'s [19] methodology and use data for non-U.S. companies to identify a significantly positive default risk premium. Kukuk and Rönnberg [21] extend the binary response model by proposing a mixed logit model that allows stochastic parameters and non-linearities in the regressor variables. Bonfim [22] tests macroeconomic and financial data using a probit model and argues that corporate defaults are driven by multiple firm-specific factors. However, the results show that macroeconomic factors are also important in estimating default risk over time.

Hazard Models
The third generation of corporate default predictions includes studies using hazard models. Shumway [23] uses a duration analysis with a hazard model and shows that this approach predicts corporate defaults better than traditional single-period models do. The hazard model is also referred to as survival analysis and can be used to calculate the probability of a corporate default over time. This default prediction methodology uses Cox's [24] hazard regression model by defining a corporation's status as either normal (= 0) or in default (= 1). The corporation's status is no longer observed once a default event occurs. Assuming that T is the time at which a company defaults, company's survival function at time t can be expressed as follows: The hazard function, λ(t), indicates the instantaneous failure rate at time t and is defined as follows: The Cox proportional hazard model defines the instantaneous failure rate, λ(t), as being proportional to the unspecified baseline hazard rate, λ 0 (t), as follows: where β is a column vector composed of regression coefficients, and X n is the set of explanatory variables for the nth company. The Cox model is a semi-parametric model consisting of a non-parametric base risk rate λ 0 (t) and a parametric factor exp(β'X n ). The partial likelihood function for the regression coefficient β is calculated as follows: where Y ij is an indicator variable equal to one if t j ≥ t i and zero otherwise. δ i is an indicator variable equal to one when an observation is not censored and zero otherwise. The parameters are estimated such that the partial likelihood function is maximized. The hazard model is developed further by many subsequent studies. Chava and Jarrow [25] confirm that the hazard model exhibits superior prediction performance, and they address the importance of industry effects and market variables. Nam et al. [26] extend Shumway's [23] analysis by including time-varying covariates to incorporate macroeconomic dependencies. Dakovic et al. [27] fit a generalized linear mixed model, including unobserved heterogeneity between industry sectors, into a discrete hazard model, and show that the new model outperforms conventional models with Altman's [7] variables. Duan, Sun, and Wang [28] develop a forward intensity model from the hazard model and predict corporate default probabilities over multiple periods. Traczynski [29] extends the hazard model by taking a Bayesian model-averaging approach and shows that this approach leads to a better prediction performance relative to other typical models. Figlewski et al. [30] evaluate the effect of macroeconomic conditions on corporate default risk using reduced-form Cox intensity models. Tian et al. [31] use a variable selection technique with a discrete hazard model and demonstrate that the variables selected using the least absolute shrinkage and selection operator can improve prediction performance.

Corporate Default Prediction Using Machine Learning Techniques
Samuel [32] proposes the concept of machine learning and defines it as "a discipline that gives computers the ability to learn without a clear program". Mitchell [33] further develops this notion of machine learning, saying, "a computer program is said to learn from experience E with respect to some class of tasks T and performance measures P, if its performance at tasks in T, as measured by P, improves with experience." In this regard, a machine learning algorithm for corporate default prediction can be described as a series of processes that improve the default indicator (P) to perform the task of predicting corporate credit risk (T) using actual corporate credit information (E). Researches on corporate default prediction using machine learning techniques are conducted in various ways, especially in the field of computer science, and Barboza et al. [34] argue that machine learning models exhibit better performances in predicting the corporate bankruptcy. Table 2 shows three important methodologies and their representative studies. In most cases, a default prediction using machine learning adopts a classification problem that categorizes a company's status as being in one of two or more states, defined as normal (= 0) and in default (= 1), and it calculates the probability that the company is in a specific state. Thus, machine learning algorithms for solving classification problems are primarily used; representative examples include SVMs, decision trees, and artificial neural network algorithms.

Support Vector Machines
The SVM classification algorithm is widely used in various fields, including corporate default predictions. This algorithm uses a separating hyperplane to classify n observations, each of which has p features. Each observation can be classified as having one of two statuses, defined as y i ∈ {−1,1}, where y i represents the status of the ith observation. The SVM algorithm needs to determine the farthest separating hyperplane, and it allows some misclassifications to avoid the overfitting problem. The optimal separating hyperplane can be represented by the following equation subject to where β 0 , β 1 , · · · , β p are hyperplane parameters and ε 0 , ε 1 , · · · , ε n are slack variables to allow for some misclassified observations. M denotes the margin, which means the distance between the separating hyperplane and the observations. The tuning parameter, C, is greater than zero and determines the limits of the classification errors [35].
Shin et al. [36] apply an SVM to corporate default prediction and show that the SVM performs better than a back-propagation neural network model. Chen [37] compares several default prediction models and claims that the SVM has high accuracy and performs well for short-and long-term default predictions. Liang et al. [38] also show that the SVM yields the best performance when predicting bankruptcy using financial ratios and corporate governance indicators. Lu et al. [39] extend the SVM methodology using hybrid switching particle swarm optimization.

Decision Trees and Random Forests
The decision tree (DT) algorithm is a methodology for solving regression or classification problems by charting decision rules in a tree structure. The DT algorithm divides a feature space, composed of a combination of p explanatory variables X 1 , X 2 , . . . , X p , into J non-overlapping regions R 1 , R 2 , . . . , R J , and it makes the same prediction for observations belonging to the same domain. The following Gini index measures the quality of separation where J is the number of states andp mj is the proportion of state j in region R m [35]. The pruning process, which establishes individual decision trees, is performed using the Gini index. The DT algorithm has the advantage that the model is intuitive and easy to interpret, but it has the limitation that the over-fitting problem is likely to occur in the process of dividing the feature space or producing branches, and the prediction accuracy is reduced as a result. The random forest (RF) algorithm is a machine learning algorithm that uses multiple DTs. The RF algorithm chooses a pre-determined number of explanatory variables through randomization when it creates a new DT. Generally, the number of variables selected is given by the square root of p, the total number of explanatory variables. If we denote this number as k, then the RF algorithm generates multiple DTs with k randomly selected explanatory variables. For a classification problem, this model's prediction is based on the most commonly predicted result across the DTs. Olson et al. [40] argue that the DT algorithm is more understandable and accurate than other machine learning algorithms. Tsai et al. [41] compare DT ensembles with SVMs and multilayer perception neural networks and claim that DT ensembles perform the best. Zięba et al. [42] extend the DT algorithms by using the extreme gradient boosting and synthetic features generation. Jardin [43] notably proposes a corporate default prediction model with ensembles of Kohonen maps and argues that this approach is highly efficient.

Artificial Neural Networks
An artificial neural network is a machine learning algorithm devised according to the process by which real human brains operate. This algorithm solves complex problems by mimicking the structure of the brain and connecting artificial neurons using simple structures. Neurons, which are the basic units making up the human brain and spinal cord, are responsible for transmitting the signals that they receive to other connected neurons. Neurons transmit signals to other neurons only when the intensity of the received signal is above a certain threshold.
Artificial neurons simulate the roles of these actual neurons through mathematical models. Each artificial neuron receives multiple signals, x 1 , x 2 , . . . , x j , composed of zeroes and ones, and it calculates the weighted sum of the signals that it receives according to their weights, w 1 , w 2 , . . . , w j . Depending on the model, signals belonging to the set (−∞, ∞) or the set (0, ∞) may be received. A signal is then transmitted to the next artificial neuron only when the weighted sum of the signals received is above a certain intensity or threshold. The weights and thresholds of each neuron are determined by the combination that leads to the best results based on past experience or data. An artificial neuron can be expressed as follows: An artificial neural network is a machine learning methodology that can solve complex problems using a combination of simple artificial neurons. The network is composed of connections among multiple layers of artificial neurons, and each layer is divided into an input layer and an output layer, and a hidden layer between them. The process of training or optimizing the network involves determining the weights and thresholds for each artificial neuron to obtain the best results and therefore requires strong computational power.
Among the various possible structures of artificial neural networks, a model with several hidden layers is called a deep neural network. The term "deep learning" refers to the use of such deep neural networks for machine learning. This approach is intended to solve the problem of local minima that arises in the existing artificial neural network methodology. It is characterized by continuously using learnings from the data to improve the problem-solving ability of the network.
Default predictions using artificial neural networks are attempted before the 1990s. Yang et al. [44] explore several algorithms and show that Fisher discriminant analysis and probabilistic neural networks have the best prediction performance. Recently, following the pioneering work of Hinton, Osindero, and Teh [45], artificial neural networks are again emerging as a technique for corporate default predictions. Falavigna [46] predicts the default risks of small Italian companies with insufficient account information using an artificial neural network algorithm. López Iturriaga and Sanz [47] estimate and visualize banks' default risks by combining multilayer perceptrons and self-organizing maps. Azayite and Achchab [48] improve a neural network's default prediction model by incorporating discriminant variables. Geng et al. [49] show that the neural network approach has a better performance than other classifier algorithms. In addition, many studies have attempted to improve the prediction performance by revising the neural network algorithm [50][51][52].
In recent years, deep learning has used a convolutional neural network (CNN) and recurrent neural network (RNN) algorithms to solve the overfitting problems that arise in the learning process and improve performance. CNN algorithms divide the information used for learning into multiple domains and analyze the highly relevant domains using limited information. Thus, these algorithms perform well in the field of image recognition and are widely used. The RNN algorithms are a way to process and analyze data sequentially rather than independently analyzing the data used for learning, and they are used to predict time series data and recognize the context of text data. Currently, deep learning is used to understand the context of information and perform various tasks, such as pattern recognition, natural language processing, and autonomous driving.

Other Studies
Because a corporate default is a rare event, the training datasets are typically highly imbalanced. Quantitative models that use the full sample may not be appropriate because they may result in biased predictions [53,54]. To overcome this problem, Zhou [55] compares several sampling techniques (random oversampling with replication, the synthetic minority oversampling technique, random undersampling, and undersampling based on clustering from the nearest neighbor) across several machine learning algorithms (i.e., discriminant analysis, logistic regression, artificial neural networks, and SVM). The results show that random undersampling and SVM perform well in most cases, but the number of defaults in the training dataset can impact which methodology is most effective. Similarly, Veganzones and Séverina [56] also show that an imbalanced dataset can disturb prediction performance and that the SVM method is less sensitive than other methodologies. Kim et al. [57] suggest using the optimization of cluster-based undersampling to solve the imbalance problem. Piri et al. [58] use a synthetic informative minority oversampling algorithm to enhance SVM performance with an imbalanced dataset. Tian et al. [59] claim that different sampling techniques are required depending on the purpose of the study. Song and Peng [60] suggest a multi-criteria decision making-based approach.

Discussion
In this study, we investigate previous studies on corporate default prediction. We also categorize them from the perspectives of financial engineering and machine learning. The findings are as follows. First, in much of the machine learning literature, corporate default predictions are considered as classification problems. On the other hand, structural and hazard models in the field of financial engineering understand and analyze corporate defaults as sequences. These approaches help us to understand how the time-varying changes in the variables affect a company's default risk. Second, macroeconomic factors are rare or often neglected in corporate bankruptcy prediction studies using machine learning methodologies. Of course, corporate defaults are mainly affected by the financial conditions of individual companies. However, macroeconomic conditions also have an important effect. Corporate default predictions using classification methodologies require the assumption that individual events are independent. However, if macroeconomic conditions change, such as the financial crisis, corporate defaults are no longer independent events. Third, in many cases, the machine learning approaches show superior corporate default forecasting performances to the financial engineering approaches. However, machine learning methodologies do not provide a meaningful answer to the cause of corporate default. While these forecasting can be useful for investors or credit rating agencies, they do not help executives who want to improve their businesses.
Our findings also suggest some new tasks in the field of machine learning when predicting corporate defaults. First, it is important to keep in mind that a corporate default prediction model is a multi-period model in which the future is affected by past decisions. Nevertheless, many machine learning models use one-period financial statements to avoid complexity. However, corporate defaults are not generally caused by one excessive loss or huge debt, as inadequate decision making over several periods often results in corporate default. Because the context is important for corporate default predictions, multi-period models are more appropriate for explaining corporate defaults; the RNN methodology can be a good example. In addition, it is necessary to use financial statement items from multiple periods at the same time. Second, the stock price and the corporate value evaluated in the stock market are important factors. However, the daily updated stock price is rarely used in machine learning approaches because that data cycle is inconsistent with that of financial statements. Even when the stock price is used, only the price at the time of the financial statements is incorporated. However, many financial research studies confirm that a company's stock price is an important explanatory variable for predicting corporate default. Finally, a corporate default prediction model should be able to suggest the cause of default. Predicting a corporate default is not simply a cats-or-dogs classification problem. Corporate executives and government officials can obtain a large amount of information from corporate default prediction models. However, if the model is a black box and cannot identify the issues leading to a default prediction, the model's usefulness is limited. Corporate default forecasts should not stop with classifications but rather should be able to offer clues on how to avoid defaults.
This study has several limitations. We do not cover all topics related to corporate default predictions. Explanatory variables that can affect corporate default are not considered. Variable selection techniques that can improve forecasting performances are also not covered. While we classified the related literature into statistical and machine learning approaches, we do not cover the entirety of the literature but present representative studies that are important in each category. Through this concentration, we derive meaningful findings and insights associated with this topic.

Conclusions
This review paper investigates the progress of research related to corporate default predictions and examines the main research methodologies by classifying them as either statistical models or machine learning algorithms. In addition, it forecasts the financial sector innovation brought about by the development of machine learning. In doing so, this study aims to provide clues regarding the future convergence of research in the management science and computer science fields and to lay the foundation for expanding financial engineering methodologies to predict corporate default risks.
Our findings and suggestions for corporate default predictions are even more meaningful at this particular time. Owing to a series of technological developments referred to as the Fourth Industrial Revolution, the need to apply new methodologies to the field of financial engineering, including corporate default forecasting, is emerging. In particular, the big data analysis methodologies presented in this study suggest the need for enterprise-wide data governance to diagnose business conditions and help investors make accurate decisions. Not just large companies but also small and medium-sized enterprises now need to lay the foundation to easily and rapidly introduce new technologies by accurately identifying the type, size, and frequency of management data and continuously implementing quality management. The increase in the value of utilizing big data associated with the development of machine learning methodologies is accompanied by the need to strengthen the protection of personal information. Although de-identification measures are taken to protect personal information in the process of using big data, personal information has been leaked, in some cases through re-identification by combining big data with other information disclosed on the Internet. It is necessary to actively utilize the latest technologies, such as homogeneous cryptography, to reduce the risk of information leakage and perform big data analysis and processing without disclosing personal information.
Corporate default predictions using machine learning demand attention because the calculation process used to generate predictions may be a black box depending on the algorithm. Thus, although such methodologies can be used to calculate corporate default risk, they face the limitation that they cannot provide strategies for improving a company's management to reduce default risk. Thus, when performing corporate default forecasting, it is necessary to select an appropriate methodology that can provide suitable information for the purpose of prediction, requiring a detailed understanding of the appropriate utilization of each methodology.