Machine Learning for Bankruptcy Prediction in the American Stock Market: Dataset and Benchmarks

: Predicting corporate bankruptcy is one of the fundamental tasks in credit risk assessment. In particular, since the 2007/2008 ﬁnancial crisis, it has become a priority for most ﬁnancial institutions, practitioners, and academics. The recent advancements in machine learning (ML) enabled the development of several models for bankruptcy prediction. The most challenging aspect of this task is dealing with the class imbalance due to the rarity of bankruptcy events in the real economy. Furthermore, a fair comparison in the literature is difﬁcult to make because bankruptcy datasets are not publicly available and because studies often restrict their datasets to speciﬁc economic sectors and markets and/or time periods. In this work, we investigated the design and the application of different ML models to two different tasks related to default events: (a) estimating survival probabilities over time; (b) default prediction using time-series accounting data with different lengths. The entire dataset used for the experiments has been made available to the scientiﬁc community for further research and benchmarking purposes. The dataset pertains to 8262 different public companies listed on the American stock market between 1999 and 2018. Finally, in light of the results obtained, we critically discuss the most interesting metrics as proposed benchmarks for future studies.


Introduction
Since the 2007/2008 financial crisis, most financial institutions, lenders, and academics have become interested in predicting corporate bankruptcy. Usually, corporate bankruptcy costs spread to the whole economy, resulting in cascade effects that impact many companies [1,2].
Despite that different research works have already demonstrated the ability of machine learning (ML) to assess the likelihood of companies' default, making a fair comparison among all the proposed approaches in the literature remains challenging for several reasons: (a) most of the datasets are not publicly available or are only related to specific economic scenarios like private companies in different countries [3,4]. For private companies, little information is generally available, which makes it difficult to exploit other sources of information that may improve bankruptcy prediction performance (e.g., textual disclosures [5], annual reports [6], stock market data) and that can be used by more complex models; (b) bankruptcy prediction actually involves different tasks: the default prediction in tasks for the next year, using past data, and the survival probability prediction task that aims to predict the probability that a company will face financial distress in k years. Most datasets cannot permit the performing of both tasks, and this is a clear limitation to the development of intelligent models that aim to generalize; (c) bankruptcy prediction models are usually trained on imbalanced data including few examples of the bankruptcy class: there is still no general accepted metric to assess bankruptcy prediction performance with machine learning. Indeed, the prediction accuracy can be misleading since it gives the same cost to false positives and false negatives 1.
We collected a dataset for bankruptcy prediction considering 8262 different companies in the stock market in the time interval between 1999 and 2018. The dataset has been made public (https://github.com/sowide/bankruptcy_dataset, accessed on 14 August 2022) for the scientific community for further investigations as a benchmark and thus it can be easily enriched with data coming from other sources pertaining to the same companies. Our dataset faithfully followed the FAIR Data Principles [7]: (a) Findable: our data is indexed in a searchable resource and had a unique and persistent identifier.
Accessible: our data is understandable to humans and machines and it is deposited in a trusted repository. (c) Interoperable: we used a formal, accessible, shared, and broadly applicable language for knowledge representation. (d) Reusable: we provided accurate information on provenance and clear usage licenses.

2.
We investigated two different bankruptcy prediction tasks: The first task (T1) is the default prediction task which aims to predict the company status in the next year using time series of accounting data or just the last fiscal year available. The second task (T2) is the survival probability prediction task in which the model tries to predict the company status over k years in the future. 3.
In light of the results achieved, we critically discuss the most interesting metrics as proposed benchmarks for the future, such as: Area Under the Curve (AUC), precision, recall (sensitivity), type I error, type II error, and macro-and micro-F1 scores for each class.
The paper is organized as follows: In Section 2, we provide an overview the stateof-the-art approaches for bankruptcy prediction. In Section 3, we describe in detail the dataset that has been used for this study. In Section 4, we review and describe all the machine-learning algorithms that we used in our experiments. In Section 5, we introduce the metrics used for the imbalanced scenario encountered in this study. In Section 6, we describe the first task we evaluated in this work, concerning the prediction of a company's health status based only on data from previous years. In Section 7, we present our second task where we performed a survival probability task on companies within the dataset. In Section 8, we show and describe the experimental results, and finally in Section 9 we summarize the results with a critical discussion about the metrics.

Related Works
The recent advancements in machine learning (ML) led to new, innovative, and functional approaches [8,9]. Moreover, they have enabled the development of intelligent models that try to assess the likelihood of companies' default by looking for relationships among different types of financial data, and the financial status of a firm in the future [10][11][12][13][14][15][16][17]. Different ML algorithms and techniques such as Support Vector Machine (SVM) [18], boost-ing techniques [19], discriminant analysis [20], and Neural Networks [5] have been used in the literature for this task. Moreover, different architectures have been evaluated to identify effective decision boundaries for this binary classification problem, such as the least absolute shrinkage and selection operator [21], dynamic slacks-based model [22], and two-stage classification [13]. However, although default prediction models have been studied for decades, several issues remain. Interestingly, some new issues have even been introduced with the recently increased exploitation of machine-learning models. Indeed, since the Z-Score model was proposed by Altman in 1968 [23], research mainly focused on accounting-based ratios as markers to detect and understand if a firm is likely to face financial difficulties, such as bankruptcy. Scoring-based models use discriminant analysis to provide ordinal rankings of default risk but are often computed from small datasets using statistical and probabilistic models that focus more on explainability and explicability but miss generalization over time and across different sectors [24]. Other examples are the Kralicek quick test [25] and the Taffler model [26].
In [27], a step towards modern machine learning was made by introducing a binary response model that uses explanatory variables and applies a logistic function for bankruptcy prediction [28]. However, the main goal of these models' class is not to identify a decision boundary in the feature space but only to select a decision based on an output threshold that was statistically significant in the past for the specific sector. For example, Altman suggested two thresholds, 1.81 and 2.99.
Specifically, an Altman's Z-score above the 2.99 threshold means that firms are not expected to default in the next two years, below 1.81 that they are expected to default, while the interval between the two thresholds is named the "zone of ignorance" where no clear decision can be taken. However, even though many practitioners use this threshold, in Altman's view, this is an unfortunate practice since over the past 50 years, credit-worthiness dynamics and trends have changed so dramatically that the original zone cutoffs are no longer relevant [24].
Moreover, we still lack a definite theory for the bankruptcy prediction task [18,29] and in particular, a generally accepted performance metric is missing along with a formal theoretical framework. As a consequence, the most common methodology in bankruptcy prediction tasks is identifying discriminant features using a trial and error approach with various accounting-based ratios [15,16].
Machine-learning models usually need large datasets to be trained and suffer when class imbalance is strong as in bankruptcy, since default events are quite rare. Learning from imbalanced data requires dealing with several challenges, especially when the most important class that should be recognized is exactly the one that is least represented in the dataset. This issue is strongly related to the lack of a general performance metric.
Machine-learning techniques like ensemble methods were firstly explored for default prediction by Nanni et al. [30]. Kim et al. showed a much better performance for the ensembles compared to standalone classifiers, while their results were also confirmed by Kim et al. [31]. Wang et al. further analyzed the performance of ensemble models, finding that bagging outperformed boosting for all credit databases in terms of average accuracy, as well as type I and type II error [32]. In [33], Barboza et al. show that, on average, machine-learning models exhibit 10% more accuracy than scoring-based ones. Specifically, in this study, Support Vector Machines (SVM), Random Forests (RF) as well as bagging and boosting techniques were tested for predicting bankruptcy events and were compared with results from the discriminant analysis, Logistic Regression, and Neural Networks. The authors found that bagging, boosting, and RF outperformed all other models. However, since the dataset has not been released and the models' hyper-parameters are not reported, it is difficult to replicate such results and understand if the performance improves because of the quality of the models or because the authors take into account some other financial variables as features.
Considering that the comparisons of the models are still inconclusive, new studies exploring different models, contexts, and datasets are relevant. A firm's failure is likely to be caused by difficulties experienced over time and not just the previous year. In light of this, the dynamic behavior of firms should be considered in a theoretical framework for bankruptcy prediction such as growth measures and changes in some variables [33]. Several research works investigated this aspect, but the results are again inconclusive and often not reproducible. In [28], the authors show that firms exhibit failure tendencies as much as five years prior to the actual event. On the other hand, Mossman et al. [34] pointed out that the models are only capable of predicting bankruptcy two years prior to the event, which improves to three years if used for multiperiod corporate default prediction [35]. In most studies, ratios are analyzed backward in time starting with the bankruptcy event and going back until the model becomes unreliable or inaccurate. Moreover, most of the bankruptcy prediction models in the literature do not take advantage of the sequential nature of the financial data. This lack of multi-period models is also emphasized in a literature review by Kim et al. [36].

Dataset
Since most of the bankruptcy models are evaluated on private datasets or small publicly available ones, we provide a novel dataset for bankruptcy prediction related to the public companies in the American stock market (New York Stock Exchange and NASDAQ). We collected accounting data from 8262 different companies in the period between 1999 and 2018. The stock market is dynamic with new companies becoming public every year, changing properties and names, or being removed or suspended from the market as a result of acquisitions or regulatory action. For this reason, we consider the same companies used in [6,37] since this set of firms has been proved to be a fair approximation of the American stock market for each year in that time interval. According to the Security Exchange Commission (SEC), a company in the American market is considered bankrupt in two cases:

•
If the firm's management files Chapter 11 of the Bankruptcy Code to "reorganize" its business: management continues to run the day-to-day business operations but all significant business decisions must be approved by a bankruptcy court.

•
If the firm's management files Chapter 7 of the Bankruptcy Code: the company stops all operations and goes completely out of business.
When these events occur we label the fiscal year before the chapter filing as "bankruptcy" (1) for the next year. Otherwise, the company is considered healthy (0). In light of this, our dataset enables learning how to predict bankruptcy at least one year before it happens and, as a consequence, it is possible to deal with the default prediction task using time series and also to deal with the survival probability task looking ahead. Figure 1 shows the percentage of companies' default for each year in the dataset. This value can be underestimated due to the exclusion of some companies in the past because of their small market capitalization, but it appears to agree with the literature that usually reports that only a percentage below 1% of the available firms in the market is likely to default every year under normal conditions. However, in some periods, bankruptcy rates have been higher than usual, for example during the Dot-com Bubble in the early 2000s and the Great Recession in 2007-2008. Our dataset distribution reflects this condition, see Table 1.
For all the companies and for each year, we selected 18 accounting and financial variables. Features were selected according to the most frequently used ratios, and accounting information to which the literature refers [21,23,38]. The dataset has no missing values or synthetic and imputed added values. Finally, the resulting dataset of 78,682 firm-year observations is divided into three subsets according to the period of time: a training set, a validation set, and a test set. We used data from 1999 until 2011 for training, data from 2012 until 2014 for validation and model comparison, and the remaining years from 2015 to 2018 as a test-set to prove the ability of the models to predict bankruptcy in real never seen cases and macroeconomic conditions. Table 2 shows the full list of the 18 features available in the dataset and their description.

X1 Current assets
All the assets of a company that are expected to be sold or used as a result of standard business operations over the next year X2 Cost of goods sold The total amount a company paid as a cost directly related to the sale of products

Depreciation and amortization
Depreciation refers to the loss of value of a tangible fixed asset over time (such as property. machinery, buildings, and plant). Amortization refers to the loss of value of intangible assets over time.

X4 EBITDA
Earnings before interest,taxes, depreciation and amortization: Measure of a company's overall financial performance alternative to the net income

X5 Inventory
The accounting of items and raw materials that a company either uses in production or sells.

X6
Net Income The overall profitability of a company after all expenses and costs have been deducted from total revenue.

Total Receivables
The balance of money due to a firm for goods or services delivered or used but not yet paid for by customers.

Market value
The price of an asset in a marketplace. In our dataset it refers to the market capitalization since companies are publicly traded in the stock market

X9
Net sales The sum of a company's gross sales minus its returns, allowances, and discounts.
X10 Total assets All the assets, or items of value, a business owns X11 Total Long term debt A company's loans and other liabilities that will not become due within one year of the balance sheet date

X13 Gross Profit
The profit a business makes after subtracting all the costs that are related to manufacturing and selling its products or services

X14 Total Current Liabilities
It is the sum of accounts payable, accrued liabilities and taxes such as Bonds payable at the end of the year, salaries and commissions remaining

X15 Retained Earnings
The amount of profit a company has left over after paying all its direct costs, indirect costs, income taxes and its dividends to shareholders

X16 Total Revenue
The amount of income that a business has made from all sales before subtracting expenses. It may include interest and dividends from investments

X17 Total Liabilities
The combined debts and obligations that the company owes to outside parties

X18 Total Operating Expenses
The expense a business incurs through its normal business operations In light of this, the dataset could be used to build and validate different ML models for both of the main two tasks in bankruptcy prediction we show in this research work. Moreover, since the dataset has a temporal dimension, several time series analysis techniques could be exploited as long as there are unsupervised methodologies.

Machine-Learning Models
In this section, we briefly review and describe all the machine-learning algorithms we used for the experiments described in the next sections.

Support Vector Machine
Support Vector Machine is one of the oldest ML algorithms and aims to identify the decision boundaries as the maximum-margin hyperplane separating two classes. The hyperplane equation is given by Equation (1).
where ω is the normal vector and b the bias. The objective function of SVM can be expressed as: where is the deviation between f (x) and the target y i .

Random Forest
Random Forest is an ensemble learning algorithm developed by Breiman [39]. Ensemble learning is a way to combine different basic classifiers ("weak" classifiers) to compose a new one (strong learner), more complex, more efficient, and more precise. The weak classifiers should make independent errors in their predictions, and thus a strong classifier can be composed of different algorithms or if the same algorithm is used the models should be trained with different subsets of the training set.
Random Forest is an ensemble bagging tree-based learning algorithm. In particular, the Random Forest Classifier is a set of decision trees that are trained using randomly selected subsets of the training set and randomly selected subsets of features.
A Random Forest Classifier is composed of a collection of classification trees h (estimators): where (Θ k ) represents identically and independently distributed random vectors, and each tree casts a unit vote for the most likely class at input x. Each tree in the collection votes (only once) to assign the sample to a class, considering the x feature vector. The final choice is to attribute the example to the class that obtained the majority of votes. A graphic representation of the algorithm is presented in Figure 2.

Boosting Algorithms
Boosting is a subset of ensemble methods where a collection of models are trained sequentially to permit every model to improve and compensate for the weakness of its predecessor.
Boosting algorithms differ in how they create and aggregate weak learners during the sequential stacking process. In our work, we used various boosting algorithms: • AdaBoost [40]: This was the first boosting algorithm developed for classification and regression tasks. It fits a sequence of weak learners on different weighted training data. It gives incorrect predictions more weight in sub-sequence iterations and less weight to correct predictions. In this way, it forces the algorithm to "focus" on observations that are harder to predict. The final prediction comes from weighing the majority vote or sum. The algorithm begins by forecasting the original dataset and giving the same weight to each observation. If the prediction is incorrect, using the first "weak" learner, the algorithm will give a higher weight value to that observation. This procedure is iterated until the model reaches a predefined value of accuracy.
AdaBoost is typically easy to use because it does not need complex parameters during its tuning procedures and it shows low sensitivity to overfitting. Moreover, it is able to learn from a small set of features learning incrementally new information. However, Adaboost is sensitive to noisy data and abnormal values. • Gradient Boosting [41]: This algorithm uses a set of weak predictive models, typically decision trees. Gradient Boosting trains many models sequentially that are then composed using the additive modeling property. In each training epoch, a new learner is added to increase the accuracy of the previous one. Each model gradually minimizes the whole system loss function using the Gradient Descent algorithm. • XGBoost (Extreme Gradient Boosting) [42]: XGBoost is an optimized distributed Gradient Boosting library designed to be highly efficient, flexible, and portable. XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function based on the difference between the predicted and target outputs and a penalty term for model complexity. The training proceeds by adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction (4).
where α i and r i are the regularization parameters and residuals computed with the i-th tree, respectively, and h i is a function that is trained to predict residuals, r i using X for the i-th tree. α i is computed using the residuals, r i solving the following optimization problem: where L(Y, F(X)) is a differentiable loss function.

Logistic Regression
Binary Logistic Regression models the relationship between a set of independent variables and a binary dependent variable. The goal is to find the best model that describes the relationship between the dependent variable and multiple independent variables.
The Logistic Regression's dependent variable could be binary or categorical and the independent ones could be a mixture of continuous, categorical, and binary variables.
The general form of Logistic Regression is as follows: where x 1 , x 2 , . . . , x m is the feature vector and z is a linear combination function of the features. The parameters b 1 , b 2 , . . . , b m are the regression coefficients to be estimated. The output is between 0 and 1, and, usually, if the output is above the threshold of 0.5 the model predicts class 1 (positive) and otherwise class 0 (negative).

Artificial Neural Network
An Artificial Neural Network (ANN) is a non-linear function approximator. It consists of an input layer of neurons and an unspecified number of hidden layers and a final output layer. Each neuron performs a weighted sum of its inputs, and finally, it applies an activation function that determines the output of each neuron. When the activation function is a Sigmoid function, the single neuron works as a Logistic Regression without the classification threshold. Figure 3 shows a general structure of an ANN with the input layer, two hidden layers, and the final output layer whose structure strongly depends on the task it should perform. The main common architecture is the feed-forward ANN where each neuron is linked to every neuron in the next layer but without exhibiting any intra-layer connection among neurons belonging to the same layer. Each layer can be seen as a partial approximation of the final function. Each connection has a weight ω assigned that is randomly initialized at the beginning. The output h i , of a neuron i, in the hidden layer, is: where σ() is the activation function, N the number of input neurons of the layer, ω ij the weights, x j inputs of the neurons, and b i the bias terms of the hidden neurons. The goal of the activation function is to bound the value of the neuron so that the Neural Network is not stuck by divergent neurons. Weight estimation for each connection is the main goal of the ANN's training. This step is usually performed using the backpropagation algorithm [43] that minimizes an objective function that measures the distance between the desired and actual output of the network. Inputs and outputs from a Neural Network can be binary or even symbols when data are appropriately encoded. This feature confers a wide range of applicability to Neural Networks.

Metrics for Imbalanced Bankruptcy Prediction Tasks
The two bankruptcy prediction tasks we propose in the following sections have been implemented as binary prediction tasks where the positive class (1) indicates bankruptcy while the negative class (0) means that a company has been classified as healthy. To compare our models and investigate their effectiveness, we used different metrics considering the imbalance condition in the validation and the test sets. These metrics will be critically discussed in Section 8 in light of our results for the two tasks. According to the binary classification task we used to formulate the bankruptcy prediction problem, the following variables represent: Since the validation and test sets are both imbalanced for both tasks with a prevalence of healthy companies (∼96-97%), we did not compare the models in terms of accuracy of the model. Indeed, the proportion of correct matches would be insufficient in assessing the model performance. We firstly use the Area Under the Curve (AUC) for all the comparisons as it is commonly used in the literature to compare the performance of models on imbalanced datasets and specifically to evaluate the bankruptcy models in general. AUC measures the ability of a classifier to distinguish between classes and is used as a summary of the Receiver Operating Characteristic (ROC) curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
In addition, we investigated other important metrics that can be used to clarify the models' performance depending on the target stakeholders. These are the precision, recall, and F 1 scores for each class. The precision achieved for a class is the accuracy of that class' predictions. The recall (sensitivity) is the ratio of the class instances that are correctly detected by the classifier. The F 1 score is the harmonic mean of precision and recall: whereas the regular mean treats all values equally, the harmonic one gives more weight to low values. As a consequence, a high F 1 score for a certain class is achieved only if both its precision and recall are high. Equations (11)-(13) report how these quantities are computed for the positive class. The definitions for the negative class are exactly the same by inverting positives with negatives.
Moreover, we computed and reported two other global metrics for the classifier that have been selected because they enable an overall evaluation of the classifier on both classes: • The macro-F 1 score: The macro-F 1 score is computed as the arithmetic mean of the F 1 score of all the classes. • The micro-F 1 score: It is used to assess the quality of multi-label binary problems. It measures the F 1 score of the aggregated contributions of all classes.
Finally, we used two other metrics that are often evaluated in the bankruptcy prediction models. Because bankruptcy is a rare event, using the classification accuracy to measure a model's performance is misleading since it assumes that type I error (Equation (12)) and type II error (Equation (13)) are equally costly. Actually, the cost of false negatives is much higher than that of false positives for a financial institution. An error has a different cost depending on the class that has been incorrectly predicted. The cost of predicting a company going into default as healthy is much higher than the cost of predicting a company that will default as healthy. In light of this, we explicitly computed and reported type I and type II errors and compared the classifiers focusing in particular on type II and recall of the positive class.
Type I I error = FN TP + FN (13)

Task T1: Default Prediction with Historical Time Series
The first task we performed using the dataset is training each ML model presented in Section 4 for default prediction with historical time-series accounting data. Our first step was to perform the most classical task of predicting a company's health status for the next year based solely on data from the previous year. Furthermore, we attempted to answer the open question in the literature regarding the number of years that should be considered in order to maximize the performance of the bankruptcy prediction model. To achieve this, we define a Window Length (WL) variable that refers to the number of fiscal years considered as a temporal window for the prediction. We trained all the models using data between 1999 and 2011, and we made the first comparison using the Validation set (2012-2014) in terms of AUC.
Finally, we report the results of the test set using the best models identified on the validation set. This last step is necessary to verify the ability of the models to generalize. The average number of years available for each company in the dataset is 8 years. However, we evaluated a WL ranging from one to five years for two main reasons: • Five years is the general maximum number of years found in the literature to be useful. • When increasing the WL considered, the most recent companies are excluded because they do not have available data. In general, considering more years leads to smaller training and test sets. This could introduce a statistical bias, causing the analysis to focus on only the more mature and stable companies that have existed for several years while ignoring the relatively newer, smaller companies that are riskier and have higher default rates. This could introduce a statistical bias forcing the analysis to consider only the more structured and stable companies that have existed for several years while ignoring the relatively new companies with smaller market capitalization and which are thus riskier and have a higher probability of default, particularly during periods of economic decline.

Models Comparison and Selection for Default Prediction
All the models were trained using the same training set (1999-2011) and compared using the same validation set (2012-2014). However, learning from an imbalanced training set led to unsatisfactory results for the bankruptcy class in the validation set. For this reason, we decided to use a random balanced training set that is evaluated for different runs. Indeed, every model is evaluated over 100 independent and different runs: for every run, the training set is balanced with exactly the same number of bankruptcy examples and a random choice of healthy examples from the same period. The number of features changes according to the number of available variables for the temporal window length selected.
A binary classification task is implemented in each model, where the positive class (1) represents a bankruptcy case in the next year, and the negative class (0) represents a healthy case. For RF, AB, GB, and XGB, we used 500 estimators for a fair comparison, while the other specific parameters are the default ones provided in the Scikit-Learn implementations. The ANN is a multi-layer perceptron with three layers. In the first layer, the number of neurons is equal to the number of input features. The hidden layer has half as many neurons as the first layer, and finally, the output layer uses a Sigmoid function to produce a binary prediction. All neurons except for those in the output layer use the Rectified Linear Unit activation (RELU) function. Each Neural Network is trained over 1000 epochs using the early-stopping regularization technique that prevents overfitting when the validation loss tends to increase for a patience number of times.
We compared all the models using the average Area Under the Curve (AUC) over 100 runs. Figure 5 reports the results we achieved trying to predict bankruptcy in the most common setting that exploits only the accounting variables of the last available fiscal year. On average, the ensemble tree-based ML models perform better than the others except for Adaboost. The best average result is achieved with the Random Forest (AUC = 0.748), although the Neural Network achieves the best AUC in a single run (AUC = 0.856). However, the ANN exhibits the highest variability in the results, which led to a lower average AUC (AUC = 0.653). Figure 6 reports the comparison among the models when using more than one year from the prediction (WL from two to five years). As a result of this experiment, we can assert that on average increasing the number of fiscal years in the input does not seem to impact the average AUC significantly: The highest average value is reached by Random Forest using a window length of 5 years. However, as in the first case, in general, the ensemble models exhibit better performance. However, the best absolute result is reached by the Neural Network again with AUC = 0.85. It should be highlighted that the Neural Network's high variability concerning the results can be imputed to the random weights' initialization for every run. Probably performing several runs over the same training set could reduce this problem. Finally, Table 3 shows the variance for each model considering every window length.

Results for Default Prediction
Once we compared the algorithms with 100 different runs on the validation set, ranked the average AUC achieved for every temporal window, and selected the best model on the validation set, we performed a final evaluation on the test set (2015-2018). These data have never been used for model training or comparison. It is important to evaluate on a different test set in terms of temporal period because it refers to a different economic cycle. Since we trained on data until 2011 that include the 2007/2008 sub-prime crisis and the European debt crisis in 2010-2011, there could be a bias concerning the knowledge learnt by the models about what is a company that is likely to go into bankruptcy. Evaluating all the models only on the validation set is not enough since we used that dataset to select the hyper-parameters. If a model performs well on both validation (2012-2014) and on the never used test set (2015-2018), we could asses that it is effectively able to generalize concerning bankruptcy companies. The best models on the validation set are in order Random Forest, Gradient Boosting, XGBoost, and Logistic Regression. Figure 7 presents the final results we achieved for this task on the American stock market. We also added the results achieved by the Neural Network with the best model that actually outperforms all the other machine-learning techniques in terms of AUC. It appears that this result contradicts what has been reported in the current literature, which is that bagging and boosting ensemble models perform better. The results achieved on the never seen test set by the ANN demonstrate what is currently known for other machine-learning applications: When a Neural Network is properly designed, trained, and fine-tuned with higher computational costs and when the best model is identified with random and grid searches, then Neural Networks usually outperform all the machine-learning baselines thanks to their ability to deal with non-linear dependencies. The ANN is also considered because, although it yields the worst average AUC on the validation set, it achieves the best absolute performance on the test set when using the best models found on the validation set.

Task T2: Survival Probability Task
The second task we performed using the dataset is the years anticipated prediction of the bankruptcy events. This task is implemented by considering the event in the dataset and looking ahead a number of years; we named it the Look-Ahead Window (LAW). In practice, default prediction with WL = 1 is exactly the same as the one with LAW = 1. For this reason we investigated the LAW parameter ranging from two to five years. The main difference between task T1 and this task is that for task T2 the models exploit only a single year of accounting variables in the past depending on the LAW parameter. For example if LAW is selected equal to three years, it means that the model learns how to predict the companies' status looking over three years. This way of predicting bankruptcy is usually adopted to estimate the survival probability of a company within some years. All the experiments are conducted with the same methodological framework as for task T1. We trained all the models for 100 different runs by randomly undersampling the number of healthy examples in the training set in order to have a balanced set. We compared all the models on the validation set to select the best ones by varying the LAW parameter. Models settings are the same as for task T1. Figure 8 shows the models comparison. We remind readers that the validation set is unbalanced and this is the reason why the performances are measured in terms of the AUC. The results are similar to task T1. Adaboost achieves the worst performance on average along with Support Vector Machine for all the LAW parameters. The other ensemble models seem to outperform the other models and exhibit a small variance by changing the training set for each run. The Neural Network achieves the best absolute result for each window but presents a high performance variance. However, one should highlight that, in general, all the models achieve a better average AUC for the survival probability task rather than on the classical bankruptcy prediction (task T1). Table 3 shows the variance calculated on the AUC values on the validation set for each look-ahead window.

Results for the Survival Probability Prediction
As already carried out for task T1, we selected the best models on the validation set, and we measured the generalization ability of the models on the test set (2015-2018). We compared GB, XGB, LR, and RF. As expected from the previous models' comparison, the best model is Random Forest also in this case. However, also for this task, the best ANN model found on the validation set definitely outperforms all the other models, showing an additional demonstration of the better ability of this category of models to achieve better performance when properly designed. In general, the results of both validation and test experiments suggest that machine learning can better predict a company's status over a long period (from 3 to 5 years before) rather than over a short period (1 or 2 years). Results are presented in Figure 9.

Results
In this section, we present a deeper analysis of the results for both tasks T1 and T2 by also discussing the other metrics available. In Sections 6 and 7, we compared all the models in terms of the AUC, and we identified the best model for both tasks using this metric. This metric was chosen for two reasons: first, due to the imbalance test set, but also because it is most commonly used in literature for this task. Now, we would like to analyze the same results by looking at the other metrics in order to find some robust answers about the models for the two tasks. Tables 4 and 5 show all the results we achieved with all the models on the test set (2015-2018).
We show precision and recall for both the bankruptcy class and the healthy one. Starting from these, we computed the micro-and macro-F 1 scores for the overall classifier and finally the type I and type II errors. All the results were computed for all the temporal window WLs and LAWs of the two tasks.
In the end, we reported the complete confusion matrix of the experiments that we have never seen reported in the other papers because we strongly believe that this may help for correct comparisons and usage as benchmarks for future investigations.

The Most Suitable Metric
The first question we aim to answer with the experiments is the following: Does a high value of AUC mean that the model is better at predicting bankruptcy or survival probability over time? There is not a unique and simple answer. We compared the overall performance of the models using the AUC and the micro-F 1 score, and the two metrics present similar results both with high values: models that exhibit a high AUC also have a high micro-F 1 score. This direct proportionality is also evident with the macro-F 1 score, which we remind readers is the arithmetic average of the simple F 1 scores on the two classes, but the value is usually definitely lower.
The reason why the macro-F 1 score is much lower is related to the low precision that the models achieve in the bankruptcy class. All the best models selected with the highest value of AUC exhibit a high recall and a low precision for the bankruptcy class and this is not evident looking only at the AUC value that often is reported as something similar to the accuracy. The precision-recall trade-off is well known in the literature: there is an inverse proportionality between the two metrics since precision considers false positives and recall considers false negatives. If the model is good at predicting true positives with a low value of false negatives it will often predict negative samples as positive (High False Positive and so low precision). Moreover, in our dataset and in most of the bankruptcy datasets, the class imbalance is usually very pronounced, with an average of 97% of healthy (negative) samples in the test set versus a small 3% of bankruptcy (positive) samples. For this reason, we can assert that the precision over the bankruptcy class strongly depends on the absolute value of the negative examples in the test set. The reader may observe this dependence by looking at the confusion matrices reported in the two tables.
At the same time, this condition can be considered optimal for some financial stakeholders: If the model has a high recall on the bankruptcy class it will make some wrong predictions on healthy companies but ensure that most of the risky companies are detected and avoided in their investments.
The type II error is the other metric mostly related to the recall of the bankruptcy class, but it considers the ratio of false negatives (default companies that have been wrongly classified as healthy): to minimize this value, it is important to reduce the number of false negatives while for the recall we want to maximize the number of true positives. However, these two metrics should always be highlighted along with the AUC because they provide concrete insight into the ability of the model to correctly identify companies that are going to face financial troubles. Indeed, a higher value of AUC is due to the high precision and recall that the models achieve in the healthy (negative) class since it is over-represented.
The final user, particularly if it is a regulatory body that is responsible for monitoring the status of a company, should not blindly trust the AUC value since all the models selected on this basis may produce several false alarms (low precision for the bankruptcy class, high values of false positives). In this case, they should design and compare the ML models using the macro-F 1 score and then the AUC.

Best Model and Temporal Windows
In light of our results, we can assert that Neural Networks should be preferred among all the ML models for the bankruptcy prediction tasks since they present the best ability to generalize on new and unseen cases with every metric adopted. However, the training and design of ANNs is usually harder and requests a larger computation time, higher costs, and more experiments with respect to the other models. When computation time and costs are constraints to be kept in account or when the model should be used in high dynamic contexts, the Random Forest algorithm should be preferred since it presents almost similar performance, but it requires fewer parameters, and thus it takes less time to be designed and trained.
The temporal window analysis for tasks T1 and T2 led to the following conclusions: • For the default prediction task (T1), the general performance increases when considering more than one year of accounting variables and this is true for both ANNs and Random Forest. Indeed, the 5-year temporal window exhibits the best results in terms of AUC. However, in light of the discussion about the metrics, the temporal window of three years achieves the best trade-off between AUC and macro-F 1 score using fewer variables. • For the survival probability prediction task (T2), the best performance is achieved when trying to predict the company status three years in advance (LAW = 3) with an AUC = 0.87 with the ANN. It should be highlighted that ANN reached for LAW = 5 a considerable AUC = 0.86. However, all the models exhibit a really low precision on the bankruptcy class except for SVM and XGBoost. Indeed the best model in terms of macro-F 1 score is XGBoost (LAW = 5) In general, learning from temporal variables seems to lead to better performance, especially when the model learning capacity can learn complex patterns as happens with Neural Networks.

Conclusions
In this research work, we deeply investigated the performance of several machinelearning techniques concerning predicting bankruptcy in the American stock market. We compared the models over two different tasks: (a) default prediction using time series accounting data; (b) survival probability prediction. We performed the tasks using a dataset with 8262 companies in the period between 1999 and 2018. The dataset is also one of the contributions of this paper since it has been publicly released. We used temporal criterion to divide the dataset into training, validation, and test sets. For both tasks, Neural Networks achieve the absolute best results despite exhibiting a great variance in their results, leading to the conclusion that these models can be superior only when opportunely designed and trained with higher computational costs. Finally, we critically discuss the general use of the Area Under the Curve as a common metric to evaluate bankruptcy prediction tasks since in most cases computing precision, recall for the single classes, and the macro-F 1 score would better define the models' performance. Moreover, we highlighted that using more fiscal years for the prediction can improve the performance for both tasks, as has been proved in the past only for small datasets. In light of this, future works are to be related to the use of Recurrent Neural Networks and attention-models to better exploit the time-series information, considering the possible trade-off of using deep learning models with such short time series and relatively small datasets. Moreover, bankruptcy prediction could also be evaluated with unsupervised models like Isolation Tree and anomaly detection models. The current dataset could also help in that case. Finally, future works are to be related to the possible limitations of this research work. The main issue to be deeply investigated is related to the temporal dimension of the study: we were able to collect reliable data until 2018 and testing on previously unseen example has been promising. However, it would be interesting to evaluate if the current models could also generalize on different economic situations like the ones that come up with the COVID-19 pandemic and the resource crisis. Another limitation of this work is related to the class imbalance. Several techniques of effective sampling methods should be considered in the study in order to evaluate a balanced scenario as long as there is synthetic data generation.

Conflicts of Interest:
The authors declare no conflict of interest.