Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination?

: Forecasting the creditworthiness of customers is a central issue of banking activity. This task requires the analysis of large datasets with many variables, for which machine learning algorithms and feature selection techniques are a crucial tool. Moreover, the percentages of “good” and “bad” customers are typically imbalanced such that over-and undersampling techniques should be employed. In the literature, most investigations tackle these three issues individually. Since there is little evidence about their joint performance, in this paper, we try to ﬁll this gap. We use ﬁve machine learning classiﬁers, and each of them is combined with different feature selection techniques and various data-balancing approaches. According to the empirical analysis of a retail credit bank dataset, we ﬁnd that the best combination is given by random forests, random forest recursive feature elimination and random oversampling.


Introduction
The financial crises observed in the first two decades of the current century have been the subject of unprecedented attention from financial institutions, especially concerning credit risk.Credit assessment has become a building block of credit risk measurement and management (Huang et al. 2007).
Lending money is a traditional banking activity whose analysis is based on several variables.Banks can assess borrowers' abilities to repay the loan through the design of a credit scoring process, aimed at classifying the applicants into categories corresponding to good and bad credit quality, according to their capability to honor financial obligations.Since applicants with bad creditworthiness have high probability of defaulting, the accuracy of credit scoring is critical to financial institutions' profitability.Even a one percent improvement in the estimation accuracy of credit scoring of "bad" applicants may significantly decrease the losses of a financial institution (Hand and Henley 1997).
A credit scoring model (Anderson 2007, p. 6;Bolder 2018) is usually defined as a statistical model aimed at estimating the probability of default of the counterparties in a credit portfolio, according to the values of the explanatory variables or features.Credit scores are often divided into classes that represent rating categories, where the rating essentially means the level of creditworthiness.Credit scoring was originally determined subjectively according to personal judgment.Later on, it was based on the so-called "5Cs": the character of the consumer, the capital, the collateral, the capacity and the economic conditions.However, with the tremendous increase in the number of applicants, it has become impossible to carry out a manual screening.
Nowadays, because of the high demand in the management of large loan portfolios and the regulatory prescriptions, quantitative credit assessment models are routinely used for credit admission evaluation.Credit scoring models are built according to features such as income and loan payment history, as well as data about previously accepted and rejected applicants (Chen and Huang 2003).The advantages of quantitative credit assessment include reducing the cost of credit analysis, enabling faster decisions, insuring credit collections and diminishing possible risks (West 2000).
Two categories of automatic credit scoring approaches (i.e., statistical techniques and artificial intelligence (AI)) have been studied in the credit risk literature (Huang et al. 2004).Various statistical methods have been applied, and we mention here linear discriminant analysis (LDA; Karels and Prakash 1987;Reichert et al. 1983), logistic regression (LR; Thomas 2000;West 2000) and multivariate adaptive regression splines (Friedman 1991).However, a common weakness of most statistical approaches in this set-up is that some assumptions, such as multivariate normality of predictors in LDA, are frequently violated in practice, which makes these techniques theoretically invalid (Huang et al. 2004).
More recently, many studies have demonstrated that AI methods such as artificial neural networks (ANNs; Desai et al. 1996;West 2000), decision trees (DT; Hung and Chen 2009;Makowski 1985), case-based reasoning (Buta 1994;Shin and Han 2001), and support vector machines (SVM; Baesens et al. 2003;Huang et al. 2007;Schebesch and Stecking 2005) are effective tools for credit scoring.Unlike statistical approaches, AI techniques automatically extract knowledge from training samples without assuming specific data distributions.Previous investigations suggest that AI often outperforms statistical methods in dealing with credit scoring problems, especially for nonlinear classification patterns (Huang et al. 2004;Van Gestel and Baesens 2009).
However, there is no overall best AI technique, for what is best depends on the details of the problem, the data structure, the predictors used, the extent to which it is possible to segregate the classes by using those predictors, and the goal of the classification analysis (Hand and Henley 1997;Yu et al. 2008).
An additional issue often encountered in practice is the so-called imbalanced data problem, caused by the fact that the number of observations belonging to the two classes is often not the same.This difficulty is particularly serious in credit risk measurement, where datasets are often strongly imbalanced, since they typically contain many more non-defaulters than defaulters.The effects of imbalanced data can be mitigated by means of under-or oversampling techniques, which diminish the fraction of overrepresented observations or augment the fraction of underrepresented observations, respectively.In this paper, we employ the three most common methods: the synthetic minority oversampling technique (SMOTE, Chawla et al. 2002), possibly cleaned via Tomek's link (SMOTEtomek, Tomek 1976), and the random oversampling algorithm (Baesens et al. 2003;He and Garcia 2009).
Since the effectiveness of such algorithms may depend on the classifiers they are combined with, we also explore the accuracy gains provided by each under-or oversampling technique in relation to various machine learning algorithms.For simplicity, in the rest of the paper, we use the term "sample-modifying" instead of "under-or oversampling".
The last important issue is feature selection.To avoid overfitting and decrease the variance of the estimated models, only important predictors should be included in the models.To this aim, we analyze three selection approaches: random forest recursive elimination, chi-squared feature selection, and L1-based feature selection.
To sum up, in this paper we try to answer the following research questions: To address these issues, in the empirical analysis, we use a publicly available retail credit dataset containing 20 quantitative and qualitative predictors and 1000 applicants.In this dataset, we carry out a comparative assessment of the performance of five machine learning classifiers (decision trees, random forests, K-nearest neighbor, neural networks, and Naïve Bayes) combined with three oversampling techniques (SMOTE, SMOTEtomek, and random oversampling) and three feature selection algorithms (random forest recursive elimination, chi-squared feature selection, and L1-based feature selection).The accuracy is measured via four criteria: the area under the curve (AUC), average accuracy, sensitivity (true positive rate), and specificity (true negative rate), as well as by means of both the valuation set approach (based on sample splitting) and K-fold cross-validation.
This paper is based on the joint use of (1) machine learning classifiers, (2) feature selection methods, and (3) oversampling techniques.Even though their combined impact on the prediction accuracy is likely to be substantially different from the effect obtained when we focus on only one of them, the credit risk measurement literature lacks an investigation focused on the combination of these three approaches.In order to fill this gap, in this paper we study the joint performance of the techniques (1-3) above.
The remainder of the paper is organized as follows.In Section 2, we present the details of the classifiers, the feature selection methods, and the oversampling techniques.In Section 3, we describe the set-up of the empirical analysis and report the results.Based on the outcomes of these experiments, in Section 4, we conclude the paper and outline future research directions.

Machine Learning and Credit Risk: Some Background
The number of applications of machine learning techniques in credit risk has increased dramatically in the last 15 years or so.Here, we give an overview of some significant articles (see Leo et al. 2019 and the references therein for further information).
Tsai and Chen (2010) considered different combinations of hybrid machine learning, clustering, and classification models for credit risk measurement.Their results suggest that a hybrid model based on a combination of different techniques has the best performance.In particular, logistic regression and neural networks provided the highest prediction accuracy.
Trivedi (2020) conducted a comparative analysis of the performance of different feature selection criteria associated with various machine learning classifiers.To measure the performance, several evaluation metrics were considered: accuracy, the F-measure, false positive rate, false negative rate, and training time.The conclusion was that a combination of random forests and chi-squared feature selection appeared to have the best performance, albeit at the price of a higher training time.De Castro Vieira et al. (2019) considered two models: the first one was based on different time intervals for default prediction, and the second one disregarded the categorical variables (gender, age, and marital status).Their outcomes suggest that the number of days overdue and the accuracy of the model are positively related, especially as the number of days overdue increases.The most accurate models tend to be bagging, random forests, and boosting.Furthermore, by removing the categorical predictors, the discriminatory power of the credit risk rating system is preserved.Wang et al. (2011) carried out a comparative analysis of the performance of three ensemble approaches (bagging, boosting, and stacking) used to improve the predictive power of four base classifiers (logistic regression, decision trees, artificial neural networks, and support vector machines).According to their results, the ensemble approaches typically enhance individual base learners in a non-negligible measure.Specifically, bagging outperformed boosting across all credit datasets analyzed in the paper, whereas stacking and bagging combined with decision trees were the preferred approaches in terms of classical accuracy measures.

Classification Techniques
Classification algorithms are supervised learning models estimated from the patterns in the training data, whose class membership is known in advance.A classifier tries to estimate the relationship between the inputs (predictors) and output (indicator of class membership) in the training set.After the model is trained, its performance is tested on new data, usually called a test set (James et al. 2021).
In the following, we denote the training observations as x i = (x i1 , . . ., x id ) T , i = 1, . . ., n, where d represents the dimensionality of the feature space (i.e., the number of predictors in the model).The target variable y is assumed to be a categorical variable that can take M possible values S 1 , . . ., S M .
In addition to the machine learning techniques listed in the rest of this section, in the empirical analysis in Section 3, we will also employ regularized logistic regression as a benchmark in the class of statistical techniques.

Decision Trees
A decision tree comprises a series of logical decisions, which are represented as a tree structure (Breiman et al. 1984).The tree consists of nodes indicating a decision based on a feature and leaves where the final decision is made.The decisions are similar to if-then rules, since they take as the input a condition described by a set of attributes.If it is satisfied, then it returns a decision, which is the predicted value; otherwise, it carries out further analysis.The probability that an arbitrary sample point (y, x) from the jth class belongs to the leaf C s is estimated as follows: where n C s is the number of training observations in C s .

Random Forests
Random forests are a generalization of decision trees, since a random forest is indeed an ensemble of decision trees (Breiman 2001;James et al. 2021), because B decision trees are built on bootstrapped samples obtained from the original sample.Moreover, each tree is developed using a subset of randomly chosen features.Since each decision tree yields a predicted class, for overall classification, an RF predicts the class by using the majority vote criterion (James et al. 2021, p. 341), taking into account the output of all decision trees.A pseudo-code is given in Algorithm 1.
Algorithm 1 Random Forest Given training observations (y i , x i ), i = 1, . . ., n, the number of features d e selected for the ensembles, and the number of trees B in the ensemble, we use B trees to construct the random forest.The following steps are performed: 1.
Use bagging (James et al. 2021, Sect. 8.2.1) to create B samples of a size n, where each sample is used as training data; 2.
For constructing the trees of the random forest, the features are randomly sampled from the d e features selected in advance, and the trees are grown without pruning; 3.
The training samples obtained via bagging in Step 1 are used in the B decision trees to obtain trained models.Prediction is carried out via the majority vote mechanism and used for making a final classification decision.

Naïve Bayes
Naïve Bayes classification (Denison et al. 2002) is based on the Bayes theorem, which explicitly gives posterior probabilities of class membership.The central simplifying assumption in order to obtain a tractable specification of the joint probability distribution of the features is that the predictors are independent, since in this case, their joint distribution is equal to the product of the marginal distributions.Naïve Bayes has a light computational burden, but its performance is critically related to the plausibility of the independence assumption.

Artificial Neural Networks
Similar to other classifiers, artificial neural networks (often simply called neural networks) take as input features x 1 , . . ., x d and construct a nonlinear function f (x) aimed at predicting the dependent variable y.The peculiarity of the method is the procedure followed to obtain f .The most common type of neural network consists of three layers of units: the input, hidden, and output layers.Such a structure is usually called a multilayer perceptron.A layer of "input" units are fed to a layer of "hidden" units, which is finally connected to a layer of "output" units (Haykin 2004).
The algorithm is called neural network because the hidden units are interpreted as neurons in the brain.In the last decade or so, neural networks experienced extraordinary success partly related to the availability of large datasets that allow effectively training such a complex model.

K-Nearest Neighbor
Consider a feature vector x * , corresponding to a new observation that needs to be classified.The K-nearest neighbor (KNN) algorithm assigns the observation with predictors x * to the class of the majority of the K-nearest neighbors of x * in the training dataset.The nearest neighbors are determined by calculating the Euclidean distance between the input feature vector x * and the feature vectors of the training observations, and the flexibility of the algorithm is determined by the "size" of the neighborhood (i.e., by the parameter K).Too small values of K should be avoided, because they would lead to overfitting the training data.

Feature Selection Methods
A feature is an individual measurable property of the process being observed.Feature selection (or variable elimination) is the process of determining which features within the dataset are effective for the resulting prediction.It helps in understanding data, reducing the computation requirements, easing the effects of the curse of dimensionality, and improving prediction performance (Chandrashekar and Sahin 2014).In this section, we introduce some feature selection techniques that will be employed in Section 3 to examine how the models behave with different sets of features.

Random Forest Recursive Feature Elimination
Recursive feature elimination (RFE) is a greedy algorithm based on feature-ranking techniques (Zhou et al. 2014).The algorithm measures the classifier performance by eliminating predictors in an iterative manner.In a first step, RFE trains the classifier with all d features, and then it calculates the importance of each feature via the information gain method or the mean reduction in the Gini index (James et al. 2021, p. 336;Ustebay et al. 2018).Subsequently, subsets of progressively smaller sizes m = d, d − 1, . . ., 1 are created by iterative elimination of the features.The model is retrained within each subset, and its performance is calculated.Hence, RF-RFE is a feature selection method that combines RFE and random forests (see Ustebay et al. 2018 for details).A step-by-step description is given in Algorithm 2.
Train the model in the training set with all d predictors.2.
Compute the overall performance, and rank the predictors by importance.

3.
For each subset size m (m = 1, . . ., d), repeat the following steps: (a) Train the model by using only the m most important predictors.(b) Compute the classification performance, and rank the m predictors by importance using the mean reduction in Gini index.

4.
Use the model based on the optimal number of predictors m * , corresponding to the highest performance at Step 3(b) above.

Chi-Squared Feature Selection
In feature selection, we test the null hypothesis of independence by means of the well-known chi-squared test.In particular, we assess whether the class label (the target) is independent of a given feature (Alshaer et al. 2021).The d tests are given by where O and E are the observed and expected frequency, respectively, m s is the number of categories of the sth predictor, and k is the number of classes of the target variable.Continuous features must be discretized.
As usual, large values of χ 2 s imply significant evidence against the null hypothesis of independence of y and x s , and the reference distribution under the null is χ 2 (k−1)(m s −1) .Finally, the features x s for which independence from y cannot be rejected are eliminated from the analysis.

Support Vector Machines and L1-Based Feature Selection
We used support vector machines with linear kernels.The prediction obtained had the general form pred(x) = sign(b + ∑ d i=1 α i K(x, x i )).If the kernel was linear (i.e., K(x, v) = x T v), then the prediction became sign(b + w T x) for w = (w 1 , . . ., w d ) T = (α 1 x 1 , . . ., α d x d ) T , where w is a vector of weights that can be computed explicitly.
This technique classifies a new observation (y * , x * ) by testing whether the linear combination w d of the components of x * is larger or smaller than a given threshold −b (Brankl et al. 2002).Hence, in this approach, the jth feature is more likely to be important if its weight w j is above the threshold.This type of feature weighting has some intuitive interpretation, because a predictor with a small |w j | value has a minor impact on the predictions and can be ignored (Sindhwani et al. 2001).

Over-and Undersampling Techniques
Imbalanced datasets are a relevant issue commonly observed in real-world applications that can have a significant impact on the classification performance of machine learning algorithms.As pointed out by Ganganwar (2012), the available solutions can be grouped into two categories.At the data level, sample-modifying techniques have been developed.At the algorithmic level, cost-sensitive learning methods have been proposed.Here, we apply three algorithms in the former category which have been shown to guarantee a robust solution (Batista et al. 2004).See Baesens et al. (2003) and He and Garcia (2009) for further details.
The basic idea consists of resampling the original dataset, either by oversampling the smallest class or undersampling the largest class until the sizes of the classes are approximately the same.Since undersampling may discard some important information and consequently worsen the performance of the classifiers, oversampling tends to be preferred (Ganganwar 2012).
Random oversampling is one of the simplest methods, as it increases the minority class through randomly repeated copies of the minority class.A possible disadvantage is that if the dataset is large, it may introduce a significant additional computational burden.Moreover, since it yields exact copies of the minority class, it can increase the risk of overfitting.
The synthetic minority oversampling technique (SMOTE; Chawla et al. 2002) oversamples the minority class by synthetically creating new instances rather than oversampling with replacement, as random oversampling does.The SMOTE forms new minority examples by interpolating between several minority class observations that are "close" to each other.
In more detail, given a minority observation x i , the K-nearest neighbors of the same class are selected.In a second step, some of these nearest training observations are ran-domly chosen according to a prespecified oversampling rate.Finally, new synthetic examples are generated along the segment, joining the minority example and its selected nearest neighbors.
SMOTEtomek (Tomek 1976) stands for SMOTE after Tomek and fits into the undersampling group of methods.In this approach, the majority class is undersampled by randomly removing majority class observations until the minority class reaches some specified percentage of the majority class.In particular, the Tomek link discards observations from the most represented class that are close to the least represented class in order to obtain a training dataset with a more clear-cut separation between the two groups.

Evaluation Criteria
The performance of the models was evaluated based on the established standard measures in the fields of credit scoring.These criteria were the area under the ROC curve (AUC; James et al. 2021, Sect. 4.4.2) and the average accuracy (equal to 1 − er, where er is the error rate).To further strengthen the analysis, we also computed the sensitivity (true positive rate = 1 − Type II error) and the specificity (true negative rate = 1 − Type I error) (see James et al. 2021, p. 152).The basic ingredients are usually represented as follows in the so-called confusion matrix reported in Table 1.A default prediction model can misclassify a customer in two ways.First, if the predicted class of a defaulting client is non-default, then the main cost for the bank is the loss of interest and capital.The second error occurs when the model classifies a non-defaulting customer as default and implies the opportunity cost of not lending to a non-defaulting client, which is a missing business opportunity.The cost of the former (i.e., a false negative) is typically higher for a bank.
Several works (see, for example, Dopuch et al. 1987;Koh 1992;Nanda and Pendharkar 2001) suggest that incorporating sensitivity and specificity into the prediction models can lead to more accurate results, especially when the two types of error are associated with different costs.Hence, for decision-making purposes in a banking framework, if a lender can come up with a measure of the cost of the Type I and Type II errors, then proper estimates of sensitivity and specificity can be more important than accuracy.

Dataset Description
The datasets employed for developing credit-scoring models should contain financial characteristics (income, credit history, balance sheet information, . . .), behavioral information (loan payment behavior, credit usage, . . .), and categorical variables (age, marital status, . . .), which are the inputs of the model.In addition, an outcome variable that describes the status of (default or non-default) of the applicant is also known.
In this study, we used a German retail credit dataset downloaded from the UCI machine learning repository 1 .The dataset refers to the years 1973-1975 and contains 1000 instances and 20 attributes, which give information about the financial statuses of the clients.Of the 20 features, 7 are quantitative and 13 are categorical.To mention just a few of them, we recalled the financial records statuses, measures related to advance rates, bank accounts or securities, installment rates as a percentage of disposable income, and information on the property, age, and number of existing credits.In addition to the 20 features, the dataset contains the target variable credit risk, which is the usual binary variable describing non-creditworthy and creditworthy customers, coded as 1 and 0, respectively.Unfortunately, no information about the definition of default used for constructing the target variable is given.We guessed that the dataset employs the usual definition (i.e., payments missed or delayed by at least 90 days) (see, for example, Duffie and Singleton 2003, p. 44 for possible definitions of the concept).The classes are imbalanced because 300 instances correspond to bad counterparties and 700 instances to good counterparties (Groemping 2019).Table 2 gives the full list.

Numerical Details
In the empirical analysis, we used five classifiers: neural networks (NNs), naïve Bayes (NB), decision trees (DTs), random forest (RF), K-nearest neighbors, and regularized logistic regression.Three feature selection techniques were employed: chi-squared feature selection, random forest recursive feature elimination, and L1-based feature selection.Since there were fewer defaulters than non-defaulters, the implementation of a preprocessing step to balance the classes was a sensible way of proceeding.We used the SMOTE, SMOTEtomek, and random oversampling algorithms outlined in Section 2.4.However, since the imbalance was not strong, we also performed the analysis with the original imbalanced data.
For the purpose of training and evaluating the models, the dataset was randomly split into a training and a test set in proportions of 75 and 25%, respectively 2 .The models were implemented in Python with the default parameters using the following packges: Pandas for data manipulation, Matplotlib and Seaborn for data visualization, and Scikit-learn for data preprocessing and fitting the models.For reproducibility purposes, we recall here explicitly the numerical values of the main inputs.
For random forests, the number of trees in the forest was 100.The mean decrease in the Gini index was the measure of the quality of a split in both random forests and decision trees.In the Naïve Bayes approach, the prior probability of each class, π k (k = 1, . . ., M), was estimated via the relative frequency of the training data in the kth class.As for the univariate distributions of the predictors, they were assumed to be Gaussian in the continuous case, whereas standard nonparametric estimates were used for categorical densities (see, for example, James et al. 2021, p. 156).The KNN algorithm employed a number of neighbors K = 5 3 .In neural networks, the activation function for the hidden layer is a rectified linear unit (ReLU).Finally, in regularized logistic regression, the norm of the penalty was 2 4 .

Chi-Squared Feature Selection
Table 3 and Figure 1 display the results obtained with the chi-squared feature selection method of Section 2.3.2.Table 3 lists the seven predictors whose p-values were smaller than 0.01.The actual p-values of the tests are shown in Figure 1.Features corresponding to the test statistics with p-values smaller than 0.01 were considered to be significant for classifying defaulters and non-defaulters.Table 4 shows the classification performances obtained with the predictors selected via chi-squared feature selection as well as all combinations of the classification algorithms and the sample-modifying approaches used in this paper.For comparison purposes, we reported the outcomes with the original (imbalanced) dataset.
We can see that when we employed RF, the model accuracy, sensitivity, specificity, and area under the curve improved significantly with the sample-modifying techniques with respect to the original dataset.Furthermore, when comparing the performance of RF with the different sample-modifying techniques, the combination of random oversampling and random forest achieved the best performance, with an accuracy of 0.854 and AUC = 0.925.
As for decision trees, when we used the imbalanced dataset, they had poor performance, being only slightly better than random guessing, with AUC = 0.598 and an accuracy approximately equal to 0.68.In addition, the algorithm tended to favor the most represented class.We conjecture that this was probably caused by the imbalance of the data, since the performance increased significantly by applying sample-modifying techniques.The best combination was based on random oversampling, with an accuracy and AUC equal to 0.8114 and 0.811, respectively.
Turning now to Gaussian Naïve Bayes, we found a rather surprising outcome: when we balanced the data, GNB's performance decreased with respect to the imbalanced data case.Hence, the best outcome was achieved with imbalanced data.The accuracy of the model was 0.752, and the area under the curve was 0.795, similar to RF with imbalanced data.2) and the data balanced via the SMOTE, SMOTEtomek and random oversampling techniques, respectively (Figures 3-5).Each figure displays the ROC curves corresponding to all the classification techniques employed.Overall, the best result was achieved when we combined RF with random oversampling.It is worth noting that all sample-modifying techniques considerably improved the RF algorithm, which was never ranked first when we considered the original imbalanced data but was always clearly the best after balancing the classes.The KNN classifier worked rather well by using the imbalanced data, although the most represented class was favored.The best outcome arose from the combination of KNN with the SMOTETomek.In this case, the accuracy was equal to 0.744, and AUC = 0.839.
In the neural network case, the performance of the classifier improved significantly with the sample-modifying methods compared with the imbalanced data, and the NN was overall the second-best approach after RF.
Finally, logistic regression performed best when combined with the SMOTETomek.In this case, the AUC was 0.794.However, the performance with imbalanced data was quite similar.

Random Forest Recursive Feature Elimination
When random forest recursive feature elimination was used as a feature selection method, the 10 most important variables according to the mean decrease in the Gini index (James et al. 2021, p. 343) are listed in Table 5.Note that the tenth variable had a mean decrease equal to 0.042.With only one exception (present age), the 7 features selected by Chi-squared feature selection were a subset of the 10 predictors chosen via random forest recursive feature elimination.This suggests that the selection of the relevant predictors was rather robust with respect to the algorithm employed for this goal.Importance was measured via the mean decrease in the Gini index.
Table 6 reports the results obtained when we used different sample-modifying techniques with the features selected by random forest recursive elimination.
Both RF and DT performed better with random oversampling than with the other sample-modifying techniques.It is worth noting that RF with random oversampling was characterized by the highest accuracy and AUC among all methods, whereas DT had the largest specificity.These results were quite similar to those achieved when using the chi-squared feature selection algorithm.Gaussian Naïve Bayes obtained comparable results with the different sample-modifying techniques, but the outcomes were worse than those based on the imbalanced data.KNN performed well with the SMOTE compared with the other techniques, with 0.76 accuracy and an AUC of 0.824.The neural network's performance was similar with all the samplemodifying techniques and better with respect to the performance with the imbalanced data.Finally, the performance of the logistic regression was not significantly enhanced by the sample-modifying techniques, whose outcomes were more or less comparable.
To sum up, in this case, for most classifiers, the performance as measured by the AUC also significantly improved compared with the models based on imbalanced data.Similar to Section 3.3.1,however, sample-modifying did not help when combined with GNB, where the best outcomes were obtained with the original data, and with LR, where the AUC remained approximately the same with and without sample-modifying techniques.Overall, the best performance was given by random forests with random oversampling, with an AUC and accuracy of 0.932 and 0.8429, respectively.The largest specificity (0.927) was obtained by RF with imbalanced data at the price of a very low sensitivity (0.444).As for sensitivity, DT with random oversampling achieved the largest value, but its AUC was smaller than that for RF with random oversampling.
All in all, the results yielded by random forest recursive feature elimination were similar to those obtained by chi-squared feature selection (see Table 4).Since, as noted above, the predictors selected by chi-squared selection and random forest recursive feature elimination were quite similar, this is not a suprising outcome.
Figures 7-10 show the ROC curves for the original data (Figure 7) and the data balanced via the SMOTE, SMOTEtomek, and random oversampling, respectively .Each figure displays the ROC curves corresponding to all the classification techniques employed.The plots are again similar to the results in Section 3.3.1.We observe that the highest AUC resulted from the use of random forests and random oversampling.Moreover, sample modifying was especially beneficial for the RF classifier, and GNB was the only case where sample-modifying methods mostly decreased the predictive accuracy.

L1-Based Feature Selection
The last feature selection approach employed in this paper is the L1-based criterion introduced in Section 2.3.3.Recall that in this algorithm, the features associated to coefficients not equal to zero are considered to contribute significantly to the prediction of the target variable.Table 7 lists the nine best default predictors according to the L1-based algorithm.
Table 8 confirms the results obtained in Sections 3.3.1 and 3.3.2:the random forest classifier with random oversampling had the best performance, with an accuracy equal to 0.8571 and AUC = 0.925.However, random forest also performed well with the SMOTE and SMOTETomek.Analogously, the decision trees performed best when combined with random oversampling.
Figures 11-14 show the ROC curves for the original data (Figure 11) and the data balanced via the SMOTE, SMOTEtomek, and random oversampling, respectively .Each plot displays the ROC curves corresponding to all the classification techniques employed.Similar to the cases of the two feature selection methods analyzed in the previous two sections, the most striking feature of the plots is the strong improvement of RF when combined with sample-modifying techniques.These outcomes are not surprising, since it is well known that Naïve Bayes is an approach that often trades some predictive power with computational efficiency, whereas neural networks are a complex method whose estimation is more time-consuming.In this set-up, all computing times are rather small, so the practical implementation of all the algorithms should not be difficult, even in larger datasets with more predictors.

General Comments about the Results
The main message of the analysis run in the preceding sections is that random forest was the best classifier in terms of average accuracy and AUC, and artificial neural networks also showed good performance with respect to the other ML algorithms.It is worth stressing that a traditional statistical approach such as logistic regression also worked quite well.This outcome suggests that statistical techniques are still effective and reliable tools for credit scoring.Given that LR is also highly interpretable and already well known by banks and practitioners, it may be convenient for banks to use both approaches (statistical and ML-based) when routinely predicting customer defaults in daily business practices.
All the feature-selection algorithms yielded very similar sets of predictors.Not surprisingly, this implies that, with any of them, one would obtain similar classification outcomes.If we look at the sample-modifying techniques, in the best classification approach (i.e., random forest), random oversampling was preferable to the SMOTE and SMOTEtomek techniques in terms of both accuracy and AUC.However, the SMOTE or SMOTEtomek techniques were sometimes better when we considered other classification algorithms.Overall, the best method was random forest combined with random oversampling.
The performance of almost all classifiers improved significantly with the samplemodifying techniques, with respect to the base imbalanced data case.The only exception was the Naïve Bayes algorithm, which also had comparable performance when the data were imbalanced.Another general result is that sample-modifying techniques are especially beneficial in terms of sensitivity.
Tree-based techniques (i.e., decision trees and random forests) had higher performance when combined with random oversampling.Moreover, they gained more than other classifiers from the use of sample-modifying methods.According to our results, the best combination of algorithms to build a robust, accurate, and sensitive credit-scoring model is random forest combined with random forest recursive feature elimination and the random oversampling technique.
The dataset employed in this section has already been used in ML applications in the past (see, for example, Dea et al. (2001);Ekin et al. (1999); Gonzalez et al. (2001); Ustebay et al. (2018); Wang et al. (2003)).However, Ustebay et al. (2018) performed unsupervised learning analysis, whereas Wang et al. (2003) studied the impact of modifications of the dataset, so these two studies are not directly comparable to our work.The other three articles employed selection and learning techniques that were different from ours, but some comparison is sensible.Dea et al. (2001) 2001) developed a graph-based relational concept learner.In this case, the highest accuracy was 71.52%.
Our results suggest that ML techniques are a powerful tool for credit scoring and default prediction purposes.From the economic point of view, inaccurate estimates of creditworthiness in the banking sector were the key determinant of the two worst economic crises of modern times (i.e., the Great Depression of 1929 and the Great Recession of 2008).The latter in particular was triggered by the so-called subprime mortgage crisis, where underestimation of default probabilities and easy credit conditions had catastrophic economic consequences.The use of ML approaches such as the ones proposed in the current paper, possibly combined with more traditional statistical techniques and under strict regulatory control, is an additional shelter against further crises.
3.3.6.Dealing with a More-Imbalanced Set-Up In the preceding section, we studied the performance of the classifiers and the databalancing techniques in a framework where classes were moderately imbalanced.Since typical credit datasets are more imbalanced 5 , one may wonder whether the results obtained in Section 3.3 can be generalized to a more imbalanced set-up.
To investigate this issue, in this section, we perform the same analysis in an extremely imbalanced dataset.Specifically, we used the Default dataset from the ISLR R package, which contains 10,000 simulated credit card default observations, consisting of a target variable (default indicator) and three features with a percentage of defaults equal to 3.33%.For simplicity, we only implemented random forests, neural networks, and logistic regression, since these approaches turned out to be very effective in the German credit dataset.Table 10 illustrates the performance of the classifiers with different data-balancing techniques.
From Table 10, we see that RF combined with random oversampling had the best performance in terms of accuracy (0.9846) and AUC (0.9998), which is in line with the outcomes obtained in Section 3.3.The use of data-balancing techniques yielded mixed results.On one hand, all methods significantly improved the sensitivity, similar to the German credit dataset.In terms of accuracy, the results were worse for the NN and LR and remained approximately the same for RF.The AUC was improved for RF and not significantly different for the NN and RF.As in the German credit dataset, RF was the classifier that gained more from the combination with data-balancing techniques, especially with random oversampling.

Conclusions
Recently, many studies have shed light on credit scoring, which has become one of the cornerstones of credit risk measurement.In this paper, we tried to identify the most important predictors of credit default for the purpose of constructing machine learning classifiers that identify defaulters and non-defaulters as efficiently as possible.
Since our data were imbalanced, we implemented three sample-modifying algorithms and subsequently assessed the performance improvement of the classification models.The take-home messages are that random forest combined with any feature selection algorithm and with random oversampling is the best classifier, and data-balancing algorithms are beneficial, especially for improving sensitivity.
In terms of classifier performance, similar outcomes were obtained by De Castro Vieira et al. (2019) andTrivedi (2020).Given that, in recent years, there has been an exponentially increasing number of papers employing machine learning for the construction of credit scoring, it is difficult to give a detailed list of the results in the literature here.A good starting point is the references in Leo et al. (2019, sct. 3.1).
A possible limitation of this study is that a moderately large dataset was used in the main application.The second empirical analysis was based on a larger sample and seemed to confirm the results, but the data were simulated.Hence, the use of a larger real dataset should be considered to double-check the accuracy of the models.Another issue open to further research is the use of different credit categories to test the models.In future research, we plan to extend the investigation to corporate defaults.Finally, the impact of sample-modifying techniques in datasets where classes are more imbalanced is also worth further scrutiny.

Figure 1 .
Figure 1.p-values of the chi-squared feature selection procedure.

Table 2 .
Description of all the features in the German credit dataset.

Table 3 .
Variables selected via chi-squared feature selection.

Table 4 .
Chi-squared feature selection with different classification algorithms.Values in bold are the maximum of each column.

Table 5 .
The 10 most important predictors obtained via random forest recursive feature elimination.The importance of the variables in the random forest recursive feature elimination approach.

Table 6 .
Random forest recursive feature elimination with different classification algorithms.Values in bold are the maximum of each column.

Table 9
summarizes the training computing times.We have only reported the results with random oversampling, since the SMOTE and SMOTEtomek techniques are very similar.We can see that Naïve Bayes had the lowest training time, while the algorithm with the highest training time was the neural network.

Table 9 .
Running time of the algorithms (in seconds).
Ekin et al. (1999)ork-like algorithm.They selected seven features, analogous to what we obtained in Table3, and most features were the same in both cases.The accuracy in the test set was 74.25%, essentially equal to what we showed in Table4.Similar accuracies were also obtained byEkin et al. (1999)via distance-based methods.Gonzalez et al. (

Table 10 .
Performance of RF, NN, and LR in the Default dataset.Values in bold are the maximum of each column.